Klasifikasi naif Bayésian

Ti Wikipédia, énsiklopédia bébas basa Sunda
Luncat ka: pituduh, sungsi
Panneau travaux.png Artikel ieu keur dikeureuyeuh, ditarjamahkeun tina basa Inggris.
Bantosanna diantos kanggo narjamahkeun.

Klasifikasi Naive Bayesian ngarupakeun metoda klasifikasi probabiliti sederhana. Watesan nu leuwih jentre dina kaayaan model probibiliti nyaeta independent feature model. Watesan naive Bayes dumasar kana kanyataan yen model probabiliti bisa diturunkeun ngagunakeun Bayes' Theorem (keur ngahargaan Thomas Bayes) sarta pakait kacida jeung asumsi bebas nu teu kapanggih di alam nyata, sabab kitu ngarupakeun (sacara ngahaja) naive. Gumantung kana katepatan pasti tina model probiliti, klasifikasi naive Bayes bisa direntetkeun kacida efisien dina susunan supervised learning. Dina pamakean praktis, parameter estimasi keur model naive Bayes make metoda maximum likelihood; dina basa sejen, hiji hal bisa digawekeun mibanda model naive Bayes bari teu nuturkeun Bayesian probability atawa ngagunakeun unggal metoda Bayesian.

Model probabiliti naive Bayes[édit | sunting sumber]

Sacara abstrak, model probabiliti klasifikasi ngarupakeun model kondisional

p(C \vert F_1,\dots,F_n)\,

dina kelas variabel terikat C mibanda sajumlah leutik hasil atawa kelas, kondisional dina sababaraha sipat variabel F_1 nepi ka F_n. Masalahna lamun jumlah sipat n badag atawa waktu sipat bisa dicokot tina nilai wilangan nu badag, mangka dumasar kana model dina tabel probabiliti ngarupakeun hal infeasible. Mangak kudu dirumuskeun duei modelna keur nyieun nu leuwih hade.

Ngangunakeun Bayes' theorem, dituliskeun

p(C \vert F_1,\dots,F_n) = \frac{p(C) \ p(F_1,\dots,F_n\vert C)}{p(F_1,\dots,F_n)}

Dina praktekna urang ngan museurkeun kana pembilang, pembagi heunteu gumantung kana C sarta nilai sipat F_i diberekeun, mangka pembagi ngarupakeun konstanta. Pembilang sarua jeung model joint probability

p(C, F_1, \dots, F_n)\,

nu bisa dituliskeun saperti di handap, ngagunakeun pamakean pengulangan tina harti conditional probability:

p(C, F_1, \dots, F_n)\,
= p(C) \ p(F_1,\dots,F_n\vert C)
= p(C) \ p(F_1\vert C) \ p(F_2,\dots,F_n\vert C, F_1)
= p(C) \ p(F_1\vert C) \ p(F_2\vert C, F_1) \ p(F_3,\dots,F_n\vert C, F_1, F_2)
= p(C) \ p(F_1\vert C) \ p(F_2\vert C, F_1) \ p(F_3\vert C, F_1, F_2) \ p(F_4,\dots,F_n\vert C, F_1, F_2, F_3)

jeung saterusna. Kiwari asumsi "naive" kondisional bebas loba dipake: anggap unggal sipat F_i ngarupakeun independent dina unggal sipat F_j keur j\neq i. Ieu hartina yen

p(F_i \vert C, F_j) = p(F_i \vert C)\,

sarta model gabungan ditembongkeun ku

p(C, F_1, \dots, F_n)
= p(C) \ p(F_1\vert C) \ p(F_2\vert C) \ p(F_3\vert C) \ \dots
= p(C) \prod_{i=1}^n p(F_i \vert C)

Ieu hartina yen dina kaayaan asumsi bebas di luhur, sebaran kondisional dina kelas variabel C bisa ditembongkeun saperti kieu:

p(C \vert F_1,\dots,F_n) = \frac{1}{Z}  p(C) \prod_{i=1}^n p(F_i \vert C)

numana Z ngarupakeun faktor skala terikat ngan dina F_1,\dots,F_n, contona, konstanta lamun nilai sipat variabel dipikanyaho.

Model dina bentuk ieu leuwih gamapang diurus, ti saprak ieu faktor disebut kelas prior p(C) sarta sebaran probabiliti bebas p(F_i\vert C). Lamun didinya kelas k classes sarta lamun model keur p(F_i) bisa digambarkeun dina watesan parameter r, mangka pakait jeung model naive Bayes ngabogaan parameter (k - 1) + n r k. Dina praktek, salawasna k=2 (klasifikasi biner) sarta r=1 (Bernoulli variable salaku sipat) ngarupakeun hal umum, sarta jumlah wilangan parameter tina model naive Bayes nyaeta 2n+1, numana n ngarupakeun wilangan sipat biner nu dipake keur prediksi.

Parameter estimasi[édit | sunting sumber]

Dina watesan supervised learning, kahayang nga-estimasi parameter tina model sebaran. Sabab asumsi sipat bebas, eta cukup keur estimasi kelas prior jeung model sipat kondisional bebas, ku make metoda maximum likelihood, Bayesian inference atawa prosedur parameter estimasi sejenna.

Ngawangun klasifikasi tina model probabiliti[édit | sunting sumber]

Diskusi leuwih jentre diturunkeun tina sipat model bebas, nyaeta, model probabiliti naive Bayes. Klasifikasi naive Bayes ngombinasikeun ieu model nu mibanda decision rule. Salah sahiji aturan nu umum keur nangtukeun hipotesa nu leuwih mungkin; dipikanyaho salaku aturan kaputusan maksimum posterior atawa MAP. Klasifikasi pakait ngarupakeun fungsi \mathit{classify} nu dihartikeun saperti:

\mathit{classify}(f_1,\dots,f_n) = \mathop{\mathrm{argmax}}_c \ p(C=c) \prod_{i=1}^n p(F_i=f_i\vert C=c)

Diskusi[édit | sunting sumber]

Klasifikasi naive Bayes ngabogaan sababaraha sipat nu ilahar dipake dina praktek, despite the fact that the far-reaching independence assumptions are often violated. Like all probabilistic classifiers under the MAP decision rule, it arrives at the correct classification as long as the correct class is more probable than any other class; class probabilities do not have to be estimated very well. In other words, the overall classifier is robust to serious deficiencies of its underlying naive probability model. Other reasons for the observed success of the naive Bayes classifier are discussed in the literature cited below.

In real life, the naive Bayes approach is more powerful than might be expected from the extreme simplicity of its model; in particular, it is fairly robust in the presence of non-independent attributes wi. Recent theoretical analysis has shown why the naive Bayes classifier is so robust.

Conto: klasifikasi dokumen[édit | sunting sumber]

Conto didieu pagawean nu make klasifikasi naive Bayesian classification keur masalah document classification. Consider the problem of classifying documents by their content, for example into spam and non-spam E-mails. Imagine that documents are drawn from a number of classes of documents which can be modelled as sets of words where the (independent) probability that the i-th word of a given document occurs in a document from class C can be written as

p(w_i \vert C)\,

(For this treatment, we simplify things further by assuming that the probability of a word in a document is independent of the length of a document, or that all documents are of the same length).

Then the probability of a given document D, given a class C, is

p(D\vert C)=\prod_i p(w_i \vert C)\,

The question that we desire to answer is: "what is the probability that a given document D belongs to a given class C?"

Now, by their definition, (see Probability axiom)

p(D\vert C)={p(D\cap C)\over p(C)}

and

p(C\vert D)={p(D\cap C)\over p(D)}

Bayes' theorem manipulates these into a statement of probability in terms of likelihood.

p(C\vert D)={p(C)\over p(D)}\,p(D\vert C)


Assume for the moment that there are only two classes, S and ¬S.

p(D\vert S)=\prod_i p(w_i \vert S)\,

and

p(D\vert\neg S)=\prod_i p(w_i\vert\neg S)\,

Using the Bayesian result above, we can write:

p(S\vert D)={p(S)\over p(D)}\,\prod_i p(w_i \vert S)
p(\neg S\vert D)={p(\neg S)\over p(D)}\,\prod_i p(w_i \vert\neg S)

Dividing one by the other gives:

{p(S\vert D)\over p(\neg S\vert D)}={p(S)\,\prod_i p(w_i \vert S)\over p(\neg S)\,\prod_i p(w_i \vert\neg S)}

Which can be re-factored as:

{p(S)\over p(\neg S)}\,\prod_i {p(w_i \vert S)\over p(w_i \vert\neg S)}

Thus, the probability ratio p(S | D) / p(¬S | D) can be expressed in terms of a series of likelihood ratios. The actual probability p(S | D) can be easily computed from log (p(S | D) / p(¬S | D)) based on the observation that p(S | D) + p(¬S | D) = 1.

Taking the logarithm of all these ratios, we have:

\ln{p(S\vert D)\over p(\neg S\vert D)}=\ln{p(S)\over p(\neg S)}+\sum_i \ln{p(w_i\vert S)\over p(w_i\vert\neg S)}

This technique of "log-likelihood ratios" is a common technique in statistics. In the case of two mutually exclusive alternatives (such as this example), the conversion of a log-likelihood ratio to a probability takes the form of a sigmoid curve: see logit for details.

Tempo oge[édit | sunting sumber]

Sumber sejen[édit | sunting sumber]

  • Pedro Domingos and Michael Pazzani. "On the optimality of the simple Bayesian classifier under zero-one loss". Machine Learning, 29:103-­130, 1997. (also online at CiteSeer: [1])
  • Irina Rish. "An empirical study of the naive Bayes classifier". IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence. (available online: PDF, PostScript)

Tumbu kaluar[édit | sunting sumber]