# Téoréma Bayes

Luncat ka: pituduh, paluruh

Téorema Bayes mangrupa hasil dina tiori probabiliti, which gives the conditional probability distribution of a variabel acak A given B in terms of the conditional probability distribution of variable B given A and the marginal probability distribution of A alone.

In the context of Bayesian probability theory and statistical inference, the marginal probability distribution of A alone is usually called the prior probability distribution or simply the prior. The conditional distribution of A given the "data" B is called the posterior probability distribution or just the posterior.

As a mathematical theorem, Bayes' théorem is valid regardless of whether one adopts a frequentist or a Bayesian interpretation of probability. However, there is disagreement as to what kinds of variables can be substituted for A and B in the théorem; this topic is tréated at gréater length in the articles on Bayesian probability and frequentist probability.

## Historical remarks

Bayes' théorem is named after the Reverend Thomas Bayes (1702–61). Bayes worked on the problem of computing a distribution for the paraméter of a binomial distribution (to use modérn terminology); his work was edited and presented posthumously (1763) by his friend Richard Price, in An Essay towards solving a Problem in the Doctrine of Chances. Bayes' results were replicated and extended by Laplace in an essay of 1774, who apparently was not aware of Bayes' work.

One of Bayes' results (Proposition 5) gives a simple description of conditional probability, and shows that it does not depend on the order in which things occur:

If there be two subsequent events, the probability of the second b/N and the probability of both together P/N, and it being first discovered that the second event has also happened, the probability I am right [i.e. the conditional probability of the first event being true given that the second has happened] is P/b.

The main result (Proposition 9 in the essay) derived by Bayes is the following: assuming a uniform distribution for the prior distribution of the binomial paraméter p, the probability that p is between two values a and b is

${\displaystyle {\frac {\int _{a}^{b}{\begin{pmatrix}n+m\\m\end{pmatrix}}p^{m}(1-p)^{n}\,dp}{\int _{0}^{1}{\begin{pmatrix}n+m\\m\end{pmatrix}}p^{m}(1-p)^{n}\,dp}}}$

where m is the number of observed successes and n the number of observed failures. His preliminary results, in particular Propositions 3, 4, and 5, imply the result now called Bayes' Théorem (as described below), but it does not appéar that Bayes himself emphasized or focused on that result.

What is "Bayesian" about Proposition 9 is that Bayes presented it as a probability for the paraméter p. That is, not only can one compute probabilities for experimental outcomes, but also for the paraméter which governs them, and the same algebra is used to maké inferences of either kind. Interestingly, Bayes actually states his question in a way that might maké the idéa of assigning a probability distribution to a paraméter palatable to a frequentist. He supposes that a billiard ball is thrown at random onto a billiard table, and that the probabilities p and q are the probabilities that subsequent billiard balls will fall above or below the first ball. By making the binomial paraméter p depend on a random event, he cleverly escapes a philosophical quagmire that he most likely was not even aware was an issue.

## Statement of Bayes' theorem

Bayes' théorem is a relation among conditional and marginal probabilities. It can be viewed as a méans of incorporating information, from an observation, for example, to produce a modified or updated probability distribution.

Suppose the marginal probability density function or probability mass function of a random variable X is

${\displaystyle f_{X}(x)\;}$

(be very careful to distinguish between the capital X and the lower-case x above!). This is the prior probability distribution of X. Suppose the conditional probability density function or probability mass function of Y given X = x (a function of y) is

${\displaystyle f_{Y\mid X=x}(y).}$

As a function of x, this is the likelihood function

${\displaystyle L_{X\mid Y=y}(x)=f_{Y\mid X=x}(y).}$

The likelihood function is not a probability density function or a probability mass function for X, since it need not integrate (or sum) over x to produce 1.

Bayes théorem says:

To get the posterior probability distribution of X (i.e., the conditional probability distribution of X given Y), multiply the prior probability density function (or mass function) for X by the likelihood function, and then normalize to produce a probability distribution.

"Normalize" méans to multiply or divide by a constant to maké the resulting function a probability density function or a probability mass function. Thus the posterior probability density function is

${\displaystyle f_{X\mid Y=y}(x)={f_{X}(x)L_{X\mid Y=y}(x) \over {\mbox{constant}}}.}$

The normalizing constant in the denominator is

${\displaystyle \int _{-\infty }^{\infty }f_{X}(x)L_{X\mid Y=y}(x)\,dx.}$

In the discrete case, one would have a sum rather than an integral. If one takes the measure-theoretic viewpoint, either is an integral.

### Conto

Suppose the proportion R of voters who will vote "yes" in a referendum is sebaran seragam between 0 and 1. That is the prior probability distribution of R. A random sample of 10 voters is taken, and it is found that seven of them will vote "yes". The conditional distribution of the number X of voters in this small sample who will vote "yes", given that (capital) R is some particular number (lower-case) r, is a sebaran binomial with paraméters 10 and r, i.e., it is the distribution of the number of "successes" in 10 independent Bernoulli trials with probability r of success on éach trial. One therefore has

${\displaystyle f_{X\mid R=r}(x)={10 \choose x}r^{x}(1-r)^{10-x}.}$

Since X was observed to be 7, the likelihood function is

${\displaystyle L(r)={10 \choose 7}r^{7}(1-r)^{3}}$

for 0 ≤ r ≤ 1. The prior probability density function is

${\displaystyle f_{R}(r)=1\ {\mbox{if}}\ 0\leq r\leq 1}$

and 0 otherwise. Multiplying the prior by the likelihood, we get

${\displaystyle f_{R}(r)L(r)={10 \choose 7}r^{7}(1-r)^{3}}$

if 0 ≤ r ≤ 1, and 0 otherwise. Integrating, we get

${\displaystyle \int _{0}^{1}r^{7}(1-r)^{3}\,dr=1/1320,}$

so the posterior probability density function is

${\displaystyle f_{R\mid X=7}(r)=1320r^{7}(1-r)^{3}}$

for r between 0 and 1, and 0 otherwise.

One may be interested in the probability that more than half the voters will vote "yes". The prior probability that more than half the voters will vote "yes" is 1/2, by the symmetry of the sebaran seragam. The posterior probability that more than half the voters will vote "yes", i.e., the conditional probability given the outcome of the opinion poll—that seven of the 10 voters questioned will vote "yes"—is

${\displaystyle 1320\int _{1/2}^{1}r^{7}(1-r)^{3}\,dr=0.88671875}$

about an "89% chance".

## Derivation in the discrete case

To derive Bayes' théorem in the discrete case, note first from the definition of conditional probability that

${\displaystyle P(A|B)P(B)=P(A,B)=P(B|A)P(A)\,}$

denoting by P(A,B) the joint probability of A and B. Dividing the left- and right-hand sides by P(B), we obtain

${\displaystyle P(A|B)={\frac {P(B|A)P(A)}{P(B)}}}$

which is Bayes' théorem.

éach term in Bayes' théorem has a conventional name. The term P(A) is called the prior probability of A. It is "prior" in the sense that it precedes any information about B. P(A) is also the marginal probability of A. The term P(A|B) is called the posterior probability of A, given B. It is "posterior" in the sense that it is derived from or entailed by the specified value of B. The term P(B|A), for a specific value of B, is called the likelihood function for A given B and can also be written as L(A|B). The term P(B) is the prior or marginal probability of B, and acts as the normalizing constant.

### Alternative forms of Bayes' theorem

Bayes' théorem is often embellished by noting that

${\displaystyle P(B)=P(A,B)+P(A^{C},B)=P(B|A)P(A)+P(B|A^{C})P(A^{C})\,}$

so the théorem can be restated as

${\displaystyle P(A|B)={\frac {P(B|A)P(A)}{P(B|A)P(A)+P(B|A^{C})P(A^{C})}}\,,}$

where AC is the complementary event of A. More generally, where {Ai} forms a partition of the event space,

${\displaystyle P(A_{i}|B)={\frac {P(B|A_{i})P(A_{i})}{\sum _{j}P(B|A_{j})P(A_{j})}}\,,}$

for any Ai in the partition.

See also the law of total probability.

### Bayes' theorem for probability densities

There is also a version of Bayes' théorem for continuous distributions. It is somewhat harder to derive, since probability densities, strictly spéaking, are not probabilities, so Bayes' théorem has to be established by a limit process; see Papoulis (citation below), Section 7.3 for an elementary derivation. Bayes' théorem for probability densities is formally similar to the théorem for probabilities:

${\displaystyle f(x|y)={\frac {f(y|x)\,f(x)}{f(y)}}}$

and there is an analogous statement of the law of total probability:

${\displaystyle f(x|y)={\frac {f(y|x)\,f(x)}{\int _{-\infty }^{\infty }f(y|x)\,f(x)\,dx}}}$

As in the discrete case, the terms have standard names. f(x, y) is the joint distribution of X and Y, f(x|y) is the posterior distribution of X given Y=y, f(y|x) = L(x|y) is (as a function of x) the likelihood function of X given Y=y, and f(x) and f(y) are the marginal distributions of X and Y respectively, with f(x) being the prior distribution of X.

Here we have indulged in a conventional abuse of notation, using f for éach one of these terms, although éach one is réally a different function; the functions are distinguished by the names of their arguments.

### Extensions of Bayes' theorem

Théorems analogous to Bayes' théorem hold in problems with more than two variables. These théorems are not given distinct names, as they may be mass-produced by applying the laws of probability. The general strategy is to work with a decomposition of the joint probability, and to marginalize (integrate) over the variables that are not of interest. Depending on the form of the decomposition, it may be possible to prove that some integrals must be 1, and thus they fall out of the decomposition; exploiting this property can reduce the computations very substantially. A Bayesian network is essentially a mechanism for automatically generating the extensions of Bayes' théorem that are appropriate for a given decomposition of the joint probability.

## Examples

Typical examples that use Bayes' théorem assume the philosphy underlying Bayesian probability that uncertainty and degrees of belief can be méasured as probabilities. For worked out examples, pléase see the article on the examples of Bayesian inference.

## References

### Versions of the essay

• Thomas Bayes (1763), "An Essay towards solving a Problem in the Doctrine of Chances", Philosophical Transactions of the Royal Society of London, 53.
• Thomas Bayes (1763/1958) "Studies in the History of Probability and Statistics: IX. Thomas Bayes's Essay Towards Solving a Problem in the Doctrine of Chances", Biometrika 45:296-315 (Bayes's essay in modernized notation)
• Thomas Bayes "An essay towards solving a Problem in the Doctrine of Chances" (Bayes's essay in the original notation)

### Commentaries

• G.A. Barnard. (1958) "Studies in the History of Probability and Statistics: IX. Thomas Bayes's Essay Towards Solving a Problem in the Doctrine of Chances", Biometrika 45:293-295 (biographical remarks)
• Daniel Covarrubias "An Essay Towards Solving a Problem in the Doctrine of Chances" (an outline and exposition of Bayes's essay)
• Stephen M. Stigler (1982) "Thomas Bayes' Bayesian Inference," Journal of the Royal Statistical Society, Series A, 145:250-258 (Stigler argues for a revised interpretation of the essay -- recommended)
• Isaac Todhunter (1865) A History of the Mathematical Theory of Probability from the time of Pascal to that of Laplace, Macmillan. Reprinted 1949, 1956 by Chelséa and 2001 by Thoemmes.