RE: Bayesian Pop Quiz

From: Ben Goertzel (ben@goertzel.org)
Date: Fri Aug 30 2002 - 09:43:38 MDT


Christian wrote:
> Bayes' Theorem supposes that we have a universal set U which is
> subdivided
> into disjunct subsets H_1, ..., H_n. Then, given an event A,
> the probability of H_i when A has happened, P(H_i | A), can be
> calculated as
>
> P(H_i | A) = P(H_i)*P(A | H_i) / (\sum_j P(H_j)*P(A | H_j))
>
> What I have heard, the controversy that sometimes arises out of
> the use of
> this theorem is due to the fact that the probabilities P(H_j) are
> often very
> difficult to calculate, so you can distort your data by setting the
> probabilities P(H_j) in a sloppy fashion.
>
> Am I correct in saying that the different Bayesian philosophies are
> concerned with methods of setting these probabilities (are these the
> "priors" you discuss?) in a careful way? Or is this too simplistic?

From
http://ic.arc.nasa.gov/ic/projects/bayes-group/html/bayes-theorem-long.html

"Bayes' theorem gives the rule for updating belief in a Hypothesis H (i.e.
the probability of H) given additional evidence E, and background
information (context) I:

      p(H|E,I) = p(H|I)*p(E|H,I)/p(E|I) [Bayes Rule]

The left-hand term, p(H|E,I), is called the posterior probability, and it
gives the probability of the hypothesis H after considering the effect of
evidence E in context I. The p(H|I) term is just the prior probability of H
given I alone; that is, the belief in H before the evidence E is considered.
The term p(E|H,I) is called the likelihood, and it gives the probability of
the evidence assuming the hypothesis H and background information I is true.
The last term, 1/p(E|I), is independent of H, and can be regarded as a
normalizing or scaling constant. The information I is a conjunction of (at
least) all of the other statements relevant to determining p(H|I) and
p(E|I)."

So, yeah, it's often the setting of the priors P(H_i) [in your multivariate
example] that is controversial. MaxEnt is one way of doing this.

Choice of Bayesian versus parametric stats methods often comes down to a
matter of taste: does one make heuristic assumptions about priors (MaxEnt,
invariance-principle-based assumptions, etc.), or does one make a heuristic
assumption regarding what pdf one is dealing with (Gaussian, hypergeometric,
whatever...).

Another controversial point is the making conditional independence
assumtpions, for instance it's handy to simplify

      p(H|E1,E2,E3,I) = p(H|I)*p(E1,E2,E3|H,I)/p(E1,E2,E3|I)

                        p(H|I)*p(E1|H,I)*p(E2|E1,H,I)*p(E3|E2,E1,H,I)
                      = ---------------------------------------------
                               p(E1|I)*p(E2|E1,I)*p(E3|E2,E1,I)

by assuming

p(E2|E1,I) = p(E2|I) and p(E1|E2,I) = p(E1|I).

but it's not always correct...

This leads one into Bayesian networks, a popular AI technique in which one
constructs a directed acyclic graph (dag) of events, so that any two events
in the graph are independent conditional on their ancestors in the graph.

See

 http://www.cs.berkeley.edu/~murphyk/Bayes/bayes.html

for basic Bayes nets info.

Cyc, for example, uses Bayes nets ideas to make parts of their knowledge
base probabilistic.

A problem with Bayes nets is that real knowledge bases often aren't easily
decomposable into dag hierarchies. Thus, there have arisen things like
"loopy Bayes nets." My own AI system Novamente uses a variant of prob.
inference called Probabilistic Term Logic (PTL), which we created ourselves,
and which is vaguely along the lines of loopy Bayes nets, but fits better
into an integrative AI framework.

Specifically, whereas Bayes nets (even loopy ones) assume all inference
occur within a single universal set U, PTL allows for a distributed network
of inferences, each of which may occur within a different U. So it doesn't
assume a consistent probability model, rather a family of overlapping
probability models.

In Novamente, some probabilities are detected by "direct evaluation of
evidence" (which includes the results of some nonprobabilistic cognitive
methods). Then other probabilities are extrapolated from these using
probability theoretic rules (which incorporate Bayes rule among other
algebraic identities...). The "nonprobabilistic cognitive methods", from a
Bayesian perspective, could be interpreted as setting prior probabilities.
This is not how we usually think about the system's operations though....
We usually think as though there are a family of cognitive, perceptual and
action processes going on in the system, cooperating in revising the same
pool of procedural and declarative knowledge, and explicitly probabilistic
methods are just one member of the family. Eliezer points out that all the
members of the family can in principle be viewed in probabilistic terms, and
it's true, but I don't find this observation all that useful.

-- Ben G



This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:40 MDT