Bayesian Learning


Provides practical learning algorithms:


Provides useful conceptual framework

Bayes Theorem



\begin{displaymath}P(h\vert D) = \frac{P(D\vert h) P(h)}{P(D)} \end{displaymath}

Choosing Hypotheses



\begin{displaymath}P(h\vert D) = \frac{P(D\vert h) P(h)}{P(D)} \end{displaymath}

Generally want the most probable hypothesis given the training data

Maximum a posteriori hypothesis hMAP:

  hMAP $\displaystyle = \arg \max_{h \in H} P(h\vert D)$  
    $\displaystyle = \arg \max_{h \in H} \frac{P(D\vert h) P(h)}{P(D)}$  
    $\displaystyle = \arg \max_{h \in H}P(D\vert h) P(h)$  


If assume P(hi)=P(hj) then can further simplify, and choose the Maximum likelihood (ML) hypothesis


\begin{displaymath}h_{ML} = \arg \max_{h_{i} \in H} P(D\vert h_{i}) \end{displaymath}

Bayes Theorem


Does patient have cancer or not?

A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only $98\%$ of the cases in which the disease is actually present, and a correct negative result in only $97\%$ of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer.

P(cancer) =
P($\neg$ cancer) =
P(+ $\mid$ cancer) =
P(- $\mid$ cancer) =
P(+ $\mid$ $\neg$ cancer) =
P(- $\mid$ $\neg$ cancer) =

Basic Formulas for Probabilities


Brute Force MAP Hypothesis Learner


1.
For each hypothesis h in H, calculate the posterior probability

\begin{displaymath}P(h\vert D) = \frac{P(D\vert h) P(h)}{P(D)} \end{displaymath}

2.
Output the hypothesis hMAP with the highest posterior probability

\begin{displaymath}h_{MAP} = \mathop{\rm argmax}_{h \in H} P(h\vert D)\end{displaymath}

Most Probable Classification of New Instances


So far we've sought the most probable hypothesis given the data D(i.e., hMAP)


Given new instance x, what is its most probable classification?


Consider:

Bayes Optimal Classifier


Bayes optimal classification:


\begin{displaymath}\arg \max_{v_{j} \in V} \sum_{h_{i} \in H} P(v_{j}\vert h_{i}) P(h_{i}\vert D)\end{displaymath}

Example:


P(h1|D)=.4, P(-|h1)=0, P(+|h1)=1  
P(h2|D)=.3, P(-|h2)=1, P(+|h2)=0  
P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0  

therefore
$\displaystyle \sum_{h_{i} \in H} P(+\vert h_{i}) P(h_{i}\vert D)$ = .4  
$\displaystyle \sum_{h_{i} \in H} P(-\vert h_{i}) P(h_{i}\vert D)$ = .6  

and
$\displaystyle \arg \max_{v_{j} \in V} \sum_{h_{i} \in H} P(v_{j}\vert h_{i}) P(h_{i}\vert D)$ = -  

Naive Bayes Classifier


Along with decision trees, neural networks, nearest nbr, one of the most practical learning methods.


When to use

Successful applications:

Naive Bayes Classifier


Assume target function $f: X \rightarrow V$, where each instance x described by attributes $\langle a_{1}, a_{2} \ldots a_{n} \rangle$.

Most probable value of f(x) is:

vMAP = $\displaystyle \mathop{\rm argmax}_{v_{j} \in V} P(v_{j} \vert a_{1}, a_{2} \ldots a_{n})$  
vMAP = $\displaystyle \mathop{\rm argmax}_{v_{j} \in V} \frac{P(a_{1}, a_{2} \ldots a_{n}\vert v_{j})
P(v_{j})}{P(a_{1}, a_{2} \ldots a_{n})}$  
  = $\displaystyle \mathop{\rm argmax}_{v_{j} \in V} P(a_{1}, a_{2} \ldots a_{n}\vert v_{j}) P(v_{j})$  


Naive Bayes assumption:

\begin{displaymath}P(a_{1}, a_{2} \ldots a_{n}\vert v_{j}) = \prod_{i} P(a_{i} \vert v_{j}) \end{displaymath}

which gives


\begin{displaymath}\mbox{\bf Naive Bayes classifier: } v_{NB} = \mathop{\rm argmax}_{v_{j} \in V} P(v_{j})
\prod_{i} P(a_{i} \vert v_{j}) \end{displaymath}

Naive Bayes Algorithm


Naive_Bayes_Learn(examples)

For each target value vj
$\hat{P}(v_j) \leftarrow$ estimate P(vj)
For each attribute value ai of each attribute a
$\hat{P}(a_i\vert v_j) \leftarrow$ estimate P(ai|vj)







Classify_New_Instance(x)

\begin{displaymath}v_{NB} = \mathop{\rm argmax}_{v_{j} \in V} \hat{P}(v_{j}) \prod_{a_i \in x} \hat{P}(a_{i} \vert v_{j}) \end{displaymath}

Naive Bayes: Example


Day Outlook Temperature Humidity Wind PlayTennis

D1

Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Consider PlayTennis again, and new instance


\begin{displaymath}\langle Outlk=sun, Temp=cool, Humid=high, Wind=strong \rangle \end{displaymath}


Want to compute:

\begin{displaymath}v_{NB} = \mathop{\rm argmax}_{v_{j} \in V} P(v_{j}) \prod_{i} P(a_{i} \vert v_{j}) \end{displaymath}



\begin{displaymath}P(y)\ P(sun\vert y)\ P(cool\vert y)\ P(high\vert y)\ P(strong\vert y) = .005 \end{displaymath}


\begin{displaymath}P(n)\ P(sun\vert n)\ P(cool\vert n)\ P(high\vert n)\ P(strong\vert n) = .021 \end{displaymath}



\begin{displaymath}\rightarrow v_{NB} = n \end{displaymath}

Naive Bayes: Subtleties


1.
Conditional independence assumption is often violated

\begin{displaymath}P(a_{1}, a_{2} \ldots a_{n}\vert v_{j}) = \prod_{i} P(a_{i} \vert v_{j}) \end{displaymath}


\begin{displaymath}\mathop{\rm argmax}_{v_{j} \in V} \hat{P}(v_{j}) \prod_{i} \h...
...rgmax}_{v_{j} \in V} P(v_{j}) P(a_{1} \ldots, a_n \vert v_{j}) \end{displaymath}


Naive Bayes: Subtleties


2.
what if none of the training instances with target value vj have attribute value ai? Then

\begin{displaymath}\hat{P}(a_i\vert v_j) = 0 \mbox{, and...}\end{displaymath}


\begin{displaymath}\hat{P}(v_{j}) \prod_{i} \hat{P}(a_{i} \vert v_{j}) = 0 \end{displaymath}

Typical solution is Bayesian estimate for $\hat{P}(a_{i} \vert v_{j})$

\begin{displaymath}\hat{P}(a_{i} \vert v_{j}) \leftarrow\frac{n_{c} + mp}{n + m} \end{displaymath}

where

Learning to Classify Text


Why?


Naive Bayes is among most effective algorithms


What attributes shall we use to represent text documents??

Learning to Classify Text


Target concept $Interesting? : Document \rightarrow\{+,-\}$

1.
Represent each document by vector of words
2.
Learning: Use training examples to estimate
Naive Bayes conditional independence assumption


\begin{displaymath}P(doc\vert v_j) = \prod_{i=1}^{length(doc)} P(a_i=w_k \vert v_j) \end{displaymath}

where P(ai=wk| vj) is probability that word in position i is wk, given vj


one more assumption: $P(a_i=w_k\vert v_j) = P(a_m=w_k\vert v_j), \forall i,m$

Pseudocode


LEARN/SMALL>_NAIVE/SMALL>_BAYES/SMALL>_TEXT( Examples, V)

1. collect all words and other tokens that occur in Examples

  $\mbox{$\bullet$}$
$Vocabulary \leftarrow$ all distinct words and other tokens in Examples

2. calculate the required P(vj) and P(wk|vj) probability terms

  $\mbox{$\bullet$}$
For each target value vj in V do
  • $docs_{j} \leftarrow$ subset of Examples for which the target value is vj

  • $P(v_{j}) \leftarrow\frac{\vert docs_{j}\vert}{\vert Examples\vert}$

  • $Text_{j} \leftarrow$ a single document created by concatenating all members of docsj

  • $n \leftarrow$ total number of words in Textj (counting duplicate words multiple times)

  • for each word wk in Vocabulary
    • $n_{k} \leftarrow$ number of times word wk occurs in Textj

    • $P(w_{k}\vert v_{j}) \leftarrow\frac{n_{k} + 1}{n + \vert Vocabulary\vert}$


CLASSIFY/SMALL>_NAIVE/SMALL>_BAYES/SMALL>_TEXT(Doc)

Twenty NewsGroups


Given 1000 training documents from each group

Learn to classify new documents according to which newsgroup it came from


comp.graphics misc.forsale
comp.os.ms-windows.misc rec.autos
comp.sys.ibm.pc.hardware rec.motorcycles
comp.sys.mac.hardware rec.sport.baseball
comp.windows.x rec.sport.hockey
   
alt.atheism sci.space
soc.religion.christian sci.crypt
talk.religion.misc sci.electronics
talk.politics.mideast sci.med
talk.politics.misc  
talk.politics.guns  


Naive Bayes: 89% classification accuracy

Article from rec.sport.hockey


Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.edu
From: xxx@yyy.zzz.edu (John Doe)
Subject: Re: This year's biggest and worst (opinion)...
Date: 5 Apr 93 09:53:39 GMT

I can only comment on the Kings, but the most 
obvious candidate for pleasant surprise is Alex
Zhitnik. He came highly touted as a defensive 
defenseman, but he's clearly much more than that. 
Great skater and hard shot (though wish he were 
more accurate). In fact, he pretty much allowed 
the Kings to trade away that huge defensive 
liability Paul Coffey. Kelly Hrudey is only the 
biggest disappointment if you thought he was any 
good to begin with. But, at best, he's only a 
mediocre goaltender. A better choice would be 
Tomas Sandstrom, though not through any fault of 
his own, but because some thugs in Toronto decided

Learning Curve for 20 Newsgroups


\psfig{figure=figures/bayes-text-results.ps}

Accuracy vs. Training set size (1/3 withheld for test)

Bayesian Belief Networks


Interesting because:

  $\mbox{$\bullet$}$
Naive Bayes assumption of conditional independence too restrictive
  $\mbox{$\bullet$}$
But it's intractable without some such assumptions...
  $\mbox{$\bullet$}$
Bayesian Belief networks describe conditional independence among subsets of variables
$\rightarrow$
allows combining prior knowledge about (in)dependencies among variables with observed training data


(also called Bayes Nets)

Conditional Independence


Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Ygiven the value of Z; that is, if

\begin{displaymath}(\forall x_i,y_j,z_k) \ P(X = x_i \vert Y = y_j, Z = z_k) = P(X = x_i \vert Z
= z_k) \end{displaymath}

more compactly, we write

P(X | Y,Z) = P(X | Z)




Example: Thunder is conditionally independent of Rain, given Lightning

P(Thunder | Rain, Lightning) = P(Thunder | Lightning)


Naive Bayes uses cond. indep. to justify

P(X,Y|Z) = P(X|Y,Z) P(Y|Z)  
  = P(X|Z) P(Y|Z)  

Bayesian Belief Network


\psfig{figure=figures/bayesnet.ps}

Network represents a set of conditional independence assertions:

  • Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors.
  • Directed acyclic graph

Bayesian Belief Networks


A belief network represents the dependence between variables.

* nodes

* links

* conditional probability tables

In the Spotlight


Online Airline Pricing