On ideas behind Bayesian Learning


When one reads about Machine Learning (ML) or Artificial Intelligence (AI) it is common to come across a list of approaches to AI such as: symbolic reasoning, Bayesian learning, artificial neural networks and so on. Seeing this list I was puzzled with a notion of Bayesian Learning. The only thing I knew with a name Bayes in its name was a famous Bayes's theorem which states that
\[
P(H\mid E) = \frac{P(E\mid H)P(H)}{P(E)}
\]
where $P(H)$ and $P(H\mid E)$ denote prior and posterior probabilities of hypothesis $H$, respectively. What puzzled me is the application of such a simple looking formula to draw patterns from data. In order to learn basic principles of Bayesian learning, I decided to consult Wikipedia and few online tutorials. This post is about results of this exploration.

Wikipedia says
Bayesian learning (inference?) is a method of statistical inference in which Bayes's theorem is used to update the probability for a hypothesis as more evidence or information becomes available.
The key phrase in above description seems to be 'update the probability'. To understand this term better let us describe probabilistic worldview first.

Probabilistic worldview
It is possible to interpret many events using probabilistic language. At first this approach might seem a bit funny, however it becomes quite natural as one practices it. For example, let $A$ be a boolean variable. Instead of saying that $A=true$ one could say that $A$ corresponds to probability distribution on $\{0,1\}$ such that $P(0) = 0$ and $P(1) = 1$. The general idea is that probability distributions can capture patterns in objects. Let us move to more interesting examples. Imagine a machine learning setting with an input space $X$,  an output space $Y$ and a training set $D\subset X\times Y$. Our aim is to be able to estimate output for $x_{new}$ denoted as $\hat{y}(x_{new})$.

Bayesian Approach
The approach taken by Bayesian Learning is different from estimating $y_{new}$. Instead, we wish to estimate a distribution $P(Y\mid X=x_{new})$. Clearly the distribution provides more information about $y_new$ rather than the estimate alone. The question is how should we come up with such a distribution.

Updates in probability distributions
To be more precise, we are interested in finding $P(Y\mid X=x_{new}, D)$ rather than $P(Y\mid X=x_{new})$. That means we assume that some prior distribution $P_0$ for $P(Y\mid X=x_{new})$, and update it towards $P_1$ for $P(Y\mid X=x_{new},D)$.

How to update?
Suppose that the prior distribution $P_0$ can be parametrised by $\theta$, which is taken to be discrete valued for simplicity. So we have
\begin{align*}
P(Y\mid X=x_{new}, D) &= \sum_m P(Y, \theta=m\mid X=x_{new}, D)\\
                                          &= \sum_m P(Y\mid \theta = m, X=x_{new}, D)P(\theta=m\mid X=x_{new},D)\\
                                          &= \sum_m P(Y\mid \theta=m, X=x_{new})P(\theta=m\mid D)
\end{align*}
In above, we made few assumptions. The first says that distribution of $Y$ is independent of $D$ given $\theta$ and $X$. The second says that distribution of $\theta$ is independent of $X$ given $D$.

Application of the theorem of Bayes
By theorem of Bayes we know that
\[
P(\theta = m\mid D) = \frac{P(D\mid \theta=m)P(\theta=m)}{P(D)}
\]
If we further assume that training examples are drawn independently, we have
\[
P(D\mid \theta=m) = \prod_D P((x_i,y_i)\mid \theta = m)
\]
As for computation of $P(D)$ it poses some challenges, as we need to sum (integrate?) across all values of $\theta$. The standard technique of countering such a challenge is the use of sampling methods such as those of Markov Chain Monte Carlo, of Gibbs and so on.
In this way we could compute a posterior probability, which is supposed to give some idea on $y_{new}$.

Summary
Even though lots of details were not covered in this post, I hope that this post gives a taste of how Bayesian learning is done.


Comments

Popular posts from this blog

On learning C++

Neural Networks as Morphisms