Bayesian Machine Learning (part – 1)
Introduction
As a data scientist, I am curious about knowing different analytical processes from a probabilistic point of view. There are two most popular ways of looking into any event, namely Bayesian and Frequentist . When Frequentist researchers look at any event from frequency of occurrence, Bayesian researchers focus more on probability of events happening.
i am starting this series of blog posts to illustrate the Bayesian methods of performing analytics. I will try to cover as much theory as possible with illustrative examples and sample codes so that readers can learn and practice simultaneously.
Let’s start !!!
Defining Baye’s Rule
As we all know Baye’s rule is one of the most popular probability equation, which is defined as :
P(a given b) = P(a intersection b) / P(b) ….. (1)
Here a and b are events that have taken place.
In the above equation I have bold-marked given and intersection as these words have the major significance in Baye’s rule. Given illustrates that event b has already happened and now we need ato determine the probability of happening event a. Intersection illustrates the occurrence of event a and b simultanously.
The another form in which this above equation can be written is as follows:
P(a given b) = P(b given a) * P(a) / P(b) …. (2)
(this equation can easily be derived from equation 1)
The above equation formulates the foundation of Bayesian inference.
Understanding the Baye’s Rule from Analytics Perspective
In analytics we always try to identify the worldly behaviors from models. These models are mathematical equations with some parameters in them. These parameters are measured based upon the behavior of the events or the evidence we collect from the world. These evidence are popularly known as Data.
So the question occurs, how Bayesian methods helps in identifying these parameters ?
Let us first see how Baye’s Rule can incorporate these models in them. Now we will take theta and X as our events in the Baye’s rule and will re-write the equation 2.
P(theta given X) = P(X given theta) * P(theta) / P(X) ….. (3)
Now, let us define all the different components of the above equation:
- P(theta given X) : Posterior Distribution**
- P(X given theta) : Likelihood
- P(theta) : Prior distribution**
- P(X) : Evidence
** We can use the term distribution as all these terms are probabilities ranging from 0 to 1. theta in above case becomes the parameters of the model we need to compute. X is the data on which the model is trained.
Equation 3 can be re-written as :
posterior distribution = likelihood * prior distribution / evidence ….. (4)
Now let us see all the above components individually.
Prior Distribution : We consider prior distribution of theta as the information we have regarding theta before even starting the analytical model fitting process. This information is mostly based upon the experience. Usually we take Normal distribution with mean = 0 and variance = 1 as the prior distribution of theta .
Posterior Distribution : This is the solution distribution we get over our theta given our data. That is, once we have trained our model on the given data, we finally lands up at tuning our parameters of the model. Posterior distribution is the distribution over measured theta. (this is again a big difference between frequentist and Bayesian way of inference)
Likelihood : This term is not a probabilistic distribution over theta. But it is the probability of occurrence of the data given the theta . In other words, given some theta how likely are we to get the given data, that means how accurately our model with given theta as parameters can understand the given data.
Evidence : it is the probability of the occurrence of the data itself.
Now that we have our definition in place, let us see an example showing how Bayesian can help in determining the selection of a hypothesis, given the data.
let us suppose we have following data:
X = {2,4,,8,32,64}
And we propose following two hypothesis:
1) 2^n where n is ranging from 0 to 9
2) 2*n where n is ranging from 0 to 50
Now let us see how we can use Baye’s rule
Note : as we have no prior information, we will have equal probability for all hypothesis.
—– Hypothesis 2^n where n is ranging from 0 to 10——
This Hypothesis takes following values : 1,2,4,8,16,32,64,128,256,512
- prior 1 : 1 / 2
- Likelihood 1 : (1/10)*(1/10)*(1/10)*(1/10)*(1/10)
- evidence : constant for all the hypothesis as the input data is fixed
- posterior 1 : (1/10)*(1/10)*(1/10)*(1/10)*(1/10) * (1/2) / evidence
—– Hypothesis 2*n where n is ranging from 0 to 50—–
This Hypothesis takes following values : 0,2,4,6,8,10,12,14,16…100.
- prior 2 : 1 / 2
- Likelihood 2 : (1/50)*(1/50)*(1/50)*(1/50)*(1/50)
- evidence : constant for all the hypothesis as the input data is fixed
- posterior 2 : (1/50)*(1/50)*(1/50)*(1/50)*(1/50) * (1/2) / evidence
Now from above analysis we can easily see that Posterior 1 >> Posterior 2 that means Hypothesis 1 defines the data in much better way then Hypothesis 2.
If we closely look into the evaluation of posterior for both the hypothesis, we will note the major difference creator was the likelihood. In future we will note that maximizing this likelihood will help in parameter tuning. This method is popularly known as Maximum Likelihood Estimation.
So in this post I introduced Baye’s Rule. In the next post we will see how to use it in estimating paramters for linear regression with example.
Thanks For Reading !!!