Introduction To Machine Learning - Naive Bayes Classifier

This is my first post on machine learning algorithms, so please forgive my briefness in some concepts/examples.

The Naive Bayes model (NBM) is probably the simplest out of all machine learning algorithms, involving Bayesian statistics. It is a great primer model, in the sense that it should be used as a first take on your data. Usually, the NBM is a favorite for text mining/classification and fits those kinds of applications really well. It makes over simplifying assumptions, but none-the-less, a great tool for gaining preliminary insights into the data.

We first start off by defining conditional independence. Suppose we have three random variables A, B and C. These could be anything. For example, let them be the following:
A - will I have pizza for lunch today? (boolean variable, yes/no)
B - did I visit the cafeteria which has my favorite pizzeria (say, Tom's Delicious Pizzas)? (boolean variable, yes/no)
C - what's the name of the place on my receipt where I picked up lunch from today? (multivalued  variable; say I usually visit 8 places in the cafeteria for lunch, out of which Tom's Delicious Pizza is one)

Now, if you look closely at the inter-relationship between these three variables, you'll observe, that if I know that my receipt has/does not have "Tom's Delicious Pizzas" on it, the variable B does not tell me anything new towards predicting A! This is exactly what conditional independence means. "Conditioned" on the fact that I know C, the knowledge of B does not help me in any way to predict A. Hence, we can completely ignore B! In probabilistic terms, this is written as:

P(A| B, C) = P(A|C) provided A and B are conditionally independent, given C

Note: The "given C" part is really important. without any knowledge of C, B would suddenly become very relevant to predicting A. It is absolutely important for us to know C for the above formula to hold. If the following is true:

P(A|B) = P(A)

then it means that A and B are completely, or totally independent. This means, that B really has nothing to do with A. You can imagine A and B to be something like:
A - do you have a cold? (boolean variable, yes/no)
B - did the cow in your farm moo? (boolean variable, yes/no)
As you can see above, A and B have nothing to do with each other, they are completely independent.



Now that we have independence concepts out of the way, let us get to the Bayes Rule, which states the following:

P(A|B) = (P(B|A).P(A)) / P(B)

Also, we have the chain rule for any random variables X1, X2, ....., Xn as:

P(X1, X2, ....., Xn)  =  P(X1).P(X2 | X1).P(X3 | X1, X2)......P(Xn | X1, X2, X3, ...., X(n-1) )

As puzzling as these rules may look, they non-the-less, are true. The proofs for these are a bit more involved (if I had to explain them in layman's terms) and so, you can look them up here and here.
The basics of probability theorem can be found on a lot of websites, such as Wikipedia, university text uploads, etc. Here is one article that is relatively good. Please spend some time reading/scrolling through these texts before proceeding.

Now that probability theory is out of the way, we can focus on NBMs. How is all of the probability theory relevant to NBMs anyway? Usually, with machine learning, you have a set of variables, called features (and referred to as "feature vector" together) that you select to describe your object of interest. Lets go with a historical example. Suppose you are designing a spam filter, that wants to classify whether the email you just received could be spam or not. To help you, you have thousands of emails classified as spam/not spam by many users all over the world, over the past few years. There are two ways you could model the scenario:
1. Let our feature vector be all words in the English language's dictionary. That means, we have approximately 58,000 variables with boolean values. Two issues with this model: Firstly, that is a very large number of parameters to handle given that most emails are a few hundred words long anyway. Following up on that, secondly, most of these boolean variables will be all zero for most documents (for example, how many emails do expect to have with the word "aardvark"?). How do we handle such a case where most of the variables we are predicting on, are zeros?
2. Let the i'th word position in the email be our variable. That means, we have a feature vector only as long as the email is (which is very good) and each of the variables can possible taky approximately 58,000 values (which doesn't sound so good, but as we see later on, this is not a problem). Note that each feature vector is going to be of a different length (since our emails are not all of the same length of course!) but again, this won't be a big problem. 

So it is easy to infer from the above two points, that the second method, at a first glance, looks better. So four a particular document d, let X(d) = {X(1), X(2), ...., X(n)} represent our feature vector, and Y(d) = {spam/ not spam} be our result variable (variable of interest, which we want to predict). We are then, interested in finding for a new document:

P(Y(new) | X(new))

From Bayes Rule above, we can write this as:

P(Y(new) | X(new)) = (P(X(new) | Y(new)).P(Y(new))) / P(X(new))

Now let us split X(new) into it's constituent variables:

P(Y(new) | X(new)) = (P(X(1), X(2), X(3), ...., X(n) | Y(new)).P(Y(new))) / P(X(1), X(2), ..., X(n))

Which using chain rule from above, becomes:  (note: brackets and the word "new" has been left out for space constraints)

P(Y | X) = ((P(X1|Y).P(X2|X1, Y).P(X3|X1, X2, Y).....P(Xn|X1, X2, ...., Xn-1, Y)).P(Y)) / P(X1, X2, ..., Xn)

This is where we make two assumptions (and this is where the beauty of NBMs lie):
1. X(i) and X(j) are conditionally independent, given Y, for all i = 1, 2, ...n and j = 1, 2, ...., n
2. Each word position is generated identically, and independently. This means, that for position, say 5, if the probability of the word "aardvark" appearing there is k, then so is the probability of "aardvark" appearing at, say position 3 (or 30, or 231, etc). It helps to think of a document of length n words as generated out of a series of n die throws, where the die instead of being 6 sided, is 58,000 sided (since English language contains approximately 58,000 words). So we have a huge die, that is somehow being rolled n times, and each roll's outcome is recorded, and thus we get our document. This in essence, translates to positional independence.

So, now we can write our formula above as:

P(Y | X) = ((P(X1 | Y).P(X2 | Y).P(X3 | Y).....P(Xn | Y)).P(Y)) / P(X1, X2, ..., Xn)

notice the simplifications introduced in the formula. 

For a new document then, we basically find:
P(Y=spam | X(new)) = (P(X1 | Y=spam).P(X2 | Y=spam)....P(Xn | Y=spam).P(Y=spam)) / P(X1, X2, ..., Xn)

P(Y=not spam | X(new)) = (P(X1 | Y=not spam).P(X2 | Y=not spam)....P(Xn | Y=not spam).P(Y=not spam)) / P(X1, X2, ..., Xn)

And we assign our new document the value for Y which has a larger P(Y | X(new)).
So how do we calculate the probabilities P(Xi | Y) and P(Y)?

For document classification, this is quite straightforward. The following formulas hold:

P(Xi = Wj | Y = k) = (# of documents which have the word Wj at position i) / (# of documents which are labeled as k in the entire corpus)

P(Y = k) = (# of documents that are labeled as k) / (total number of documents in the entire corpus)
  
There are a few problems with these formulas, specially the first one. What if in our new document, we have a word Wj at position i, that never existed in the training dataset? This means,

P(Xi = Wj | Y = k) = 0.0

which means, P(Y=k | X(new)) = (something).(something else).0.(something)... / (something) = 0

We are out of luck! we just got a zero probability that the new document belongs to class Y=k just because one word which had never appeared at that position, came up in this new document. So we need to modify the formulas above:

P(Xi = Wj | Y = k) = (# of documents which have the word Wj at position i) + l / ((# of documents which are labeled as k in the entire corpus) + m)

P(Y = k) = (# of documents that are labeled as k) + l / ((total number of documents in the entire corpus) + m)

where l and m are constants chosen by user (through prior experience, etc)

All in all, that is it to document classification with NBMs! Some excellent resources, that guided me to write this article are:

1. http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
2. http://www.cs.cmu.edu/~tom/10601_fall2012/slides/NBayes-9-13-2012_ann.pdf
2. http://scs.hosted.panopto.com/Panopto/Pages/Viewer/Default.aspx?id=a666b6e6-ad23-4fa3-96ce-ae50a42f45a3

Specially, if you have time, definitely go through the video!

Thanks for reading, and more coming in on machine learning soon!

No comments:

Post a Comment