Introduction To Machine Learning - The Distributed Naive Bayes Classifier

In my last blog, we discussed about the Naive Bayes algorithm. In this blog, we thing about how we can parallelize work on a distributed system to solve classification problems for enormous data sources.

So say we have documents of 3 classes (just to be more general than the usual binary classification case). First thing we want to do is get priors for our documents. In MapReduce, this would be an easy task:

MAP(docid, doc)
    EMIT(doc.class, 1)

So we basically emit a partial count for the class label for all documents residing on different nodes. On the reducer side, we now need to collect all these partial counts, and add them up:

REDUCE(doc.class, counts [1, 1, ....., 1, 1])
    sum = 0
    for each count in counts
        sum = sum + 1
    EMIT(doc.class, sum / M)

We assume above, that the text corpus size is M. Thus we've gotten our priors. Now, to get the word counts per document type, we'd need to iterate through the document, and for each word, we output a ({word, class}, 1) pair. Let's see how this is done:

MAP(docid, doc)
    for each word w in doc
        EMIT( (word, doc.class) , 1)

The {word, doc.class} pair within the emitted pair is a data structure that can store two values (almost trivial to implement in Java). Note that this pair, to work with Hadoop in Java, must extend the Serializable and Comparable interfaces.
The reducer now just has to count the total number of 1's occurring for a given word-document class pair.

REDUCER((word, doc.class), counts [1, 1, 1, ..., 1])
    sum = 0
    for each count in counts
        sum = sum + 1
    EMIT (word, sum / PRIOR(doc.class)*M)

Here, PRIOR is the prior for document class that we calculated above. PRIOR*M just gives us the total number of docs present in the entire corpus that have document label the same as doc.class

By pre-computing these word probabilities for enormous texts, it is possible to have classification performance of the order of O(kn) where k = number of document classes and n = length of document (in words).

That's all for now. Stay tuned for some code in my next blog (also on machine learning)!

No comments:

Post a Comment