TheJach.com

Jach's personal blog

(Largely containing a mind-dump to myselves: past, present, and future)
Current favorite quote: "Supposedly smart people are weirdly ignorant of Bayes' Rule." William B Vogt, 2010

An explanation and example of Naive Bayes

Here I embark on a slow, carefree explanation of Naive Bayes, but if you're just interested in code, or the single line of math Naive Bayes takes, then look to the bottom. Naive Bayes is an algorithm to classify things. It's probably most popular for its use in spam classification, but you can use it for pretty much anything else and it's somewhat embarrassing how successful it's been for all these years compared to anything else. Moving away from Naive Bayes to, say, a fully Bayesian system, carries a large computation cost for frequently little benefit on the types of problems Naive Bayes is good at.

Allow me a brief biblical tangent. In Genesis, God created Adam, the first man. God saw it was not good for the man to be alone, so he decided to create a helper for Adam. At the same time, he had Adam name, i.e. classify, all the creatures brought before him:

And out of the ground the LORD God formed every beast of the field, and every fowl of the air; and brought them unto Adam to see what he would call them: and whatsoever Adam called every living creature, that was the name thereof.
And Adam gave names to all cattle, and to the fowl of the air, and to every beast of the field; but for Adam there was not found an help meet for him.
--Genesis 2:19-20(KJV)


Then God made a new creature from Adam's rib, and Adam called it a woman.

I think it's an interesting note that the first thing man was tasked to do by his supposed creator was to classify things, and one of the things we create and use 'intelligent' algorithms for is to classify things. Anyway, that's enough of that, back to the math.

Recall Bayes' Theorem:

[math]\begin{align}
prob(X|A,I) &= prob(X|I) * \frac{prob(A|X,I)}{prob(A|I)} \\
prob(model\ |\ data,\ background\ information) &= prob(model\ |\ background\ information) * \frac{prob(data\ |\ model,\ background\ information)}{prob(data\ |\ background\ information)}
\end{align}
[/math]

In case you don't remember the notation, $$prob(A|B)$$ is read as "the probability of A given that B is true." (Technically some people read A and B are "events" rather than "propositions", so then it deals with predicting future events given past events already occurred. But you can always embed a human or other agent in the problem, and predict the agent saying A is true given the agent said B is true...)

Hence you might ask $$prob(rain|cloudy)$$ (probability of it raining given that the sky is currently cloudy) and $$prob(cloudy|rain)$$, which are clearly two separate statements and they shouldn't necessarily be equal. I will probably write a post or two on this theorem directly at some point, but for now take it as a given. It relates reason with reality--the left side is our state of knowledge about some model or hypothesis we have to explain our objectively observed data, with a factor of prior belief in there (which may be our old state of knowledge and so we can "update" to new states of knowledge with new data this way).

The specific problem I'm going to use the Naive Bayes classifier algorithm for is to determine which tag(s) on this site a blog post I'm writing fit into best. So to do that, the algorithm needs to rank how probable my current blog post fits in one of the tags given my blog post as data and some background information--namely, the already existing tags and posts associated with the tags. Reducing the problem even more, I'm basically giving the algorithm my "post" as a sequence of words, and asking it to estimate what tags it fits under. It achieves this by counting how many times each word appears in the given post, and seeing how many times that word appears in the training data along with a given tag.

I'm calculating $$prob(tag\ fits|this\ post,\ and\ training\ posts)$$.

Using Bayes Theorem, I calculate that like so:

[math]prob(tag|post,training) = prob(tag|training)*\frac{prob(post|tag,training)}{prob(post|training)}[/math]

Again, $$prob(tag|post,training)$$ is the probability that a given post should belong with our tag. Identically, it's the probability that our tag is a correct model for our post. If the post is mathy, like this one, prob(math|post) should be fairly high, and much higher than prob(shakespeare|post). Translating probabilities to English can be at tricky thing--the standard notation is, I think, a mistake.

$$prob(tag|training)$$ is our prior probability that some tag matches a post without actually looking at the post. We might notice the tag cloud and how "programming" seems to have a lot of posts for it, while "music" does not. So if we're asked to give a prior estimate that any new post fits the tag "programming", it should probably be much higher than the prior estimate that it's "music". This represents our subjective "best-guess" ahead of time, but it could also represent our previous result since the old post would get merged with the training data to affect how we reason about a new post.

Since we actually have the training data available to us, we'll calculate this prior using the reasonable notion that the more posts a tag has going for it compared to all others, the more likely that tag is to be the one for the current post. Expressed in math, we're saying that $$prob(tag|training) = \frac{number\ of\ posts\ for\ tag\ in\ training}{total\ number\ of\ posts}$$.

As of this writing, the prior probability that any new post I write is a "programming" post is 46/206, or about 22%, and "music" is 1/206, or about half a percent. Does this seem reasonable? Another measure we might use is the total number of words that all the posts in a tag have compared to the total number of words in the whole blog. I tried this out, it didn't affect my results very much because I have enough data that the prior becomes irrelevant, but selection of priors can be important sometimes.

I belabor the point that the prior probability is subjective, and I want to extremely belabor the point that this subjectivity is the same subjectivity involved in making an initial guess in Newton's Square Root Method--you may be off at first but over time you will approach the best answer. But this doesn't mean you should be stupid about choosing the prior. If we had no knowledge, we'd make a uniform prior and assume all tags were equally likely, but our training data gives us knowledge. (As a given, since it's on the right side of the bar.)

$$prob(post|tag,training)$$ is the more interesting part of the problem right now. It represents the probability that given some tag, the words in the input post appear in that tag's data set. It's sometimes called the "likelihood function"; you can sometimes go out and look at it. Unlike our prior where we're justified in just looking at post counts so far for some tag, here we actually care about the content of those posts rather than the posts themselves. I'll explain more how to calculate this shortly.

$$prob(post|training)$$ is the least interesting part of the equation and for this application we'll ignore it. It's going to be identical for all our calculations, because each new post is as likely as the next as I could write about anything. So we can just factor it out as a scaling constant, then ignore it entirely since we're only interested in the ordering of tag probabilities (suggesting say the top 5 for me to use) as opposed to the actual probability the tag matches. Of course one could try and calculate it (perhaps with a function of how many times each word in the post appears in all the posts ever), but it seems overkill in this situation just like more effort to make our prior better seems to be overkill. Of course your application may vary with how important this is. Anyway, to be safe with the math, we'll downgrade our previous equation to a proportion:

[math]\begin{align}
prob(tag|post,training) &\propto prob(tag|training) * prob(post|tag,training) \\
prob(tag|post,training) &\propto \frac{number\ of\ posts\ for\ tag\ in\ training}{total\ number\ of\ posts} * prob(post|tag,training)
\end{align}[/math]

We can calculate the prior probability of a tag with the fraction of posts in that tag over all posts, so how do we measure the likelihood, the last prob() value? What can we say about a post's likelihood of falling into a chosen tag? To measure this, we must look at the post's content, and use that content as our objective information along with our training data to see how often this post's content appears in a chosen tag's content.

Here is the naive step of Naive Bayes. We're going to say that a post's content is defined entirely by its words (which are strings of text separated by spaces), and that each word appears independently of other words. This is clearly false, but it works good enough, and as long as we stuff this assumption into our background information with the training data we'll be okay.

Anyway, mathematically, if we're calculating some post fitting into a given tag, we're really calculating that post's first word, and second word, and third word, and so on, fitting into a given tag. With the rest of the naive assumption, we can calculate by just multiplying together the probability that the post's first word is in a given tag, times the prob that the second word fits in a given tag, and so on.

[math]\begin{align}
prob(post|tag,training) &= prob(word1,word2,word3,\cdots,wordM|tag,training) \\
&(by\ naive\ assumption\ of\ independence) \\
&= prob(word1|tag,training)*prob(word2|tag,training)*\cdots*prob(wordM|tag,training)
\end{align}[/math]

Noting that $$M$$ is the number of words in a post.

The probability of a single word for a given tag is determined by the number of times that word appears in that tag's content from the training data divided by the number of all words in that tag's content. In a sense, we're just computing histograms and comparing each column's count with the total count, only with words this time instead of posts like we do for the prior probability.

Remember, this is a very naive assumption! If you know anything about language, you'll know that one particular word being in a sentence is highly dependent on the words around it. When we actually implement this, we'll "massage" the data a bit to make up for this naive assumption's deficiency tradeoffs. For example, the word "the" likely appears for every tag multiple times, possibly even as the dominant word, but no matter how many times it appears for a tag it's likely that it offers no useful information for our problem.

One problem you may have noticed: what happens if a word has never been said for a given tag? Then our measured $$prob(word|tag,training)$$ is approximately 0, and that obliterates the rest of the other probabilities with the multiplication! Just because a particular word hasn't been said for a tag doesn't mean that tag doesn't apply, especially considering all the other words, so this seems like a stupid thing to happen right? How can we avoid it? We certainly can't use an actual 0, but using an epsilon value that's almost 0 (like 0.0001) doesn't do the trick either when we're multiplying.

If you remember some high school algebra, you may have studied logarithms. Logs have the interesting property that $$log(A*B) = log(A) + log(B)$$. (And $$log(A/B) = log(A) - log(B)$$.) So when our implementation is going through all the words and calculating their likelihoods given a chosen tag, it should sum up the log (whatever base) values of the likelihoods instead of trying to multiply them.

You may see an additional problem again though. $$log(0)$$ is undefined. (Positive infinity if we're approaching 0 from the positive side, negative infinity if we're approaching it from the negative side.) But unlike before where we couldn't use actual 0s, here we can safely insert an epsilon (like 0.0001) as the count without distorting the other likelihoods and without ignoring the word completely, and compute $$log(0.0001)$$. And of course if we have an actual word count because the word existed, we compute $$log(word\ count)$$ as normal. (My epsilon was chosen experimentally (0.1 made "music" seem likely for this post!) and also intuitively--while yes a word that hasn't been said before for a tag doesn't disqualify it, it should still penalize that tag quite a bit for lacking it, especially when that tag might not have many words to begin with.)

You'll note that the value becomes more negative as the count goes down. This corresponds to the idea that if a word appears as a small fraction of the total words in a tag, it has less importance to that tag. "keyboard" may appear in a "fruit" tag a couple times, whereas "orange" may appear many times, "keyboard" is thus less important. But of all possible words that have not appeared in the "fruit" tag, given one out of the blue (as our naive assumption asks) is reason to believe that word has even less to do with "fruit" than "keyboard" does.

The closer to 0 our log result gets, the closer to certainty this word becomes. Of course all our results will be pretty far away due to the assumptions we're making and my particular data set, so I do end up dividing the logs by 1000 just to get a nicer looking number. (This is why we're proportional.)

Let's make sure we're being explicit with our math:

[math]\begin{align}
prob(tag|post,training) &\propto \frac{number\ of\ posts\ for\ tag\ in\ training}{total\ number\ of\ posts} * prob(word1|tag,training)*prob(word2|tag,training)*\cdots*prob(wordM|tag,training) \\
\log(prob(tag|post,training)) &\propto \log\Big[\frac{number\ of\ posts\ for\ tag\ in\ training}{total\ number\ of\ posts} * prob(word1|tag,training)*prob(word2|tag,training)*\cdots*prob(wordM|tag,training)\Big] \\
&\propto \log(\frac{number\ of\ posts\ for\ tag\ in\ training}{total\ number\ of\ posts}) + \log(prob(word1|tag,training)) + \log(prob(word2|tag,training)) + \cdots + \log(prob(wordM|tag,training))
\end{align}
[/math]

So our final Naive Bayes classifier we've taken all this time to go over is expressed in one line of math with a trivial simplification:


[math]\begin{align}
\log(prob(tag|post,training)) &\propto \log(\frac{number\ of\ posts\ in\ tag}{number\ of\ posts}) + \log(\frac{number\ of\ word1\ in\ tag}{words\ in\ tag}) + \log(\frac{number\ of\ word2\ in\ tag}{words\ in\ tag}) + \cdots + \log(\frac{number\ of\ wordM\ in\ tag}{words\ in\ tag}))\\
\log(prob(tag|post,training)) &\propto \log(\frac{number\ of\ posts\ in\ tag}{number\ of\ posts}) - M*\log(words\ in\ tag) + \log(number\ of\ word1\ in\ tag) + \cdots + \log(number\ of\ wordM\ in\ tag))
\end{align}[/math]

Remembering that $$M$$ is the number of words in a post, and noting that if "number of wordX in tag" is 0 we fudge it to 0.0001. So now we have the equation for one tag, it's just a matter of going through each tag now to weigh the probabilities and select the top 10 or so most likely ones this post belongs to. The rest of the implementation is left as an exercise to the reader.

Okay just kidding. This is my blog and I'll actually get some use out of it. (I also have to demonstrate that it actually works by running this post through it!) What follows is a portion of the code I'm actually running. (It is the reader's job to worry about further optimization and performance beyond what I care about for only me using it when I write new posts.) There is a low-hanging optimization-fruit I'm going to take advantage of, though, and that's pre-computing the training data values words-in-tag and number-of-particular-word-in-tag. I can then add to it over time. HostGator sucks and making a lot of DB queries at once to recompute that every time kills my site, so I'm just offloading the data in a private server file to read and write.

Because of that optimization, the code is divided into two parts. The first part is defining our training data structure (just simple JSON to store histograms basically) and a function to add new data to it (and a one-time initial load), the second part is simply reading the values back out we need for our Naive Bayes equation. All functions are implicitly in a class.


/**
* Associates more words with a tag to our training data file.
* Our training data file 'naive_bayes.json' contains tags and their
* associated words along with the counts of how many times that word appears
* in the tag:
* {"tags": [
* "foo": {"wordcount": total, "words": {"word1": count, ... }},
* "bar": ...
* ]}
*/
private function naive_bayes_data_add($tag, $txt) {
$fn = APP_URI . 'naive_bayes.json';
if (file_exists($fn)) {
$training = json_decode(file_get_contents($fn), 1);
} else {
$training = array('tags' => array());
}
if (!isset($training['tags'][$tag]))
$training['tags'][$tag] = array('wordcount' => 0, 'words' => array());
$words = $this->txt2words($txt);
$training['tags'][$tag]['wordcount'] += count($words);
foreach (array_count_values($words) as $word=>$count) {
if (isset($training['tags'][$tag]['words'][$word]))
$training['tags'][$tag]['words'][$word] += $count;
else
$training['tags'][$tag]['words'][$word] = $count;
}
file_put_contents($fn, json_encode($training));
}


/**
* Takes input $txt and returns an array of words, also filtering out the
* 100 most common English words we don't care about
* like 'the' as well as punctuation.
*/
private function txt2words($txt) {
$bad = 'the of and a to in is you that it he was for on are as with his they I at be this have from or one had by word but not what all were we when your can said there use an each which she do how their if will up other about out many then them these so some her would make like him into time has look two more write go see number no way could people my than first water been call who oil its now find long down day did get come made may part';
return array_filter(str_word_count($txt, 1), function($word) use (&$bad) {
return mb_stripos($bad, $word) === FALSE;
});
}


I've omitted the code to generate the initial file as it should be obvious and it's dependent on your data source anyway. (Let me know in the comments if you really want it.) (Edit: I've since updated the above function so that it only opens and writes on command and passes the array-to-write by reference.) Obviously I could have improved my add function to avoid reading/writing every time so a bulk load could happen, but the size of my data isn't enough for me to really care about that. Another interesting optimization would be to have a separate array of all the unique words for all content, and then each word can just reference the index instead of the full string. But I'm not worried about storage and the speedup is likely minimal; the json file is only about 1 MB.

Now for the second part, it's reading in the data, running Naive Bayes, and calculating a couple other things, and voilà!


/**
* Computes a score proportional to prob(tag|post,training) for all tags.
*
* @param $post_content - content of post.
* @param $N - how many top tags to return; 0 for all.
* @return Returns minimally
* {'tags': [{'tag_name': 'tag1', 'score': -200.3}, ...]} for $N
* highest-scoring tags with possible extra measures.
*/
private function naive_bayes($post_content, $N=0) {
global $dbc;
$fn = APP_URI . 'naive_bayes.json';
if (!file_exists($fn))
return;
$training = json_decode(file_get_contents($fn), 1);

$hs = new HomeService(1);
$tags = $hs->get_tag_list_detailed(array());
$tagl = $tags['tag_list']; // has 'tag_name' of all tags and
// 'total' (number of posts for tag)

$q = "SELECT COUNT(id) FROM posts";
$r = mysqli_query($dbc, $q);
list($total_posts) = mysqli_fetch_row($r);

$post_words = $this->txt2words($post_content);
$post_wordcount = count($post_words);

foreach ($tagl as $key=>$tag_info) {
$tag = $tag_info['tag_name'];
$posts_in_tag = $tag_info['total'];
$prior = $posts_in_tag / $total_posts;

$tag_wordcount = $training['tags'][$tag]['wordcount'];
$tag_histo = $training['tags'][$tag]['words'];

$log_sum = 0;
$words_shared = 0;

foreach ($post_words as $word) {
$val = 0.0001;
if (isset($tag_histo[$word])) {
$val = $tag_histo[$word];
$words_shared++;
}
$log_sum += log($val);
}
$log_sum -= $post_wordcount * log($tag_wordcount);


// Naive Bayes:
$tagl[$key]['score'] = round((log($prior) + $log_sum) / 1000, 6);
// (Factor of 1000 so we have smaller numbers.)

// Extra measures:
$tagl[$key]['percent_shared_words'] = round(100*$words_shared/$post_wordcount, 2);
$tagl[$key]['words_shared'] = $words_shared;
$tagl[$key]['words_missed'] = $post_wordcount - $words_shared;
$tagl[$key]['percent_prior'] = round(100*$prior, 2);
}

usort($tagl, function($lhs, $rhs) {
if ($lhs['score'] == $rhs['score'])
return 0;
return ($lhs['score'] < $rhs['score']) ? 1 : -1; // sort DESC
});

if ($N > 0)
$tagl = array_slice($tagl, 0, $N);
return array('tags' => $tagl);
}


I ran my algorithm on the post content above this paragraph to get an idea of how well it was doing. My criterion for "good enough" was that it had "programming" and "math" in the top 5. Do you think it performed well?

Here's the data for the top 5:

programming: score is -20.726721, percent_shared_words is 72.76, words_shared is 1386, words_missed is 519, our percent_prior was 22.33.

philosophy: score is -20.771194, percent_shared_words is 74.91, words_shared is 1427, words_missed is 478, percent_prior was 18.93.

math: score is -21.396616, percent_shared_words is 62.26, words_shared is 1186, words_missed is 719, percent_prior was 7.28.

personal: score is -21.467156, percent_shared_words is 67.72, words_shared is 1290, words_missed is 615, percent_prior was 12.62.

thought: score is -23.006031, percent_shared_words is 54.38, words_shared is 1036, words_missed is 869, percent_prior was 7.77.

I'm calling it a win, though I was surprised PHP didn't quite make it as a recommended tag. But remember, it's not just how many words are shared, but how many words are shared multiple times. Maybe next time when this post will contribute it'll crop up. (Good news is that given an existing post, the existing tags for that post are basically at 100%.)

I hope my math and implementation are accurate, I believe they are. If not, I'm sure someone will come by and chew me out, or I'll discover for myself that it's not working as well as it should!

Oh, for a similar treatment of this process, but less detailed and a bit more confusing in my opinion for people who lack Bayes' Theorem intuition or just aren't that great with math symbols, check out this article that also starts with Bayes and uses logs: Naive Bayesian Text Classification


Posted on 2012-01-12 by Jach

Tags: bayes, math, philosophy, php, programming

Permalink: https://www.thejach.com/view/id/227

Trackback URL: https://www.thejach.com/view/2012/1/an_explanation_and_example_of_naive_bayes

Back to the top

Mohit October 09, 2013 07:57:51 AM You Rock buddy! I have been searching for this explanation a long time and your post is immensely helpful!Thanks a Lot!
Back to the first comment

Comment using the form below

(Only if you want to be notified of further responses, never displayed.)

Your Comment:

LaTeX allowed in comments, use $$\$\$...\$\$$$ to wrap inline and $$[math]...[/math]$$ to wrap blocks.