Naive Bayes Classifier
This is a type classifier that is mostly used as a baseline for text classification. The classifier uses the Bayes Theorem, hence it is a probabilistic model. It is called Naive Bayes because the order of words in a sentence does not impact the result of the classifier. For example the sentences:
The boy is kind, not so?
and
The boy is not so kind?
are considered the same by Naive Bayes. But we know that the first sentence is not the same as the second sentence! This property of the Naive Bayes model is called the independence property of Naive Bayes.
Though Naive Bayes does not consider the order of words, it considers the multiplicities of the words in a sentence. For example:
what a great day
is not the same as
what a great great day.
Now, let us look at Bayes Theorem.
Bayes Theorem
Let A and B be events. Then we define the following conditional probabilities:
Note: P(A∩B) is called the joint probability of A and B.
Since equations (1) and (2) have common numerators (P(A∩B)), we can put them together and have:
Equation (3) is Bayes Theorem where,
P(A) is the prior probability — probability before carrying out a test;
P(B) is the marginal probability — the probability of the evidence;
P(B|A) is the likelihood of B given A — the probability of the evidence given that the probability of A is true; and
P(A|B) is the posterior probability — the probability of A after the evidence has been seen.
In the Naive Bayes text classification problem, we are interested in finding the probability of a label given the words. Hence we can rewrite equation (3) in terms of label and the words:
Simplifying further, we have:
Naive Bayes Classifier
We want the model to be easy to estimate so from equation (5), we make a naive assumption that the probability of a word given a class is independent of previous words and then we have:
This assumption is what makes us call the model naive as we know that this assumption is not true in many cases because the probability of one word given a class may be dependent on the probability of another word given that class.
Simple Example
Suppose we have 5 sentences from a movie review as follows:
And we want to predict the review of a new sentence:
We can construct a table to summarize the reviews from Table 1 as follows:
And calculate the most probable review as follows:
From the above, we see that the review with more probability is the positive review.
The above tutorial is inspired by the NLP courses taught at the African Masters in Machine Intelligence programme by Armand Joulin and Edouard Grave of Facebook AI Research.