Building Spam Classification Using The Naive Bayes Algorithm

In our daily life, we get lots of emails. Some emails are useful and some are not. An unsolicited email sent in the bulk is a spam email. We do not generally want spam emails, so spam classifiers throw them in spam folders before they appear in our inbox section.

According to Statista, around 29% of the emails sent in 2019 were spam emails. It has been studied that spam emails impede economic growth and causes loss of billion dollars of GDP. Rao and Railey claim economic loss at over $1 trillion if firms were not investing in anti-spam technology.

The statistics are sufficient to underscore the importance of spam filters. With the progress in machine learning and deep learning increasing day by day, spam filters have made use of them to protect customers, and they have been successful to a large extent. From saving email reading time to protecting customers from frauds, deceits, and phishing, spam filters have done excellent work in preventing losses and increasing efficiency.

Email Classification Using Naive Bayes Classifiers

Today, let’s scratch spam email classification using one of the simplest techniques called naive Bayes classification. Naive Bayes classifiers are the classifiers that are based on Bayes’ theorem, a theorem that gives the probability of an event based on prior knowledge of conditions related to the event. It can be used to build a naive but good enough spam classifier, and we will see its use using a Python machine learning library, Sklearn.

At first, let’s import relevant libraries, sub-packages, modules, and classes.

import matplotlib.pyplot as plt
import nltk 
import numpy as np
import pandas as pd 
import seaborn as sns

In addition, let's import some methods, functions, and classes from Scikit-learn (Sklearn), one of the widely used libraries in data science.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.utils.multiclass import unique_labels

Now, let’s download the email dataset (around 5500 rows) from the dataset URL, which I got from the AIDevNepal’s GitHub repository. The dataset contains non-spam emails and spam emails. Also, let’s convert the labels to numerical values, 1 for spam and 0 for non-spam.


data = pd.read_csv('https://raw.githubusercontent.com/AiDevNepal/ai-saturdays-workshop-8/master/data/spam.csv')

data['target'] = np.where(data['target']=='spam',1, 0)

Shall we peek into the data?

data.head(10)

Before training, let’s divide the dataset into training and validation. By default, Sklearn splits training and testing data in the ratio of 70:30.


X_train, X_test, Y_train, Y_test = train_test_split(data['text'], 
                                                    data['target'], 
                                                    random_state=0)

Our raw dataset is the email messages. We can not feed such raw datasets to machine learning algorithms. Machine learning algorithms train models by doing computation, and the computation is possible with numerical values. So, let’s extract features from the raw dataset for training. For doing that, we transform all the email messages to the vectorized form using CountVectorizer class. Here, we take unigram and bigram, and train using the training examples.

# extract features
vectorizer = CountVectorizer(ngram_range=(1, 2)).fit(X_train)
X_train_vectorized = vectorizer.transform(X_train)

Now, we create a multinomial Naive Bayes model using Sklearn API and train it with the dataset we created. Actually, naive Bayes is a performant machine learning algorithm on small datasets. It generalizes well with a small number of training examples, which complex models like neural networks fail at.

model = MultinomialNB(alpha=0.1)
model.fit(X_train_vectorized, Y_train)

Let’s test the model by doing predictions on the testing set. We are transforming the raw test data by using the vectorizer we previously created.

predictions = model.predict(vectorizer.transform(X_test))
print("Accuracy:", 100 * sum(predictions == Y_test) / len(predictions), '%')

The accuracy of our model on testing data is whopping 98.99%. WOW!!!

Now, let's test our model with real-life emails and see how they predict.


model.predict(vectorizer.transform(
    [
        "Thank you, ABC. Can you also share your LinkedIn profile? As you are a good at programming at pyhthon, would be willing to see your personal/college projects.",
        "Hi y’all, We have a Job Openings in the positions of software engineer, IT officer at ABC Company.Kindly, send us your resume and the cover letter as soon as possible if you think you are an eligible candidate and meet the criteria.",
        "Dear ABC, Congratulations! You have been selected as a SOftware Developer at XYZ Company. We were really happy to see your enthusiasm for this vision and mission. We are impressed with your background and we think you would make an excellent addition to the team.",
    ])
            )

Are you eager to see what our model predicts? Okay, here it is.

Here the output of the model predictions of all the given three emails is 0. And as we previously defined, 0 means non-spam. That’s right! I just tested with emails I received from my employers, colleagues, and friends.

Okay, what about spam emails in my spam folder? Let’s test them.

model.predict(vectorizer.transform(
    [
        "congratulations, you became today's lucky winner",
        "1-month unlimited calls offer Activate now",
        "Ram wants your phone number",
        
    ])
            )

The output of the above example is:

Nailed it! It predicts everything as spam. You are a savior ❤

As we saw, the classifier turned out to be a savior for me in the end, otherwise, I would have been a victim of some fraud activities or phishing attempts. Now, how about testing your emails and see how this naive algorithm performs?