## Email Spam Detection Using Naive Bayes Classifier: A Case Study

*By Dhiraj Bezbaruah | Nov. 2, 2020, 6:11 a.m.*

Email spam is operations which are sending undesirable messages to different email clients. The historical backdrop of Email spam is beginning before 2004, however, these are the enormous parts that convey spam to the way it is today. Commercialization of the web and united as a complete thing of electronic post as a ready to be got to a method for news has another face thing coming in of not needed data and sends messages One subset of UBE is UCE (Unsolicited Commercial Email). The inverse of "spam", email which one needs, is called "ham", normally when alluding to a message mechanized examination, (for example, Bayesian Filtering). Email spam targets singular clients with regular postal mail messages. Email spam records are regularly made by checking Usenet postings, taking Internet mailing records, or scanning the Web for locations. Email spams normally cost clients cash out-of-pocket to get. Numerous individuals - anybody with measured telephone administration - read or get their mail while the meter is running, as it were. Spam costs them extra cash. On top of that, it costs cash for ISPs and online administrations to transmit spam, and these expenses are transmitted specifically to endorsers. Progressively, email spam today is sent by means of "zombie systems", systems of infection or worm-contaminated PCs in homes and workplaces around the world. Numerous advanced worms introduce indirect access that permits the spammer to get to the PC and utilization it for pernicious purposes. This entangles endeavors to control the spread of spam, as a rule, the spam does not clearly start from the spammer. In November 2008 an ISP, McColo, which was giving support of botnet administrators, was depered and spam dropped 50 to 75 percent all-inclusive. In the meantime, it is turning out to be clear that malware creators, spammers, and phishers are gaining from one another, and perhaps framing different sorts of organizations. There are two fundamental sorts of spam, and they have distinctive consequences for Internet clients. Cancellable Usenet spam is a solitary message sent to 20 or more Usenet newsgroups. (Through long experience, Usenet clients have found that any message presented on such a large number of newsgroups is frequently not significant to most or every one of them.) Usenet spam is gone for "prowlers", individuals who read newsgroups yet seldom or never post and dole their location out. Usenet spam denies clients of the utility of the newsgroups by overpowering them with a torrent of promoting or other unessential posts. Moreover, Usenet spam subverts the capacity of framework directors and proprietors to deal with the themes they acknowledge on their frameworks.

**Email Filtering/Spam Filtering****: **

To detect unsolicited and unwanted emails and prevent those unwanted messages from getting to a user’s inbox is called a spam filter. The spam filter is a program like other types of the filtering program that looks for certain criteria on which it bases judgments. The input of email filtering software emails. The message through unchanged for delivery to the user's mailbox is the output of the email filter. Some of the mail filters are able to edit messages during processing. Mail filters have differing degrees of configurability. Once in a while they settle on choices taking into account coordinating a consistent expression. At different times, essential words in the message body are utilized, or maybe the email location of the sender of the message. Some more propelled channels, especially hostile to spam channels, use measurable archive order methods, for example, the guileless Bayes classifier. Picture sifting can likewise be utilized that utilization complex picture examination calculations to identify skin tones and particular body shapes typically connected with obscene pictures. Mail filters can be introduced by the client, either as independent projects (see interfaces underneath) or as a major aspect of their email project (email customer).

In email programs, clients can make individual, "manual" channels that then naturally channel mail as indicated by the picked criteria. Most email projects now likewise have programmed spam separating capacity. Network access suppliers can likewise introduce mail channels in their mail exchange operators as a support of the greater part of their clients. Because of the developing danger of fake sites, Internet administration suppliers channel URLs in email messages to uproot the risk before clients click. Normal uses for mail filters incorporate arranging incoming emails and evacuation of spam and PC infections. A less basic utilization is to investigate active email at a few organizations to guarantee that workers consent to proper laws. Clients may additionally utilize a mail filter to organize messages and to sort them into organizers in light of the topic or other criteria.

**The solution to the Problem: **

To solve the problem of the previous study in this project we are using the Naive Bayesian Classifier to classify the spam and non-spam emails. The naive Bayesian Classifier is one of the most popular and simplest methods for classification. Naive Bayesian Classifiers are highly scalable, learning problems the number of features is required for the number of linear parameters. Training of the large data simple can be easily done with the Naive Bayesian Classifier, which takes very little time as compared to other classifiers. The accuracy of the system is increased using Naïve Bayesian Classifier.

**Applications of Naive Bayes Algorithms:**

- Real-time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real-time.
- Multi-class Prediction: This algorithm is well known for the multi-class prediction feature. Here we can predict the probability of multiple classes of target variables.
- Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifier mostly used in text classification (due to better results in multi-class problems and independence rule) has a higher success rate as compared to other algorithms. So, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments).
- Recommendation System: Naive Bayes Classifier and Collaborative Filtering together build a Recommendation System that uses machine learning and data mining technique to filter unseen information and predict whether a user would like a given resource or not.

**Aim and Objectives**

Today email spam is a headache to all individuals. To solve this problem, we need to use some spam classification techniques. In this project, we will be developing an application to differentiate spam and non-spam email using Naïve Bayes Classifier. While the technology is being used to determine spam and non-spam by email service providers, our topmost focus will be on making it more accurate and useful using the existing dataset. To study the Naïve Bayes algorithm depending on the new user input after implementing the algorithm.

We will be developing a command-line program where users can check some text input whether it is spam or not. The program will also calculate and show the accuracy of the Machine Learning Model.

We are also going to develop an easy-to-use Graphical User Interface (GUI) representation of the model where the user can calculate the accuracy score of the model and can give text input to classify as spam or ham.

In this project work to solve the problem of spam email, we are going to use a machine learning spam classification method to find out the spam and ham in the used dataset. To implement this work the following steps involved are;

- Input Dataset: - The data-set used here is named as Enron-spam dataset and it is split into spam and ham category which contain 3672 ham and 1500 spam emails. We will recognize spam and ham mails as it contains ‘farmer.ham’ and ‘GP.spam’ in its file name.
- Pre-processing: - It is the first step where we remove those words from the document which may not contribute to the information we want to extract. Emails may contain a lot of undesirable characters like punctuation marks, stop words, digits, etc. which may not be helpful in detecting the spam email. And after that, a word dictionary is created which contains 3000 most frequently occurring words.
- Feature Extraction: - Once the dictionary is ready, we can extract the word count vector (our feature here) of 3000 dimensions for each email of the training set. Each word count vector contains the frequency of 3000 words in the training file. Suppose we have 500 words in our dictionary. Each word count vector contains the frequency of 500 dictionary words in the training file. Suppose text in training file was “Get the work done, work done” then it will be encoded as [0,0,0,0,0,…….0,0,2,0,0,0,……,0,0,1,0,0,…0,0,1,0,0,……2,0,0,0,0,0]. Here, all the word counts are placed at 296th, 359th, 415th, 495th index of 500 length word count vector, and the rest are zero.
- Training the classifier: - With the help of scikit-learn ML library for training classifiers which is an open-source python ML library installed separately by PIP installation, we just need to import it to the program to train the data. Once the classifiers are trained, we can check the performance of the models on the test set. We extract the word count vector for each mail-in test-set and predict its class (ham or spam) with the trained NB classifier. The ratio of testing and training data is 20:80.
- New user input: - Now user can give a new text input data in the training machine and it will check and give an output if the given data is spam or ham.

**Methodology:**

Classification is a form of data analysis that can be used to extract models describing important data classes or to predict future data trends. Such analysis can help to provide us with a better understanding of data on a large scale.

**Naive Bayes Classifier**

Naive Bayes classification is a machine learning algorithm for classification problems. It is primarily used for text classification which involves a high set of data. A few examples are email detection and classifying news articles etc. Identifying the document into a particular category is still very challenging because of the large and vast number of features in the datasets. Naive Bayes is popular in commercial and open-source anti-spam e-mail filtration. Naves Bayes is potentially good at serving as a document classification model.

Although several machine learning techniques have been employed in anti-spam e-mail filtering, including algorithms that are considered top-performers in text classification, like Boosting and Support Vector Machines, Decision Tree, Neural Network, Logistic Regression. But Naive Bayes classifier currently appears to be particularly popular in commercial and open-source spam filters. This is probably due to their simplicity, which makes them easy to implement, their linear computational complexity, and the accuracy, which in spam filtering is comparable to that of more elaborate learning algorithms.

A Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature, given the class variable. For example, a fruit may be considered to be orange if it is orange in color, round, and about 5" in diameter. Even if these features depend on each other or upon the existence of the other features, a Naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an orange. Basically, it's "Naive" because it makes assumptions that may or may not turn out to be 100% correct.

**Pros and Cons of Naive Bayes:**

Pros:

- It is easy and fast to predict the class of test data-set. It also performs well in multi-class prediction.
- When the assumption of independence holds, a Naive Bayes classifier performs better compared to other models like logistic regression and you need less training data and time.
- It performs well in the case of categorical input variables compared to numerical variables. For numerical variables, the normal distribution is assumed (bell curve, which is a strong assumption).

Cons:

- If a categorical variable has a category (in the test data set), which was not observed in the training data-set, then the model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this problem, we can use the smoothing technique. One of the simplest smoothing technique is called Laplace estimation.
- On the other side, Naive Bayes is also known as a bad estimator, so the probability outputs are not to be taken too seriously.
- Another limitation of Naive Bayes is the assumption of the independent predictor. In real life, it is almost impossible that we get a set of predictors that are totally independent.

**Types of Naïve Bayes:**

- Multinomial NB: It is used when we have discrete data. For example, let’s say, we have a text classification problem. Here we can consider probability trials which is one step further and instead of “word occurring in the document”, we have “count how often word occurs in the document”, you can think of it as “number of times outcome number x_i is observed over the n trials”.
- Bernoulli NB: This model is useful if your feature vectors are binary (i.e. zeros and ones). One application can be text classification with a ‘bag of words’ model where the 1s & 0s are “word occurs in the document” and “word does not occur in the document” respectively.
- Gaussian NB: It is used for continuous features. Also used in classification and it assumes that features follow a normal distribution.

**Bayes Theorem:**

Bayes - “It refers to the statistician and Philosopher Thomas Bayes and the theorem named after him, Bayes theorem, which is the base for the naïve Bayes algorithm".

Let us understand what the Bayes theorem is

Where,

P (A|X) = Probability of occurrence of event A given that event B is true

P (A) and P (B) = Probability of occurrence of both event A and B

P (X|A) = Probability of occurrence of event X given event B is true

**How Bayes Theorem Works?**

Let’s understand it by using an example. Below we have a training data set of weather and corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify whether players will play or not based on weather conditions. Let’s follow the below steps to perform it.

Step 1: Convert the data set into a frequency table

Step 2: Create a Likelihood table by finding the probabilities like Overcast probability = 0.29 and the probability of playing is 0.64.

Step 3: Now, use the Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of the prediction.

Problem: Players will play if the weather is sunny. Is this statement is correct?

We can solve it using the above-discussed method of the posterior probability.

P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Naive Bayes uses a similar method to predict the probability of different classes based on various attributes. This algorithm is mostly used in text classification and problems having multiple classes.

**Conclusion:**

Email Spam is the most crucial matter in a social network. There are many problems created through spam. Spam is nothing this is an unwanted message or mail which the end-user doesn’t want in our mailbox. To solve this problem we created an email spam classification system and identifies the spam and non-spam emails. Here we are using the Naïve Bayesian Classifier and extracting the word using the word-count algorithm. After calculation, we find that the naïve Bayesian classifier has more accurate as its accuracy score is 94% and it can be considered to have the highest rate of good results. The error rate is very low when we are using the Naïve Bayesian Classifier. So we can say that Naïve Bayesian Classifier produces better Email classification. With this study, we have achieved a better understanding of machine learning algorithms and the python programming language.