Bayesian Email Spam Filter

NOTES:

a) You will implement an “Email Spam Filter” using the principle of Bayesian Logic. The algorithm for the spam filter is illustrated with an example in the link below:

· Bayesian Filtering Example: Process Software Inc.

b) For this exercise, we will implement a slightly modified version of the algorithm as described in the above link.

Background:

An email that is not spam is called a ham email or simply, ham.

We will begin by training our filter with five emails that are known to be spam and five emails that are known to be ham. After the training, we will read an input email, split it into individual words and compute the spamicity for every word. Spamicity for a word, W is defined by the following formula:

· Spamicity = (A)/(A+B)

Where:

· A is the frequency of W in the spam emails that were used for training.

· B is the frequency of W in the ham emails that were used for training.

As we compute the spamicity for every word, we will store its value in an array list. Using the Collections API, we will sort the array list in ascending order and store its minimum value, C (first value in the array list) and maximum value, D (last value in the array list).

Finally, we will compute the probability of the input email being spam using the Bayes formula:

· P (Email is Spam) = (C * D)/ ((C * D) + (1-C) * (1-D))

Programming Details:

Follow the steps below:

1) Text files for training the spam filter:

o Create five text files that contain spam emails. Name them as spam1.txt through spam5.txt

o Create five text files that contain ham emails. Name them as ham1.txt through ham5.txt

o Spam and Ham text files for training.

o Store the above ten files in the same folder as your Java program.

2) We will create two Hash maps as below:

a. HashMap <String, Integer> spam = new HashMap <String, Integer>();

§ Read through spam1.txt through spam5.txt and store the frequency of each word in this map.

b. HashMap <String, Integer> ham = new HashMap <String, Integer>();

§ Read through ham1.txt through ham5.txt and store the frequency of each word in this map.

3) To check if an input email is spam or ham, we will create a text file, “test.txt” and fill it up with text.

4) For this step, we will read through “test.txt” and compute the spamicity (of type double) for every word, W. To store the spamicity of every word in “test.txt”, we will create a hash map, testWords as defined below:

§ HashMap <String, Double> testWords = new HashMap <String, Double>();

We will also add the spamicity of every word, W in “test.txt” to an array list, list (ArrayList<Double>).

To compute the spamicity of word, W, implement the steps below:

a. Start by checking if maps, spam and ham do not contain the word, W by using the following syntax:

§ if (!spam.containsKey(W) && !ham.containsKey(W))

If the above statement is true, the word, W is not contained in both spam and ham. For such words, we will assign a spamicity of 0.4 for W.

b. For the scenarios where W is contained in spam but not in ham (ham frequency of W = 0), contained in ham but not in spam (spam frequency of W = 0) and contained in both spam and ham, we will compute the spamicity as below:

§ spamicity = (Frequency of W in spam/5.0)/ ((Frequency of W in spam/5.0) + (Frequency of W in ham /5.0));

The number 5.0 in the above equation represents the number of emails that were used to train the spam filter. If the spamicity for W is greater than 1.0, we will assign a spamicity of 1.0 for W.

c. Store the spamicity of W in the map, testWords as below:

§ testWords.put(W, spamicity);

d. Add the spamicity of W to list.

5) Sort the list, list in ascending order. Find the minimum value, A (first) and maximum value, B (last) of list. Compute the probability of the input email being spam by using the Bayes formula.

Solution:

· Java Implementation

Program Output:

Test Case #1:

· test.txt: Check your bank account for a million dollars from Bank of Nigeria.

Spamicity values for every word in test.txt:

Word Spamicity

Bank 1.000

Check 0.400

Nigeria 0.600

a 0.500

account 1.000

bank 1.000

dollars 1.000

for 1.000

from 1.000

million 1.000

of 1.000

your 1.000

Sorted values of Spamicity

0.400 0.500 0.600 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

Probability of test.txt being spam = 1.0

· Detailed output

Test Case #2:

· test.txt: Nigeria is a country in West Africa.

Input Email Spamicity

Word Spamicity

Africa 0.400

Nigeria 0.600

West 0.400

a 0.500

country 0.400

in 0.000

is 0.000

Sorted values of Spamicity

0.000 0.000 0.400 0.400 0.400 0.500 0.600

Probability of test.txt being spam = 0.0

· Detailed output