Bayesian Email Spam Filter
NOTES:
a)
You will
implement an “Email Spam Filter” using the principle of Bayesian Logic. The algorithm
for the spam filter is illustrated with an example in the link below:
·
Bayesian Filtering Example:
Process Software Inc.
b) For this exercise, we will implement a slightly modified version of the algorithm as described in the above link.
Background:
An email
that is not spam is called a ham email or simply, ham.
We will
begin by training our filter with five emails that are known to be spam and
five emails that are known to be ham. After the training, we will read an input
email, split it into individual words and compute the spamicity for every
word. Spamicity for a word, W is defined
by the following formula:
·
Spamicity = (A)/(A+B)
Where:
·
A is the
frequency of W in the spam emails that were used for training.
·
B is the
frequency of W in the ham emails that were used for training.
As we compute
the spamicity for every word, we will store its value in an array list. Using
the Collections API, we will sort the array list in ascending order and store
its minimum value, C (first value in the array list) and maximum value, D (last
value in the array list).
Finally, we
will compute the probability of the input email being spam using the Bayes
formula:
·
P (Email is
Spam) = (C * D)/ ((C * D) + (1-C) *
(1-D))
Programming Details:
Follow the
steps below:
1)
Text files
for training the spam filter:
o
Create five
text files that contain spam emails. Name them as spam1.txt through spam5.txt
o
Create five
text files that contain ham emails. Name them as ham1.txt through ham5.txt
o
Spam and Ham text files for training.
o
Store the
above ten files in the same folder as your Java program.
2)
We will
create two Hash maps as below:
a.
HashMap
<String, Integer> spam = new HashMap <String, Integer>();
§ Read through spam1.txt through
spam5.txt and store the frequency of each word in this map.
b.
HashMap
<String, Integer> ham = new
HashMap <String, Integer>();
§ Read through ham1.txt through
ham5.txt and store the frequency of each word in this map.
3)
To check if
an input email is spam or ham, we will create a text file, “test.txt” and fill
it up with text.
4)
For this
step, we will read through “test.txt” and compute the spamicity (of type double) for every word, W. To store the spamicity of every word in
“test.txt”, we will create a hash map, testWords as defined below:
§ HashMap <String, Double> testWords = new HashMap <String, Double>();
We will also add the spamicity of every word, W in “test.txt” to
an array list, list (ArrayList<Double>).
To compute the spamicity of word, W, implement the steps below:
a.
Start by
checking if maps, spam and ham do not contain the word, W by using
the following syntax:
§ if (!spam.containsKey(W)
&& !ham.containsKey(W))
If the above statement is true, the word, W is not contained in
both spam and ham. For such words, we will assign a spamicity of 0.4 for W.
b.
For the
scenarios where W is contained in spam
but not in ham (ham frequency of W =
0), contained in ham but not in spam (spam frequency of W = 0) and
contained in both spam and ham, we will compute the spamicity as
below:
§ spamicity = (Frequency of W in spam/5.0)/ ((Frequency of W in spam/5.0) + (Frequency of W in ham /5.0));
The number 5.0 in the above equation represents the number of
emails that were used to train the spam filter. If the spamicity for W is
greater than 1.0, we will assign a spamicity of 1.0 for W.
c.
Store the
spamicity of W in the map, testWords as
below:
§ testWords.put(W, spamicity);
d. Add the spamicity of W to list.
5)
Sort the
list, list in ascending order. Find
the minimum value, A (first) and maximum value, B (last) of list. Compute the probability of the
input email being spam by using the Bayes formula.
Solution:
Program Output:
Test Case #1:
·
test.txt:
Check your bank account for a million dollars from Bank of Nigeria.
Spamicity values
for every word in test.txt:
Word Spamicity
Bank 1.000
Check 0.400
Nigeria 0.600
a 0.500
account 1.000
bank 1.000
dollars 1.000
for 1.000
from 1.000
million 1.000
of 1.000
your 1.000
Sorted values of Spamicity
0.400 0.500 0.600 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
Probability of test.txt being
spam = 1.0
Test Case #2:
·
test.txt:
Nigeria is a country in West Africa.
Input Email
Spamicity
Word Spamicity
Africa 0.400
Nigeria 0.600
West 0.400
a 0.500
country 0.400
in 0.000
is 0.000
Sorted
values of Spamicity
0.000 0.000
0.400 0.400 0.400
0.500 0.600
Probability
of test.txt being spam = 0.0