The program operates in two phases. First, in the training phase, your program reads in two different text files containing training examples for the two classes. These two files together are called the training set. Your program will examine the emails in the training set and, for each class, tabulate information about the number of training examples and how many times each word in the vocabulary appears in the examples for that class. (The "vocabulary" is the set of all words that appear in the training set emails.)
Second, in the testing phase, your program will read in another two text files containing a new set of examples for each of the two classes. These two files together are called the testing set. Your program will run the classifier on each email in the testing set and classify it into one of the two classes by identifying the MAP hypothesis. Your code will report, for each email in the training set, the log-probability of the email belonging to each of the two classes, the predicted class, and whether or not the prediction was correct. At the end, your program will report the the number of examples in the testing set that were indeed classified correctly.
It is not realistic to think that your classifier will perform perfectly. After all, the only information your classifier will have to work with is which of the vocabulary words are present in an email. Even more sophisticated spam classifiers make mistakes, too! Therefore, do not be concerned if the program reports a classifier accuracy below 100%, that does not necessarily imply your program is doing something wrong.
<SUBJECT> a single line with the subject of email 1 (might be blank) </SUBJECT> <BODY> line 1 of the body line 2 of the body more lines.... possibly blank </BODY> <SUBJECT> a single line with the subject of email 2 (might be blank) </SUBJECT> <BODY> line 1 of the body line 2 of the body more lines.... possibly blank </BODY>
TEST 144 66/78082 features true -344.684 -331.764 ham rightI'm asking for this specific format because I will use the "diff" utility to compare your output against the correct output, so the format has to match exactly.
At the very end of your output, report the number of emails in the testing set that were classified correctly.
You can have other things in the output, such as diagnostic information about priors or likelihoods or anything you want that helps you debug your project, as long as the specific output lines above are there.
Your output must contain the TEST lines, the final number of emails correctly classified at the end, and any other diagnostic messages you want.
train-spam-small.txt train-ham-small.txt test-spam-small.txt test-ham-small.txt[ output ] [ output with only the TEST lines ]
Useful statistics: vocab size = 3, priors = 0.6 and 0.4, 2 out of 2 test set emails classified correctly.
train-spam.txt train-ham.txt test-spam.txt test-ham.txt[ output ] [ output with only the TEST lines ]
Useful statistics: vocab size = 78082, priors = 0.7866966480154788 and 0.21330335198452124, 1179 out of 1551 test set emails classified correctly.
For extra credit (up to 10 percentage points), add additional features to the model or change the existing features as you see fit to try to improve the accuracy of the classification. You should still use the Naive Bayes setup, so don't muck around with the math, but you can add new features for e.g., length of the email, words present/absent in subject vs body, word count instead of just presence/absence, etc.
If you attempt the extra credit, turn in a separate copy of your program for this, with a writeup explaining what features you added, why you thought they would help, and sample output showing if they did (report how the classification accuracy changed).