Spam or Ham with Naive Bayes

Spam or Ham with Naive Bayes

Introduction
In this assignment you will implement a Naive Bayes classifier and use it for spam filtering
on a text message data set.

Data Set
The data set you will train and evaluate your classifier on contains 5574 SMS text messages,
747 of which are spam. You will use three files, train.txt ,dev.txt , and test.txt
containing training, development, and testing data respectively.

Data File Format
Each line in a data file represents one input/output sample. A line contains two fields,
separated by a tabstop (“\t”). The first field contains the output label, “spam” or “ham”. The
second field contains the actual message text. For example,
spam Congratulations ur awarded either £500 of CD gift vouchers & Free entry 2 our £100 weekly draw txt M
USIC to 87066 TnCs www.Ldew.com 1 win150ppmx3age16
ham Just hopeing that wasn‘t too pissed up to remember and has gone off to his sisters or something!

Getting Started
Download the file homework3.py . You will complete the class NbClassifier in this file,
which represents a Naive Bayes classifier.
You will eventually run this file as follows, to train a model on the training data and test it
on the validation data.
$ python homework3.py train.txt dev.txt
Precision:0.0 Recall:0.0 F-Score:0.0 Accuracy:0.0
Obviously, instead of 0.0, your program should eventually print the actual results once you
are done with this problem.
Part 1 – Extracting Attributes


The attributes you will use for the classifier will be individual words in the text.
• First, write the function extract_words, which should read in the text of an SMS message
and return a list of individual words (strings). Pre-processing is fairly important for text
classification tasks. For now, simply lower-case the text, remove any punctuation, and
then split it on white spaces. Make sure to remove any additional leading and trailing
white spaces per word. We will improve the pre-processing step later.
• Next, write the method collect_attribute_types, which should compute a vocabulary
consisting of the set of unique words occurring at least k times in the training data. Start
by setting k to 1. You can tune the value of klater on the validation set. The method
should return nothing, but it should populate the self.attribute_typesattribute of the
NbClassifier object.

Part 2 – Training the Classifier

Training a Naive Bayes classifier amounts to estimating two probability distributions from
the training data.
1. The prior for the labels, i.e. P(Label=y) for each label y.
2. The conditional probability for each attribute, given the label, in our case P(Word=w |
Label=y) for each label y and word w that occurs in the vocabulary, i.e.
the self.attribute_typesset.
• Write the method train, which should read in the training data (using extract_words to
preprocess it) and then populate the self.label_priorand self.word_given_labeldictionaries. The keys for label_priorshould just be the two labels (i.e. “ham” or “spam”). The keys for word_given_labelshould be (word, label) tuples, for example
(‘congratulations’,’spam’).
• Compute the conditional probability as P(Word=w|Label=y) = count(w,y) / [total number of words in any training example with label y].
The intuition here is that each word is generated by its own event.
• Unfortunately, not all word/label combinations appear in the training data.
Your conditional probability estimates should be “smoothed”. You can use the following
simple additive smoothing approach (also known as Laplacian Smoothing).
P(Word=w|Label=y) = (count(w,y)+c) / ([total number of words in any training example with label y] + c*vocabulary size).
The idea is to add c more occurence for each word/label combination, giving a little
probability mass to unseen combinations. Start by setting c to one.

Part 3 – Testing the Classifier

  • Write the method predictwhich should take the text of an SMS as input (as a string). We
    want to compute the probability for each label given the words in the string as:
    P(y | word1, word2, … wordn) = alpha * P(y) * P(word1 | y) * P(word2 | y) * … * P(wordn |y)
    where alpha is a normalization factor to make sure that the probabilities for all labels sum
    up to 1.
    Because we are only interested in classification, your function can just compute the joint
    probability of the label and words:
    P(y, word1, word2, … wordn) = P(y) * P(word1 | y) * P(word2 | y) * … * P(wordn | y)
    Unfortunately, for long inputs, the probabilties become extremely small. To be able to
    still compare the two labels we can run the computation in log space. Instead of the
    expression above, compute the following:
    log(P(y, word1, word2, … wordn)) = log(P(y)) + log(P(word1 | y)) + log(P(word2 | y)) + … + log(P(wordn | y)).
    You might want to import the math module and use the math.log functions.
    The function should then return a dictionary mapping the label to the log probability for
    that label. Here is an example in which the input would be classified as spam:
    {“ham”:-197.80923461412195, “spam”:-158.7005945688615}
    • Write the method evaluate, that takes the filename for the validation or test set as
    parameters and returns (a tuple of) four values: precision for predicting spam, recall for
    predicting spam, fscore, classificaition accuracy.
    In the function, call the predict method for each input sample, and then count how many
    true positive false positive, true negatives, and false negative predictions are made (with
    respect to predicting “spam”).
    • Run your classifier on the validation set dev.txt. You should achieve accuracies in the
    high 90% range and an F-score in the low 0.90s.

Part 4 – Tuning the Classifier
The classifier has at leats two different parameters that can be tuned: the smoothing
parameter c and the word cuttofk. Experiment with these parameterson the validation set until you achieve better performance. Your final submission shouldinclude a comment at the beginning of the file, stating your original performance (with thedefaults) on the validation set and your improved performance after tuning on bothvalidation and test.
Part 5 – Improving Pre-Processing

Our classifier contains features for words like “to”, “it”, or “and”,(so-called stop words), which occur frequently in any English text and are probably not very informative for distinguishing between the two classes. A common improvement for Naive Bayes text classification is to remove such stop-words from the feature set. The file stopwords_mini.txtcontains a very short list of stopwords. Modify your code so the stopwords list is read in when NbClassifier is instantiated and stopwords are excluded from the feature set. Do not modify the function or method signatures.
You can modify the list of stopwords to improve performance on the dev set.

import sys

defextract_words(text):

return None # should return a list of words in the data sample.

classNbClassifier(object):

def __init__(self, training_filename, stopword_file = None):

self.attribute_types = set()

self.label_prior = {}

self.word_given_label = {}

self.collect_attribute_types(training_filename, 1)

self.train(training_filename)

defcollect_attribute_types(self, training_filename, k):

self.attribute_types = set() # replace this

def train(self, training_filename):

self.label_prior = {} # replace this

self.word_given_label = {} #replace this

def predict(self, text):

return {} #replace this

def evaluate(self, test_filename):

precision = 0.0

recall = 0.0

fscore = 0.0

accuracy = 0.0

return precision, recall, fscore, accuracy

defprint_result(result):

print(“Precision:{} Recall:{} F-Score:{} Accuracy:{}”.format(*result))

if __name__ == “__main__”:

classifier = NbClassifier(sys.argv[1])

result = classifier.evaluate(sys.argv[2])

print_result(result) 

Solution 

# Peformance:

#  Not removing stop words, c = 1, k = 1, validation set:

#   Precision:0.9508196721311475 Recall:0.8656716417910447

#   F-Score:0.9062499999999999 Accuracy:0.9784560143626571

#

#  Not removing stop words, c = 1, k = 3, validation set:

#   Precision:0.9365079365079365 Recall:0.8805970149253731

#   F-Score:0.9076923076923077 Accuracy:0.9784560143626571

#

#  Not removing stop words, c = 1, k = 3, test set:

#   Precision:0.9649122807017544 Recall:0.873015873015873

#   F-Score:0.9166666666666667 Accuracy:0.9820466786355476

import sys

import string

import math

defextract_words(text):

text = text.lower()

text = text.translate(text.maketrans(“”,””, string.punctuation))

returntext.split()

classNbClassifier(object):

def __init__(self, training_filename, stopword_file = None):

self.attribute_types = set()

self.label_prior = {}

self.word_given_label = {}

self.stopwords = set()

with open(“stopwords_mini.txt”) as f:

for line in f:

self.stopwords.update(line.split())

self.collect_attribute_types(training_filename, 3)

self.train(training_filename)

defcollect_attribute_types(self, training_filename, k):

counts = {}

with open(training_filename) as f:

for line in f:

words = extract_words(line)

for word in words[1:]:

if word in self.stopwords:

continue

counts[word] = counts.get(word, 0) + 1

if counts[word] == k:

self.attribute_types.add(word)

 

def train(self, training_filename):

c = 1

# Counts for (label, word) tuples.

word_counts = dict()

# Counts for labels.

label_counts = dict()

# Total number of words in any training example, by label.

lw_counts = dict()

with open(training_filename) as f:

for line in f:

words = extract_words(line)

label = words[0]

# Remove label and words which are not in the

# attribute_types set.

words =

[w for w in words[1:]

if w in self.attribute_types]

# Calculate word counts.

for word in words:

word_counts[(word, label)] = word_counts.get((word, label), 0) + 1

lw_counts[label] = lw_counts.get(label, 0) + len(words)

label_counts[label] = label_counts.get(label, 0) + 1

# Calculate probabilities.

for word in self.attribute_types:

for label in label_counts:

n = (word_counts.get((word, label), 0) + c)

d = (lw_counts[label] + c * len(self.attribute_types))

self.word_given_label[(word, label)] = n / d

total = sum(label_counts.values())

for label in label_counts:

self.label_prior[label] = label_counts[label] / total

def predict(self, text):

result = {}

for label in self.label_prior:

s = math.log(self.label_prior[label])

for word in extract_words(text):

p = self.word_given_label.get((word, label), 0)

if p:

s += math.log(p)

result[label] = s

return result

def evaluate(self, test_filename):

# Number of true positives.

tpositives = 0

# Number of false positives.

fpositives = 0

# NUmber of false negatives.

fnegatives= 0

# NUmber of true negatives.

tnegatives = 0

with open(test_filename) as f:

for line in f:

words = extract_words(line)

label = words[0]

result = self.predict(” “.join(words[1:]))

prediction = max(result, key=result.get)

if label == “spam”:

if prediction == “spam”:

tpositives += 1

else:

fnegatives += 1

else:

if prediction == “spam”:

fpositives += 1

else:

tnegatives += 1

precision = tpositives / (tpositives + fpositives)

recall = tpositives / (tpositives + fnegatives)

fscore = 2 * precision * recall / (precision + recall)

accuracy = (tpositives + tnegatives) / (fpositives + fnegatives + tpositives + tnegatives)

return precision, recall, fscore, accuracy

defprint_result(result):

print(“Precision:{} Recall:{} F-Score:{} Accuracy:{}”.format(*result))

if __name__ == “__main__”:

classifier = NbClassifier(sys.argv[1])

result = classifier.evaluate(sys.argv[2])

print_result(result)