Find most common words and length of words

 

Find Most Common Words and Length of Words

  1. Word counting in a file

Your goal  is to write a function that takes a file handle as input and returns the number of words in the file that are a single letter, 2 letters, 3 letters, and so on until the longest word length.  Assume that words are separated by space.

Test this function with a program that takes a filename as input and write the letter size distribution in a file called “file name_SizeDistribution”.

Example:

test.txt

test_sizeDistribution.txt

  1. Most common words

Your goal is to write a Python function that takes a file handle as input and returns the most common words in the text file. Your program should first build a python dictionary that tracks the number of occurrences of every word in the book. Assume that words are separated by space.

Test this with a program that takes a filename as input and prints i) 5 most common words and ii) 5 most common words of length greater than 5 Your code should print the results from the 3 sample files.

  1. Uniqueness of each book

We will use a very simple definition of uniqueness of a book  – the number of unique  words that occur in the book that do not occur in any of the other books, as a percentage of total number of words in the book. 

Solution 

p1.py

 defcount_len_freq(handle):

“”” return the number of words in the file with different lengths. “””

freq = {}

for line in handle:

words = line.split()

for word in words:

# increase the count of the corresponding length

freq[len(word)] = freq.get(len(word), 0) + 1

returnfreq

if __name__ == ‘__main__’:

# test program, ask the user to enter the input file name

# and save the output to the file

filename = input(‘Enter the name of the input file (.txt): ‘)

# get the name and extension of the filename and construct

# the name of the output file

importos

base, extension = os.path.splitext(filename)

output = base + “_sizeDistribution” + extension

# open the files

handle1 = open(filename, ‘r’, encoding=’ISO-8859-2′)

handle2 = open(output, ‘w’)

counts = count_len_freq(handle1)

for length, count in counts.items():

handle2.write(“size %d: %d\n” % (length, count))

handle1.close()

handle2.close()

print(“output is saved into file”, output) 

p2.py

defcount_word_freq(handle):

“”” return the count of words in the file. Words are case-sensitive.

The results are converted into a list in the decreasing order of the counts. “””

freq = {}

for line in handle:

words = line.split()

for word in words:

# increase the count of the corresponding word

freq[word] = freq.get(word, 0) + 1

# convert the dictionary into a list so we can sort it by the counts

freq = [(v, k) for k, v in freq.items()]

freq.sort(reverse=True)

returnfreq

if __name__ == ‘__main__’:

# test program, ask the user to enter the input file name

importio

filename = input(‘Enter the name of the input file (.txt): ‘)

handle = io.open(filename, ‘r’, encoding=’ISO-8859-2′)

freq = count_word_freq(handle)

# display the 5 most common words

print(“\n\n5 most common words in file %s\n” % (filename))

for i in range(5):

if i <len(freq):

print(“%-4d %s” % (freq[i][0], freq[i][1]))

# display the 5 most common words of length greater than 5

print(“\n\n5 most common words of length greater than 5\n”)

k = 0

for i in range(len(freq)):

if k < 5:

iflen(freq[i][1]) > 5: # the word’s length is more than 5

print(“%-4d %s” % (freq[i][0], freq[i][1]))

k = k + 1

else:

break

handle.close() 

p3.py

 defget_word_set(handle):

“”” return the set of words in the file. Words are case-sensitive. “””

value = set()

for line in handle:

words = line.split()

for word in words:

# add the word into the word set

value.add(word)

return value

defget_percent_unique(i, sets):

“”” return the percentage of the unique words in sets[i] from words in all sets “””

words = sets[i] # the words to check

allwords = set() # all words (except words in set[i])

for j in range(len(sets)):

if j != i:

allwords |= sets[j]

# now count the number of words that are not in allwords (unique words)

count = 0

for word in words:

if word not in allwords:

count = count + 1

# calculate and return the percentage of the unique words

if count > 0:

return count / len(words)

return 0

if __name__ == ‘__main__’:

# test program, ask the user to enter the input file names

count = int(input(“Enter the number of files to check: “))

# read the file names and collectthw unique words for them

filenames = []

wordsets = []

for i in range(count):

filename = input(“Enter the file name (.txt): “)

# open the file and get the set of words from it

importio

handle = io.open(filename, ‘r’, encoding=’ISO-8859-2′)

words = get_word_set(handle)

handle.close()

# save the file and set into the array list

filenames.append(filename)

wordsets.append(words)

# display the uniqueness of each book

print(“\n\nUniqueness of each book:\n”)

for i in range(len(wordsets)):

uniqueness = get_percent_unique(i, wordsets)

print(“%-30s: %.2f%%” % (filenames[i], uniqueness * 100))