Word Frequency Analysis

This exercise has been adapted from Think Python Ch. 13.1. The goal of this toolbox exercise will be to write a Python program that can automatically analyze the linguistic characteristics of a book. Along the way we will learn a bit about reading files.

Get Set

Grab the starter code for this toolbox exercise via the normal fork-and-clone method from https://github.com/olin-toolboxes/ToolBox-WordFrequency

The starter code will be in frequency.py.

Download your favorite book

Go to Project Gutenberg and download your favorite out-of-copyright book in plain text format. The file pg32325.txt has been placed in the word_frequency_analysis directory to give you an example of the type of file you should download.

Complete the declared function get_word_list

The function should read the specified Project Gutenberg text file, strip out whitespace, header comments, and punctuation and return a list of all words in the book in order. In addition, the words should all be converted to lowercase.

Hints:

“The string module provides strings named whitespace, which contains space, tab, newline, etc., and punctuation which contains the punctuation characters. Let’s see if we can make Python swear:

>>> import string
>>> print(string.punctuation)
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

Also, you might consider using the string methods strip, replace, split, and translate. Find documentation for those methods here.

More Hints:

The first step is loading the file and stripping away the header comment. Here is some code that does just this and stores the resultant list of lines in a variable called lines. Make sure you understand what it is doing, and modify it if you need to:

f = open(file_name, 'r')
lines = f.readlines()
curr_line = 0
while lines[curr_line].find('START OF THIS PROJECT GUTENBERG EBOOK') == -1:
  curr_line += 1
lines = lines[curr_line+1:]

Get Top 100 Words

Next, fill out the implementation of the function get_top_n_words that takes as input the list of words computed in by your get_word_list function and searches for the n most frequently used words and returns a list of these words in order of frequency from most to least frequently occurring.

Hints: you will probably want to process the raw list of words into a dictionary where the key is a particular word and the value is the number of times it occurs in the input word_list. Suppose you have created such a dictionary and its name is word_counts. You can sort the words by frequency of occurrence using the Python code:

ordered_by_frequency = sorted(word_counts, key=word_counts.get, reverse=True)

Finishing your program

Add some code that calls the two functions you just wrote so that you get the words in your Project Gutenberg text, calculate the top 100 most frequently occurring words, and print the word list out. Once you have done this, push your finished code to your repository and submit a pull request to get your toolbox exercise checked off with a NINJA.

Making it Cooler (optional)

If you want to do some more advanced word frequency analysis, try the rest of the exercises in Think Python 13.1.