Series of programming assignments from "Introduction to Data Science" course - Join the data revolution by University of Washington

Problem 4: Compute Term Frequency

Write a Python script, frequency.py, to compute the term frequency histogram of the livestream data you harvested from Problem 1. The frequency of a term can be calculate with the following formula: [# of occurrences of the term in all tweets]/[# of occurrences of all terms in all tweets]

frequency.py should take a file of tweets as an input and be usable in the following way:

      $ python frequency.py <tweet_file>



Assume the tweet file contains data formatted the same way as the livestream data.

Your script should print to stdout each term-frequency pair, one pair per line, in the following format:

      <term:string> <frequency:float>



For example, if you have the pair (bar, 0.1245) it should appear in the output as:

      bar 0.1245



Frequency measurements may take phrases into account, but this is not required. We only ask that you compute frequencies for individual tokens.

Depending on your method of parsing, you may end up with frequencies for hashtags, links, stop words, phrases, etc. Some noise is acceptable for the sake of keeping parsing simple.

import sys
import json
 
 
def test (tf):
    uncoded = []
    decodedText=[]
    for x in tf.readlines():
        y= json.loads(x)
        if y.has_key("text"):
            uncoded.append(y["text"])
 
    for x in uncoded:
 
        decodedText.append((x.encode("utf-8")))
 
    return decodedText
 
def calc (decodedText):
    totalWords = 0.0
    words = {}
    for x in decodedText:
 
        for word in x.split():
            totalWords += 1
            if word in words:
                x = (words)[word] + 1.0
                words[word] = x
            elif word.isalnum() or "," in word:
                words[word] = 1.0
 
    for x in range(len(words)):
        print words.keys()[x] + " " + str(words.values()[x] / totalWords)
def main():
    sent_file = open(sys.argv[1])
    x= test(sent_file)
    calc(x)
if __name__ == '__main__':
    main()

Leave a Comment

Fields with * are required.

Please enter the letters as they are shown in the image above.
Letters are not case-sensitive.