Series of programming assignments from "Introduction to Data Science" course - Join the data revolution by University of Washington
Problem 4: Compute Term Frequency
Write a Python script, frequency.py, to compute the term frequency histogram of the livestream data you harvested from Problem 1. The frequency of a term can be calculate with the following formula: [# of occurrences of the term in all tweets]/[# of occurrences of all terms in all tweets]
frequency.py should take a file of tweets as an input and be usable in the following way:
$ python frequency.py <tweet_file>
Assume the tweet file contains data formatted the same way as the livestream data.
Your script should print to stdout each term-frequency pair, one pair per line, in the following format:
<term:string> <frequency:float>
For example, if you have the pair (bar, 0.1245) it should appear in the output as:
bar 0.1245
Frequency measurements may take phrases into account, but this is not required. We only ask that you compute frequencies for individual tokens.
Depending on your method of parsing, you may end up with frequencies for hashtags, links, stop words, phrases, etc. Some noise is acceptable for the sake of keeping parsing simple.
import sys import json def test (tf): uncoded = [] decodedText=[] for x in tf.readlines(): y= json.loads(x) if y.has_key("text"): uncoded.append(y["text"]) for x in uncoded: decodedText.append((x.encode("utf-8"))) return decodedText def calc (decodedText): totalWords = 0.0 words = {} for x in decodedText: for word in x.split(): totalWords += 1 if word in words: x = (words)[word] + 1.0 words[word] = x elif word.isalnum() or "," in word: words[word] = 1.0 for x in range(len(words)): print words.keys()[x] + " " + str(words.values()[x] / totalWords) def main(): sent_file = open(sys.argv[1]) x= test(sent_file) calc(x) if __name__ == '__main__': main()
Leave a Comment