Series of programming assignments from "Introduction to Data Science" course - Join the data revolution by University of Washington

Problem 4: Compute Term Frequency

Write a Python script, frequency.py, to compute the term frequency histogram of the livestream data you harvested from Problem 1. The frequency of a term can be calculate with the following formula: [# of occurrences of the term in all tweets]/[# of occurrences of all terms in all tweets]

frequency.py should take a file of tweets as an input and be usable in the following way:

```      \$ python frequency.py <tweet_file>

```

Assume the tweet file contains data formatted the same way as the livestream data.

Your script should print to stdout each term-frequency pair, one pair per line, in the following format:

```      <term:string> <frequency:float>

```

For example, if you have the pair (bar, 0.1245) it should appear in the output as:

```      bar 0.1245

```

Frequency measurements may take phrases into account, but this is not required. We only ask that you compute frequencies for individual tokens.

Depending on your method of parsing, you may end up with frequencies for hashtags, links, stop words, phrases, etc. Some noise is acceptable for the sake of keeping parsing simple.

```import sys
import json

def test (tf):
uncoded = []
decodedText=[]
if y.has_key("text"):
uncoded.append(y["text"])

for x in uncoded:

decodedText.append((x.encode("utf-8")))

return decodedText

def calc (decodedText):
totalWords = 0.0
words = {}
for x in decodedText:

for word in x.split():
totalWords += 1
if word in words:
x = (words)[word] + 1.0
words[word] = x
elif word.isalnum() or "," in word:
words[word] = 1.0

for x in range(len(words)):
print words.keys()[x] + " " + str(words.values()[x] / totalWords)
def main():
sent_file = open(sys.argv[1])
x= test(sent_file)
calc(x)
if __name__ == '__main__':
main()```