Series of programming assignments from "Introduction to Data Science" course - Join the data revolution by University of Washington
Problem 6: Top ten hash tags
Write a Python script, top_ten.py, that computes the ten most frequently occurring hash tags from the data you gathered in Problem 1.
top_ten.py should take a file of tweets as an input and be usable in the following way: $ python top_ten.py Assume the tweet file contains data formatted the same way as the livestream data.
In the tweet file, each line is a Tweet object, as described in the twitter documentation. You should not be parsing the “text” field.
Your script should print to stdout each hashtag-count pair, one per line, in the following format:
<hashtag:string> <count:float>
For example, if you have the pair (baz, 30) it should appear in the output as:
baz 30.0
Remember your output must contain floats, not ints.
import sys import json def test (tf): tweets = [] decodedText=[] for x in tf.readlines(): y= json.loads(x) #uncoded.append(y["text if y.has_key("entities") and y["entities"]["hashtags"] != []: for x in y["entities"]["hashtags"]: if x["text"].isalnum(): tweets.append((x["text"])) newTweets ={} for i in tweets: if i in newTweets: newTweets[i] += 1 else: newTweets[i] = 1 topTen = [] for w in sorted(newTweets, key=newTweets.get, reverse=True): topTen.append((w, newTweets[w])) topTen = topTen[0:10] for (x,y) in topTen: print x + " " + str(y) #for x in uncoded: # # decodedText.append((x.encode("utf-8"))) # #return decodedText def sfDict(sf): #x = {} #for s in sf.readlines(): # y= s.split() # x["pair"] = { # "word" : y[0], # "val" : y[1] # } x = [] for s in sf.readlines(): y = s .split("\t") x.append((y[0], y[1])) return x def check(decodedText): states={} for (tweet, st) in decodedText: val= 0.0 for (x,y) in op: if ((x + " " ) or (" " + x)) in tweet: val= val + float(y) if st in states: states[st] += val else: states[st] = val x= 0.0 finalState = "" for key, value in states.iteritems(): if value > x: finalState = key x= value print finalState def main(): tweet_file = open(sys.argv[1]) test(tweet_file) if __name__ == '__main__': main()
Leave a Comment