Twitter Sentiment Analysis in Python1: warm up & get data

Twitter Sentiment Analysis in Python1: warm up & get data

Series of programming assignments from "Introduction to Data Science" course - Join the data revolution by University of Washington

Problem 0: Query Twitter with Python
Problem 1: Get Twitter Data
Problem 2: Derive the sentiment of EACH tweet

Problem 0: Query Twitter with Python To retrieve recent tweets associated with the term “microsoft,” you use this url:

http://search.twitter.com/search.json?q=microsoft

To access this url in Python and parse the response, you can use the following snippet:

import urllib import json

response = urllib.urlopen("http://search.twitter.com/search.json?q=microsoft") print json.load(response)

The format of the result is JSON, which stands for JavaScript Object Notation. It is a simple format for representing nested structures of data --- lists of lists of dictionaries of lists of .... you get the idea. As you might imagine, it is fairly straightforward to convert JSON data into a Python data structure. Indeed, there is a convenient library to do so, called json, which we will use.

Twitter provides only partial documentation for understanding this data format, but it's not difficult to deduce the structure.

Using this library, the json data is parsed and converted to a Python dictionary representing the entire result set. (If needed, take a moment to read the documentation for Python dictionaries). The "results" key of this dictionary corresponds holds the actual tweets; each tweet is itself another dictionary.

a) Write a program, print.py, to print out the text of each tweet in the result.

b) Generalize your program, print.py, to fetch and print 10 pages of results. Note that you can return a different page of results by passing an additional argument in the url:

http://search.twitter.com/search.json?q=microsoft&page=2

print.py should be executable in the following way:

      $ python print.py

When executed, the script should print each tweet on an individual line to stdout. What to turn in: Nothing. This is a warmup exercise.

Problem 1: Get Twitter Data

To access the live stream, you will need to install the oauth2 library so you can properly authenticate.

This library is already installed on the class virtual machine. Or you can install it yourself in your Python environment.

The steps below will help you set up your twitter account to be able to access the live 1% stream.

● Create a twitter account if you do not already have one.

● Go to https://dev.twitter.com/apps and log in with your twitter credentials.

● Click "create an application"

● Fill out the form and agree to the terms. Put in a dummy website if you don't have one you want to use.

● On the next page, scroll down and click "Create my access token"

● Copy your "Consumer key" and your "Consumer secret" into twitterstream.py

● Click "Create my access token." You can Read more about Oauth authorization.

● Open twitterstream.py and set the variables corresponding to the consumer key, consumer secret, access token, and access secret.

access_token_key = ""

access_token_secret = ""

consumer_key = ""

consumer_secret = ""

● Run the following and make sure you see data flowing and that no errors occur. Stop the program with Ctrl-C once you are satisfied. $ python twitterstream.py

You can pipe the output to a file, wait a few minutes, then terminate the program to generate a sample. Use the following command:

$ python twitterstream.py > output.txt

Let this script run for a minimum of 10 minutes. Keep the file output.txt for the duration of the assignment, we will be reusing it in later problems.

Don’t use someone else’s file; we will check for uniqueness in other parts of the assignment. What to turn in: The first 20 lines of your file. You can get the first 20 lines by using the following command:

$ head -n 20 output.txt

Problem 2: Derive the sentiment of each tweet

For this part, you will compute the sentiment of each tweet based on the sentiment scores of the terms in the tweet. The sentiment of a tweet is equivalent to the sum of the sentiment scores for each term in the tweet. You are provided with a skeleton file, tweet_sentiment.py, which can be executed using the following command: $ python tweet_sentiment.py The file AFINN-111.txt contains a list of pre-computed sentiment scores. Each line in the file contains a word or phrase followed by a sentiment score. Each word or phrase found in a tweet, but not in AFINN-111.txt should be given a sentiment score of 0. See the file AFINN-README.txt for more information. To use the data in the AFINN-111.txt file, you may find it useful to build a dictionary. Note that the AFINN-111.txt file format is tab-delimited, meaning that the term and the score are separated by a tab character. A tab character can be identified a "\t".The following snippet may be useful: afinnfile = open("AFINN-111.txt") scores = {} # initialize an empty dictionary for line in afinnfile: term, score = line.split("\t") # The file is tab-delimited. "\t" means "tab character" scores[term] = int(score) # Convert the score to an integer.

print scores.items() # Print every (term, score) pair in the dictionary

Assume the tweet file contains data formatted the same way as the livestream data.

Your script should print to stdout the sentiment of each tweet in the file, one sentiment per line:

      <sentiment:float>

NOTE: You must provide a score for every tweet in the sample file, even if that score is zero. However the sample file will only include English tweets

The first sentiment corresponds to the first tweet in the input file, the second sentiment corresponds to the second tweet in the input file, and so on.

Hints: The json.loads function parses a string to JSON.

Refer to the twitter documentation in order to determine what field to parse.

import sys
import json
def hw():
    print 'Hello, world!'
 
def lines(fp):
    print str(len(fp.readlines()))
 
def test (sf, tf):
    uncoded = []
    decodedText=[]
    for x in tf.readlines():
        y= json.loads(x)
        if y.has_key("text"):
            uncoded.append(y["text"])
 
    for x in uncoded:
 
        decodedText.append((x.encode("utf-8")))
 
    return decodedText
def sfDict(sf):
    #x = {}
    #for s in sf.readlines():
    #    y= s.split()
    #    x["pair"] = {
    #        "word" : y[0],
    #        "val" : y[1]
    #    }
    x = []
    for s in sf.readlines():
            y = s .split("\t")
            x.append((y[0], y[1]))
    return x
def check(decodedText, op):
    for z in decodedText:
        val= 0.0
        for (x,y) in op:
            if ((x + " " )  or (" " + x)) in z:
                val= val + float(y)
        print z + "  : " + str(val)
 
def main():
    sent_file = open(sys.argv[1])
    tweet_file = open(sys.argv[2])
    x =test(sent_file, tweet_file)
    y= sfDict(sent_file)
    check (x,y)
 
if __name__ == '__main__':
    main()

Tags: MOOC, python, Introduction to Data Science, twitter

Su	Mo	Tu	We	Th	Fr	Sa
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Problem 1: Get Twitter Data

Problem 2: Derive the sentiment of each tweet

Leave a Comment

CV