Algorithms in MapReduce1: Inverted Index

Algorithms in MapReduce1: Inverted Index

In this assignment, you will be designing and implementing MapReduce algorithms for a variety of common data processing tasks.

Algorithms in MapReduce: Instructions Help

In this assignment, you will be designing and implementing MapReduce algorithms for a variety of common data processing tasks.

The MapReduce programming model (and a corresponding system) was proposed in a 2004 paper from a team at Google as a simpler abstraction for processing very large datasets in parallel. The goal of this assignment is to give you experience “thinking in MapReduce.” We will be using small datasets that you can inspect directly to determine the correctness of your results and to internalize how MapReduce works. In the next assignment, you will have the opportunity to use a MapReduce-based system to process the very large datasets for which is was designed.

As always, the first thing to do is to update your provided course materials using git pull. These resources may have changed since the last time you interacted with the datasci_course_materials repository. Github Instructions.

Next, review the lectures to make sure you understand the programming model.

You may also want to experiment running your queries on the JSMapReduce service for this assignment to see what it would be like to run a MapReduce query on a cluster of machines.

Python MapReduce Framework

You will be provided with a python library called MapReduce.py that implements the MapReduce programming model. The framework faithfully implements the MapReduce programming model, but it executes entirely on a single machine -- it does not involve parallel computation.

Here is the word count example discussed in class implemented as a MapReduce program using the framework:

# Part 1
mr = MapReduce.MapReduce()
 
# Part 2
def mapper(record):
    # key: document identifier
    # value: document contents
    key = record[0]
    value = record[1]
    words = value.split()
    for w in words:
      mr.emit_intermediate(w, 1)
 
# Part 3
def reducer(key, list_of_values):
    # key: word
    # value: list of occurrence counts
    total = 0
    for v in list_of_values:
      total += v
    mr.emit((key, total))
 
# Part 4
inputdata = o p e n (sys.argv[1])
mr.e x e c u t e (inputdata, mapper, reducer)

In Part 1, we create a MapReduce object that is used to pass data between the map function and the reduce function; you won’t need to use this object directly.

In Part 2, the mapper function tokenizes each document and emits a key-value pair. The key is a word formatted as a string and the value is the integer 1 to indicate an occurrence of word.

In Part 3, the reducer function sums up the list of occurrence counts and emits a count for word. Since the mapper function emits the integer 1 for each word, each element in the list_of_values is the integer 1.

The list of occurrence counts is summed and a (word, total) tuple is emitted where word is a string and total is an integer.

In Part 4, the code loads the json file and executes the MapReduce query which prints the result to stdout.

Create an Inverted index. Given a set of documents, an inverted index is a dictionary
where each word is associated with a list of the document identifiers in which that word appears.
''''
import MapReduce
import sys
 
mr = MapReduce.MapReduce()
 
 
def mapper(record):
    ''''
The input is a 2 element list: [document_id, text]
document_id: document identifier formatted as a string
text: text of the document formatted as a string
''''
    key = record[0]
    value = record[1]
 
    for word in value.split():
       mr.emit_intermediate(word,key)
 
def reducer(key, list_of_values):
    ''''
The output should be a (word, document ID list) tuple where word is a String and document ID list is a list of Strings.
''''
    result = []
 
    #want to remove duplicates
    #might use set(), but got some bug which did not want to solve
    for document_ID in list_of_values:
        if document_ID not in result:
            result.append(document_ID)
    mr.emit( (key,result) )
 
 
inputdata = o p e n(sys.argv[1])
mr.e x e c u t e(inputdata, mapper, reducer)

Tags: MOOC, python, Introduction to Data Science, MapReduce

Su	Mo	Tu	We	Th	Fr	Sa
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Leave a Comment

CV