In this assignment, you will be designing and implementing MapReduce algorithms for a variety of common data processing tasks. Problem 5: Consider a set of key-value pairs where each key is sequence id and each value is a string of nucleotides, e.g., GCTTCCGAAATGCTCGAA.... Write a MapReduce query to remove the last 10 characters from each string of nucleotides, then remove any duplicates generated.

Map Input

The input is a 2 element list: [sequence id, nucleotides]

sequence id: Unique identifier formatted as a string

nucleotides: Sequence of nucleotides formatted as a string Reduce Output

The output from the reduce function should be the unique trimmed nucleotide strings.

You can test your solution to this problem using dna.json:

    python unique_trims.py dna.json

You can verify your solution against unique_trims.json.

import MapReduce
import sys
 
"""
Word Count Example in the Simple Python MapReduce Framework
"""
 
mr = MapReduce.MapReduce()
 
# =============================
# Do not modify above this line
def mapper(record):
    # key: document identifier
    # value: document contents
    trim_nucleotid = record[1][:-10]
    mr.emit_intermediate(trim_nucleotid, 1 )
 
def reducer(trim_nucleotid, list_of_values):
    # key: word
    # value: list of occurrence counts
    #mr.emit((person,len(list_of_values)) )
    mr.emit(trim_nucleotid)
 
 
# Do not modify below this line
# =============================
if __name__ == '__main__':
  inputdata = open(sys.argv[1])
  mr.e xecute(inputdata, mapper, reducer)

Leave a Comment

Fields with * are required.

Please enter the letters as they are shown in the image above.
Letters are not case-sensitive.