Algorithms in MapReduce5: Unique Trimmed DNA Sequences

Algorithms in MapReduce5: Unique Trimmed DNA Sequences

In this assignment, you will be designing and implementing MapReduce algorithms for a variety of common data processing tasks. Problem 5: Consider a set of key-value pairs where each key is sequence id and each value is a string of nucleotides, e.g., GCTTCCGAAATGCTCGAA.... Write a MapReduce query to remove the last 10 characters from each string of nucleotides, then remove any duplicates generated.

Map Input

The input is a 2 element list: [sequence id, nucleotides]

sequence id: Unique identifier formatted as a string

nucleotides: Sequence of nucleotides formatted as a string Reduce Output

The output from the reduce function should be the unique trimmed nucleotide strings.

You can test your solution to this problem using dna.json:

    python unique_trims.py dna.json

You can verify your solution against unique_trims.json.

import MapReduce
import sys
 
"""
Word Count Example in the Simple Python MapReduce Framework
"""
 
mr = MapReduce.MapReduce()
 
# =============================
# Do not modify above this line
def mapper(record):
    # key: document identifier
    # value: document contents
    trim_nucleotid = record[1][:-10]
    mr.emit_intermediate(trim_nucleotid, 1 )
 
def reducer(trim_nucleotid, list_of_values):
    # key: word
    # value: list of occurrence counts
    #mr.emit((person,len(list_of_values)) )
    mr.emit(trim_nucleotid)
 
 
# Do not modify below this line
# =============================
if __name__ == '__main__':
  inputdata = open(sys.argv[1])
  mr.e xecute(inputdata, mapper, reducer)

Tags: MOOC, python, Introduction to Data Science, MapReduce

Su	Mo	Tu	We	Th	Fr	Sa
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Leave a Comment

CV