In this assignment, you will be designing and implementing MapReduce algorithms for a variety of common data processing tasks. Problem 5: Consider a set of key-value pairs where each key is sequence id and each value is a string of nucleotides, e.g., GCTTCCGAAATGCTCGAA.... Write a MapReduce query to remove the last 10 characters from each string of nucleotides, then remove any duplicates generated.
Map Input
The input is a 2 element list: [sequence id, nucleotides]
sequence id: Unique identifier formatted as a string
nucleotides: Sequence of nucleotides formatted as a string Reduce Output
The output from the reduce function should be the unique trimmed nucleotide strings.
You can test your solution to this problem using dna.json:
python unique_trims.py dna.json
You can verify your solution against unique_trims.json.
import MapReduce import sys """ Word Count Example in the Simple Python MapReduce Framework """ mr = MapReduce.MapReduce() # ============================= # Do not modify above this line def mapper(record): # key: document identifier # value: document contents trim_nucleotid = record[1][:-10] mr.emit_intermediate(trim_nucleotid, 1 ) def reducer(trim_nucleotid, list_of_values): # key: word # value: list of occurrence counts #mr.emit((person,len(list_of_values)) ) mr.emit(trim_nucleotid) # Do not modify below this line # ============================= if __name__ == '__main__': inputdata = open(sys.argv[1]) mr.e xecute(inputdata, mapper, reducer)
Leave a Comment