Megadata With Python and Hadoop

1,680 views
1,588 views

Published on

July 2010 Triangle Hadoop Users Group presentation

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,680
On SlideShare
0
From Embeds
0
Number of Embeds
360
Actions
Shares
0
Downloads
30
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • This template can be used as a starter file for a photo album.
  • Megadata With Python and Hadoop

    1. 1. Processing Megadata<br />With Python and Hadoop<br />July 2010 TriHUG<br />Ryan Cox<br />www.asciiarmor.com<br />
    2. 2. 0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF108991999999999999999999<br />0029029070999991901010113004+64333+023450FM-12+000599999V0202901N008219999999N0000001N9-00721+99999102001ADDGF104991999999999999999999<br />0029029070999991901010120004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00941+99999102001ADDGF108991999999999999999999<br />0029029070999991901010206004+64333+023450FM-12+000599999V0201801N008219999999N0000001N9-00611+99999101831ADDGF108991999999999999999999<br />0029029070999991901010213004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00561+99999101761ADDGF108991999999999999999999<br />0029029070999991901010220004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00281+99999101751ADDGF108991999999999999999999<br />0029029070999991901010306004+64333+023450FM-12+000599999V0202001N009819999999N0000001N9-00671+99999101701ADDGF106991999999999999999999<br />0029029070999991901010313004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999101741ADDGF108991999999999999999999<br />0029029070999991901010320004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00281+99999101741ADDGF108991999999999999999999<br />0029029070999991901010406004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999102311ADDGF108991999999999999999999<br />0029029070999991901010413004+64333+023450FM-12+000599999V0202301N008219999999N0000001N9-00441+99999102261ADDGF108991999999999999999999<br />0029029070999991901010420004+64333+023450FM-12+000599999V0202001N011819999999N0000001N9-00391+99999102231ADDGF108991999999999999999999<br />0029029070999991901010506004+64333+023450FM-12+000599999V0202701N004119999999N0000001N9+00001+99999101821ADDGF104991999999999999999999<br />0029029070999991901010513004+64333+023450FM-12+000599999V0202701N002119999999N0000001N9+00061+99999102591ADDGF104991999999999999999999<br />0029029070999991901010520004+64333+023450FM-12+000599999V0202301N004119999999N0000001N9+00001+99999102671ADDGF104991999999999999999999<br />0029029070999991901010606004+64333+023450FM-12+000599999V0202701N006219999999N0000001N9+00061+99999102751ADDGF103991999999999999999999<br />0029029070999991901010613004+64333+023450FM-12+000599999V0202701N006219999999N0000001N9+00061+99999102981ADDGF100991999999999999999999<br />0029029070999991901010620004+64333+023450FM-12+000599999V0203201N002119999999N0000001N9-00111+99999103191ADDGF100991999999999999999999<br />0029029070999991901010706004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999103341ADDGF100991999999999999999999<br />0029029070999991901010713004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00501+99999103321ADDGF100991999999999999999999<br />0029029070999991901010720004+64333+023450FM-12+000599999V0202001N009819999999N0000001N9-00441+99999103321ADDGF100991999999999999999999<br />0029029070999991901010806004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00281+99999103221ADDGF108991999999999999999999<br />0029029070999991901010813004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999103201ADDGF108991999999999999999999<br />0035029070999991901010820004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00331+99999102991ADDGF108991999999999999999999MW1701<br />0029029070999991901010906004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00501+99999102871ADDGF108991999999999999999999<br />0029029070999991901010913004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999102661ADDGF108991999999999999999999<br />0029029070999991901010920004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00281+99999102391ADDGF108991999999999999999999<br />0029029070999991901011006004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00441+99999101601ADDGF100991999999999999999999<br />0029029070999991901011013004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00441+99999101481ADDGF100991999999999999999999<br />0029029070999991901011020004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00441+99999101381ADDGF100991999999999999999999<br />0029029070999991901011106004+64333+023450FM-12+000599999V0202501N006219999999N0000001N9-00391+99999101061ADDGF100991999999999999999999<br />0029029070999991901011113004+64333+023450FM-12+000599999V0202701N008219999999N0000001N9-00501+99999101141ADDGF100991999999999999999999<br />0029029070999991901011120004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00441+99999101261ADDGF100991999999999999999999<br />0029029070999991901011206004+64333+023450FM-12+000599999V0202701N004119999999N0000001N9-00391+99999101311ADDGF104991999999999999999999<br />0029029070999991901011213004+64333+023450FM-12+000599999V0203201N008219999999N0000001N9-00331+99999102071ADDGF103991999999999999999999<br />0029029070999991901011220004+64333+023450FM-12+000599999V0202901N009819999999N0000001N9-00221+99999102191ADDGF100991999999999999999999<br />0029029070999991901011306004+64333+023450FM-12+000599999V0203201N004119999999N0000001N9+00001+99999101661ADDGF100991999999999999999999<br />0029029070999991901011313004+64333+023450FM-12+000599999V0203201N008219999999N0000001N9-00061+99999102351ADDGF100991999999999999999999<br />0029029070999991901011320004+64333+023450FM-12+000599999V0203201N004119999999N0000001N9-00171+99999102321ADDGF100991999999999999999999<br />0029029070999991901011406004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00441+99999102721ADDGF100991999999999999999999<br />0029029070999991901011413004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00391+99999102551ADDGF100991999999999999999999<br />0029029070999991901011420004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999102261ADDGF100991999999999999999999<br />0029029070999991901011506004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00061+99999101831ADDGF108991999999999999999999<br />0029029070999991901011513004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9+00171+99999101541ADDGF108991999999999999999999<br />0035029070999991901011520004+64333+023450FM-12+000599999V0202301N015919999999N0000001N9+00221+99999101321ADDGF108991999999999999999999MW1721<br />~130 GB NCDC climate Dataset<br />
    3. 3. high_temp=0<br />forline inopen('1901'): <br /> line =line.strip() <br /> (year, temp, quality) =<br /> (line[15:19], line[87:92], line[92:93])<br /> if(temp !="+9999"and quality in"01459"): <br />high_temp=max(high_temp,float(temp)) <br />printhigh_temp<br />How can we make this scale?<br />( and do more interesting things )<br />
    4. 4.
    5. 5. JeffREY DEAN – Google - 2004<br />“Our abstraction is in-spired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.”<br />
    6. 6.
    7. 7. defmapper(line): <br /> line =line.strip() <br /> (year, temp, quality) =<br /> (line[15:19], line[87:92], line[92:93]) <br />if(temp !="+9999"and quality in"01459"): <br />returnfloat(temp) <br />returnNone<br />output =map(mapper,open('1901')) <br />printreduce(max,output)<br />Mapreduce in pure python<br />
    8. 8. for line in sys.stdin:<br />val = line.strip()<br /> (year, temp, q) = (val[15:19], val[87:92], val[92:93])<br /> if (temp != "+9999" and re.match("[01459]", q)):<br /> print "%s %s" % (year, temp)<br />mapper.py<br />(last_key, max_val) = (None, 0)<br />for line in sys.stdin:<br /> (key, val) = line.strip().split(" ")<br /> if last_key and last_key != key:<br /> print "%s %s" % (last_key, max_val)<br /> (last_key, max_val) = (key, int(val))<br /> else:<br /> (last_key, max_val) = (key, max(max_val, int(val)))<br />if last_key:<br /> print "%s %s" % (last_key, max_val)<br />reduer.py<br />cat dataFile | mapper.py | sort | reducer.py<br />Hadoop Streaming<br />
    9. 9. Dumbo<br />
    10. 10. def mapper(key,value):<br /> line = value.strip()<br /> (year, temp, quality) = <br /> (line[15:19], line[87:92], line[92:93])<br /> if (temp != "+9999" and quality in "01459"):<br /> yield year, int(temp)<br />def reducer(key,values):<br /> yield key,max(values)<br />if __name__ == "__main__":<br /> import dumbo<br />dumbo.run(mapper,reducer,reducer)<br />Dumbo<br />
    11. 11. <ul><li>Ability to pass around Python objects
    12. 12. Job / Iteration Abstraction
    13. 13. Counter / Status Abstraction
    14. 14. Simplified Joining mechanism
    15. 15. Ability to use non-Java combiners
    16. 16. Built-in library of mappers / reducers
    17. 17. Excellent way to model MR algorithms</li></ul>Dumbo<br />
    18. 18. CLI – API – Web Console<br />Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).<br />Elastic Map REduce<br />
    19. 19. Elastic Map REduce<br />
    20. 20. Use Hadoop’s Job Tracker or Amazon’s ‘Debugger’<br />Elastic Map REduce<br />
    21. 21. CloudWatch metrics<br />Elastic Map REduce<br />
    22. 22. CloudWatch metrics<br />Elastic Map REduce<br />
    23. 23. Quiz: How would you do this?<br />
    24. 24. Map Reduce Algorithms ARE different<br />BFS(G, s) // G is the graph and s is the starting node<br />for each vertex u ∈ V [G] - {s}<br />do color[u] ← WHITE // color of vertex u<br />d[u] ← ∞ // distance from source s to vertex u<br />π[u] ← NIL // predecessor of u<br />color[s] ← GRAY<br />d[s] ← 0<br />π[s] ← NIL<br />Q ← Ø // Q is a FIFO - queue<br />ENQUEUE(Q, s)<br /> while Q ≠ Ø // iterates as long as there are gray vertices. <br /> do u ← DEQUEUE(Q)<br />for each v ∈ Adj[u]<br />do if color[v] = WHITE // discover the undiscovered adjacent vertices<br />then color[v] ← GRAY // enqueued whenever painted gray<br />d[v] ← d[u] + 1<br />π[v] ← u<br />ENQUEUE(Q, v)<br />color[u] ← BLACK // painted black whenever dequeued<br />
    25. 25. > m = function() { emit(this.user_id, 1); } <br />> r = function(k,vals) { return 1; } <br />> res = db.events.mapReduce(m, r, { query : {type:'sale'} }); <br />> db[res.result].find().limit(2) <br />{ "_id" : 8321073716060 , "value" : 1 } <br />{ "_id" : 7921232311289 , "value" : 1 }<br />MongoDB<br />> {ok, [R]} = Client:mapred([{<<"groceries">>, <<"mine">>},                             {<<"groceries">>, <<"yours">>}],                            [{'map', {'qfun', Count}, 'none', false},                             {'reduce', {'qfun', Merge}, 'none', true}]).<br />Riak<br />Map Reduce Elsewhere<br />
    26. 26. Definitive Guide Hadoop<br />http://www.hadoopbook.com<br />Dumbo<br />http://dumbotics.com/<br />http://github.com/klbostee/dumbo/<br />Elastic Map Reduce<br />http://aws.amazon.com/<br />Boto<br />http://github.com/boto<br />Getting Started Slides<br />http://www.slideshare.net/pacoid/getting-started-on-hadoop<br />Learn MOre<br />
    27. 27. DEMO<br />

    ×