Upcoming SlideShare
×

1,680 views
1,588 views

Published on

July 2010 Triangle Hadoop Users Group presentation

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
1,680
On SlideShare
0
From Embeds
0
Number of Embeds
360
Actions
Shares
0
30
0
Likes
0
Embeds 0
No embeds

No notes for slide
• This template can be used as a starter file for a photo album.

1. 1. Processing Megadata<br />With Python and Hadoop<br />July 2010 TriHUG<br />Ryan Cox<br />www.asciiarmor.com<br />
3. 3. high_temp=0<br />forline inopen('1901'): <br /> line =line.strip() <br /> (year, temp, quality) =<br /> (line[15:19], line[87:92], line[92:93])<br /> if(temp !="+9999"and quality in"01459"): <br />high_temp=max(high_temp,float(temp)) <br />printhigh_temp<br />How can we make this scale?<br />( and do more interesting things )<br />
4. 4.
5. 5. JeffREY DEAN – Google - 2004<br />“Our abstraction is in-spired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.”<br />
6. 6.
7. 7. defmapper(line): <br /> line =line.strip() <br /> (year, temp, quality) =<br /> (line[15:19], line[87:92], line[92:93]) <br />if(temp !="+9999"and quality in"01459"): <br />returnfloat(temp) <br />returnNone<br />output =map(mapper,open('1901')) <br />printreduce(max,output)<br />Mapreduce in pure python<br />
8. 8. for line in sys.stdin:<br />val = line.strip()<br /> (year, temp, q) = (val[15:19], val[87:92], val[92:93])<br /> if (temp != "+9999" and re.match("[01459]", q)):<br /> print "%s %s" % (year, temp)<br />mapper.py<br />(last_key, max_val) = (None, 0)<br />for line in sys.stdin:<br /> (key, val) = line.strip().split(" ")<br /> if last_key and last_key != key:<br /> print "%s %s" % (last_key, max_val)<br /> (last_key, max_val) = (key, int(val))<br /> else:<br /> (last_key, max_val) = (key, max(max_val, int(val)))<br />if last_key:<br /> print "%s %s" % (last_key, max_val)<br />reduer.py<br />cat dataFile | mapper.py | sort | reducer.py<br />Hadoop Streaming<br />
9. 9. Dumbo<br />
10. 10. def mapper(key,value):<br /> line = value.strip()<br /> (year, temp, quality) = <br /> (line[15:19], line[87:92], line[92:93])<br /> if (temp != "+9999" and quality in "01459"):<br /> yield year, int(temp)<br />def reducer(key,values):<br /> yield key,max(values)<br />if __name__ == "__main__":<br /> import dumbo<br />dumbo.run(mapper,reducer,reducer)<br />Dumbo<br />
11. 11. <ul><li>Ability to pass around Python objects
12. 12. Job / Iteration Abstraction
13. 13. Counter / Status Abstraction
14. 14. Simplified Joining mechanism
15. 15. Ability to use non-Java combiners
16. 16. Built-in library of mappers / reducers
17. 17. Excellent way to model MR algorithms</li></ul>Dumbo<br />
18. 18. CLI – API – Web Console<br />Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).<br />Elastic Map REduce<br />
19. 19. Elastic Map REduce<br />
20. 20. Use Hadoop’s Job Tracker or Amazon’s ‘Debugger’<br />Elastic Map REduce<br />
21. 21. CloudWatch metrics<br />Elastic Map REduce<br />
22. 22. CloudWatch metrics<br />Elastic Map REduce<br />
23. 23. Quiz: How would you do this?<br />
24. 24. Map Reduce Algorithms ARE different<br />BFS(G, s) // G is the graph and s is the starting node<br />for each vertex u ∈ V [G] - {s}<br />do color[u] ← WHITE // color of vertex u<br />d[u] ← ∞ // distance from source s to vertex u<br />π[u] ← NIL // predecessor of u<br />color[s] ← GRAY<br />d[s] ← 0<br />π[s] ← NIL<br />Q ← Ø // Q is a FIFO - queue<br />ENQUEUE(Q, s)<br /> while Q ≠ Ø // iterates as long as there are gray vertices. <br /> do u ← DEQUEUE(Q)<br />for each v ∈ Adj[u]<br />do if color[v] = WHITE // discover the undiscovered adjacent vertices<br />then color[v] ← GRAY // enqueued whenever painted gray<br />d[v] ← d[u] + 1<br />π[v] ← u<br />ENQUEUE(Q, v)<br />color[u] ← BLACK // painted black whenever dequeued<br />
25. 25. > m = function() { emit(this.user_id, 1); } <br />> r = function(k,vals) { return 1; } <br />> res = db.events.mapReduce(m, r, { query : {type:'sale'} }); <br />> db[res.result].find().limit(2) <br />{ "_id" : 8321073716060 , "value" : 1 } <br />{ "_id" : 7921232311289 , "value" : 1 }<br />MongoDB<br />> {ok, [R]} = Client:mapred([{<<"groceries">>, <<"mine">>},                             {<<"groceries">>, <<"yours">>}],                            [{'map', {'qfun', Count}, 'none', false},                             {'reduce', {'qfun', Merge}, 'none', true}]).<br />Riak<br />Map Reduce Elsewhere<br />