Upcoming SlideShare
×

4,531 views

Published on

My intro talk for hadoop and how to use it with python streaming.

Published in: Education, Technology
7 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
4,531
On SlideShare
0
From Embeds
0
Number of Embeds
27
Actions
Shares
0
159
0
Likes
7
Embeds 0
No embeds

No notes for slide

1. 1. Paul Tarjan Chief Technical Monkey Yahoo! http://paulisageek.com
2. 2. Data is ugly <ul><li>1 08bade48-1081-488f-b459-6c75d75312ae 2 2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a 100.0 2.0 0.0 29 08bade48-1081-488f-b459-6c75d75312ae 3 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 1.0 0.0 </li></ul><ul><li>29 08bade48-1081-488f-b459-6c75d75312ae 2 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 1.0 0.0 11 08bade48-1081-488f-b459-6c75d75312ae 1 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 2.0 0.0 </li></ul>
3. 3. Summaries are OK
4. 4. Graphs are pretty
5. 5. Word count, yay! <ul><li>Given a document, how many of each word are there? </li></ul><ul><li>Real world : </li></ul><ul><ul><li>Given our search logs, how many people click on result 1 </li></ul></ul><ul><ul><li>Given our flickr photos, how many cat photos are there </li></ul></ul><ul><ul><li>Give our web crawl, what are the 10 most popular words? </li></ul></ul>
6. 6. Map/Reduce
7. 7. Hadoop streaming <ul><li>Basically, write some code that runs like : </li></ul><ul><li>\$ cat data | mapper.py | sort | reducer.py </li></ul>
8. 8. Python wordcount mapper http://github.com/ptarjan/hands-on-hadoop-tutorial/blob/master/mapper.py
9. 9. Python wordcount reducer http://github.com/ptarjan/hands-on-hadoop-tutorial/blob/master/reducer.py
10. 10. Run <ul><li>\$ cat data | mapper.py | sort | reducer.py </li></ul><ul><li>If that works, run it live </li></ul><ul><ul><li>Put the data into hadoops Distributed File System (DFS) </li></ul></ul><ul><ul><li>Run hadoop </li></ul></ul><ul><ul><li>Read the output data in the DFS </li></ul></ul>
11. 11. Run Hadoop <ul><li>Stream data through these two files, saving the output back to HDFS: \$HADOOP_HOME/bin/hadoop jar \$HADOOP_HOME/hadoop-streaming.jar -input input_dir -output output_dir -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py </li></ul>
12. 12. View output <ul><li>View output files: </li></ul><ul><ul><li>\$ hadoop dfs -ls output_dir </li></ul></ul><ul><li>Note multiple output files (&quot;part-00000&quot;, &quot;part-00001&quot;, etc) </li></ul><ul><li>View output file contents: </li></ul><ul><ul><li>\$ hadoop dfs -cat output_dir/part-00000 </li></ul></ul>
13. 13. Do it live! <ul><li>http://github.com/ptarjan/hands-on-hadoop-tutorial/ </li></ul>