Daniel Sikar: Hadoop MapReduce - 06/09/2010

1. QUICK AND DIRTY PARALLEL PROCESSING ON THE CLOUD Daniel Sikar

2. EC2 S3

5. Elastic MapReduce Ruby library

6. Hadoop

7. s3cmd

8. Hadoop MapReduce Job Tracker + Task Tracker + Slaves HDFS – Distributed file system

9. Hadoop MapReduce usage Data crunching in general Clicks Statistics etc

10. Hadoop Project Mgmt Committee

11. MapReduce ?

12. MapReduce Key Pairs <key,value>

13. MapReduce

14. HTTP Logs Log file A: (...) FreeTouchScreenNokia5230 (...) (...) GetRidofAllSpeedCameras(...) (...) USManWinsLottery (...) (...) BNPToLaunchElectionManifesto (...) Log file B: (...) FreeTouchScreenNokia5230 (...) (...) BodyLanguageTellsAll (...)

15. MapReduce <FreeTouchScreenNokia5230, 1> + <FreeTouchScreenNokia5230, 1> = <FreeTouchScreenNokia5230, 2>

16. Hadoop Streaming Running MapReduce jobs with .exe fiels and scripts $ <list> | mapper | reducer

17. Hadoop Streaming Running MapReduce jobs with .exe fiels and scripts $ <list> | mapper | reducer

18. Real life example of Hadoop Streaming usage

19. Wikipedia Page Access Logs

20. Wine Grape Varieties

21. Wikipedia WGV Page Access Stats

22. Business Decisions

23. Launching a virtual Hadoop Cluster $ elastic-mapreduce --create --name "Wiki log crunch" --alive --num-instances –instance-type c1.medium 20 Created job flow <job flow id> $ ec2din (...)

26. Pseudo-Distributed Operation

27. Fully-Distributed Operation

28. NameNode

29. JobTracker

30. DataNode + TaskTracker

32. Pseudo-Distributed Operation

33. Fully-Distributed Operation

34. NameNode

35. JobTracker

36. DataNode + TaskTracker

37. Add a step $ elastic-mapreduce --jobflow <jfid> --stream --step-name "Wiki log crunch" --input s3n://dsikar-wikilogs-2009/dec/ --output s3n://dsikar-wikilogs-output/21 --mapper s3n://dsikar-wiki-scripts/wikidictionarymap.pl --reducer s3n://dsikar-wiki-scripts/wikireduce.pl http://<instance public dns>:9100

38. s3cmd # make bucket $ s3cmd mb s3://dsikar-wikilogs # put log files $ s3cmd put pagecounts-200912*.gz s3://dsikar-wikilogs/dec $ s3cmd put pagecounts-201004*.gz s3://dsikar-wikilogs/apr # list log files $ s3cmd ls s3://dsikar-wikilogs/ # put scripts $ s3cmd put *.pl s3://dsikar-wiki-scripts/ # delete log files $ s3cmd del --recursive --force s3://dsikar-wikilogs/ # remove bucket $ s3cmd rb s3://dsikar-wikilogs/

39. Elastic MapReduce --create --list --jobflow --describe --stream --terminate

40. Output files part-00000 part-00001 part-00002 (...)

41. Further aggregation

42. Conclusion Hadoop MapReduce provides out-of-the-box ready-to-go distributed computing.

Daniel Sikar: Hadoop MapReduce - 06/09/2010

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (6)

Similar to Daniel Sikar: Hadoop MapReduce - 06/09/2010

Similar to Daniel Sikar: Hadoop MapReduce - 06/09/2010 (20)

More from Skills Matter

More from Skills Matter (20)

Recently uploaded

Recently uploaded (20)

Daniel Sikar: Hadoop MapReduce - 06/09/2010

Editor's Notes