Daniel Sikar: Hadoop MapReduce - 06/09/2010

  • 2,208 views
Uploaded on

In this podcast speaker Daniel Sikar talks about Hadoop MapReduce.

In this podcast speaker Daniel Sikar talks about Hadoop MapReduce.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,208
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • So without further ado lets get this show on the road and run a job concurrently on a few virtual machines.

Transcript

  • 1. QUICK AND DIRTY PARALLEL PROCESSING ON THE CLOUD Daniel Sikar
  • 2. EC2 S3
  • 3.  
  • 4. Tools
    • AWS Command line tools
    • 5. Elastic MapReduce Ruby library
    • 6. Hadoop
    • 7. s3cmd
  • 8. Hadoop MapReduce Job Tracker + Task Tracker + Slaves HDFS – Distributed file system
  • 9. Hadoop MapReduce usage Data crunching in general Clicks Statistics etc
  • 10. Hadoop Project Mgmt Committee
  • 11. MapReduce ?
  • 12. MapReduce Key Pairs <key,value>
  • 13. MapReduce
  • 14. HTTP Logs Log file A: (...) FreeTouchScreenNokia5230 (...) (...) GetRidofAllSpeedCameras(...) (...) USManWinsLottery (...) (...) BNPToLaunchElectionManifesto (...) Log file B: (...) FreeTouchScreenNokia5230 (...) (...) BodyLanguageTellsAll (...)
  • 15. MapReduce <FreeTouchScreenNokia5230, 1> + <FreeTouchScreenNokia5230, 1> = <FreeTouchScreenNokia5230, 2>
  • 16. Hadoop Streaming Running MapReduce jobs with .exe fiels and scripts $ <list> | mapper | reducer
  • 17. Hadoop Streaming Running MapReduce jobs with .exe fiels and scripts $ <list> | mapper | reducer
  • 18. Real life example of Hadoop Streaming usage
  • 19. Wikipedia Page Access Logs
  • 20. Wine Grape Varieties
  • 21. Wikipedia WGV Page Access Stats
  • 22. Business Decisions
  • 23. Launching a virtual Hadoop Cluster $ elastic-mapreduce --create --name &quot;Wiki log crunch&quot; --alive --num-instances –instance-type c1.medium 20 Created job flow <job flow id> $ ec2din (...)
  • 24.  
  • 25. Hadoop
    • Standalone Operation
    • 26. Pseudo-Distributed Operation
    • 27. Fully-Distributed Operation
    • 28. NameNode
    • 29. JobTracker
    • 30. DataNode + TaskTracker
  • 31. Hadoop
    • Standalone Operation
    • 32. Pseudo-Distributed Operation
    • 33. Fully-Distributed Operation
    • 34. NameNode
    • 35. JobTracker
    • 36. DataNode + TaskTracker
  • 37. Add a step $ elastic-mapreduce --jobflow <jfid> --stream --step-name &quot;Wiki log crunch&quot; --input s3n://dsikar-wikilogs-2009/dec/ --output s3n://dsikar-wikilogs-output/21 --mapper s3n://dsikar-wiki-scripts/wikidictionarymap.pl --reducer s3n://dsikar-wiki-scripts/wikireduce.pl http://<instance public dns>:9100
  • 38. s3cmd # make bucket $ s3cmd mb s3://dsikar-wikilogs # put log files $ s3cmd put pagecounts-200912*.gz s3://dsikar-wikilogs/dec $ s3cmd put pagecounts-201004*.gz s3://dsikar-wikilogs/apr # list log files $ s3cmd ls s3://dsikar-wikilogs/ # put scripts $ s3cmd put *.pl s3://dsikar-wiki-scripts/ # delete log files $ s3cmd del --recursive --force s3://dsikar-wikilogs/ # remove bucket $ s3cmd rb s3://dsikar-wikilogs/
  • 39. Elastic MapReduce --create --list --jobflow --describe --stream --terminate
  • 40. Output files part-00000 part-00001 part-00002 (...)
  • 41. Further aggregation
  • 42. Conclusion Hadoop MapReduce provides out-of-the-box ready-to-go distributed computing.