Your SlideShare is downloading. ×
0
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Daniel Sikar: Hadoop MapReduce - 06/09/2010

2,280

Published on

In this podcast speaker Daniel Sikar talks about Hadoop MapReduce.

In this podcast speaker Daniel Sikar talks about Hadoop MapReduce.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,280
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • So without further ado lets get this show on the road and run a job concurrently on a few virtual machines.
  • Transcript

    • 1. QUICK AND DIRTY PARALLEL PROCESSING ON THE CLOUD Daniel Sikar
    • 2. EC2 S3
    • 3.  
    • 4. Tools
      • AWS Command line tools
      • 5. Elastic MapReduce Ruby library
      • 6. Hadoop
      • 7. s3cmd
    • 8. Hadoop MapReduce Job Tracker + Task Tracker + Slaves HDFS – Distributed file system
    • 9. Hadoop MapReduce usage Data crunching in general Clicks Statistics etc
    • 10. Hadoop Project Mgmt Committee
    • 11. MapReduce ?
    • 12. MapReduce Key Pairs <key,value>
    • 13. MapReduce
    • 14. HTTP Logs Log file A: (...) FreeTouchScreenNokia5230 (...) (...) GetRidofAllSpeedCameras(...) (...) USManWinsLottery (...) (...) BNPToLaunchElectionManifesto (...) Log file B: (...) FreeTouchScreenNokia5230 (...) (...) BodyLanguageTellsAll (...)
    • 15. MapReduce <FreeTouchScreenNokia5230, 1> + <FreeTouchScreenNokia5230, 1> = <FreeTouchScreenNokia5230, 2>
    • 16. Hadoop Streaming Running MapReduce jobs with .exe fiels and scripts $ <list> | mapper | reducer
    • 17. Hadoop Streaming Running MapReduce jobs with .exe fiels and scripts $ <list> | mapper | reducer
    • 18. Real life example of Hadoop Streaming usage
    • 19. Wikipedia Page Access Logs
    • 20. Wine Grape Varieties
    • 21. Wikipedia WGV Page Access Stats
    • 22. Business Decisions
    • 23. Launching a virtual Hadoop Cluster $ elastic-mapreduce --create --name &quot;Wiki log crunch&quot; --alive --num-instances –instance-type c1.medium 20 Created job flow <job flow id> $ ec2din (...)
    • 24.  
    • 25. Hadoop
      • Standalone Operation
      • 26. Pseudo-Distributed Operation
      • 27. Fully-Distributed Operation
      • 28. NameNode
      • 29. JobTracker
      • 30. DataNode + TaskTracker
    • 31. Hadoop
      • Standalone Operation
      • 32. Pseudo-Distributed Operation
      • 33. Fully-Distributed Operation
      • 34. NameNode
      • 35. JobTracker
      • 36. DataNode + TaskTracker
    • 37. Add a step $ elastic-mapreduce --jobflow <jfid> --stream --step-name &quot;Wiki log crunch&quot; --input s3n://dsikar-wikilogs-2009/dec/ --output s3n://dsikar-wikilogs-output/21 --mapper s3n://dsikar-wiki-scripts/wikidictionarymap.pl --reducer s3n://dsikar-wiki-scripts/wikireduce.pl http://<instance public dns>:9100
    • 38. s3cmd # make bucket $ s3cmd mb s3://dsikar-wikilogs # put log files $ s3cmd put pagecounts-200912*.gz s3://dsikar-wikilogs/dec $ s3cmd put pagecounts-201004*.gz s3://dsikar-wikilogs/apr # list log files $ s3cmd ls s3://dsikar-wikilogs/ # put scripts $ s3cmd put *.pl s3://dsikar-wiki-scripts/ # delete log files $ s3cmd del --recursive --force s3://dsikar-wikilogs/ # remove bucket $ s3cmd rb s3://dsikar-wikilogs/
    • 39. Elastic MapReduce --create --list --jobflow --describe --stream --terminate
    • 40. Output files part-00000 part-00001 part-00002 (...)
    • 41. Further aggregation
    • 42. Conclusion Hadoop MapReduce provides out-of-the-box ready-to-go distributed computing.

    ×