Your SlideShare is downloading. ×
 
QUICK AND DIRTY  PARALLEL PROCESSING  ON THE CLOUD Daniel Sikar
EC2 S3
 
Tools <ul><li>AWS Command line tools
Elastic MapReduce Ruby library
Hadoop
s3cmd </li></ul>
Hadoop MapReduce Job Tracker HDFS – Distributed file system
Hadoop MapReduce usage Data crunching in general Clicks Statistics etc
Hadoop Project Mgmt Committee
MapReduce ?
MapReduce Key Pairs <key,value>
MapReduce
HTTP Logs Log file A: (...) FreeTouchScreenNokia5230 (...) (...) GetRidofAllSpeedCameras(...) (...) USManWinsLottery (...)...
MapReduce <FreeTouchScreenNokia5230, 1> + <FreeTouchScreenNokia5230, 1> = <FreeTouchScreenNokia5230, 2>
Hadoop Streaming Running MapReduce jobs  with .exe fiels  and scripts $ <list> | mapper | reducer
Hadoop Streaming Running MapReduce jobs  with .exe fiels  and scripts $ <list> | mapper | reducer
Real life example of Hadoop Streaming usage
Wikipedia Page Access Logs
Wine Grape Varieties
Wikipedia WGV Page Access Stats
Business Decisions
Launching a virtual Hadoop Cluster $  elastic-mapreduce  --create --name &quot;Wiki log crunch&quot; --alive --num-instanc...
 
Hadoop <ul><li>Standalone Operation
Pseudo-Distributed Operation
Fully-Distributed Operation
NameNode
JobTracker
DataNode + TaskTracker </li></ul>
Hadoop <ul><li>Standalone Operation
Upcoming SlideShare
Loading in...5
×

Aws Quick Dirty Hadoop Mapreduce Ec2 S3

4,066

Published on

Aws Quick Dirty Hadoop Mapreduce Ec2 S3

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,066
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • So without further ado lets get this show on the road and run a job concurrently on a few virtual machines.
  • Transcript of "Aws Quick Dirty Hadoop Mapreduce Ec2 S3"

    1. 2. QUICK AND DIRTY PARALLEL PROCESSING ON THE CLOUD Daniel Sikar
    2. 3. EC2 S3
    3. 5. Tools <ul><li>AWS Command line tools
    4. 6. Elastic MapReduce Ruby library
    5. 7. Hadoop
    6. 8. s3cmd </li></ul>
    7. 9. Hadoop MapReduce Job Tracker HDFS – Distributed file system
    8. 10. Hadoop MapReduce usage Data crunching in general Clicks Statistics etc
    9. 11. Hadoop Project Mgmt Committee
    10. 12. MapReduce ?
    11. 13. MapReduce Key Pairs <key,value>
    12. 14. MapReduce
    13. 15. HTTP Logs Log file A: (...) FreeTouchScreenNokia5230 (...) (...) GetRidofAllSpeedCameras(...) (...) USManWinsLottery (...) (...) BNPToLaunchElectionManifesto (...) Log file B: (...) FreeTouchScreenNokia5230 (...) (...) BodyLanguageTellsAll (...)
    14. 16. MapReduce <FreeTouchScreenNokia5230, 1> + <FreeTouchScreenNokia5230, 1> = <FreeTouchScreenNokia5230, 2>
    15. 17. Hadoop Streaming Running MapReduce jobs with .exe fiels and scripts $ <list> | mapper | reducer
    16. 18. Hadoop Streaming Running MapReduce jobs with .exe fiels and scripts $ <list> | mapper | reducer
    17. 19. Real life example of Hadoop Streaming usage
    18. 20. Wikipedia Page Access Logs
    19. 21. Wine Grape Varieties
    20. 22. Wikipedia WGV Page Access Stats
    21. 23. Business Decisions
    22. 24. Launching a virtual Hadoop Cluster $ elastic-mapreduce --create --name &quot;Wiki log crunch&quot; --alive --num-instances –instance-type c1.medium 20 Created job flow <job flow id> $ ec2din (...)
    23. 26. Hadoop <ul><li>Standalone Operation
    24. 27. Pseudo-Distributed Operation
    25. 28. Fully-Distributed Operation
    26. 29. NameNode
    27. 30. JobTracker
    28. 31. DataNode + TaskTracker </li></ul>
    29. 32. Hadoop <ul><li>Standalone Operation
    30. 33. Pseudo-Distributed Operation
    31. 34. Fully-Distributed Operation
    32. 35. NameNode
    33. 36. JobTracker
    34. 37. DataNode + TaskTracker </li></ul>
    35. 38. Add a step $ elastic-mapreduce --jobflow <jfid> --stream --step-name &quot;Wiki log crunch&quot; --input s3n://dsikar-wikilogs-2009/dec/ --output s3n://dsikar-wikilogs-output/21 --mapper s3n://dsikar-wiki-scripts/wikidictionarymap.pl --reducer s3n://dsikar-wiki-scripts/wikireduce.pl http://<instance public dns>:9100
    36. 39. s3cmd # make bucket $ s3cmd mb s3://dsikar-wikilogs # put log files $ s3cmd put pagecounts-200912*.gz s3://dsikar-wikilogs/dec $ s3cmd put pagecounts-201004*.gz s3://dsikar-wikilogs/apr # list log files $ s3cmd ls s3://dsikar-wikilogs/ # put scripts $ s3cmd put *.pl s3://dsikar-wiki-scripts/ # delete log files $ s3cmd del --recursive --force s3://dsikar-wikilogs/ # remove bucket $ s3cmd rb s3://dsikar-wikilogs/
    37. 40. Elastic MapReduce --create --list --jobflow --describe --stream --terminate
    38. 41. Output files part-00000 part-00001 part-00002 (...)
    39. 42. Further aggregation
    40. 43. Conclusion Hadoop MapReduce provides out-of-the-box ready-to-go distributed computing.
    41. 44. That's all folks and thanks for attending: QUICK AND DIRTY PARALLEL PROCESSING ON THE CLOUD Daniel Sikar

    ×