• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
 

Aws Quick Dirty Hadoop Mapreduce Ec2 S3

on

  • 5,206 views

Aws Quick Dirty Hadoop Mapreduce Ec2 S3

Aws Quick Dirty Hadoop Mapreduce Ec2 S3

Statistics

Views

Total Views
5,206
Views on SlideShare
5,145
Embed Views
61

Actions

Likes
2
Downloads
0
Comments
0

4 Embeds 61

http://www.slideshare.net 30
http://skillsmatter.com 28
http://smash 2
http://192.168.56.101 1

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • So without further ado lets get this show on the road and run a job concurrently on a few virtual machines.

Aws Quick Dirty Hadoop Mapreduce Ec2 S3 Aws Quick Dirty Hadoop Mapreduce Ec2 S3 Presentation Transcript

  •  
  • QUICK AND DIRTY PARALLEL PROCESSING ON THE CLOUD Daniel Sikar
  • EC2 S3
  •  
  • Tools
    • AWS Command line tools
    • Elastic MapReduce Ruby library
    • Hadoop
    • s3cmd
  • Hadoop MapReduce Job Tracker HDFS – Distributed file system
  • Hadoop MapReduce usage Data crunching in general Clicks Statistics etc
  • Hadoop Project Mgmt Committee
  • MapReduce ?
  • MapReduce Key Pairs <key,value>
  • MapReduce
  • HTTP Logs Log file A: (...) FreeTouchScreenNokia5230 (...) (...) GetRidofAllSpeedCameras(...) (...) USManWinsLottery (...) (...) BNPToLaunchElectionManifesto (...) Log file B: (...) FreeTouchScreenNokia5230 (...) (...) BodyLanguageTellsAll (...)
  • MapReduce <FreeTouchScreenNokia5230, 1> + <FreeTouchScreenNokia5230, 1> = <FreeTouchScreenNokia5230, 2>
  • Hadoop Streaming Running MapReduce jobs with .exe fiels and scripts $ <list> | mapper | reducer
  • Hadoop Streaming Running MapReduce jobs with .exe fiels and scripts $ <list> | mapper | reducer
  • Real life example of Hadoop Streaming usage
  • Wikipedia Page Access Logs
  • Wine Grape Varieties
  • Wikipedia WGV Page Access Stats
  • Business Decisions
  • Launching a virtual Hadoop Cluster $ elastic-mapreduce --create --name &quot;Wiki log crunch&quot; --alive --num-instances –instance-type c1.medium 20 Created job flow <job flow id> $ ec2din (...)
  •  
  • Hadoop
    • Standalone Operation
    • Pseudo-Distributed Operation
    • Fully-Distributed Operation
    • NameNode
    • JobTracker
    • DataNode + TaskTracker
  • Hadoop
    • Standalone Operation
    • Pseudo-Distributed Operation
    • Fully-Distributed Operation
    • NameNode
    • JobTracker
    • DataNode + TaskTracker
  • Add a step $ elastic-mapreduce --jobflow <jfid> --stream --step-name &quot;Wiki log crunch&quot; --input s3n://dsikar-wikilogs-2009/dec/ --output s3n://dsikar-wikilogs-output/21 --mapper s3n://dsikar-wiki-scripts/wikidictionarymap.pl --reducer s3n://dsikar-wiki-scripts/wikireduce.pl http://<instance public dns>:9100
  • s3cmd # make bucket $ s3cmd mb s3://dsikar-wikilogs # put log files $ s3cmd put pagecounts-200912*.gz s3://dsikar-wikilogs/dec $ s3cmd put pagecounts-201004*.gz s3://dsikar-wikilogs/apr # list log files $ s3cmd ls s3://dsikar-wikilogs/ # put scripts $ s3cmd put *.pl s3://dsikar-wiki-scripts/ # delete log files $ s3cmd del --recursive --force s3://dsikar-wikilogs/ # remove bucket $ s3cmd rb s3://dsikar-wikilogs/
  • Elastic MapReduce --create --list --jobflow --describe --stream --terminate
  • Output files part-00000 part-00001 part-00002 (...)
  • Further aggregation
  • Conclusion Hadoop MapReduce provides out-of-the-box ready-to-go distributed computing.
  • That's all folks and thanks for attending: QUICK AND DIRTY PARALLEL PROCESSING ON THE CLOUD Daniel Sikar
  •