QUICK AND DIRTY  PARALLEL PROCESSING  ON THE CLOUD Daniel Sikar
EC2 S3
 
Tools AWS Command line tools
Elastic MapReduce Ruby library
Hadoop
s3cmd
Hadoop MapReduce Job Tracker + Task Tracker + Slaves HDFS – Distributed file system
Hadoop MapReduce usage Data crunching in general Clicks Statistics etc
Hadoop Project Mgmt Committee
MapReduce ?
MapReduce Key Pairs <key,value>
MapReduce
HTTP Logs Log file A: (...) FreeTouchScreenNokia5230 (...) (...) GetRidofAllSpeedCameras(...) (...) USManWinsLottery (...) (...) BNPToLaunchElectionManifesto (...) Log file B: (...) FreeTouchScreenNokia5230 (...) (...) BodyLanguageTellsAll (...)
MapReduce <FreeTouchScreenNokia5230, 1> + <FreeTouchScreenNokia5230, 1> = <FreeTouchScreenNokia5230, 2>
Hadoop Streaming Running MapReduce jobs  with .exe fiels  and scripts $ <list> | mapper | reducer
Hadoop Streaming Running MapReduce jobs  with .exe fiels  and scripts $ <list> | mapper | reducer
Real life example of Hadoop Streaming usage
Wikipedia Page Access Logs
Wine Grape Varieties
Wikipedia WGV Page Access Stats
Business Decisions
Launching a virtual Hadoop Cluster $  elastic-mapreduce  --create --name &quot;Wiki log crunch&quot; --alive --num-instances –instance-type c1.medium 20 Created job flow <job flow id> $  ec2din (...)
 
Hadoop Standalone Operation
Pseudo-Distributed Operation
Fully-Distributed Operation
NameNode
JobTracker

Daniel Sikar: Hadoop MapReduce - 06/09/2010

Editor's Notes

  • #21 So without further ado lets get this show on the road and run a job concurrently on a few virtual machines.