MasterclassElastic MapReduceRyan Shuttleworth – Technical Evangelist@ryanAWS
A technical deep dive beyond the basicsHelp educate you on how to get the best from AWS technologiesShow you how things wo...
A key tool in the toolbox to help with ‘Big Data’ challengesMakes possible analytics processes previously not feasibleCost...
What is EMR?Map-Reduce engine Integrated with toolsHadoop-as-a-serviceMassively parallelCost effective AWS wrapperIntegrat...
A frameworkSplits data into piecesLets processing occurGathers the results
Very largeclick log(e.g TBs)
Very largeclick log(e.g TBs)Lots of actionsby John Smith
Very largeclick log(e.g TBs)Lots of actionsby John SmithSplit thelog intomany smallpieces
Very largeclick log(e.g TBs)Lots of actionsby John SmithSplit thelog intomany smallpiecesProcess in anEMR cluster
Very largeclick log(e.g TBs)Lots of actionsby John SmithSplit thelog intomany smallpiecesProcess in anEMR clusterAggregate...
Very largeclick log(e.g TBs)WhatJohnSmithdidLots of actionsby John SmithSplit thelog intomany smallpiecesProcess in anEMR ...
WhatJohnSmithdidVery largeclick log(e.g TBs) Insight in a fraction of the time
HDFSReliable storageMapReduceData analysis
Big Data != Hadoop
Big Data != HadoopVelocityVarietyVolume
Big Data != HadoopWhen you need to innovate to collect,store, analyze, and manage your data
ComputeStorage Big DataHow do you get your slice of it?AWS Direct ConnectDedicated low latencybandwidthQueuingHighly scala...
Relational Database ServiceFully managed database(MySQL, Oracle, MSSQL)DynamoDBNoSQL, Schemaless,Provisioned throughputdat...
Elastic MapReduceGetting started, tools & operating modes
EMR clusterStart an EMRcluster usingconsole or cli tools
Master instance groupEMR clusterMaster instancegroup created thatcontrols thecluster (runsMySQL)
Master instance groupEMR clusterCore instance groupCore instancegroup created forlife of cluster
Master instance groupEMR clusterCore instance groupHDFS HDFSCore instances runDataNode andTaskTrackerdaemons
Master instance groupEMR clusterTask instance groupCore instance groupHDFS HDFSOptional taskinstances can beadded orsubtra...
Master instance groupEMR clusterTask instance groupCore instance groupHDFS HDFSAmazon S3S3 can be used asunderlying ‘files...
Master instance groupEMR clusterTask instance groupCore instance groupHDFS HDFSAmazon S3Master nodecoordinatesdistribution...
Master instance groupEMR clusterTask instance groupCore instance groupHDFS HDFSAmazon S3Core and Taskinstances read-write ...
mapInputfile reduceOutputfileEC2 instance
mapInputfile reduceOutputfilemapInputfile reduceOutputfilemapInputfile reduceOutputfileEC2 instanceEC2 instanceEC2 instance
HDFSEMRPig
HDFSEMRS3 DynamoDB
HDFSEMRS3 DynamoDBData management
HDFSEMRPigS3 DynamoDBAnalytics languagesData management
HDFSEMRPigS3 DynamoDBRDSAnalytics languagesData management
HDFSEMRPigS3 DynamoDBRDSAnalytics languagesData managementRedShiftAWS DataPipeline
EMRCreateJob FlowRun JobStep 1Run JobStep 2Run JobStep nEnd JobFlowCluster createdMaster, Core &Task instancesgroupsProces...
EMRCreateJob FlowRun JobStep 1Run JobStep 2Run JobStep nEnd JobFlowCluster createdMaster, Core &Task instancesgroupsProces...
Utility that comes with the Hadoop distributionAllows you to create and run Map/Reduce jobs with anyexecutable or script a...
Input source
Output location
Mapper script location (S3)
#!/usr/bin/pythonimport sysimport redef main(argv):pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")for line in sys.stdin:for w...
#!/usr/bin/pythonimport sysimport redef main(argv):pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")for line in sys.stdin:for w...
#!/usr/bin/pythonimport sysimport redef main(argv):pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")for line in sys.stdin:for w...
Reducer script orfunction
Aggregate: Sort inputsand add up totalse.g“Abacus 1”“Abacus 1”becomes“Abacus 2”
Manages the job flow: coordinating thedistribution of the MapReduceexecutable and subsets of the rawdata, to the core and ...
Manages the job flow: coordinating thedistribution of the MapReduceexecutable and subsets of the rawdata, to the core and ...
Manages the job flow: coordinating thedistribution of the MapReduceexecutable and subsets of the rawdata, to the core and ...
SSH onto master nodewith this keypair
Keep job flow runningso you can add moreto it
…aakar 3abdal 3abdelaziz 18abolish 3aboriginal 12abraham 6abseron 9abstentions15accession 73accord 90achabar 3achievements...
./elastic-mapreduce --create --stream--mappers3://elasticmapreduce/samples/wordcount/wordSplitter.py--inputs3://elasticmap...
./elastic-mapreduce --create --stream--mappers3://elasticmapreduce/samples/wordcount/wordSplitter.py--inputs3://elasticmap...
./elastic-mapreduce --create --stream--mappers3://elasticmapreduce/samples/wordcount/wordSplitter.py--inputs3://elasticmap...
./elastic-mapreduce --create --stream--mappers3://elasticmapreduce/samples/wordcount/wordSplitter.py--inputs3://elasticmap...
./elastic-mapreduce --create --stream--mappers3://elasticmapreduce/samples/wordcount/wordSplitter.py--inputs3://elasticmap...
launches a job flow to run on a single-node cluster usingan Amazon EC2 m1.small instancewhen your steps are running correc...
elastic-mapreduce--create--key-pair micro--region eu-west-1--name MyJobFlow--num-instances 5--instance-type m2.4xlarge–ali...
elastic-mapreduce--create--key-pair micro--region eu-west-1--name MyJobFlow--num-instances 5--instance-type m2.4xlarge–ali...
elastic-mapreduce--create--key-pair micro--region eu-west-1--name MyJobFlow--num-instances 5--instance-type m2.4xlarge–ali...
elastic-mapreduce--create--key-pair micro--region eu-west-1--name MyJobFlow--num-instances 5--instance-type m2.4xlarge–ali...
elastic-mapreduce--create--bootstrap-actions3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "-s,dfs.block.s...
70You can specify up to 16 bootstrap actions per job flow byproviding multiple bootstrap-action parameterss3://elasticmapr...
71You can specify up to 16 bootstrap actions per job flow byproviding multiple bootstrap-action parameterss3://elasticmapr...
Integrated toolsSelecting the right one
Hadoop streamingUnstructured data with wide variation offormats, high flexibility on languageBroad language support (Pytho...
Hadoop streamingUnstructured data with wide variation offormats, high flexibility on languageBroad language support (Pytho...
Hadoop streamingUnstructured data with wide variation offormats, high flexibility on languageBroad language support (Pytho...
Hive integrationSchema on read
SQL Interface for working with dataSimple way to use HadoopCreate Table statement references data location onS3Language ca...
SQL HiveQLUpdates UPDATE, INSERT,DELETEINSERT, OVERWRITETABLETransactions Supported Not supportedIndexes Supported Not sup...
JDBC Listener started by default on EMRPort 10003 for Hive 0.8.1, Ports 9100-9101 for HadoopDownload hive-jdbc-0.8.1.jarAc...
./elastic-mapreduce –create--name "Hive job flow”--hive-script--args s3://myawsbucket/myquery.q--args -d,INPUT=s3://myawsb...
./elastic-mapreduce--create--alive--name "Hive job flow”--num-instances 5 --instance-type m1.large --hive-interactiveInter...
CREATE EXTERNAL TABLE impressions (requestBeginTime string,adId string,impressionId string,referrer string,userAgent strin...
CREATE EXTERNAL TABLE impressions (requestBeginTime string,adId string,impressionId string,referrer string,userAgent strin...
CREATE EXTERNAL TABLE impressions (requestBeginTime string,adId string,impressionId string,referrer string,userAgent strin...
hive> select * from impressions limit 5;Selecting from sourcedata directly via Hadoop
Streaming Hive Pig DynamoDB RedshiftUnstructuredData✓ ✓Structured Data ✓ ✓ ✓ ✓LanguageSupportAny* HQL Pig Latin Client SQL...
Cost considerationsMaking the most of EC2 resources
1 instance for 100 hours=100 instances for 1 hour
Small instance = $8
1 instance for 1,000 hours=1,000 instances for 1 hour
Small instance = $80
100%Achieving economies of scaleTime
Reserved capacity100%Achieving economies of scaleTime
OnReserved capacity100%On-demandTimeAchieving economies of scale
OnReserved capacity100%On-demandTimeAchieving economies of scaleSpot
Job Flow14 HoursDuration:Scenario #1EMR with spot instances#1: Cost without Spot4 instances *14 hrs * $0.50 = $28
Job Flow14 HoursDuration:Scenario #1Duration:Job Flow7 HoursScenario #2EMR with spot instances#1: Cost without Spot4 insta...
Job Flow14 HoursDuration:Scenario #1Duration:Job Flow7 HoursScenario #2EMR with spot instances#1: Cost without Spot4 insta...
Job Flow14 HoursDuration:Scenario #1Duration:Job Flow7 HoursScenario #2EMR with spot instances#1: Cost without Spot4 insta...
Some other things…Hints and tips
Managing big dataDynamic Creation of Clusters for Processing JobsBulk of Data Stored in S3Use Reduced Redundancy Storage f...
Shark can return results up to 30 times fasterthan the same queries run on Hive.elastic-mapreduce --create --alive --name"...
Summary
What is EMR?Map-Reduce engine Integrated with toolsHadoop-as-a-serviceMassively parallelCost effective AWS wrapperIntegrat...
Try the tutorial:aws.amazon.com/articles/2855Find out more:aws.amazon.com/big-data
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Upcoming SlideShare
Loading in...5
×

Masterclass Webinar: Amazon Elastic MapReduce (EMR)

2,040

Published on

This webinar recording will explain how to get started with Amazon Elastic MapReduce (EMR). EMR enables fast processing of large structured or unstructured datasets, and in this webinar we'll demonstrated how to setup an EMR job flow to analyse application logs, and perform Hive queries against it. We'll review best practices around data file organisation on Amazon Simple Storage Service (S3), how clusters can be started from the AWS web console and command line, and how to monitor the status of a Map/Reduce job. The security configuration that allows direct access to the Amazon EMR cluster in interactive mode will be shown, and we'll see how Hive provides a SQL like environment, while allowing you to dynamically grow and shrink the amount of compute used for powerful data processing activities.

Amazon EMR YouTube Recording: http://youtu.be/gSPh6VTBEbY

Published in: Technology

Masterclass Webinar: Amazon Elastic MapReduce (EMR)

  1. 1. MasterclassElastic MapReduceRyan Shuttleworth – Technical Evangelist@ryanAWS
  2. 2. A technical deep dive beyond the basicsHelp educate you on how to get the best from AWS technologiesShow you how things work and how to get things doneBroaden your knowledge in ~45 minsMasterclass
  3. 3. A key tool in the toolbox to help with ‘Big Data’ challengesMakes possible analytics processes previously not feasibleCost effective when leveraged with EC2 spot marketBroad ecosystem of tools to handle specific use casesAmazon Elastic MapReduce
  4. 4. What is EMR?Map-Reduce engine Integrated with toolsHadoop-as-a-serviceMassively parallelCost effective AWS wrapperIntegrated to AWS services
  5. 5. A frameworkSplits data into piecesLets processing occurGathers the results
  6. 6. Very largeclick log(e.g TBs)
  7. 7. Very largeclick log(e.g TBs)Lots of actionsby John Smith
  8. 8. Very largeclick log(e.g TBs)Lots of actionsby John SmithSplit thelog intomany smallpieces
  9. 9. Very largeclick log(e.g TBs)Lots of actionsby John SmithSplit thelog intomany smallpiecesProcess in anEMR cluster
  10. 10. Very largeclick log(e.g TBs)Lots of actionsby John SmithSplit thelog intomany smallpiecesProcess in anEMR clusterAggregatethe resultsfrom allthe nodes
  11. 11. Very largeclick log(e.g TBs)WhatJohnSmithdidLots of actionsby John SmithSplit thelog intomany smallpiecesProcess in anEMR clusterAggregatethe resultsfrom allthe nodes
  12. 12. WhatJohnSmithdidVery largeclick log(e.g TBs) Insight in a fraction of the time
  13. 13. HDFSReliable storageMapReduceData analysis
  14. 14. Big Data != Hadoop
  15. 15. Big Data != HadoopVelocityVarietyVolume
  16. 16. Big Data != HadoopWhen you need to innovate to collect,store, analyze, and manage your data
  17. 17. ComputeStorage Big DataHow do you get your slice of it?AWS Direct ConnectDedicated low latencybandwidthQueuingHighly scalable eventbufferingAmazon Storage GatewaySync local storage to the cloudAWS Import/ExportPhysical media shipping
  18. 18. Relational Database ServiceFully managed database(MySQL, Oracle, MSSQL)DynamoDBNoSQL, Schemaless,Provisioned throughputdatabaseS3Object datastore up to 5TBper object99.999999999% durabilityComputeStorage Big DataWhere do you put your slice of it?SimpleDBNoSQL, SchemalessSmaller datasets
  19. 19. Elastic MapReduceGetting started, tools & operating modes
  20. 20. EMR clusterStart an EMRcluster usingconsole or cli tools
  21. 21. Master instance groupEMR clusterMaster instancegroup created thatcontrols thecluster (runsMySQL)
  22. 22. Master instance groupEMR clusterCore instance groupCore instancegroup created forlife of cluster
  23. 23. Master instance groupEMR clusterCore instance groupHDFS HDFSCore instances runDataNode andTaskTrackerdaemons
  24. 24. Master instance groupEMR clusterTask instance groupCore instance groupHDFS HDFSOptional taskinstances can beadded orsubtracted toperform work
  25. 25. Master instance groupEMR clusterTask instance groupCore instance groupHDFS HDFSAmazon S3S3 can be used asunderlying ‘filesystem’ forinput/output data
  26. 26. Master instance groupEMR clusterTask instance groupCore instance groupHDFS HDFSAmazon S3Master nodecoordinatesdistribution ofwork and managescluster state
  27. 27. Master instance groupEMR clusterTask instance groupCore instance groupHDFS HDFSAmazon S3Core and Taskinstances read-write to S3
  28. 28. mapInputfile reduceOutputfileEC2 instance
  29. 29. mapInputfile reduceOutputfilemapInputfile reduceOutputfilemapInputfile reduceOutputfileEC2 instanceEC2 instanceEC2 instance
  30. 30. HDFSEMRPig
  31. 31. HDFSEMRS3 DynamoDB
  32. 32. HDFSEMRS3 DynamoDBData management
  33. 33. HDFSEMRPigS3 DynamoDBAnalytics languagesData management
  34. 34. HDFSEMRPigS3 DynamoDBRDSAnalytics languagesData management
  35. 35. HDFSEMRPigS3 DynamoDBRDSAnalytics languagesData managementRedShiftAWS DataPipeline
  36. 36. EMRCreateJob FlowRun JobStep 1Run JobStep 2Run JobStep nEnd JobFlowCluster createdMaster, Core &Task instancesgroupsProcessing stepExecute a scriptsuch as java, hivescript, python etcChained stepExecutesubsequent stepusing data fromprevious stepChained stepAdd stepsdynamically orpredefine them asa pipelineCluster terminatedCluster instancesterminated on end of jobflow
  37. 37. EMRCreateJob FlowRun JobStep 1Run JobStep 2Run JobStep nEnd JobFlowCluster createdMaster, Core &Task instancesgroupsProcessing stepExecute a scriptsuch as java, hivescript, python etcChained stepExecutesubsequent stepusing data fromprevious stepChained stepAdd stepsdynamically orpredefine them asa pipelineCluster terminatedCluster instancesterminated on end of jobflowHadoop breaks a data setinto multiple sets if thedata set is too large toprocess quickly on asingle cluster node.Step can be compiledjava .jar or scriptexecuted by theHadoop StreamerData typically passedbetween steps usingHDFS file system onCore Instances Groupnodes or S3Output data generallyplaced in S3 forsubsequent use
  38. 38. Utility that comes with the Hadoop distributionAllows you to create and run Map/Reduce jobs with anyexecutable or script as the mapper and/or the reducerReads the input from standard input and the reduceroutputs data through standard outputBy default, each line of input/output represents a recordwith tab separated key/valueHadoop Streaming
  39. 39. Input source
  40. 40. Output location
  41. 41. Mapper script location (S3)
  42. 42. #!/usr/bin/pythonimport sysimport redef main(argv):pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")for line in sys.stdin:for word in pattern.findall(line):print "LongValueSum:" + word.lower() + "t" + "1"if __name__ == "__main__":main(sys.argv)Streaming: word count python script Stdin > Stdout
  43. 43. #!/usr/bin/pythonimport sysimport redef main(argv):pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")for line in sys.stdin:for word in pattern.findall(line):print "LongValueSum:" + word.lower() + "t" + "1"if __name__ == "__main__":main(sys.argv)Streaming: word count python script Stdin > StdoutRead words from StdIn line by line
  44. 44. #!/usr/bin/pythonimport sysimport redef main(argv):pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")for line in sys.stdin:for word in pattern.findall(line):print "LongValueSum:" + word.lower() + "t" + "1"if __name__ == "__main__":main(sys.argv)Streaming: word count python script Stdin > StdoutOutput to StdOut tab delimited recorde.g. “LongValueSum:abacus 1”
  45. 45. Reducer script orfunction
  46. 46. Aggregate: Sort inputsand add up totalse.g“Abacus 1”“Abacus 1”becomes“Abacus 2”
  47. 47. Manages the job flow: coordinating thedistribution of the MapReduceexecutable and subsets of the rawdata, to the core and task instancegroupsIt also tracks the status of each taskperformed, and monitors the health ofthe instance groups.To monitor the progress of the jobflow, you can SSH into the master nodeas the Hadoop user and either look atthe Hadoop log files directly or accessthe user interface that HadoopMaster instance groupInstance groups
  48. 48. Manages the job flow: coordinating thedistribution of the MapReduceexecutable and subsets of the rawdata, to the core and task instancegroupsIt also tracks the status of each taskperformed, and monitors the health ofthe instance groups.To monitor the progress of the jobflow, you can SSH into the master nodeas the Hadoop user and either look atthe Hadoop log files directly or accessthe user interface that HadoopMaster instance groupContains all of the core nodes of a jobflow.A core node is an EC2 instance thatruns Hadoop map and reduce tasksand stores data using the HadoopDistributed File System (HDFS).The EC2 instances you assign as corenodes are capacity that must beallotted for the entire job flow run.Core nodes run both the DataNodesand TaskTracker Hadoop daemons.Core instance groupInstance groups
  49. 49. Manages the job flow: coordinating thedistribution of the MapReduceexecutable and subsets of the rawdata, to the core and task instancegroupsIt also tracks the status of each taskperformed, and monitors the health ofthe instance groups.To monitor the progress of the jobflow, you can SSH into the master nodeas the Hadoop user and either look atthe Hadoop log files directly or accessthe user interface that HadoopMaster instance groupContains all of the core nodes of a jobflow.A core node is an EC2 instance thatruns Hadoop map and reduce tasksand stores data using the HadoopDistributed File System (HDFS).The EC2 instances you assign as corenodes are capacity that must beallotted for the entire job flow run.Core nodes run both the DataNodesand TaskTracker Hadoop daemons.Core instance groupContains all of the task nodes in a jobflow.The task instance group is optional.You can add it when you start the jobflow or add a task instance group to ajob flow in progress.You can increase and decrease thenumber of task nodes. Because theydont store data and can be added andremoved from a job flow, you can usetask nodes to manage the EC2 instancecapacity your job flow usesTask instance groupInstance groups
  50. 50. SSH onto master nodewith this keypair
  51. 51. Keep job flow runningso you can add moreto it
  52. 52. …aakar 3abdal 3abdelaziz 18abolish 3aboriginal 12abraham 6abseron 9abstentions15accession 73accord 90achabar 3achievements 3acic 4acknowledged 6acquis 9acquisitions 3…
  53. 53. ./elastic-mapreduce --create --stream--mappers3://elasticmapreduce/samples/wordcount/wordSplitter.py--inputs3://elasticmapreduce/samples/wordcount/input--outputs3n://myawsbucket]--reduceraggregate
  54. 54. ./elastic-mapreduce --create --stream--mappers3://elasticmapreduce/samples/wordcount/wordSplitter.py--inputs3://elasticmapreduce/samples/wordcount/input--outputs3n://myawsbucket]--reduceraggregate
  55. 55. ./elastic-mapreduce --create --stream--mappers3://elasticmapreduce/samples/wordcount/wordSplitter.py--inputs3://elasticmapreduce/samples/wordcount/input--outputs3n://myawsbucket]--reduceraggregate
  56. 56. ./elastic-mapreduce --create --stream--mappers3://elasticmapreduce/samples/wordcount/wordSplitter.py--inputs3://elasticmapreduce/samples/wordcount/input--outputs3n://myawsbucket]--reduceraggregate
  57. 57. ./elastic-mapreduce --create --stream--mappers3://elasticmapreduce/samples/wordcount/wordSplitter.py--inputs3://elasticmapreduce/samples/wordcount/input--outputs3n://myawsbucket]--reduceraggregate
  58. 58. launches a job flow to run on a single-node cluster usingan Amazon EC2 m1.small instancewhen your steps are running correctly on a small set ofsample data, you can launch job flows to run on multiplenodesYou can specify the number of nodes and the type ofinstance to run with the --num-instances and --instance-type parameters
  59. 59. elastic-mapreduce--create--key-pair micro--region eu-west-1--name MyJobFlow--num-instances 5--instance-type m2.4xlarge–alive--log-uri s3n://mybucket/EMR/log
  60. 60. elastic-mapreduce--create--key-pair micro--region eu-west-1--name MyJobFlow--num-instances 5--instance-type m2.4xlarge–alive--log-uri s3n://mybucket/EMR/logInstance type/count
  61. 61. elastic-mapreduce--create--key-pair micro--region eu-west-1--name MyJobFlow--num-instances 5--instance-type m2.4xlarge–alive--log-uri s3n://mybucket/EMR/logDon’t terminate cluster –keep it running so steps canbe added
  62. 62. elastic-mapreduce--create--key-pair micro--region eu-west-1--name MyJobFlow--num-instances 5--instance-type m2.4xlarge–alive--pig-interactive --pig-versions latest--hive-interactive –-hive-versions latest--hbase--log-uri s3n://mybucket/EMR/logAdding Hive, Pig and Hbase tothe job flow
  63. 63. elastic-mapreduce--create--bootstrap-actions3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "-s,dfs.block.size=1048576”--key-pair micro--region eu-west-1--name MyJobFlow--num-instances 5--instance-type m2.4xlarge–alive--pig-interactive --pig-versions latest--hive-interactive –-hive-versions latestUsing a bootstrapping actionto configure the cluster
  64. 64. 70You can specify up to 16 bootstrap actions per job flow byproviding multiple bootstrap-action parameterss3://elasticmapreduce/bootstrap-actions/configure-daemonsConfigure JVM properties--bootstrap-actions3://elasticmapreduce/bootstrap-actions/configure-daemons --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19s3://elasticmapreduce/bootstrap-actions/configure-hadoopConfigure cluster wide hadoop settings./elastic-mapreduce –create --bootstrap-actions3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--site-config-file,s3://myawsbucket/config.xml,-s,mapred.tasktracker.map.tasks.maximum=2"s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensiveConfigure cluster wide memory settings./elastic-mapreduce –create --bootstrap-actions3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive
  65. 65. 71You can specify up to 16 bootstrap actions per job flow byproviding multiple bootstrap-action parameterss3://elasticmapreduce/bootstrap-actions/run-ifConditionally run a command./elastic-mapreduce --create --alive --bootstrap-actions3://elasticmapreduce/bootstrap-actions/run-if--args "instance.isMaster=true,echo running onmaster node"Shutdown actions Sun scripts on shutdownA bootstrap action script can create one or moreshutdown actions by writing scripts to the/mnt/var/lib/instance-controller/public/shutdown-actions/ directory. Each script must run and completewithin 60 seconds.Custom actions Run a custom script--bootstrap-action "s3://elasticmapreduce/bootstrap-actions/download.sh"
  66. 66. Integrated toolsSelecting the right one
  67. 67. Hadoop streamingUnstructured data with wide variation offormats, high flexibility on languageBroad language support (Python, Perl,Bash…)Mapper, reducer, input and output allsupplied by the jobflowScale from 1 to unlimited number ofprocesses
  68. 68. Hadoop streamingUnstructured data with wide variation offormats, high flexibility on languageBroad language support (Python, Perl,Bash…)Mapper, reducer, input and output allsupplied by the jobflowScale from 1 to unlimited number ofprocessesDoing analytics in Eclipse is wrong2-stage dataflow is rigidJoins are lengthy and error-pronePrototyping/exploration requires compile & deploy
  69. 69. Hadoop streamingUnstructured data with wide variation offormats, high flexibility on languageBroad language support (Python, Perl,Bash…)Mapper, reducer, input and output allsupplied by the jobflowScale from 1 to unlimited number ofprocessesHiveStructured data which is lower valueor where higher latency is tolerablePigSemi-structured data which can bemined using set operationsHbaseNear real time K/V store for structureddata
  70. 70. Hive integrationSchema on read
  71. 71. SQL Interface for working with dataSimple way to use HadoopCreate Table statement references data location onS3Language called HiveQL, similar to SQLAn example of a query could be:SELECT COUNT(1) FROM sometable;Requires to setup a mapping to the input dataUses SerDe:s to make different input formatsqueryablePowerful data types (Array & Map..)
  72. 72. SQL HiveQLUpdates UPDATE, INSERT,DELETEINSERT, OVERWRITETABLETransactions Supported Not supportedIndexes Supported Not supportedLatency Sub-second MinutesFunctions Hundreds DozensMulti-table inserts Not supported SupportedCreate table as select Not valid SQL-92 Supported
  73. 73. JDBC Listener started by default on EMRPort 10003 for Hive 0.8.1, Ports 9100-9101 for HadoopDownload hive-jdbc-0.8.1.jarAccess by:Open port on security group ElasticMapReduce-masterSSH Tunnel
  74. 74. ./elastic-mapreduce –create--name "Hive job flow”--hive-script--args s3://myawsbucket/myquery.q--args -d,INPUT=s3://myawsbucket/input,-d,OUTPUT=s3://myawsbucket/outputHiveQL to execute
  75. 75. ./elastic-mapreduce--create--alive--name "Hive job flow”--num-instances 5 --instance-type m1.large --hive-interactiveInteractive hive session
  76. 76. CREATE EXTERNAL TABLE impressions (requestBeginTime string,adId string,impressionId string,referrer string,userAgent string,userCookie string,ip string)PARTITIONED BY (dt string)ROW FORMATserde com.amazon.elasticmapreduce.JsonSerdewith serdeproperties ( paths=requestBeginTime,adId, impressionId, referrer, userAgent,userCookie, ip )LOCATION ‘s3://mybucketsource/tables/impressions ;
  77. 77. CREATE EXTERNAL TABLE impressions (requestBeginTime string,adId string,impressionId string,referrer string,userAgent string,userCookie string,ip string)PARTITIONED BY (dt string)ROW FORMATserde com.amazon.elasticmapreduce.JsonSerdewith serdeproperties ( paths=requestBeginTime,adId, impressionId, referrer, userAgent,userCookie, ip )LOCATION ‘s3://mybucketsource/tables/impressions ;Table structureto create(happens fast asjust mapping tosource)
  78. 78. CREATE EXTERNAL TABLE impressions (requestBeginTime string,adId string,impressionId string,referrer string,userAgent string,userCookie string,ip string)PARTITIONED BY (dt string)ROW FORMATserde com.amazon.elasticmapreduce.JsonSerdewith serdeproperties ( paths=requestBeginTime,adId, impressionId, referrer, userAgent,userCookie, ip )LOCATION ‘s3://mybucketsource/tables/impressions ;Source data in S3
  79. 79. hive> select * from impressions limit 5;Selecting from sourcedata directly via Hadoop
  80. 80. Streaming Hive Pig DynamoDB RedshiftUnstructuredData✓ ✓Structured Data ✓ ✓ ✓ ✓LanguageSupportAny* HQL Pig Latin Client SQLSQL ✓ ✓Volume Unlimited Unlimited Unlimited RelativelyLow1.6 PBLatency Medium Medium Medium Ultra Low Low
  81. 81. Cost considerationsMaking the most of EC2 resources
  82. 82. 1 instance for 100 hours=100 instances for 1 hour
  83. 83. Small instance = $8
  84. 84. 1 instance for 1,000 hours=1,000 instances for 1 hour
  85. 85. Small instance = $80
  86. 86. 100%Achieving economies of scaleTime
  87. 87. Reserved capacity100%Achieving economies of scaleTime
  88. 88. OnReserved capacity100%On-demandTimeAchieving economies of scale
  89. 89. OnReserved capacity100%On-demandTimeAchieving economies of scaleSpot
  90. 90. Job Flow14 HoursDuration:Scenario #1EMR with spot instances#1: Cost without Spot4 instances *14 hrs * $0.50 = $28
  91. 91. Job Flow14 HoursDuration:Scenario #1Duration:Job Flow7 HoursScenario #2EMR with spot instances#1: Cost without Spot4 instances *14 hrs * $0.50 = $28
  92. 92. Job Flow14 HoursDuration:Scenario #1Duration:Job Flow7 HoursScenario #2EMR with spot instances#1: Cost without Spot4 instances *14 hrs * $0.50 = $28#2: Cost with Spot4 instances *7 hrs * $0.50 = $14 +5 instances * 7 hrs * $0.25 = $8.75Total = $22.75
  93. 93. Job Flow14 HoursDuration:Scenario #1Duration:Job Flow7 HoursScenario #2EMR with spot instances#1: Cost without Spot4 instances *14 hrs * $0.50 = $28#2: Cost with Spot4 instances *7 hrs * $0.50 = $14 +5 instances * 7 hrs * $0.25 = $8.75Total = $22.75Time Savings: 50%Cost Savings: ~22%
  94. 94. Some other things…Hints and tips
  95. 95. Managing big dataDynamic Creation of Clusters for Processing JobsBulk of Data Stored in S3Use Reduced Redundancy Storage for Lower Value / Re-creatable Data, and Lower Cost by 30%DynamoDB for High Performance Data AccessMigrate data to Glacier for Archive and Periodic AccessVPC & IAM for Network Security
  96. 96. Shark can return results up to 30 times fasterthan the same queries run on Hive.elastic-mapreduce --create --alive --name"Spark/Shark Cluster”--bootstrap-actions3://elasticmapreduce/samples/spark/install-spark-shark.sh--bootstrap-name "install Mesos/Spark/Shark”--hive-interactive --hive-versions 0.7.1.3--ami-version 2.0--instance-type m1.xlarge --instance-count 3--key-pair ec2-keypair
  97. 97. Summary
  98. 98. What is EMR?Map-Reduce engine Integrated with toolsHadoop-as-a-serviceMassively parallelCost effective AWS wrapperIntegrated to AWS services
  99. 99. Try the tutorial:aws.amazon.com/articles/2855Find out more:aws.amazon.com/big-data
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×