Richard Cole of Amazon Gives Lightning Tallk at BigDataCamp


Published on

Richard Cole of Amazon Gives Lightning Talk at BigDataCamp

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • We also support hadoop 0.18
  • Hi, I’m Richard Cole, a software engineer on the Amazon Elastic MapReduce team. I’m going run through some of the features of the Elastic MapReduce. At the end of the talk I’ll give you the URL to these slides so you can download them. That way you don’t need to keep note down URLS.
  • Here’s a overview. First I’ll talk a little about what Amazon Elastic MapReduce is. Then I’ll explain how to get setup to EMR. Next I’ll run through an example of Developing a Bootstrap Action. I’ll then go through a quick example using Hive. My intention here is to take you through many of the useful features of our service.
  • We also support hadoop 0.18
  • Now I want to show you briefly how to get started with Elastic MapReduce. I’m going to show you how to sign up for EMR and SimpleDB. You should be able to use your.
  • Go to This is the main page for Amazon Web Services. Click the orange sign up button on the right.
  • This is the page of Amazon Elastic MapReduce. Click the orange sign up button on the right.
  • This is the main page for Amazon SimpleDB. Click the sign up button on the right. Simple DB is required for Hadoop Debugging.
  • Next download the Elastic MapReduce command line client. Click the download button.
  • To install the command line client you need to have ruby installed. You basically unzip the client into a directory and create a credentials file either there, or in your home directory. The credentials file needs to be filled in with some details that we’ll fetch in the next few slides. You need your AWS Credentials. You need an EC2 keypair. You need also to specify a log-uri, this is where log files from your jobflow will be uploaded to.
  • Next we need a copy of the access credentials. Copy your access id and private key into the credentials file.
  • To create an EC2 keypair we’re going to the AWS Management Console. Click the orange button on the right.
  • Click on the EC2 tab. The EC2 Key Pair is required to SSH to the cluster. Click create a new Key Pair. Save the secret key somewhere safe. Copy the name of the key pair and the location key pair file into the credentials.json file.
  • You don’t need to use the command line client. You can also call the web service from Java. Here’s the AWS SDK for Java. To download it you click the yellow button on the right.
  • Here’s a recap of what we just did.
  • A job flow is what we call a Hadoop cluster is running or ran at some time. Log files from the cluster are stored in S3 so that they’re accessible later after the job flow has shutdown. Typically a jobflow runs in batch mode. That is it executes a series of MapReduce jobs and then terminates. The batch job might analyse log files over some period of time and produce data in a structured format that is stored in S3 for example. You might also run a jobflow in interactive mode. The typical use case for an interactive mode jobflow is when your developing a batch process. Here you might start with a smaller jobflow and a small portion of your data, you run your Hadoop jobs that are under development and test the results that you get. Another reason to run an interactive jobflow is for Adhoc analysis. You might for example be investigating some aspect of your data, and each query that you run suggests the next query to be run. In this case you run a job flow in interactive mode. You could also choose to run a job flow as an always on, long running job flow. In this case you persist data to Amazon S3 so that you can recover in the event of a master failure, but in the normal case you pull data continuously to your datawarehouse and you run a variety of batch mode and ad-hoc processing on the job flow.
  • Job flows have steps. A step specifies a jar located in Amazon S3 to be run on the master node. The jar is like a Hadoop job jar. It has a main function that is either specified in the manfiest of the jar or on the command line and it can contain lib jars in the same way that a Hadoop job jar does. Typically a step will use the Hadoop API’s to create one or more Hadoop jobs and wait for them to terminate. Steps are executed sequentially. A step jar indicates failure by returning non-zero value. There is step property called ActionOnFailure, this says what to do after a step fails. The options are: CONTINUE, which will just continue on to the next step effectively ignoring the error, CANCEL_AND_WAIT, which will cancel all following steps and TERMINATE_JOBFLOW which terminate JOBFLOW regardless of the setting KeepJobFlowAliveWhenNoSteps. This last property is property of a jobflow, it is used to decide what to do one all the steps have been executed or cancelled. If you want an interactive or long lived cluster then you need to set this property to true.
  • Steps only run on the master node. Bootstrap actions run on all nodes. They are run after Hadoop is configured but before Hadoop is started. So you can use them to modify the site-config to set settings that are not settable on a per job basis. You can also use bootstrap actions to install additional software on the nodes or to modify the machine configuration. For example you might want to add more swap space to the nodes. Bootstrap actions run as hadoop user, however Hadoop user to escalate to root without a password using sudo. So really within bootstrap actions you have complete control over the nodes.
  • Bootstrap actions are typically scripts located in Amazon S3. They can use Hadoop to download additional software to execute from S3. They indicate failure by returning a non-zero value. If a bootstrap action fails then the node will be discarded. Be carefull though, if more than 10% of your nodes fail their bootstrap action then the job flow will fail.
  • bNext I want to show you an example of developing a bootstrap action. Lets say that your application requires the mysql client library for Ruby. Lets say you have a streaming job and it needs to fetch some parameters from an Amazon RDS instance that is running. So you want to make a bootstrap action that will install the mysql client library. First you create an install script, we’re going to use bash but you could use ruby, or python, or perl or whatever is your favorite. This script first does set --e --x to turn on tracing and to make the script fail with non-zero value if any command in the script fail. Next it escalates to root using sudo and then installs the library using apt-get. The nodes run Debian/stable and the tool for installing software under Debian is called apt-get. We’ll put this script in a file and upload it to S3.
  • So next lets run an interactive job flow using the command line client. The --alive option makes the jobflow keep running even when all steps are finished. It is important for an interactive jobflow. Next we ssh to the master node and copy our script from Amazon S3 where we uploaded it. Then we make the script executable and execute it.
  • Next we’ll run a jobflow specifying the bootstrap action script on the command line. The script will then be run on all nodes in the jobflow and install the ruby mysql client for us.
  • Test on a small subset so you don’t waste lots of money
  • Logs are delayed by 5 minutes
  • Log directory must be a bucket that you own
  • Richard Cole of Amazon Gives Lightning Tallk at BigDataCamp

    1. 1. What is Amazon Elastic MapReduce ? <ul><li>A webservice for running Hadoop clusters in the Amazon Web Services </li></ul><ul><ul><li>Easy to use </li></ul></ul><ul><ul><li>Reliable and Secure </li></ul></ul><ul><ul><li>Scalable On-Demand Service </li></ul></ul><ul><ul><li>Extensible – A service for others to build upon </li></ul></ul><ul><ul><li>Customizable </li></ul></ul>
    2. 2. Amazon Elastic MapReduce <ul><li>Runs Hadoop 0.20, Hive 0.5, Pig 0.6 </li></ul><ul><li>Takes care of provisioning nodes </li></ul><ul><ul><li>Installed and Configured Hadoop customized to Amazon EC2 and S3 </li></ul></ul><ul><ul><li>Bootstrap actions let you install software and further configure nodes </li></ul></ul><ul><li>Monitors your cluster for job completion </li></ul><ul><li>Pushes logs into Amazon S3 and provides Web UI tools for debugging and analysis </li></ul>
    3. 3. <ul><li>In my talk I’ll cover </li></ul><ul><ul><li>Getting setup and activating your AWS credits </li></ul></ul><ul><ul><li>Using Bootstrap Actions to Customize your cluster </li></ul></ul><ul><ul><li>Tips and tricks </li></ul></ul><ul><li>Watch out for </li></ul><ul><ul><li>Amazon Elastic MapReduce Keynote tomorrow </li></ul></ul><ul><ul><li>Hadoop Day in Seattle, August 14 </li></ul></ul><ul><ul><ul><li>Watch events on </li></ul></ul></ul>
    4. 4. Getting Started With Amazon Elastic MapReduce
    5. 5. Overview <ul><li>What is Amazon Elastic MapReduce? </li></ul><ul><li>Getting Setup </li></ul><ul><li>Developing a Bootstrap Action </li></ul><ul><li>Tips and Tricks </li></ul>
    6. 6. What is Amazon Elastic MapReduce <ul><li>A webservice for running Hadoop in AWS </li></ul><ul><li>Runs Hadoop 0.20, Hive 0.5, Pig 0.6 </li></ul><ul><li>Takes care of provisioning nodes </li></ul><ul><ul><li>Installed and Configured Hadoop customized to Amazon EC2 and Amazon S3 </li></ul></ul><ul><ul><li>Bootstrap actions let you install software and further configure nodes </li></ul></ul><ul><li>Monitors your Hadoop jobs </li></ul><ul><li>Pushes logs into Amazon S3 and provides Web UI tools for debugging and analysis </li></ul>
    7. 7. Getting Started with Elastic MapReduce <ul><li>Sign up for Elastic MapReduce and SimpleDB </li></ul><ul><li>Install the Elastic MapReduce Command Line Client </li></ul><ul><li>Create an EC2 KeyPair to SSH to the Master Node </li></ul><ul><li>Retrieve Your AWS Credential </li></ul>
    8. 8. Create your AWS Account
    9. 9. Claim your AWS Credits <ul><li> </li></ul>
    10. 10. Sign up for Amazon Elastic MapReduce
    11. 11. Sign up for Amazon SimpleDB
    12. 12. Download the Command Line Client
    13. 13. Install the Command Line Client cd $HOME mkdir -p elastic-mapreduce cd elastic-mapreduce wget unzip export PATH=$PATH:$(pwd) { &quot;access-id&quot;: “1111111111111111111&quot;, &quot;private-key&quot;: “ababababababababababababababaaba&quot;, &quot;key-pair&quot;: “emr-demo&quot;, &quot;key-pair-file&quot;: &quot;/home/richcole/emr-demo.pem&quot;, &quot;log-uri&quot;: &quot;s3://emr-demo/logs&quot; } credentials.json
    14. 14. Obtaining your AWS Credentials
    15. 15. AWS Management Console
    16. 16. Create EC2 Keypair
    17. 17. Fill out credentials.json file <ul><li>Add key-pair, access-id and private key to your credentials.json file </li></ul>{ &quot;access-id&quot;: “1111111111111111111&quot;, &quot;private-key&quot;: “ababababababababababababababaaba&quot;, &quot;key-pair&quot;: “emr-demo&quot;, &quot;key-pair-file&quot;: &quot;/home/richcole/emr-demo.pem&quot;, &quot;log-uri&quot;: &quot;s3://emr-demo/logs&quot; } credentials.json
    18. 18. AWS Java SDK
    19. 19. Recap <ul><li>Signed up for the Amazon Web Services </li></ul><ul><li>Installed Amazon Elastic MapReduce Command Line Client </li></ul><ul><li>Filled out credentials.json file </li></ul><ul><ul><li>Specify AWS Credentials </li></ul></ul><ul><ul><li>Specify an EC2 KeyPair created using the AWS Management Console </li></ul></ul><ul><ul><li>Specified a Log URI </li></ul></ul><ul><li>Downloaded the AWS Java SDK </li></ul>
    20. 20. Terminology <ul><li>Job flow – A Hadoop Cluster running in AWS </li></ul><ul><li>Step </li></ul><ul><ul><li>Hadoop Jars executing jobs </li></ul></ul><ul><ul><li>Hadoop Streaming </li></ul></ul><ul><ul><li>A jar that does something else </li></ul></ul><ul><li>Bootstrap Actions – Configure Hadoop </li></ul><ul><li>AWS Management Console – Create, Debug, and Analyze job flows </li></ul>
    21. 21. Job Flow <ul><li>A job flow is a Hadoop cluster that ran at some time in Amazon EC2 </li></ul><ul><ul><li>Log files are stored in S3 and may be retrieved later </li></ul></ul><ul><li>Three types of job flow </li></ul><ul><ul><li>Batch job flows (read from S3, run, write to S3) </li></ul></ul><ul><ul><li>Interactive job flow (development or ad-hoc analysis) </li></ul></ul><ul><ul><li>Long running job flows (data warehouse) </li></ul></ul>
    22. 22. Job Flow Steps <ul><li>Steps specifies a jar run on the master node </li></ul><ul><ul><li>The jar typically runs one or more Hadoop jobs </li></ul></ul><ul><ul><li>Steps are executed sequentially </li></ul></ul><ul><ul><li>Failure is indicated by returning non-zero </li></ul></ul><ul><ul><li>ActionOnFailure says what to do next </li></ul></ul><ul><ul><ul><li>CONTINUE </li></ul></ul></ul><ul><ul><ul><li>CANCEL_AND_WAIT </li></ul></ul></ul><ul><ul><ul><li>TERMINATE_JOBFLOW </li></ul></ul></ul><ul><ul><li>KeepJobFlowAliveWhenNoSteps </li></ul></ul>
    23. 23. Bootstrap Actions <ul><li>Run on all nodes during boot </li></ul><ul><ul><li>Hadoop is installed and configured but not running yet </li></ul></ul><ul><li>Typical uses </li></ul><ul><ul><li>Modify setting in site-config that may not be overridden in jobconf </li></ul></ul><ul><ul><li>Install additional software </li></ul></ul><ul><ul><li>Modify machine configuration, e.g. add swap </li></ul></ul>
    24. 24. Bootstrap Actions <ul><li>Bootstrap actions are typically scripts </li></ul><ul><ul><li>Download additional software to execute </li></ul></ul><ul><li>Indicate failure by returning non-zero </li></ul><ul><ul><li>Failure means the slave node will be discarded </li></ul></ul><ul><ul><li>More than 10% failures of slave nodes causes the job flow to fail </li></ul></ul><ul><ul><li>Failure of the master causes the job flow to fail </li></ul></ul>
    25. 25. Developing a Bootstrap Action <ul><li>Elastic mapreduce nodes does not include libruby-mysql </li></ul><ul><li>Lets say you want to install libruby-mysql because your steps needs to query an Amazon RDS Database. </li></ul>#!/bin/bash set -e -x sudo apt-get install libmysql-ruby s3://emr-demo/scripts/
    26. 26. Test on Interactive Job Flow <ul><li>Run an interactive jobflow and ssh to master </li></ul>elastic-mapreduce --create --alive --name “My Development JobFlow” --enable-debugging elastic-mapreduce --ssh j-ABABABABABABA <ul><li>Test the script on the master node </li></ul>rm --rf test-tmp && mkdir test-tmp && cd test-tmp hadoop fs --copyToLocal s3://emr-demo/scripts/ . chmod a+x ./
    27. 27. Testing a Bootstrap Action <ul><li>Start a jobflow that installs the bootstrap action </li></ul>elastic-mapreduce --create --alive --name “My Development JobFlow” --enable-debugging --bootstrap-script s3://emr-demo/scripts/ --bootstrap-name “Install Ruby MySQL” <ul><li>You can chain up to 16 bootstrap actions when creating a jobflow </li></ul>
    28. 28. Predefined Bootstrap Actions <ul><li>Configure Hadoop </li></ul><ul><ul><li>Allows you to override Hadoop site-config settings </li></ul></ul><ul><ul><li>e.g. change the number of map tasks to run concurrently on node </li></ul></ul><ul><li>Configure Daemons </li></ul><ul><ul><li>Change the memory allocated to Hadoop Daemons </li></ul></ul><ul><li>Run-if </li></ul><ul><ul><li>Conditionally run another bootstrap action based on whether the node is the master or slave </li></ul></ul>
    29. 29. Recap -- Bootstrap Action <ul><li>Create a script and upload to S3 </li></ul><ul><li>Test the script manually first on an interactive jobflow </li></ul><ul><li>Once you’re sure the script work, run it as a bootstrap action on a jobflow </li></ul><ul><li>Use predefined bootstrap actions to modify Hadoop Site Config and Daemon settings </li></ul>
    30. 30. Tips and Tricks <ul><li>The Hadoop UI is also available using Foxyproxy </li></ul><ul><li>Use the Amazon S3 Tab in the AWS Management Console </li></ul><ul><li>Enable Hadoop Debugging </li></ul><ul><li>Unit test components </li></ul><ul><li>Run on small subset of your data first </li></ul><ul><li>Use higher level abstractions like Hive, Pig, or Cascading </li></ul>
    31. 31. Questions? <ul><li>Resources , Articles, Tutorials and Developer Guide </li></ul><ul><ul><li> </li></ul></ul><ul><li>These slides available from </li></ul><ul><ul><li> </li></ul></ul>
    32. 32. Hive Example - Outline <ul><li>Run an example query on data stored as tables in Amazon S3 </li></ul><ul><li>Use the Debugging feature of the Management Console to Inspect Tasks </li></ul><ul><li>Access the Hadoop UI </li></ul><ul><li>Run the query in batch mode </li></ul>
    33. 33. Hive Example <ul><li>We have an Adserver that serves ads on the internet </li></ul><ul><li>It records impressions (showing of an ad on a web page) </li></ul><ul><li>and clicks (clicks by users on the ad) </li></ul><ul><li>Every impression contains a product, we want to estimate the clickthrough (chance of a click given a product) for every product </li></ul>
    34. 34. Example Continued select impressions.adId as adId, count(distinct clickId) / count(1) as clickthrough from impressions left outer join clicks on impressions.impressionId = clicks.impressionId group by impressions.adId ; impression_id, user_id, ad_id, … i-ABABABAB, u-ABABA, a-ABABABA … impression_id, click_id, … i-ABABABA, c-ABABA, … … impressions clicks
    35. 35. Partitioned Tables in Amazon S3 s3://elasticmapreduce/samples/hive-ads/tables/clicks/ dt=2009-04-14-13-00/ dt=2009-04-14-13-01/
    36. 36. SSH To The Master Node $ chmod og-rwx $HOME/emr-demo.pem $ export PATH=$HOME/elastic-mapreduce $ elastic-mapreduce --list --active j-1FGYJOQRLQ7OH WAITING My Interactive JobFlow COMPLETED Setup Hadoop Debugging COMPLETED Setup Hive $ elastic-mapreduce --ssh --jobflow j-1FGYJOQRLQ7OH ssh -o StrictHostKeyChecking=no -i /home/... ... hadoop@ip-10-242-235-81:~$ hive
    37. 37. Start Hive hive -d SAMPLE=s3://elasticmapreduce/samples/hive-ads -d DAY=2009-04-13 -d HOUR=08 -d NEXT_DAY=2009-04-13 -d NEXT_HOUR=09 -d OUTPUT=s3://mybucket/samples/output
    38. 38. Declare the Impressions Table ADD JAR ${SAMPLE}/libs/jsonserde.jar ; CREATE EXTERNAL TABLE impressions ( requestBeginTime string, adId string, impressionId string, referrer string, userAgent string, userCookie string, ip string ) PARTITIONED BY (dt string) ROW FORMAT serde '' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION '${SAMPLE}/tables/impressions' ; ALTER TABLE impressions ADD PARTITION (dt='2009-04-13-08-05') ;
    39. 39. Declare Clicks Table CREATE EXTERNAL TABLE clicks ( impressionId string, clickId string ) PARTITIONED BY (dt string) ROW FORMAT SERDE '' WITH SERDEPROPERTIES ( 'paths'='impressionId, number' ) LOCATION '${SAMPLE}/tables/clicks' ; ALTER TABLE clicks ADD PARTITION (dt='2009-04-13-08-05') ;
    40. 40. Execute Hive Query INSERT OVERWRITE DIRECTORY &quot;s3://emr-demo/output/clickthough&quot; SELECT impressions.adId as adId, count(distinct clickId) / count(1) as clickthrough FROM impressions left outer join clicks on impressions.impressionId = clicks.impressionId GROUP BY impressions.adId ORDER by clickthrough desc ; Ended Job = job_201006270056_0011 2868 Rows loaded to s3://emr-demo/output/clickthough
    41. 41. Viewing Steps
    42. 42. Viewing Hadoop Jobs
    43. 43. Viewing Tasks
    44. 44. Viewing Task Attempts
    45. 45. Download the Output
    46. 46. Accessing the Hadoop UI ssh -i c:/Users/richcole/emr-demo.pem -ND 8157 [email_address] Install FoxyProxy Leave the Default proxy setting as is, add a new proxy - select SOCKS Proxy, and SOCKS 5 - select localhost and port 8157 - add a whitelist rule for http://*ec2** - add a whitelist rule for http://*ec2.internal*
    47. 47. The Hadoop UI through FoxyProxy
    48. 48. Viewing Live Task Attempts
    49. 49. Running a Hive Job from CLI <ul><li>Upload Hive Script to S3 </li></ul><ul><ul><li>Use the AWS Management Console to upload the script to s3://emr-demo/scripts/myscript.q </li></ul></ul><ul><li>Then run </li></ul><ul><ul><li>SAMPLE=s3://elasticmapreduce/samples/hive-ads </li></ul></ul><ul><ul><li>elastic-mapreduce --jobflow j-ABABABABA </li></ul></ul><ul><ul><li>--hive-script s3://emr-demo/scripts/myscript.q </li></ul></ul><ul><ul><li>--args -d,SAMPLE=$SAMPLE </li></ul></ul><ul><ul><li>--args -d,OUTPUT=s3://emr-demo/output/clickthough </li></ul></ul>
    50. 50. Recap <ul><li>Ran Hive Interactively and Test the components of our script </li></ul><ul><li>Inspected Hadoop Jobs/Tasks </li></ul><ul><ul><li>in AWS Management Console </li></ul></ul><ul><ul><li>In the Hadoop UI using FoxyProxy </li></ul></ul><ul><li>Submitted Hive Job by adding a step to a running job flow using the command line tool </li></ul>
    51. 51. The End <ul><ul><li>Slides available from: </li></ul></ul><ul><ul><li> </li></ul></ul>
    52. 52. Starting an Interactive Job Flow
    53. 53. Choose Interactive Session
    54. 54. Select Keypair, Log Path, and Enable Debugging
    55. 55. Proceed with no bootstrap actions
    56. 56. Final Review of Selections