Richard Cole of Amazon Gives Lightning Tallk at BigDataCamp

What is Amazon Elastic MapReduce ? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Amazon Elastic MapReduce ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Getting Started With Amazon Elastic MapReduce http://aws.amazon.com/elasticmapreduce

Overview ,[object Object],[object Object],[object Object],[object Object]

What is Amazon Elastic MapReduce ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Getting Started with Elastic MapReduce ,[object Object],[object Object],[object Object],[object Object]

Create your AWS Account http://aws.amazon.com

Claim your AWS Credits ,[object Object]

Sign up for Amazon Elastic MapReduce http://aws.amazon.com/elasticmapreduce

Sign up for Amazon SimpleDB http://aws.amazon.com/simpledb

Download the Command Line Client http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264

Install the Command Line Client cd $HOME mkdir -p elastic-mapreduce cd elastic-mapreduce wget http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip unzip elastic-mapreduce-ruby.zip export PATH=$PATH:$(pwd) { "access-id": “1111111111111111111", "private-key": “ababababababababababababababaaba", "key-pair": “emr-demo", "key-pair-file": "/home/richcole/emr-demo.pem", "log-uri": "s3://emr-demo/logs" } credentials.json

Obtaining your AWS Credentials http://aws.amazon.com/

AWS Management Console http://console.aws.amazon.com/

Fill out credentials.json file ,[object Object],{ "access-id": “1111111111111111111", "private-key": “ababababababababababababababaaba", "key-pair": “emr-demo", "key-pair-file": "/home/richcole/emr-demo.pem", "log-uri": "s3://emr-demo/logs" } credentials.json

AWS Java SDK http://aws.amazon.com/sdkforjava/

Recap ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Terminology ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Job Flow ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Job Flow Steps ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Bootstrap Actions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Developing a Bootstrap Action ,[object Object],[object Object],#!/bin/bash set -e -x sudo apt-get install libmysql-ruby s3://emr-demo/scripts/install-mysql.sh

Test on Interactive Job Flow ,[object Object],elastic-mapreduce --create --alive --name “My Development JobFlow” --enable-debugging elastic-mapreduce --ssh j-ABABABABABABA ,[object Object],rm --rf test-tmp && mkdir test-tmp && cd test-tmp hadoop fs --copyToLocal s3://emr-demo/scripts/install-mysql.sh . chmod a+x install-mysql.sh ./install-mysql.sh

Testing a Bootstrap Action ,[object Object],elastic-mapreduce --create --alive --name “My Development JobFlow” --enable-debugging --bootstrap-script s3://emr-demo/scripts/install-mysql.sh --bootstrap-name “Install Ruby MySQL” ,[object Object]

Predefined Bootstrap Actions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Recap -- Bootstrap Action ,[object Object],[object Object],[object Object],[object Object]

Tips and Tricks ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Questions? ,[object Object],[object Object],[object Object],[object Object]

Hive Example - Outline ,[object Object],[object Object],[object Object],[object Object]

Hive Example ,[object Object],[object Object],[object Object],[object Object]

Example Continued select impressions.adId as adId, count(distinct clickId) / count(1) as clickthrough from impressions left outer join clicks on impressions.impressionId = clicks.impressionId group by impressions.adId ; impression_id, user_id, ad_id, … i-ABABABAB, u-ABABA, a-ABABABA … impression_id, click_id, … i-ABABABA, c-ABABA, … … impressions clicks

Partitioned Tables in Amazon S3 s3://elasticmapreduce/samples/hive-ads/tables/clicks/ dt=2009-04-14-13-00/ ec2-93-18-66-22.amazon.com-2009-04-14-13-00.log ec2-64-41-91-42.amazon.com-2009-04-14-13-00.log ec2-32-38-73-65.amazon.com-2009-04-14-13-00.log ec2-15-01-21-88.amazon.com-2009-04-14-13-00.log dt=2009-04-14-13-01/ ec2-93-18-66-22.amazon.com-2009-04-14-13-01.log ec2-64-41-91-42.amazon.com-2009-04-14-13-01.log ec2-32-38-73-65.amazon.com-2009-04-14-13-01.log ec2-15-01-21-88.amazon.com-2009-04-14-13-01.log

SSH To The Master Node $ chmod og-rwx $HOME/emr-demo.pem $ export PATH=$HOME/elastic-mapreduce $ elastic-mapreduce --list --active j-1FGYJOQRLQ7OH WAITING ec2-184-72-141-9.compute-1.amazonaws.com My Interactive JobFlow COMPLETED Setup Hadoop Debugging COMPLETED Setup Hive $ elastic-mapreduce --ssh --jobflow j-1FGYJOQRLQ7OH ssh -o StrictHostKeyChecking=no -i /home/... ... hadoop@ip-10-242-235-81:~$ hive

Start Hive hive -d SAMPLE=s3://elasticmapreduce/samples/hive-ads -d DAY=2009-04-13 -d HOUR=08 -d NEXT_DAY=2009-04-13 -d NEXT_HOUR=09 -d OUTPUT=s3://mybucket/samples/output

Declare the Impressions Table ADD JAR ${SAMPLE}/libs/jsonserde.jar ; CREATE EXTERNAL TABLE impressions ( requestBeginTime string, adId string, impressionId string, referrer string, userAgent string, userCookie string, ip string ) PARTITIONED BY (dt string) ROW FORMAT serde 'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION '${SAMPLE}/tables/impressions' ; ALTER TABLE impressions ADD PARTITION (dt='2009-04-13-08-05') ;

Declare Clicks Table CREATE EXTERNAL TABLE clicks ( impressionId string, clickId string ) PARTITIONED BY (dt string) ROW FORMAT SERDE 'com.amazon.elasticmapreduce.JsonSerde' WITH SERDEPROPERTIES ( 'paths'='impressionId, number' ) LOCATION '${SAMPLE}/tables/clicks' ; ALTER TABLE clicks ADD PARTITION (dt='2009-04-13-08-05') ;

Execute Hive Query INSERT OVERWRITE DIRECTORY "s3://emr-demo/output/clickthough" SELECT impressions.adId as adId, count(distinct clickId) / count(1) as clickthrough FROM impressions left outer join clicks on impressions.impressionId = clicks.impressionId GROUP BY impressions.adId ORDER by clickthrough desc ; Ended Job = job_201006270056_0011 2868 Rows loaded to s3://emr-demo/output/clickthough

Accessing the Hadoop UI ssh -i c:/Users/richcole/emr-demo.pem -ND 8157 [email_address] Install FoxyProxy https://addons.mozilla.org/en-US/firefox/addon/2464/ Leave the Default proxy setting as is, add a new proxy - select SOCKS Proxy, and SOCKS 5 - select localhost and port 8157 - add a whitelist rule for http://*ec2*.amazonaws.com* - add a whitelist rule for http://*ec2.internal*

The Hadoop UI through FoxyProxy

Running a Hive Job from CLI ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Recap ,[object Object],[object Object],[object Object],[object Object],[object Object]

The End ,[object Object],[object Object]

Starting an Interactive Job Flow

Select Keypair, Log Path, and Enable Debugging

Proceed with no bootstrap actions

Richard Cole of Amazon Gives Lightning Tallk at BigDataCamp

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (10)

Similar to Richard Cole of Amazon Gives Lightning Tallk at BigDataCamp

Similar to Richard Cole of Amazon Gives Lightning Tallk at BigDataCamp (20)

More from BigDataCamp

More from BigDataCamp (11)

Recently uploaded

Recently uploaded (20)

Richard Cole of Amazon Gives Lightning Tallk at BigDataCamp

Editor's Notes