AWS Office Hours: Amazon Elastic MapReduce

Elastic MapReduce Office Hours April 13th, 2011

Introduction Richard Cole, Lead Engineer Adam Gray, Product Manager

Office Hours IS Simply, Office Hours is a program the enables a technical audience the ability to interact with AWS technical experts. We look to improve this program by soliciting feedback from Office Hours attendees. Please let us know what you would like to see.

Office Hours is NOT Support If you have a problem with your current deployment please visit our forums or our support website http://aws.amazon.com/premiumsupport/ A place to find out about upcoming services We do not typically disclose information about our services until they are available

Agenda What’s New How-to Demonstrations Resize a running job flow Launch a Hive-based Data Warehouse (Contextual Advertising Example) Question and Answer Please begin submitting questions now

S3 Multipart Upload ,[object Object]

Can begin upload process before a Hadoop task has finished

Parallel uploads and earlier start mean data-intensive applications finish significantly fasterUsage: ./elastic-mapreduce--create --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop –args “-c,fs.s3n.multipart.uploads.enabled=true, -c,fs.s3n.multipart.uploads.split.size=524288000”

Elastic IP Address Integration ,[object Object]

Reference the same IP address for a long-running job flow even if it has to be restarted

Reference transient job flows in a consistent way each time they are launchedUsage: ./elastic-mapreduce --create --eip [existing_ip]

EC2 Instance Tagging ,[object Object]

Useful if you are running multiple job flows concurrently or managing a large numbers of Amazon EC2 instances,[object Object]

Example 1 – Resize a running job flow ,[object Object]

Customize cluster size to support varying resource needs (e.g., query support during the day versus batch processing overnight)Job Flow Job Flow Job Flow Data Warehouse (Batch Processing) Allocate 4 instances Expand to 25 instances Expand to 9 instances 3 Hours Data Warehouse (Steady State) Data Warehouse (Steady State) Shrink to 9 instances Expand to 25 instances Time remaining: Time remaining: 14 Hours Time remaining: 7 Hours

Launch a Job Flow // Launch a job flow with 8 core nodes ./elastic-mapreduce--create --alive --instance-group master --instance-type m2.2xlarge --instance-count 1 --instance-group core --instance-type m1.small --instance-count 8 Created job flow j-ABABABABAB // Describe job flow ./elastic-mapreduce --describe j-ABABABABAB

Resize Job Flow // Add 8 m2.4xlarge TASK Nodes ./elastic-mapreduce --jobflow j-ABABABABAB --add-instance-group task --instance-type m2.4xlarge --instance-count 8 // Expand CORE Nodes from 8 to 12 ./elastic-mapreduce --jobflow j-ABABABABAB --modify-instance-group core --instance-count 12 // Shrink TASK Nodes from 8 to 6 ./elastic-mapreduce --jobflow j-ABABABABAB --modify-instance-group task --instance-count 6

Terminate Job Flow // Terminate Job Flow ./elastic-mapreduce--terminate j-ABABABABAB

Example 2 – Launch a Hive-based Data Warehouse Apache Hive Data warehouse for Hadoop Open source project started at Facebook Turns data on Hadoop into a virtually limitless data warehouse Provides data summarization, ad hoc querying and analysis Enables SQL-like queries on structured and unstructured data ,[object Object],Inherits linear scalability of Hadoop

Launch a Hive Cluster(Contextual Advertising Example) // Launch a Hive cluster with cluster compute nodes ./elastic-mapreduce --create --alive --hive-interactive --name "Hive Job Flow" --instance-type cc1.4xlarge Created job flow j-ABABABABAB // SSH to Master Node ./elastic-mapreduce --ssh --jobflow j-ABABABABAB // Run a Hive Session on the Master Node hadoop@domU-12-31-39-07-D2-14:~$ hive -d SAMPLE=s3://elasticmapreduce/samples/hive-ads -d DAY=2009-04-13 -d HOUR=08 -d NEXT_DAY=2009-04-13 -d NEXT_HOUR=09 -d OUTPUT=s3://mybucket/samples/output hive>

Define Impressions Table Show Tables hive> show tables; OK Time taken: 3.51 seconds hive> Define Impressions Table ADD JAR ${SAMPLE}/libs/jsonserde.jar ; CREATE EXTERNAL TABLE impressions ( requestBeginTimestring, adId string, impressionId string, referrer string, userAgentstring, userCookie string, ipstring ) PARTITIONED BY (dt string) ROW FORMAT serde'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION '${SAMPLE}/tables/impressions' ;

Recover Partitions The table is partitioned based on time, we can tell Hive about the existence of a single partition using the following statement. ALTER TABLE impressions ADD PARTITION (dt='2009-04-13-08-05') ; If we were to query the table at this point the results would contain data from just the single partition. We can instruct Hive to recover all partitions by inspecting the data stored in Amazon S3 using the RECOVER PARTITIONS statement. ALTER TABLE impressions RECOVER PARTITIONS ;

Define Clicks Table We follow the same process to define the clicks table and recover its partitions. CREATE EXTERNAL TABLE clicks ( impressionIdstring ) PARTITIONED BY (dt string) ROW FORMAT SERDE 'com.amazon.elasticmapreduce.JsonSerde' WITH SERDEPROPERTIES ( 'paths'='impressionId' ) LOCATION '${SAMPLE}/tables/clicks' ; ALTER TABLE clicks RECOVER PARTITIONS ;

Define Output Table We are going to combine the clicks and impressions tables so that we have a record of whether or not each impression resulted in a click. We'd like this data stored in Amazon S3 so that it can be used as input to other job flows. CREATE EXTERNAL TABLE joined_impressions ( requestBeginTimestring, adId string, impressionId string, referrer string, userAgentstring, userCookie string, ip string, clicked Boolean ) PARTITIONED BY (day string, hour string) STORED AS SEQUENCEFILE LOCATION '${OUTPUT}/joined_impressions' ;

AWS Office Hours: Amazon Elastic MapReduce

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to AWS Office Hours: Amazon Elastic MapReduce

Similar to AWS Office Hours: Amazon Elastic MapReduce (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

AWS Office Hours: Amazon Elastic MapReduce