AWS Office Hours: Amazon Elastic MapReduce


Published on

The PPT features what is new with Amazon Elastic MapReduce and covers a few brief tutorials.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

AWS Office Hours: Amazon Elastic MapReduce

  1. Elastic MapReduce Office Hours<br />April 13th, 2011<br />
  2. Introduction<br />Richard Cole, Lead Engineer<br />Adam Gray, Product Manager<br />
  3. Office Hours IS<br />Simply, Office Hours is a program the enables a technical audience the ability to interact with AWS technical experts.<br />We look to improve this program by soliciting feedback from Office Hours attendees. Please let us know what you would like to see.<br />
  4. Office Hours is NOT<br />Support<br />If you have a problem with your current deployment please visit our forums or our support website<br />A place to find out about upcoming services<br />We do not typically disclose information about our services until they are available<br />
  5. Agenda<br />What’s New<br />How-to Demonstrations<br />Resize a running job flow<br />Launch a Hive-based Data Warehouse (Contextual Advertising Example)<br />Question and Answer <br />Please begin submitting questions now<br />
  6. What’s New?<br />
  7. S3 Multipart Upload<br /><ul><li>Breaks objects into chunks and uploads two or more of those chunks to S3 concurrently
  8. Can begin upload process before a Hadoop task has finished
  9. Parallel uploads and earlier start mean data-intensive applications finish significantly faster</li></ul>Usage:<br />./elastic-mapreduce--create --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop <br />–args “-c,fs.s3n.multipart.uploads.enabled=true,<br />-c,fs.s3n.multipart.uploads.split.size=524288000”<br />
  10. Elastic IP Address Integration<br /><ul><li>Allocate a static IP address and dynamically assign it to the master node of a running job flow
  11. Reference the same IP address for a long-running job flow even if it has to be restarted
  12. Reference transient job flows in a consistent way each time they are launched</li></ul>Usage:<br />./elastic-mapreduce --create --eip [existing_ip]<br />
  13. EC2 Instance Tagging<br /><ul><li>EC2 instances are tagged with, and filterable by, job flow id and instance group role
  14. Useful if you are running multiple job flows concurrently or managing a large numbers of Amazon EC2 instances</li></li></ul><li>How-To<br />
  15. Example 1 – Resize a running job flow<br /><ul><li>Speed up job flow execution in response to changing requirements
  16. Customize cluster size to support varying resource needs (e.g., query support during the day versus batch processing overnight)</li></ul>Job Flow<br />Job Flow<br />Job Flow<br />Data Warehouse<br />(Batch Processing)<br />Allocate <br />4 instances<br />Expand to <br />25 instances<br />Expand to <br />9 instances<br />3 Hours<br />Data Warehouse<br />(Steady State)<br />Data Warehouse<br />(Steady State)<br />Shrink to <br />9 instances<br />Expand to <br />25 instances<br />Time remaining:<br />Time remaining:<br />14 Hours<br />Time remaining:<br />7 Hours<br />
  17. Job Flow Architecture<br />
  18. Launch a Job Flow<br />// Launch a job flow with 8 core nodes<br />./elastic-mapreduce--create --alive --instance-group master --instance-type m2.2xlarge --instance-count 1 --instance-group core --instance-type m1.small --instance-count 8<br />Created job flow j-ABABABABAB <br />// Describe job flow<br />./elastic-mapreduce --describe j-ABABABABAB<br />
  19. Resize Job Flow<br />// Add 8 m2.4xlarge TASK Nodes<br />./elastic-mapreduce --jobflow j-ABABABABAB --add-instance-group task --instance-type m2.4xlarge --instance-count 8<br />// Expand CORE Nodes from 8 to 12<br />./elastic-mapreduce --jobflow j-ABABABABAB --modify-instance-group core --instance-count 12<br />// Shrink TASK Nodes from 8 to 6<br />./elastic-mapreduce --jobflow j-ABABABABAB --modify-instance-group task --instance-count 6<br />
  20. Terminate Job Flow<br />// Terminate Job Flow<br />./elastic-mapreduce--terminate j-ABABABABAB<br />
  21. Example 2 – Launch a Hive-based Data Warehouse<br />Apache Hive<br />Data warehouse for Hadoop<br />Open source project started at Facebook<br />Turns data on Hadoop into a virtually limitless data warehouse<br />Provides data summarization, ad hoc querying and analysis<br />Enables SQL-like queries on structured and unstructured data<br /><ul><li>E.g. arbitrary field separators possible such as “,” in CSV file formats</li></ul>Inherits linear scalability of Hadoop<br />
  22. Launch a Hive Cluster(Contextual Advertising Example)<br />// Launch a Hive cluster with cluster compute nodes<br />./elastic-mapreduce --create --alive --hive-interactive --name "Hive Job Flow" --instance-type cc1.4xlarge<br />Created job flow j-ABABABABAB <br />// SSH to Master Node<br />./elastic-mapreduce --ssh --jobflow j-ABABABABAB <br />// Run a Hive Session on the Master Node<br />hadoop@domU-12-31-39-07-D2-14:~$ hive -d SAMPLE=s3://elasticmapreduce/samples/hive-ads -d DAY=2009-04-13 -d HOUR=08 -d NEXT_DAY=2009-04-13 -d NEXT_HOUR=09 -d OUTPUT=s3://mybucket/samples/output <br />hive> <br />
  23. Define Impressions Table<br />Show Tables<br />hive> show tables; <br />OK<br />Time taken: 3.51 seconds<br />hive> <br />Define Impressions Table<br />ADD JAR ${SAMPLE}/libs/jsonserde.jar ;<br />CREATE EXTERNAL TABLE impressions ( <br />requestBeginTimestring, adId string, impressionId string, referrer string, userAgentstring, userCookie string, ipstring<br />) <br />PARTITIONED BY (dt string) <br />ROW FORMAT <br />serde'' <br />with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, <br /> referrer, userAgent, userCookie, ip' )<br />LOCATION '${SAMPLE}/tables/impressions' ; <br />
  24. Recover Partitions<br />The table is partitioned based on time, we can tell Hive about the existence of a single partition using the following statement.<br />ALTER TABLE impressions ADD PARTITION (dt='2009-04-13-08-05') ;<br />If we were to query the table at this point the results would contain data from just the single partition. We can instruct Hive to recover all partitions by inspecting the data stored in Amazon S3 using the RECOVER PARTITIONS statement.<br />ALTER TABLE impressions RECOVER PARTITIONS ; <br />
  25. Define Clicks Table<br />We follow the same process to define the clicks table and recover its partitions.<br />CREATE EXTERNAL TABLE clicks (<br />impressionIdstring <br />) <br />PARTITIONED BY (dt string) <br />ROW FORMAT <br />SERDE '' <br />WITH SERDEPROPERTIES ( 'paths'='impressionId' ) <br />LOCATION '${SAMPLE}/tables/clicks' ; <br />ALTER TABLE clicks RECOVER PARTITIONS ; <br />
  26. Define Output Table<br />We are going to combine the clicks and impressions tables so that we have a record of whether or not each impression resulted in a click. We'd like this data stored in Amazon S3 so that it can be used as input to other job flows. <br />CREATE EXTERNAL TABLE joined_impressions ( <br />requestBeginTimestring, adId string, impressionId string, referrer string, userAgentstring, userCookie string, ip string, clicked Boolean <br />) <br />PARTITIONED BY (day string, hour string) <br /> STORED AS SEQUENCEFILE <br /> LOCATION '${OUTPUT}/joined_impressions' <br />; <br />
  27. Define Local Impressions Table<br />Next, we create some temporary tables in the job flow's local HDFS partition to store intermediate impression and click data.<br />CREATE TABLE tmp_impressions(<br />requestBeginTimestring, adId string, impressionId string, referrer string, <br />userAgentstring, userCookie string, ip string <br />) <br />STORED AS SEQUENCEFILE ; <br />We insert data from the impressions table for the time duration we're interested in. Note that because the impressions table is partitioned only the relevant partitions will be read.<br />INSERT OVERWRITE TABLE tmp_impressions<br />SELECT <br />from_unixtime(cast((cast(i.requestBeginTime as bigint) / 1000) as int)) <br />requestBeginTime, i.adId, i.impressionId, i.referrer, i.userAgent, <br />i.userCookie, i.ip<br />FROM impressions i <br /> WHERE i.dt >= '${DAY}-${HOUR}-00' <br /> AND i.dt < '${NEXT_DAY}-${NEXT_HOUR}-00' ; <br />
  28. Define Local Clicks Table<br />For clicks, we extend the period of time over which we join by 20 minutes. Meaning we accept a click that occurred up to 20 minutes after the impression.<br />CREATE TABLE tmp_clicks(<br />impressionIdstring <br />) STORED AS SEQUENCEFILE;<br />INSERT OVERWRITE TABLE tmp_clicks<br />SELECT <br />impressionId<br />FROM <br /> clicks c <br />WHERE<br />c.dt>= '${DAY}-${HOUR}-00' <br />AND c.dt < '${NEXT_DAY}-${NEXT_HOUR}-20' <br />; <br />
  29. Join Tables<br />Now we combine the impressions and clicks tables using a left outer join. This way any impressions that did not result in a click are preserved. <br />INSERT OVERWRITE TABLE joined_impressions<br /> PARTITION (day='${DAY}', hour='${HOUR}') <br />SELECT <br />i.requestBeginTime, i.adId, i.impressionId, i.referrer, i.userAgent, <br />i.userCookie, i.ip, (c.impressionId is not null) clicked <br /> FROM <br />tmp_impressionsi <br /> LEFT OUTER JOIN <br />tmp_clicksc ON i.impressionId = c.impressionId<br />; <br />
  30. Terminate Interactive Session<br />// Terminate Job Flow<br />./elastic-mapreduce--terminate j-ABABABABAB<br />
  31. Question & Answer<br />Visit to watch recorded sessions and to sign up for upcoming sessions.<br />