• Like
  • Save
AWS Office Hours: Amazon Elastic MapReduce
 

AWS Office Hours: Amazon Elastic MapReduce

on

  • 4,071 views

The PPT features what is new with Amazon Elastic MapReduce and covers a few brief tutorials.

The PPT features what is new with Amazon Elastic MapReduce and covers a few brief tutorials.

Statistics

Views

Total Views
4,071
Views on SlideShare
4,065
Embed Views
6

Actions

Likes
4
Downloads
61
Comments
0

4 Embeds 6

http://www.slideshare.net 2
http://webcache.googleusercontent.com 2
http://www.utpl.edu.ec 1
http://www.onlydoo.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    AWS Office Hours: Amazon Elastic MapReduce AWS Office Hours: Amazon Elastic MapReduce Presentation Transcript

    • Elastic MapReduce Office Hours
      April 13th, 2011
    • Introduction
      Richard Cole, Lead Engineer
      Adam Gray, Product Manager
    • Office Hours IS
      Simply, Office Hours is a program the enables a technical audience the ability to interact with AWS technical experts.
      We look to improve this program by soliciting feedback from Office Hours attendees. Please let us know what you would like to see.
    • Office Hours is NOT
      Support
      If you have a problem with your current deployment please visit our forums or our support website http://aws.amazon.com/premiumsupport/
      A place to find out about upcoming services
      We do not typically disclose information about our services until they are available
    • Agenda
      What’s New
      How-to Demonstrations
      Resize a running job flow
      Launch a Hive-based Data Warehouse (Contextual Advertising Example)
      Question and Answer
      Please begin submitting questions now
    • What’s New?
    • S3 Multipart Upload
      • Breaks objects into chunks and uploads two or more of those chunks to S3 concurrently
      • Can begin upload process before a Hadoop task has finished
      • Parallel uploads and earlier start mean data-intensive applications finish significantly faster
      Usage:
      ./elastic-mapreduce--create --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop
      –args “-c,fs.s3n.multipart.uploads.enabled=true,
      -c,fs.s3n.multipart.uploads.split.size=524288000”
    • Elastic IP Address Integration
      • Allocate a static IP address and dynamically assign it to the master node of a running job flow
      • Reference the same IP address for a long-running job flow even if it has to be restarted
      • Reference transient job flows in a consistent way each time they are launched
      Usage:
      ./elastic-mapreduce --create --eip [existing_ip]
    • EC2 Instance Tagging
      • EC2 instances are tagged with, and filterable by, job flow id and instance group role
      • Useful if you are running multiple job flows concurrently or managing a large numbers of Amazon EC2 instances
    • How-To
    • Example 1 – Resize a running job flow
      • Speed up job flow execution in response to changing requirements
      • Customize cluster size to support varying resource needs (e.g., query support during the day versus batch processing overnight)
      Job Flow
      Job Flow
      Job Flow
      Data Warehouse
      (Batch Processing)
      Allocate
      4 instances
      Expand to
      25 instances
      Expand to
      9 instances
      3 Hours
      Data Warehouse
      (Steady State)
      Data Warehouse
      (Steady State)
      Shrink to
      9 instances
      Expand to
      25 instances
      Time remaining:
      Time remaining:
      14 Hours
      Time remaining:
      7 Hours
    • Job Flow Architecture
    • Launch a Job Flow
      // Launch a job flow with 8 core nodes
      ./elastic-mapreduce--create --alive --instance-group master --instance-type m2.2xlarge --instance-count 1 --instance-group core --instance-type m1.small --instance-count 8
      Created job flow j-ABABABABAB
      // Describe job flow
      ./elastic-mapreduce --describe j-ABABABABAB
    • Resize Job Flow
      // Add 8 m2.4xlarge TASK Nodes
      ./elastic-mapreduce --jobflow j-ABABABABAB --add-instance-group task --instance-type m2.4xlarge --instance-count 8
      // Expand CORE Nodes from 8 to 12
      ./elastic-mapreduce --jobflow j-ABABABABAB --modify-instance-group core --instance-count 12
      // Shrink TASK Nodes from 8 to 6
      ./elastic-mapreduce --jobflow j-ABABABABAB --modify-instance-group task --instance-count 6
    • Terminate Job Flow
      // Terminate Job Flow
      ./elastic-mapreduce--terminate j-ABABABABAB
    • Example 2 – Launch a Hive-based Data Warehouse
      Apache Hive
      Data warehouse for Hadoop
      Open source project started at Facebook
      Turns data on Hadoop into a virtually limitless data warehouse
      Provides data summarization, ad hoc querying and analysis
      Enables SQL-like queries on structured and unstructured data
      • E.g. arbitrary field separators possible such as “,” in CSV file formats
      Inherits linear scalability of Hadoop
    • Launch a Hive Cluster(Contextual Advertising Example)
      // Launch a Hive cluster with cluster compute nodes
      ./elastic-mapreduce --create --alive --hive-interactive --name "Hive Job Flow" --instance-type cc1.4xlarge
      Created job flow j-ABABABABAB
      // SSH to Master Node
      ./elastic-mapreduce --ssh --jobflow j-ABABABABAB
      // Run a Hive Session on the Master Node
      hadoop@domU-12-31-39-07-D2-14:~$ hive -d SAMPLE=s3://elasticmapreduce/samples/hive-ads -d DAY=2009-04-13 -d HOUR=08 -d NEXT_DAY=2009-04-13 -d NEXT_HOUR=09 -d OUTPUT=s3://mybucket/samples/output
      hive>
    • Define Impressions Table
      Show Tables
      hive> show tables;
      OK
      Time taken: 3.51 seconds
      hive>
      Define Impressions Table
      ADD JAR ${SAMPLE}/libs/jsonserde.jar ;
      CREATE EXTERNAL TABLE impressions (
      requestBeginTimestring, adId string, impressionId string, referrer string, userAgentstring, userCookie string, ipstring
      )
      PARTITIONED BY (dt string)
      ROW FORMAT
      serde'com.amazon.elasticmapreduce.JsonSerde'
      with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId,
      referrer, userAgent, userCookie, ip' )
      LOCATION '${SAMPLE}/tables/impressions' ;
    • Recover Partitions
      The table is partitioned based on time, we can tell Hive about the existence of a single partition using the following statement.
      ALTER TABLE impressions ADD PARTITION (dt='2009-04-13-08-05') ;
      If we were to query the table at this point the results would contain data from just the single partition. We can instruct Hive to recover all partitions by inspecting the data stored in Amazon S3 using the RECOVER PARTITIONS statement.
      ALTER TABLE impressions RECOVER PARTITIONS ;
    • Define Clicks Table
      We follow the same process to define the clicks table and recover its partitions.
      CREATE EXTERNAL TABLE clicks (
      impressionIdstring
      )
      PARTITIONED BY (dt string)
      ROW FORMAT
      SERDE 'com.amazon.elasticmapreduce.JsonSerde'
      WITH SERDEPROPERTIES ( 'paths'='impressionId' )
      LOCATION '${SAMPLE}/tables/clicks' ;
      ALTER TABLE clicks RECOVER PARTITIONS ;
    • Define Output Table
      We are going to combine the clicks and impressions tables so that we have a record of whether or not each impression resulted in a click. We'd like this data stored in Amazon S3 so that it can be used as input to other job flows.
      CREATE EXTERNAL TABLE joined_impressions (
      requestBeginTimestring, adId string, impressionId string, referrer string, userAgentstring, userCookie string, ip string, clicked Boolean
      )
      PARTITIONED BY (day string, hour string)
      STORED AS SEQUENCEFILE
      LOCATION '${OUTPUT}/joined_impressions'
      ;
    • Define Local Impressions Table
      Next, we create some temporary tables in the job flow's local HDFS partition to store intermediate impression and click data.
      CREATE TABLE tmp_impressions(
      requestBeginTimestring, adId string, impressionId string, referrer string,
      userAgentstring, userCookie string, ip string
      )
      STORED AS SEQUENCEFILE ;
      We insert data from the impressions table for the time duration we're interested in. Note that because the impressions table is partitioned only the relevant partitions will be read.
      INSERT OVERWRITE TABLE tmp_impressions
      SELECT
      from_unixtime(cast((cast(i.requestBeginTime as bigint) / 1000) as int))
      requestBeginTime, i.adId, i.impressionId, i.referrer, i.userAgent,
      i.userCookie, i.ip
      FROM impressions i
      WHERE i.dt >= '${DAY}-${HOUR}-00'
      AND i.dt < '${NEXT_DAY}-${NEXT_HOUR}-00' ;
    • Define Local Clicks Table
      For clicks, we extend the period of time over which we join by 20 minutes. Meaning we accept a click that occurred up to 20 minutes after the impression.
      CREATE TABLE tmp_clicks(
      impressionIdstring
      ) STORED AS SEQUENCEFILE;
      INSERT OVERWRITE TABLE tmp_clicks
      SELECT
      impressionId
      FROM
      clicks c
      WHERE
      c.dt>= '${DAY}-${HOUR}-00'
      AND c.dt < '${NEXT_DAY}-${NEXT_HOUR}-20'
      ;
    • Join Tables
      Now we combine the impressions and clicks tables using a left outer join. This way any impressions that did not result in a click are preserved.
      INSERT OVERWRITE TABLE joined_impressions
      PARTITION (day='${DAY}', hour='${HOUR}')
      SELECT
      i.requestBeginTime, i.adId, i.impressionId, i.referrer, i.userAgent,
      i.userCookie, i.ip, (c.impressionId is not null) clicked
      FROM
      tmp_impressionsi
      LEFT OUTER JOIN
      tmp_clicksc ON i.impressionId = c.impressionId
      ;
    • Terminate Interactive Session
      // Terminate Job Flow
      ./elastic-mapreduce--terminate j-ABABABABAB
    • Question & Answer
      Visithttp://aws.amazon.com/officehours to watch recorded sessions and to sign up for upcoming sessions.