Amazon Elastic Map Reduce - Ian Meyers

  • 742 views
Uploaded on

In this talk, Ian will talk about Amazon Elastic MapReduce and how it integrates with other AWS services in a big data stack.

In this talk, Ian will talk about Amazon Elastic MapReduce and how it integrates with other AWS services in a big data stack.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
742
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
24
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. London Hadoop User Group
  • 2. Deep experience in building and operating global web scale systems About  Amazon   Web  Services   ? …get into cloud computing? How did Amazon…
  • 3. Utility computing On demand Pay as you go Uniform Available
  • 4. Utility computing On demand Pay as you go Uniform Available
  • 5. Utility computing
  • 6. Utility computing On demand Pay as you go Uniform Available Compute   Storage   Security   Scaling   Database   Networking   Monitoring   Messaging   Workflow   DNS   Load  Balancing   Backup  CDN  
  • 7. No  Up-­‐Front   Capital  Expense   Pay  Only  for   What  You  Use   Self-­‐Service   Infrastructure   Easily  Scale  Up   and  Down   Improve  Agility  &   Time-­‐to-­‐Market   Low  Cost   Deploy Cloud computing benefits
  • 8. Traditional IT capacity ElasNc  capacity   Capacity Time Your IT needs
  • 9. On  and  Off   Fast  Growth   Variable  peaks   Predictable  peaks   ElasNc  capacity  
  • 10. ElasNc  capacity   On  and  Off   Fast  Growth   Predictable  peaks  Variable  peaks   WASTE CUSTOMER DISSATISFACTION
  • 11. ElasNc  capacity   Fast  Growth  On  and  Off   Predictable  peaks  Variable  peaks  
  • 12. NumberofEC2Instances 4/12/2008 4/14/2008 4/15/2008 4/16/2008 4/18/2008 4/19/2008 4/20/20084/17/20084/13/2008 40  servers  to  5000  in  3  days   EC2 scaled to peak of 5000 instances “Techcrunched” Launch of Facebook modification Steady state of ~40 instances
  • 13. Compute   Storage   AWS  Global  Infrastructure   Database   App  Services   Deployment  &  AdministraNon   Networking   Global Infrastructure
  • 14. Global Infrastructure Region US-WEST (N. California) EU-WEST (Ireland) ASIA PAC (Tokyo) ASIA PAC (Singapore) US-WEST (Oregon) SOUTH AMERICA (Sao Paulo) US-EAST (Virginia) GOV CLOUD ASIA PAC (Sydney)
  • 15. Availability Zone Global Infrastructure
  • 16. Customer Needs •  Store  Any  Amount  of  Data   –  Without  Capacity  Planning   •  Perform  Complex  Analysis  on  Any  Data   –  Scale  on  Demand   •  Store  Data  Securely   •  Decrease  Time  to  Market   –  Build  Environments  Quickly   •  Reduce  Costs   –  Reduce  Capital  Expenditure   •  Enable  Global  Reach  
  • 17. IngesNon  |  IntegraNon  
  • 18. ElasNc  Block  Store   High performance block storage device 1GB to 1TB in size Mount as drives to instances with snapshot/cloning functionalities IMAGE Availability 99.99% Durability 99.999999999% Is a Web Store Not a file system No Single Points of Failure Eventually consistent Paradigm Object store Performance Very Fast Redundancy Across Availability Zones Security Public Key / Private Key Pricing $0.095/GB/month Typical use case Write once, read many Limits 100 Buckets, Unlimited Storage, 5TB Objects Simple  Storage  Service   Highly  scalable  object  storage  for  the  internet   1  byte  to  5TB  in  size   99.999999999%  durability  
  • 19. Peak Requests: 830,000+ per second Total Number of Objects Stored in Amazon S3 14 Billion 40 Billion 102 Billion 762 Billion 262 Billion 1.3 Trillion Q4 2006 Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012 Objects in S3
  • 20. Glacier   Long  term  object  archive   Extremely  low  cost  per  gigabyte   99.999999999%  durability   ElasNc  Block  Store   High performance block storage device 1GB to 1TB in size Mount as drives to instances with snapshot/cloning functionalities IMAGE Durability 99.999999999% Designed for Archival Not a file system Vaults & Archives 3-5 Hour Retrieval Time Paradigm Archive Store Performance Configurable - Low Redundancy Across Availability Zones Security Public Key / Private Key Pricing $0.011/GB/month Typical use case Write once, read infrequently < 10% / Month
  • 21. Simple  Storage  Service   Highly  scalable  object  storage   1  byte  to  5TB  in  size   99.999999999%  durability   Glacier   Long  term  object  archive   Extremely  low  cost  per  gigabyte   99.999999999%  durability   Storage  Lifecycle  IntegraNon  
  • 22. Structured  Data  Management  
  • 23. Compute   Storage   AWS  Global  Infrastructure   Database   App  Services   Deployment  &  AdministraNon   Networking   Database Relational Database Service Managed Oracle, MySQL & SQL Server Dynamo DB Managed NOSQL Database Amazon Redshift Massively Parallel Petabyte Scale Data Warehouse RDS Dynamo DB Redshift
  • 24. Compute   Storage   AWS  Global  Infrastructure   Database   App  Services   Deployment  &  AdministraNon   Networking   Database Relational Database Service Database-as-a-Service No need to install or manage database instances Scalable and fault tolerant configurations Integration with Data Pipeline RDS Dynamo DB Redshift
  • 25. Compute   Storage   AWS  Global  Infrastructure   Database   App  Services   Deployment  &  AdministraNon   Networking   Database DynamoDB Provisioned throughput NoSQL database Fast, predictable, configurable performance Fully distributed, fault tolerant HA architecture Integration with EMR & Hive RDS Dynamo DB Redshift
  • 26. Compute   Storage   AWS  Global  Infrastructure   Database   App  Services   Deployment  &  AdministraNon   Networking   Database Redshift Managed Massively Parallel Petabyte Scale Data Warehouse Streaming Backup/Restore to S3 Extensive Security 2 TB -> 1.6 PB RDS Dynamo DB Redshift
  • 27. Unstructured  Data   …   Parallel  ETL  
  • 28. Elastic MapReduce Managed, elastic Hadoop cluster Integrates with S3 & DynamoDB Leverage Hive & Pig analytics scripts Support for Spot Instances Integrated HBase NOSQL Database Compute   Storage   AWS  Global  Infrastructure   Database   App  Services   Deployment  &  AdministraNon   Networking   Application Services Elastic MapReduce
  • 29. •  AWS Web Console •  Command Line elastic-­‐mapreduce  -­‐-­‐create  -­‐-­‐key-­‐pair  micro  -­‐-­‐region  eu-­‐ west-­‐1  -­‐-­‐name  IanMM-­‐Test1  -­‐-­‐num-­‐instances  5  -­‐-­‐instance-­‐ type  m2.4xlarge  –alive  -­‐-­‐log-­‐uri  s3n://meyersi-­‐ire/EMR/ log   Launching Clusters
  • 30. •  Enabling Tools elastic-­‐mapreduce  -­‐-­‐create  -­‐-­‐key-­‐pair  micro  -­‐-­‐region  eu-­‐west-­‐1  -­‐-­‐ name  IanMM-­‐Test1  -­‐-­‐num-­‐instances  5  -­‐-­‐instance-­‐type  m2.4xlarge  -­‐-­‐ alive     -­‐-­‐pig-­‐interactive  -­‐-­‐pig-­‐versions  latest   -­‐-­‐hive-­‐interactive  –-­‐hive-­‐versions  latest   -­‐-­‐hbase     -­‐-­‐log-­‐uri  s3n://meyersi-­‐ire/EMR/log   Launching Clusters
  • 31. •  Hadoop Configuration Bootstrap Action elastic-­‐mapreduce  -­‐-­‐create  -­‐-­‐bootstrap-­‐action   s3://elasticmapreduce/bootstrap-­‐ actions/configure-­‐hadoop  -­‐-­‐args  "-­‐ s,dfs.block.size=1048576”  -­‐-­‐key-­‐pair  micro   -­‐-­‐region  eu-­‐west-­‐1  -­‐-­‐name  IanMM-­‐Test-­‐3  -­‐-­‐instance-­‐group   core  -­‐-­‐instance-­‐count  2  -­‐-­‐instance-­‐type  m2.4xlarge  -­‐-­‐ instance-­‐group  task  -­‐-­‐instance-­‐count  2  -­‐-­‐instance-­‐type   m2.4xlarge  -­‐-­‐alive  -­‐-­‐pig-­‐interactive  -­‐-­‐hive-­‐interactive   -­‐-­‐log-­‐uri  s3n://meyersi-­‐ire/EMR/log   Launching Clusters
  • 32. Input Datanode: This could be a S3 bucket, RDS table, EMR Hive table, etc.   Activity: This is a data aggregation, manipulation, or copy that runs on a user- configured schedule. Output Datanode: This supports all the same datasources as the input datanode, but they don’t have to be the same type.   Amazon Data Pipeline
  • 33. Output:  S3  file   Path:  s3://trend-­‐data/#{year-­‐month-­‐day}.csv   AcNvity:  EMR  Transform   Hive  Query:  user-­‐metrics.hql   Frequency:  Daily   Input:  RDS  Table   Table:  User-­‐Demographics   SQL  PrecondiNon:    “Select  last_update  from  table“  >  #{YY-­‐MM-­‐DD}   Input:  DynamoDB  Table   Table:  User-­‐Event-­‐Data-­‐#{year-­‐month}   Success  NoNficaNon:  metrics@example.com   Failure  NoNficaNon:  emr-­‐admin@example.com   Delay  NoNficaNon:  :  emr-­‐admin@example.com     Orchestration with Data Pipeline
  • 34. Analytics Pipeline Redshift S3 RDS EMR Data Pipeline …collect & store …orchestrate …process & analyse Dynamo DB
  • 35. Benefits only possible in the Cloud Pay as you Go Lower Overall Costs Stop Guessing Capacity Agility / Speed / Innovation Avoid Undifferentiated Heavy Lifting Go Global in Minutes ✔ ✔ ✔ ✔ ✔ ✔ “Private Cloud” / On Premises X X X X X X
  • 36. Agility & Global Reach at the Core of EMR
  • 37. Ease of Operation Compute  Infrastructure   Hadoop  ConfiguraNon   Local  Disk   OperaNng  System  Config   HDFS   Networking   Hive   Pig   HBase   User  Defined  Sogware  InstallaNon  
  • 38. Ease of Operation Compute  Infrastructure   Hadoop   ConfiguraNon   Local  Disk   OperaNng   System  Config   HDFS   Networking   Hive   Pig   HBase   User  Defined  Sogware  InstallaNon   Multiple Hadoop Distributions - Open Source & MapR Clusters Launched with 1 Command Up in 5 Minutes Hard Partitioned per Customer on CPU, Memory and Disk Dynamic Cluster Resizing In any of 8 Regions around the Globe
  • 39. Lower Overall Costs Cheaper | Spot Market Management
  • 40. Lower TCO June  2013  Study  by  Accenture   Technology  Labs       Not  Sponsored  or  Funded  by  Amazon       “Accenture  assessed  the  price-­‐ performance  raJo  between  bare-­‐metal   Hadoop  clusters  and  Hadoop-­‐as-­‐a-­‐Service   on  Amazon  Web  Services…[and]  revealed   that  Hadoop-­‐as-­‐a-­‐Service  offers  bePer   price-­‐performance  raJo…”         hkp://www.accenture.com/us-­‐en/Pages/insight-­‐hadoop-­‐ deployment-­‐comparison.aspx  
  • 41. •  Spot allows customers to bid on unused EC2 capacity •  Spot price based on supply/demand of instance types in an Availability Zone •  Customers are fulfilled when their bid price is higher than the Spot Price •  Instances will be interrupted when the Spot price exceed the bid price Spot 101 - What are Spot Instances
  • 42. elastic-mapreduce --add-instance-group TASK --instance-count 100 --bid-price .4
  • 43. Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption #1: Cost without Spot 4 instances *14 hrs * $0.50 = $28 Job Flow 14 Hours Duration: Other EMR + Spot Use Cases § Run entire cluster on Spot for biggest cost savings § Reduce the cost of application testing #2: Cost with Spot 4 instances *7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $22.75 Scenario #1 Duration: Job Flow 7 Hours Scenario #2 Time Savings: 50% Cost Savings: ~20% Reducing Hadoop Costs with Spot
  • 44. Stop Guessing Capacity Dynamic Clusters
  • 45. Extend on-premise environments…
  • 46. with Amazon VPC…
  • 47. Populate as demand dictates…
  • 48. Connect over dedicated links…
  • 49. And turn it off when you are done
  • 50. EMR is Hadoop… …cheaper, easier, and more agile
  • 51. What’s New? •  MapR M7 Introduction •  Optimised for HBase Clusters •  Failure Recovery •  Point in Time Recovery Snapshotting •  Low Latency Hadoop Optimisations •  HBase Mirroring •  NFS + HDFS •  MapR M5 Price Drop •  Support for Pig 0.11.1 •  RANK, CUBE & ROLLUP capability •  Groovy UDF’s •  Support for Guava Functions •  Performance Improvements •  Spark/Shark Bootstrap Action •  In Memory Hadoop •  Spark Scripting (similar to Pig) •  Shark Shell with Hive Interoperability