London
Hadoop User
Group
Deep experience in
building and
operating global web
scale systems
About	
  Amazon	
  
Web	
  Services	
  
?
…get into clo...
Utility computing
On demand Pay as you go
Uniform Available
Utility computing
On demand Pay as you go
Uniform Available
Utility computing
Utility computing
On demand Pay as you go
Uniform Available
Compute	
  
Storage	
  
Security	
  
Scaling	
  
Database	
  
...
No	
  Up-­‐Front	
  
Capital	
  Expense	
  
Pay	
  Only	
  for	
  
What	
  You	
  Use	
  
Self-­‐Service	
  
Infrastructur...
Traditional IT
capacity
ElasNc	
  capacity	
  
Capacity
Time
Your IT needs
On	
  and	
  Off	
   Fast	
  Growth	
  
Variable	
  peaks	
   Predictable	
  peaks	
  
ElasNc	
  capacity	
  
ElasNc	
  capacity	
  
On	
  and	
  Off	
   Fast	
  Growth	
  
Predictable	
  peaks	
  Variable	
  peaks	
  
WASTE
CUSTOMER...
ElasNc	
  capacity	
  
Fast	
  Growth	
  On	
  and	
  Off	
  
Predictable	
  peaks	
  Variable	
  peaks	
  
NumberofEC2Instances
4/12/2008 4/14/2008 4/15/2008 4/16/2008 4/18/2008 4/19/2008 4/20/20084/17/20084/13/2008
40	
  servers...
Compute	
   Storage	
  
AWS	
  Global	
  Infrastructure	
  
Database	
  
App	
  Services	
  
Deployment	
  &	
  Administra...
Global Infrastructure
Region
US-WEST (N. California)
 EU-WEST (Ireland)
ASIA PAC (Tokyo)
ASIA PAC
(Singapore)
US-WEST (Ore...
Availability Zone
Global Infrastructure
Customer Needs
•  Store	
  Any	
  Amount	
  of	
  Data	
  
–  Without	
  Capacity	
  Planning	
  
•  Perform	
  Complex	
 ...
IngesNon	
  |	
  IntegraNon	
  
ElasNc	
  Block	
  Store	
  
High performance block storage
device
1GB to 1TB in size
Mount as drives to instances with
sn...
Peak Requests: 830,000+ per second
Total Number of Objects Stored in Amazon S3
14 Billion
 40 Billion
102 Billion
762 Bill...
Glacier	
  
Long	
  term	
  object	
  archive	
  
Extremely	
  low	
  cost	
  per	
  gigabyte	
  
99.999999999%	
  durabil...
Simple	
  Storage	
  Service	
  
Highly	
  scalable	
  object	
  storage	
  
1	
  byte	
  to	
  5TB	
  in	
  size	
  
99.9...
Structured	
  Data	
  Management	
  
Compute	
   Storage	
  
AWS	
  Global	
  Infrastructure	
  
Database	
  
App	
  Services	
  
Deployment	
  &	
  Administra...
Compute	
   Storage	
  
AWS	
  Global	
  Infrastructure	
  
Database	
  
App	
  Services	
  
Deployment	
  &	
  Administra...
Compute	
   Storage	
  
AWS	
  Global	
  Infrastructure	
  
Database	
  
App	
  Services	
  
Deployment	
  &	
  Administra...
Compute	
   Storage	
  
AWS	
  Global	
  Infrastructure	
  
Database	
  
App	
  Services	
  
Deployment	
  &	
  Administra...
Unstructured	
  Data	
  
…	
  
Parallel	
  ETL	
  
Elastic MapReduce
Managed, elastic Hadoop cluster
Integrates with S3 & DynamoDB
Leverage Hive & Pig analytics scripts
Supp...
•  AWS Web Console
•  Command Line
elastic-­‐mapreduce	
  -­‐-­‐create	
  -­‐-­‐key-­‐pair	
  micro	
  -­‐-­‐region	
  eu-...
•  Enabling Tools
elastic-­‐mapreduce	
  -­‐-­‐create	
  -­‐-­‐key-­‐pair	
  micro	
  -­‐-­‐region	
  eu-­‐west-­‐1	
  -­‐...
•  Hadoop Configuration Bootstrap Action
elastic-­‐mapreduce	
  -­‐-­‐create	
  -­‐-­‐bootstrap-­‐action	
  
s3://elasticm...
Input Datanode: This could be a S3 bucket, RDS
table, EMR Hive table, etc. 	
  
Activity: This is a data aggregation,
mani...
Output:	
  S3	
  file	
  
Path:	
  s3://trend-­‐data/#{year-­‐month-­‐day}.csv	
  
AcNvity:	
  EMR	
  Transform	
  
Hive	
 ...
Analytics Pipeline
Redshift
S3
RDS
EMR
Data Pipeline
…collect & store
…orchestrate
…process & analyse
Dynamo DB
Benefits only possible in the Cloud
Pay as you
Go
Lower
Overall
Costs
Stop
Guessing
Capacity
Agility /
Speed /
Innovation
...
Agility & Global Reach

at the Core of EMR
Ease of Operation
Compute	
  Infrastructure	
  
Hadoop	
  ConfiguraNon	
   Local	
  Disk	
   OperaNng	
  System	
  Config	
 ...
Ease of Operation
Compute	
  Infrastructure	
  
Hadoop	
  
ConfiguraNon	
  
Local	
  Disk	
  
OperaNng	
  
System	
  Config	...
Lower Overall Costs

Cheaper | Spot Market Management
Lower TCO
June	
  2013	
  Study	
  by	
  Accenture	
  
Technology	
  Labs	
  
	
  
	
  
Not	
  Sponsored	
  or	
  Funded	
...
•  Spot allows customers
to bid on unused EC2
capacity
•  Spot price based on
supply/demand of
instance types in an
Availa...
elastic-mapreduce --add-instance-group TASK --instance-count 100 --bid-price .4
Mix Spot and On-Demand instances to reduce cost and
accelerate computation while protecting against interruption
#1: Cost ...
Stop Guessing Capacity

Dynamic Clusters
Extend on-premise environments…
with Amazon VPC…
Populate as demand dictates…
Connect over dedicated links…
And turn it off when you are done
EMR is Hadoop…

…cheaper, easier, and more agile
What’s New?
•  MapR M7 Introduction
•  Optimised for HBase Clusters
•  Failure Recovery
•  Point in Time Recovery
Snapshot...
Upcoming SlideShare
Loading in...5
×

Amazon Elastic Map Reduce - Ian Meyers

965

Published on

In this talk, Ian will talk about Amazon Elastic MapReduce and how it integrates with other AWS services in a big data stack.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
965
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
32
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Amazon Elastic Map Reduce - Ian Meyers

  1. 1. London Hadoop User Group
  2. 2. Deep experience in building and operating global web scale systems About  Amazon   Web  Services   ? …get into cloud computing? How did Amazon…
  3. 3. Utility computing On demand Pay as you go Uniform Available
  4. 4. Utility computing On demand Pay as you go Uniform Available
  5. 5. Utility computing
  6. 6. Utility computing On demand Pay as you go Uniform Available Compute   Storage   Security   Scaling   Database   Networking   Monitoring   Messaging   Workflow   DNS   Load  Balancing   Backup  CDN  
  7. 7. No  Up-­‐Front   Capital  Expense   Pay  Only  for   What  You  Use   Self-­‐Service   Infrastructure   Easily  Scale  Up   and  Down   Improve  Agility  &   Time-­‐to-­‐Market   Low  Cost   Deploy Cloud computing benefits
  8. 8. Traditional IT capacity ElasNc  capacity   Capacity Time Your IT needs
  9. 9. On  and  Off   Fast  Growth   Variable  peaks   Predictable  peaks   ElasNc  capacity  
  10. 10. ElasNc  capacity   On  and  Off   Fast  Growth   Predictable  peaks  Variable  peaks   WASTE CUSTOMER DISSATISFACTION
  11. 11. ElasNc  capacity   Fast  Growth  On  and  Off   Predictable  peaks  Variable  peaks  
  12. 12. NumberofEC2Instances 4/12/2008 4/14/2008 4/15/2008 4/16/2008 4/18/2008 4/19/2008 4/20/20084/17/20084/13/2008 40  servers  to  5000  in  3  days   EC2 scaled to peak of 5000 instances “Techcrunched” Launch of Facebook modification Steady state of ~40 instances
  13. 13. Compute   Storage   AWS  Global  Infrastructure   Database   App  Services   Deployment  &  AdministraNon   Networking   Global Infrastructure
  14. 14. Global Infrastructure Region US-WEST (N. California) EU-WEST (Ireland) ASIA PAC (Tokyo) ASIA PAC (Singapore) US-WEST (Oregon) SOUTH AMERICA (Sao Paulo) US-EAST (Virginia) GOV CLOUD ASIA PAC (Sydney)
  15. 15. Availability Zone Global Infrastructure
  16. 16. Customer Needs •  Store  Any  Amount  of  Data   –  Without  Capacity  Planning   •  Perform  Complex  Analysis  on  Any  Data   –  Scale  on  Demand   •  Store  Data  Securely   •  Decrease  Time  to  Market   –  Build  Environments  Quickly   •  Reduce  Costs   –  Reduce  Capital  Expenditure   •  Enable  Global  Reach  
  17. 17. IngesNon  |  IntegraNon  
  18. 18. ElasNc  Block  Store   High performance block storage device 1GB to 1TB in size Mount as drives to instances with snapshot/cloning functionalities IMAGE Availability 99.99% Durability 99.999999999% Is a Web Store Not a file system No Single Points of Failure Eventually consistent Paradigm Object store Performance Very Fast Redundancy Across Availability Zones Security Public Key / Private Key Pricing $0.095/GB/month Typical use case Write once, read many Limits 100 Buckets, Unlimited Storage, 5TB Objects Simple  Storage  Service   Highly  scalable  object  storage  for  the  internet   1  byte  to  5TB  in  size   99.999999999%  durability  
  19. 19. Peak Requests: 830,000+ per second Total Number of Objects Stored in Amazon S3 14 Billion 40 Billion 102 Billion 762 Billion 262 Billion 1.3 Trillion Q4 2006 Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012 Objects in S3
  20. 20. Glacier   Long  term  object  archive   Extremely  low  cost  per  gigabyte   99.999999999%  durability   ElasNc  Block  Store   High performance block storage device 1GB to 1TB in size Mount as drives to instances with snapshot/cloning functionalities IMAGE Durability 99.999999999% Designed for Archival Not a file system Vaults & Archives 3-5 Hour Retrieval Time Paradigm Archive Store Performance Configurable - Low Redundancy Across Availability Zones Security Public Key / Private Key Pricing $0.011/GB/month Typical use case Write once, read infrequently < 10% / Month
  21. 21. Simple  Storage  Service   Highly  scalable  object  storage   1  byte  to  5TB  in  size   99.999999999%  durability   Glacier   Long  term  object  archive   Extremely  low  cost  per  gigabyte   99.999999999%  durability   Storage  Lifecycle  IntegraNon  
  22. 22. Structured  Data  Management  
  23. 23. Compute   Storage   AWS  Global  Infrastructure   Database   App  Services   Deployment  &  AdministraNon   Networking   Database Relational Database Service Managed Oracle, MySQL & SQL Server Dynamo DB Managed NOSQL Database Amazon Redshift Massively Parallel Petabyte Scale Data Warehouse RDS Dynamo DB Redshift
  24. 24. Compute   Storage   AWS  Global  Infrastructure   Database   App  Services   Deployment  &  AdministraNon   Networking   Database Relational Database Service Database-as-a-Service No need to install or manage database instances Scalable and fault tolerant configurations Integration with Data Pipeline RDS Dynamo DB Redshift
  25. 25. Compute   Storage   AWS  Global  Infrastructure   Database   App  Services   Deployment  &  AdministraNon   Networking   Database DynamoDB Provisioned throughput NoSQL database Fast, predictable, configurable performance Fully distributed, fault tolerant HA architecture Integration with EMR & Hive RDS Dynamo DB Redshift
  26. 26. Compute   Storage   AWS  Global  Infrastructure   Database   App  Services   Deployment  &  AdministraNon   Networking   Database Redshift Managed Massively Parallel Petabyte Scale Data Warehouse Streaming Backup/Restore to S3 Extensive Security 2 TB -> 1.6 PB RDS Dynamo DB Redshift
  27. 27. Unstructured  Data   …   Parallel  ETL  
  28. 28. Elastic MapReduce Managed, elastic Hadoop cluster Integrates with S3 & DynamoDB Leverage Hive & Pig analytics scripts Support for Spot Instances Integrated HBase NOSQL Database Compute   Storage   AWS  Global  Infrastructure   Database   App  Services   Deployment  &  AdministraNon   Networking   Application Services Elastic MapReduce
  29. 29. •  AWS Web Console •  Command Line elastic-­‐mapreduce  -­‐-­‐create  -­‐-­‐key-­‐pair  micro  -­‐-­‐region  eu-­‐ west-­‐1  -­‐-­‐name  IanMM-­‐Test1  -­‐-­‐num-­‐instances  5  -­‐-­‐instance-­‐ type  m2.4xlarge  –alive  -­‐-­‐log-­‐uri  s3n://meyersi-­‐ire/EMR/ log   Launching Clusters
  30. 30. •  Enabling Tools elastic-­‐mapreduce  -­‐-­‐create  -­‐-­‐key-­‐pair  micro  -­‐-­‐region  eu-­‐west-­‐1  -­‐-­‐ name  IanMM-­‐Test1  -­‐-­‐num-­‐instances  5  -­‐-­‐instance-­‐type  m2.4xlarge  -­‐-­‐ alive     -­‐-­‐pig-­‐interactive  -­‐-­‐pig-­‐versions  latest   -­‐-­‐hive-­‐interactive  –-­‐hive-­‐versions  latest   -­‐-­‐hbase     -­‐-­‐log-­‐uri  s3n://meyersi-­‐ire/EMR/log   Launching Clusters
  31. 31. •  Hadoop Configuration Bootstrap Action elastic-­‐mapreduce  -­‐-­‐create  -­‐-­‐bootstrap-­‐action   s3://elasticmapreduce/bootstrap-­‐ actions/configure-­‐hadoop  -­‐-­‐args  "-­‐ s,dfs.block.size=1048576”  -­‐-­‐key-­‐pair  micro   -­‐-­‐region  eu-­‐west-­‐1  -­‐-­‐name  IanMM-­‐Test-­‐3  -­‐-­‐instance-­‐group   core  -­‐-­‐instance-­‐count  2  -­‐-­‐instance-­‐type  m2.4xlarge  -­‐-­‐ instance-­‐group  task  -­‐-­‐instance-­‐count  2  -­‐-­‐instance-­‐type   m2.4xlarge  -­‐-­‐alive  -­‐-­‐pig-­‐interactive  -­‐-­‐hive-­‐interactive   -­‐-­‐log-­‐uri  s3n://meyersi-­‐ire/EMR/log   Launching Clusters
  32. 32. Input Datanode: This could be a S3 bucket, RDS table, EMR Hive table, etc.   Activity: This is a data aggregation, manipulation, or copy that runs on a user- configured schedule. Output Datanode: This supports all the same datasources as the input datanode, but they don’t have to be the same type.   Amazon Data Pipeline
  33. 33. Output:  S3  file   Path:  s3://trend-­‐data/#{year-­‐month-­‐day}.csv   AcNvity:  EMR  Transform   Hive  Query:  user-­‐metrics.hql   Frequency:  Daily   Input:  RDS  Table   Table:  User-­‐Demographics   SQL  PrecondiNon:    “Select  last_update  from  table“  >  #{YY-­‐MM-­‐DD}   Input:  DynamoDB  Table   Table:  User-­‐Event-­‐Data-­‐#{year-­‐month}   Success  NoNficaNon:  metrics@example.com   Failure  NoNficaNon:  emr-­‐admin@example.com   Delay  NoNficaNon:  :  emr-­‐admin@example.com     Orchestration with Data Pipeline
  34. 34. Analytics Pipeline Redshift S3 RDS EMR Data Pipeline …collect & store …orchestrate …process & analyse Dynamo DB
  35. 35. Benefits only possible in the Cloud Pay as you Go Lower Overall Costs Stop Guessing Capacity Agility / Speed / Innovation Avoid Undifferentiated Heavy Lifting Go Global in Minutes ✔ ✔ ✔ ✔ ✔ ✔ “Private Cloud” / On Premises X X X X X X
  36. 36. Agility & Global Reach at the Core of EMR
  37. 37. Ease of Operation Compute  Infrastructure   Hadoop  ConfiguraNon   Local  Disk   OperaNng  System  Config   HDFS   Networking   Hive   Pig   HBase   User  Defined  Sogware  InstallaNon  
  38. 38. Ease of Operation Compute  Infrastructure   Hadoop   ConfiguraNon   Local  Disk   OperaNng   System  Config   HDFS   Networking   Hive   Pig   HBase   User  Defined  Sogware  InstallaNon   Multiple Hadoop Distributions - Open Source & MapR Clusters Launched with 1 Command Up in 5 Minutes Hard Partitioned per Customer on CPU, Memory and Disk Dynamic Cluster Resizing In any of 8 Regions around the Globe
  39. 39. Lower Overall Costs Cheaper | Spot Market Management
  40. 40. Lower TCO June  2013  Study  by  Accenture   Technology  Labs       Not  Sponsored  or  Funded  by  Amazon       “Accenture  assessed  the  price-­‐ performance  raJo  between  bare-­‐metal   Hadoop  clusters  and  Hadoop-­‐as-­‐a-­‐Service   on  Amazon  Web  Services…[and]  revealed   that  Hadoop-­‐as-­‐a-­‐Service  offers  bePer   price-­‐performance  raJo…”         hkp://www.accenture.com/us-­‐en/Pages/insight-­‐hadoop-­‐ deployment-­‐comparison.aspx  
  41. 41. •  Spot allows customers to bid on unused EC2 capacity •  Spot price based on supply/demand of instance types in an Availability Zone •  Customers are fulfilled when their bid price is higher than the Spot Price •  Instances will be interrupted when the Spot price exceed the bid price Spot 101 - What are Spot Instances
  42. 42. elastic-mapreduce --add-instance-group TASK --instance-count 100 --bid-price .4
  43. 43. Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption #1: Cost without Spot 4 instances *14 hrs * $0.50 = $28 Job Flow 14 Hours Duration: Other EMR + Spot Use Cases § Run entire cluster on Spot for biggest cost savings § Reduce the cost of application testing #2: Cost with Spot 4 instances *7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $22.75 Scenario #1 Duration: Job Flow 7 Hours Scenario #2 Time Savings: 50% Cost Savings: ~20% Reducing Hadoop Costs with Spot
  44. 44. Stop Guessing Capacity Dynamic Clusters
  45. 45. Extend on-premise environments…
  46. 46. with Amazon VPC…
  47. 47. Populate as demand dictates…
  48. 48. Connect over dedicated links…
  49. 49. And turn it off when you are done
  50. 50. EMR is Hadoop… …cheaper, easier, and more agile
  51. 51. What’s New? •  MapR M7 Introduction •  Optimised for HBase Clusters •  Failure Recovery •  Point in Time Recovery Snapshotting •  Low Latency Hadoop Optimisations •  HBase Mirroring •  NFS + HDFS •  MapR M5 Price Drop •  Support for Pig 0.11.1 •  RANK, CUBE & ROLLUP capability •  Groovy UDF’s •  Support for Guava Functions •  Performance Improvements •  Spark/Shark Bootstrap Action •  In Memory Hadoop •  Spark Scripting (similar to Pig) •  Shark Shell with Hive Interoperability
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×