SlideShare a Scribd company logo
1 of 51
Download to read offline
London
Hadoop User
Group
Deep experience in
building and
operating global web
scale systems
About	
  Amazon	
  
Web	
  Services	
  
?
…get into cloud computing?
How did Amazon…
Utility computing
On demand Pay as you go
Uniform Available
Utility computing
On demand Pay as you go
Uniform Available
Utility computing
Utility computing
On demand Pay as you go
Uniform Available
Compute	
  
Storage	
  
Security	
  
Scaling	
  
Database	
  
Networking	
  
Monitoring	
  
Messaging	
  
Workflow	
  
DNS	
  
Load	
  Balancing	
  
Backup	
  CDN	
  
No	
  Up-­‐Front	
  
Capital	
  Expense	
  
Pay	
  Only	
  for	
  
What	
  You	
  Use	
  
Self-­‐Service	
  
Infrastructure	
  
Easily	
  Scale	
  Up	
  
and	
  Down	
  
Improve	
  Agility	
  &	
  
Time-­‐to-­‐Market	
  
Low	
  Cost	
  
Deploy
Cloud computing benefits
Traditional IT
capacity
ElasNc	
  capacity	
  
Capacity
Time
Your IT needs
On	
  and	
  Off	
   Fast	
  Growth	
  
Variable	
  peaks	
   Predictable	
  peaks	
  
ElasNc	
  capacity	
  
ElasNc	
  capacity	
  
On	
  and	
  Off	
   Fast	
  Growth	
  
Predictable	
  peaks	
  Variable	
  peaks	
  
WASTE
CUSTOMER DISSATISFACTION
ElasNc	
  capacity	
  
Fast	
  Growth	
  On	
  and	
  Off	
  
Predictable	
  peaks	
  Variable	
  peaks	
  
NumberofEC2Instances
4/12/2008 4/14/2008 4/15/2008 4/16/2008 4/18/2008 4/19/2008 4/20/20084/17/20084/13/2008
40	
  servers	
  to	
  5000	
  in	
  3	
  days	
  
EC2 scaled to peak of 5000
instances
“Techcrunched”
Launch of Facebook
modification
Steady state of ~40
instances
Compute	
   Storage	
  
AWS	
  Global	
  Infrastructure	
  
Database	
  
App	
  Services	
  
Deployment	
  &	
  AdministraNon	
  
Networking	
  
Global Infrastructure
Global Infrastructure
Region
US-WEST (N. California)
 EU-WEST (Ireland)
ASIA PAC (Tokyo)
ASIA PAC
(Singapore)
US-WEST (Oregon)
SOUTH AMERICA (Sao Paulo)
US-EAST (Virginia)
GOV CLOUD
ASIA PAC
(Sydney)
Availability Zone
Global Infrastructure
Customer Needs
•  Store	
  Any	
  Amount	
  of	
  Data	
  
–  Without	
  Capacity	
  Planning	
  
•  Perform	
  Complex	
  Analysis	
  on	
  Any	
  Data	
  
–  Scale	
  on	
  Demand	
  
•  Store	
  Data	
  Securely	
  
•  Decrease	
  Time	
  to	
  Market	
  
–  Build	
  Environments	
  Quickly	
  
•  Reduce	
  Costs	
  
–  Reduce	
  Capital	
  Expenditure	
  
•  Enable	
  Global	
  Reach	
  
IngesNon	
  |	
  IntegraNon	
  
ElasNc	
  Block	
  Store	
  
High performance block storage
device
1GB to 1TB in size
Mount as drives to instances with
snapshot/cloning functionalities
IMAGE
Availability
99.99%
Durability
99.999999999%
Is a Web Store
Not a file system
No Single Points of Failure
Eventually consistent
Paradigm Object store
Performance Very Fast
Redundancy Across Availability Zones
Security Public Key / Private Key
Pricing $0.095/GB/month
Typical use
case
Write once, read many
Limits 100 Buckets, Unlimited
Storage, 5TB Objects
Simple	
  Storage	
  Service	
  
Highly	
  scalable	
  object	
  storage	
  for	
  the	
  internet	
  
1	
  byte	
  to	
  5TB	
  in	
  size	
  
99.999999999%	
  durability	
  
Peak Requests: 830,000+ per second
Total Number of Objects Stored in Amazon S3
14 Billion
 40 Billion
102 Billion
762 Billion
262 Billion
1.3 Trillion
Q4 2006 Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012
Objects in S3
Glacier	
  
Long	
  term	
  object	
  archive	
  
Extremely	
  low	
  cost	
  per	
  gigabyte	
  
99.999999999%	
  durability	
  
ElasNc	
  Block	
  Store	
  
High performance block storage
device
1GB to 1TB in size
Mount as drives to instances with
snapshot/cloning functionalities
IMAGE
Durability
99.999999999%
Designed for Archival
Not a file system
Vaults & Archives
3-5 Hour Retrieval Time
Paradigm Archive Store
Performance Configurable - Low
Redundancy Across Availability Zones
Security Public Key / Private Key
Pricing $0.011/GB/month
Typical use
case
Write once, read
infrequently
< 10% / Month
Simple	
  Storage	
  Service	
  
Highly	
  scalable	
  object	
  storage	
  
1	
  byte	
  to	
  5TB	
  in	
  size	
  
99.999999999%	
  durability	
  
Glacier	
  
Long	
  term	
  object	
  archive	
  
Extremely	
  low	
  cost	
  per	
  gigabyte	
  
99.999999999%	
  durability	
  
Storage	
  Lifecycle	
  IntegraNon	
  
Structured	
  Data	
  Management	
  
Compute	
   Storage	
  
AWS	
  Global	
  Infrastructure	
  
Database	
  
App	
  Services	
  
Deployment	
  &	
  AdministraNon	
  
Networking	
  
Database
Relational Database Service
Managed Oracle, MySQL & SQL Server
Dynamo DB
Managed NOSQL Database
Amazon Redshift
Massively Parallel Petabyte Scale Data Warehouse
RDS Dynamo
DB
Redshift
Compute	
   Storage	
  
AWS	
  Global	
  Infrastructure	
  
Database	
  
App	
  Services	
  
Deployment	
  &	
  AdministraNon	
  
Networking	
  
Database
Relational Database Service
Database-as-a-Service
No need to install or manage database instances
Scalable and fault tolerant configurations
Integration with Data Pipeline
RDS Dynamo
DB
Redshift
Compute	
   Storage	
  
AWS	
  Global	
  Infrastructure	
  
Database	
  
App	
  Services	
  
Deployment	
  &	
  AdministraNon	
  
Networking	
  
Database
DynamoDB
Provisioned throughput NoSQL database
Fast, predictable, configurable performance
Fully distributed, fault tolerant HA architecture
Integration with EMR & Hive
RDS Dynamo
DB
Redshift
Compute	
   Storage	
  
AWS	
  Global	
  Infrastructure	
  
Database	
  
App	
  Services	
  
Deployment	
  &	
  AdministraNon	
  
Networking	
  
Database
Redshift
Managed Massively Parallel Petabyte Scale Data
Warehouse
Streaming Backup/Restore to S3
Extensive Security
2 TB -> 1.6 PB
RDS Dynamo
DB
Redshift
Unstructured	
  Data	
  
…	
  
Parallel	
  ETL	
  
Elastic MapReduce
Managed, elastic Hadoop cluster
Integrates with S3 & DynamoDB
Leverage Hive & Pig analytics scripts
Support for Spot Instances
Integrated HBase NOSQL Database
Compute	
   Storage	
  
AWS	
  Global	
  Infrastructure	
  
Database	
  
App	
  Services	
  
Deployment	
  &	
  AdministraNon	
  
Networking	
  
Application Services
Elastic
MapReduce
•  AWS Web Console
•  Command Line
elastic-­‐mapreduce	
  -­‐-­‐create	
  -­‐-­‐key-­‐pair	
  micro	
  -­‐-­‐region	
  eu-­‐
west-­‐1	
  -­‐-­‐name	
  IanMM-­‐Test1	
  -­‐-­‐num-­‐instances	
  5	
  -­‐-­‐instance-­‐
type	
  m2.4xlarge	
  –alive	
  -­‐-­‐log-­‐uri	
  s3n://meyersi-­‐ire/EMR/
log	
  
Launching Clusters
•  Enabling Tools
elastic-­‐mapreduce	
  -­‐-­‐create	
  -­‐-­‐key-­‐pair	
  micro	
  -­‐-­‐region	
  eu-­‐west-­‐1	
  -­‐-­‐
name	
  IanMM-­‐Test1	
  -­‐-­‐num-­‐instances	
  5	
  -­‐-­‐instance-­‐type	
  m2.4xlarge	
  -­‐-­‐
alive	
  	
  
-­‐-­‐pig-­‐interactive	
  -­‐-­‐pig-­‐versions	
  latest	
  
-­‐-­‐hive-­‐interactive	
  –-­‐hive-­‐versions	
  latest	
  
-­‐-­‐hbase	
  	
  
-­‐-­‐log-­‐uri	
  s3n://meyersi-­‐ire/EMR/log	
  
Launching Clusters
•  Hadoop Configuration Bootstrap Action
elastic-­‐mapreduce	
  -­‐-­‐create	
  -­‐-­‐bootstrap-­‐action	
  
s3://elasticmapreduce/bootstrap-­‐
actions/configure-­‐hadoop	
  -­‐-­‐args	
  "-­‐
s,dfs.block.size=1048576”	
  -­‐-­‐key-­‐pair	
  micro	
  
-­‐-­‐region	
  eu-­‐west-­‐1	
  -­‐-­‐name	
  IanMM-­‐Test-­‐3	
  -­‐-­‐instance-­‐group	
  
core	
  -­‐-­‐instance-­‐count	
  2	
  -­‐-­‐instance-­‐type	
  m2.4xlarge	
  -­‐-­‐
instance-­‐group	
  task	
  -­‐-­‐instance-­‐count	
  2	
  -­‐-­‐instance-­‐type	
  
m2.4xlarge	
  -­‐-­‐alive	
  -­‐-­‐pig-­‐interactive	
  -­‐-­‐hive-­‐interactive	
  
-­‐-­‐log-­‐uri	
  s3n://meyersi-­‐ire/EMR/log	
  
Launching Clusters
Input Datanode: This could be a S3 bucket, RDS
table, EMR Hive table, etc. 	
  
Activity: This is a data aggregation,
manipulation, or copy that runs on a user-
configured schedule.
Output Datanode: This supports all the same
datasources as the input datanode, but they don’t
have to be the same type.	
  
Amazon Data Pipeline
Output:	
  S3	
  file	
  
Path:	
  s3://trend-­‐data/#{year-­‐month-­‐day}.csv	
  
AcNvity:	
  EMR	
  Transform	
  
Hive	
  Query:	
  user-­‐metrics.hql	
  
Frequency:	
  Daily	
  
Input:	
  RDS	
  Table	
  
Table:	
  User-­‐Demographics	
  
SQL	
  PrecondiNon:	
  	
  “Select	
  last_update	
  from	
  table“	
  >	
  #{YY-­‐MM-­‐DD}	
  
Input:	
  DynamoDB	
  Table	
  
Table:	
  User-­‐Event-­‐Data-­‐#{year-­‐month}	
  
Success	
  NoNficaNon:	
  metrics@example.com	
  
Failure	
  NoNficaNon:	
  emr-­‐admin@example.com	
  
Delay	
  NoNficaNon:	
  :	
  emr-­‐admin@example.com	
  
	
  
Orchestration with Data Pipeline
Analytics Pipeline
Redshift
S3
RDS
EMR
Data Pipeline
…collect & store
…orchestrate
…process & analyse
Dynamo DB
Benefits only possible in the Cloud
Pay as you
Go
Lower
Overall
Costs
Stop
Guessing
Capacity
Agility /
Speed /
Innovation
Avoid
Undifferentiated
Heavy Lifting
Go Global
in Minutes
✔ ✔ ✔ ✔ ✔ ✔
“Private
Cloud” /
On
Premises
X X X X X X
Agility & Global Reach

at the Core of EMR
Ease of Operation
Compute	
  Infrastructure	
  
Hadoop	
  ConfiguraNon	
   Local	
  Disk	
   OperaNng	
  System	
  Config	
  
HDFS	
  
Networking	
  
Hive	
   Pig	
   HBase	
  
User	
  Defined	
  Sogware	
  InstallaNon	
  
Ease of Operation
Compute	
  Infrastructure	
  
Hadoop	
  
ConfiguraNon	
  
Local	
  Disk	
  
OperaNng	
  
System	
  Config	
  
HDFS	
  
Networking	
  
Hive	
  
Pig	
  
HBase	
  
User	
  Defined	
  Sogware	
  InstallaNon	
  
Multiple Hadoop
Distributions - Open Source
& MapR
Clusters Launched with 1
Command
Up in 5 Minutes
Hard Partitioned per
Customer on CPU, Memory
and Disk
Dynamic Cluster Resizing
In any of 8 Regions around
the Globe
Lower Overall Costs

Cheaper | Spot Market Management
Lower TCO
June	
  2013	
  Study	
  by	
  Accenture	
  
Technology	
  Labs	
  
	
  
	
  
Not	
  Sponsored	
  or	
  Funded	
  by	
  Amazon	
  
	
  
	
  
“Accenture	
  assessed	
  the	
  price-­‐
performance	
  raJo	
  between	
  bare-­‐metal	
  
Hadoop	
  clusters	
  and	
  Hadoop-­‐as-­‐a-­‐Service	
  
on	
  Amazon	
  Web	
  Services…[and]	
  revealed	
  
that	
  Hadoop-­‐as-­‐a-­‐Service	
  offers	
  bePer	
  
price-­‐performance	
  raJo…”	
  
	
  
	
  
	
  
hkp://www.accenture.com/us-­‐en/Pages/insight-­‐hadoop-­‐
deployment-­‐comparison.aspx	
  
•  Spot allows customers
to bid on unused EC2
capacity
•  Spot price based on
supply/demand of
instance types in an
Availability Zone
•  Customers are fulfilled
when their bid price is
higher than the Spot
Price
•  Instances will be
interrupted when the
Spot price exceed the
bid price
Spot 101 - What are Spot Instances
elastic-mapreduce --add-instance-group TASK --instance-count 100 --bid-price .4
Mix Spot and On-Demand instances to reduce cost and
accelerate computation while protecting against interruption
#1: Cost without Spot
4 instances *14 hrs * $0.50 = $28
Job Flow
14 Hours
Duration:
Other EMR + Spot Use Cases
§ Run entire cluster on Spot for biggest cost savings
§ Reduce the cost of application testing
#2: Cost with Spot
4 instances *7 hrs * $0.50 = $14 +
5 instances * 7 hrs * $0.25 = $8.75
Total = $22.75
Scenario #1
Duration:
Job Flow
7 Hours
Scenario #2
Time Savings: 50%
Cost Savings: ~20%
Reducing Hadoop Costs with Spot
Stop Guessing Capacity

Dynamic Clusters
Extend on-premise environments…
with Amazon VPC…
Populate as demand dictates…
Connect over dedicated links…
And turn it off when you are done
EMR is Hadoop…

…cheaper, easier, and more agile
What’s New?
•  MapR M7 Introduction
•  Optimised for HBase Clusters
•  Failure Recovery
•  Point in Time Recovery
Snapshotting
•  Low Latency Hadoop Optimisations
•  HBase Mirroring
•  NFS + HDFS
•  MapR M5 Price Drop
•  Support for Pig 0.11.1
•  RANK, CUBE & ROLLUP capability
•  Groovy UDF’s
•  Support for Guava Functions
•  Performance Improvements
•  Spark/Shark Bootstrap
Action
•  In Memory Hadoop
•  Spark Scripting (similar to Pig)
•  Shark Shell with Hive
Interoperability

More Related Content

What's hot

What's hot (20)

Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and SparkHBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC timeHBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 
Big Data Tools in AWS
Big Data Tools in AWSBig Data Tools in AWS
Big Data Tools in AWS
 
Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...
Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...
Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaQuerying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS Athena
 
Amazon Redshift Masterclass
Amazon Redshift MasterclassAmazon Redshift Masterclass
Amazon Redshift Masterclass
 
Optimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics WorkloadsOptimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics Workloads
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Masterclass - Redshift
Masterclass - RedshiftMasterclass - Redshift
Masterclass - Redshift
 
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and CloudHBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
 
BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2
 

Similar to Amazon Elastic Map Reduce - Ian Meyers

Lunch and Learn - Store and Move your Data To & From the AWS Cloud, Markku Le...
Lunch and Learn - Store and Move your Data To & From the AWS Cloud, Markku Le...Lunch and Learn - Store and Move your Data To & From the AWS Cloud, Markku Le...
Lunch and Learn - Store and Move your Data To & From the AWS Cloud, Markku Le...
Amazon Web Services
 
AWS Cloud Kata | Manila - Getting to Scale on AWS
AWS Cloud Kata | Manila - Getting to Scale on AWSAWS Cloud Kata | Manila - Getting to Scale on AWS
AWS Cloud Kata | Manila - Getting to Scale on AWS
Amazon Web Services
 
Rethinking the database for the cloud (iJAWS)
Rethinking the database for the cloud (iJAWS)Rethinking the database for the cloud (iJAWS)
Rethinking the database for the cloud (iJAWS)
Rasmus Ekman
 

Similar to Amazon Elastic Map Reduce - Ian Meyers (20)

Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Analytics on AWS - IP Expo 2013
Analytics on AWS - IP Expo 2013Analytics on AWS - IP Expo 2013
Analytics on AWS - IP Expo 2013
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
2017 AWS DB Day |  AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?2017 AWS DB Day |  AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
 
Lunch and Learn - Store and Move your Data To & From the AWS Cloud, Markku Le...
Lunch and Learn - Store and Move your Data To & From the AWS Cloud, Markku Le...Lunch and Learn - Store and Move your Data To & From the AWS Cloud, Markku Le...
Lunch and Learn - Store and Move your Data To & From the AWS Cloud, Markku Le...
 
[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介
[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介
[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”
 
AWS Cloud Kata | Manila - Getting to Scale on AWS
AWS Cloud Kata | Manila - Getting to Scale on AWSAWS Cloud Kata | Manila - Getting to Scale on AWS
AWS Cloud Kata | Manila - Getting to Scale on AWS
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon Redshift
 
透過 Amazon Redshift 打造數據分析服務及 Amazon Redshift 新功能案例介紹
透過 Amazon Redshift 打造數據分析服務及 Amazon Redshift 新功能案例介紹透過 Amazon Redshift 打造數據分析服務及 Amazon Redshift 新功能案例介紹
透過 Amazon Redshift 打造數據分析服務及 Amazon Redshift 新功能案例介紹
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
Your First 10 million Users on the AWS Cloud
Your First 10 million Users on the AWS CloudYour First 10 million Users on the AWS Cloud
Your First 10 million Users on the AWS Cloud
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
(DAT202) Managed Database Options on AWS
(DAT202) Managed Database Options on AWS(DAT202) Managed Database Options on AWS
(DAT202) Managed Database Options on AWS
 
Rethinking the database for the cloud (iJAWS)
Rethinking the database for the cloud (iJAWS)Rethinking the database for the cloud (iJAWS)
Rethinking the database for the cloud (iJAWS)
 

More from huguk

More from huguk (20)

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp intro
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Amazon Elastic Map Reduce - Ian Meyers

  • 2. Deep experience in building and operating global web scale systems About  Amazon   Web  Services   ? …get into cloud computing? How did Amazon…
  • 3. Utility computing On demand Pay as you go Uniform Available
  • 4. Utility computing On demand Pay as you go Uniform Available
  • 6. Utility computing On demand Pay as you go Uniform Available Compute   Storage   Security   Scaling   Database   Networking   Monitoring   Messaging   Workflow   DNS   Load  Balancing   Backup  CDN  
  • 7. No  Up-­‐Front   Capital  Expense   Pay  Only  for   What  You  Use   Self-­‐Service   Infrastructure   Easily  Scale  Up   and  Down   Improve  Agility  &   Time-­‐to-­‐Market   Low  Cost   Deploy Cloud computing benefits
  • 8. Traditional IT capacity ElasNc  capacity   Capacity Time Your IT needs
  • 9. On  and  Off   Fast  Growth   Variable  peaks   Predictable  peaks   ElasNc  capacity  
  • 10. ElasNc  capacity   On  and  Off   Fast  Growth   Predictable  peaks  Variable  peaks   WASTE CUSTOMER DISSATISFACTION
  • 11. ElasNc  capacity   Fast  Growth  On  and  Off   Predictable  peaks  Variable  peaks  
  • 12. NumberofEC2Instances 4/12/2008 4/14/2008 4/15/2008 4/16/2008 4/18/2008 4/19/2008 4/20/20084/17/20084/13/2008 40  servers  to  5000  in  3  days   EC2 scaled to peak of 5000 instances “Techcrunched” Launch of Facebook modification Steady state of ~40 instances
  • 13. Compute   Storage   AWS  Global  Infrastructure   Database   App  Services   Deployment  &  AdministraNon   Networking   Global Infrastructure
  • 14. Global Infrastructure Region US-WEST (N. California) EU-WEST (Ireland) ASIA PAC (Tokyo) ASIA PAC (Singapore) US-WEST (Oregon) SOUTH AMERICA (Sao Paulo) US-EAST (Virginia) GOV CLOUD ASIA PAC (Sydney)
  • 16. Customer Needs •  Store  Any  Amount  of  Data   –  Without  Capacity  Planning   •  Perform  Complex  Analysis  on  Any  Data   –  Scale  on  Demand   •  Store  Data  Securely   •  Decrease  Time  to  Market   –  Build  Environments  Quickly   •  Reduce  Costs   –  Reduce  Capital  Expenditure   •  Enable  Global  Reach  
  • 18. ElasNc  Block  Store   High performance block storage device 1GB to 1TB in size Mount as drives to instances with snapshot/cloning functionalities IMAGE Availability 99.99% Durability 99.999999999% Is a Web Store Not a file system No Single Points of Failure Eventually consistent Paradigm Object store Performance Very Fast Redundancy Across Availability Zones Security Public Key / Private Key Pricing $0.095/GB/month Typical use case Write once, read many Limits 100 Buckets, Unlimited Storage, 5TB Objects Simple  Storage  Service   Highly  scalable  object  storage  for  the  internet   1  byte  to  5TB  in  size   99.999999999%  durability  
  • 19. Peak Requests: 830,000+ per second Total Number of Objects Stored in Amazon S3 14 Billion 40 Billion 102 Billion 762 Billion 262 Billion 1.3 Trillion Q4 2006 Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012 Objects in S3
  • 20. Glacier   Long  term  object  archive   Extremely  low  cost  per  gigabyte   99.999999999%  durability   ElasNc  Block  Store   High performance block storage device 1GB to 1TB in size Mount as drives to instances with snapshot/cloning functionalities IMAGE Durability 99.999999999% Designed for Archival Not a file system Vaults & Archives 3-5 Hour Retrieval Time Paradigm Archive Store Performance Configurable - Low Redundancy Across Availability Zones Security Public Key / Private Key Pricing $0.011/GB/month Typical use case Write once, read infrequently < 10% / Month
  • 21. Simple  Storage  Service   Highly  scalable  object  storage   1  byte  to  5TB  in  size   99.999999999%  durability   Glacier   Long  term  object  archive   Extremely  low  cost  per  gigabyte   99.999999999%  durability   Storage  Lifecycle  IntegraNon  
  • 23. Compute   Storage   AWS  Global  Infrastructure   Database   App  Services   Deployment  &  AdministraNon   Networking   Database Relational Database Service Managed Oracle, MySQL & SQL Server Dynamo DB Managed NOSQL Database Amazon Redshift Massively Parallel Petabyte Scale Data Warehouse RDS Dynamo DB Redshift
  • 24. Compute   Storage   AWS  Global  Infrastructure   Database   App  Services   Deployment  &  AdministraNon   Networking   Database Relational Database Service Database-as-a-Service No need to install or manage database instances Scalable and fault tolerant configurations Integration with Data Pipeline RDS Dynamo DB Redshift
  • 25. Compute   Storage   AWS  Global  Infrastructure   Database   App  Services   Deployment  &  AdministraNon   Networking   Database DynamoDB Provisioned throughput NoSQL database Fast, predictable, configurable performance Fully distributed, fault tolerant HA architecture Integration with EMR & Hive RDS Dynamo DB Redshift
  • 26. Compute   Storage   AWS  Global  Infrastructure   Database   App  Services   Deployment  &  AdministraNon   Networking   Database Redshift Managed Massively Parallel Petabyte Scale Data Warehouse Streaming Backup/Restore to S3 Extensive Security 2 TB -> 1.6 PB RDS Dynamo DB Redshift
  • 27. Unstructured  Data   …   Parallel  ETL  
  • 28. Elastic MapReduce Managed, elastic Hadoop cluster Integrates with S3 & DynamoDB Leverage Hive & Pig analytics scripts Support for Spot Instances Integrated HBase NOSQL Database Compute   Storage   AWS  Global  Infrastructure   Database   App  Services   Deployment  &  AdministraNon   Networking   Application Services Elastic MapReduce
  • 29. •  AWS Web Console •  Command Line elastic-­‐mapreduce  -­‐-­‐create  -­‐-­‐key-­‐pair  micro  -­‐-­‐region  eu-­‐ west-­‐1  -­‐-­‐name  IanMM-­‐Test1  -­‐-­‐num-­‐instances  5  -­‐-­‐instance-­‐ type  m2.4xlarge  –alive  -­‐-­‐log-­‐uri  s3n://meyersi-­‐ire/EMR/ log   Launching Clusters
  • 30. •  Enabling Tools elastic-­‐mapreduce  -­‐-­‐create  -­‐-­‐key-­‐pair  micro  -­‐-­‐region  eu-­‐west-­‐1  -­‐-­‐ name  IanMM-­‐Test1  -­‐-­‐num-­‐instances  5  -­‐-­‐instance-­‐type  m2.4xlarge  -­‐-­‐ alive     -­‐-­‐pig-­‐interactive  -­‐-­‐pig-­‐versions  latest   -­‐-­‐hive-­‐interactive  –-­‐hive-­‐versions  latest   -­‐-­‐hbase     -­‐-­‐log-­‐uri  s3n://meyersi-­‐ire/EMR/log   Launching Clusters
  • 31. •  Hadoop Configuration Bootstrap Action elastic-­‐mapreduce  -­‐-­‐create  -­‐-­‐bootstrap-­‐action   s3://elasticmapreduce/bootstrap-­‐ actions/configure-­‐hadoop  -­‐-­‐args  "-­‐ s,dfs.block.size=1048576”  -­‐-­‐key-­‐pair  micro   -­‐-­‐region  eu-­‐west-­‐1  -­‐-­‐name  IanMM-­‐Test-­‐3  -­‐-­‐instance-­‐group   core  -­‐-­‐instance-­‐count  2  -­‐-­‐instance-­‐type  m2.4xlarge  -­‐-­‐ instance-­‐group  task  -­‐-­‐instance-­‐count  2  -­‐-­‐instance-­‐type   m2.4xlarge  -­‐-­‐alive  -­‐-­‐pig-­‐interactive  -­‐-­‐hive-­‐interactive   -­‐-­‐log-­‐uri  s3n://meyersi-­‐ire/EMR/log   Launching Clusters
  • 32. Input Datanode: This could be a S3 bucket, RDS table, EMR Hive table, etc.   Activity: This is a data aggregation, manipulation, or copy that runs on a user- configured schedule. Output Datanode: This supports all the same datasources as the input datanode, but they don’t have to be the same type.   Amazon Data Pipeline
  • 33. Output:  S3  file   Path:  s3://trend-­‐data/#{year-­‐month-­‐day}.csv   AcNvity:  EMR  Transform   Hive  Query:  user-­‐metrics.hql   Frequency:  Daily   Input:  RDS  Table   Table:  User-­‐Demographics   SQL  PrecondiNon:    “Select  last_update  from  table“  >  #{YY-­‐MM-­‐DD}   Input:  DynamoDB  Table   Table:  User-­‐Event-­‐Data-­‐#{year-­‐month}   Success  NoNficaNon:  metrics@example.com   Failure  NoNficaNon:  emr-­‐admin@example.com   Delay  NoNficaNon:  :  emr-­‐admin@example.com     Orchestration with Data Pipeline
  • 34. Analytics Pipeline Redshift S3 RDS EMR Data Pipeline …collect & store …orchestrate …process & analyse Dynamo DB
  • 35. Benefits only possible in the Cloud Pay as you Go Lower Overall Costs Stop Guessing Capacity Agility / Speed / Innovation Avoid Undifferentiated Heavy Lifting Go Global in Minutes ✔ ✔ ✔ ✔ ✔ ✔ “Private Cloud” / On Premises X X X X X X
  • 36. Agility & Global Reach at the Core of EMR
  • 37. Ease of Operation Compute  Infrastructure   Hadoop  ConfiguraNon   Local  Disk   OperaNng  System  Config   HDFS   Networking   Hive   Pig   HBase   User  Defined  Sogware  InstallaNon  
  • 38. Ease of Operation Compute  Infrastructure   Hadoop   ConfiguraNon   Local  Disk   OperaNng   System  Config   HDFS   Networking   Hive   Pig   HBase   User  Defined  Sogware  InstallaNon   Multiple Hadoop Distributions - Open Source & MapR Clusters Launched with 1 Command Up in 5 Minutes Hard Partitioned per Customer on CPU, Memory and Disk Dynamic Cluster Resizing In any of 8 Regions around the Globe
  • 39. Lower Overall Costs Cheaper | Spot Market Management
  • 40. Lower TCO June  2013  Study  by  Accenture   Technology  Labs       Not  Sponsored  or  Funded  by  Amazon       “Accenture  assessed  the  price-­‐ performance  raJo  between  bare-­‐metal   Hadoop  clusters  and  Hadoop-­‐as-­‐a-­‐Service   on  Amazon  Web  Services…[and]  revealed   that  Hadoop-­‐as-­‐a-­‐Service  offers  bePer   price-­‐performance  raJo…”         hkp://www.accenture.com/us-­‐en/Pages/insight-­‐hadoop-­‐ deployment-­‐comparison.aspx  
  • 41. •  Spot allows customers to bid on unused EC2 capacity •  Spot price based on supply/demand of instance types in an Availability Zone •  Customers are fulfilled when their bid price is higher than the Spot Price •  Instances will be interrupted when the Spot price exceed the bid price Spot 101 - What are Spot Instances
  • 42. elastic-mapreduce --add-instance-group TASK --instance-count 100 --bid-price .4
  • 43. Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption #1: Cost without Spot 4 instances *14 hrs * $0.50 = $28 Job Flow 14 Hours Duration: Other EMR + Spot Use Cases § Run entire cluster on Spot for biggest cost savings § Reduce the cost of application testing #2: Cost with Spot 4 instances *7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $22.75 Scenario #1 Duration: Job Flow 7 Hours Scenario #2 Time Savings: 50% Cost Savings: ~20% Reducing Hadoop Costs with Spot
  • 47. Populate as demand dictates…
  • 49. And turn it off when you are done
  • 50. EMR is Hadoop… …cheaper, easier, and more agile
  • 51. What’s New? •  MapR M7 Introduction •  Optimised for HBase Clusters •  Failure Recovery •  Point in Time Recovery Snapshotting •  Low Latency Hadoop Optimisations •  HBase Mirroring •  NFS + HDFS •  MapR M5 Price Drop •  Support for Pig 0.11.1 •  RANK, CUBE & ROLLUP capability •  Groovy UDF’s •  Support for Guava Functions •  Performance Improvements •  Spark/Shark Bootstrap Action •  In Memory Hadoop •  Spark Scripting (similar to Pig) •  Shark Shell with Hive Interoperability