Markku Lepistö
Technology Evangelist
Amazon Web Services
@markkulepisto
#1
●○○○○
We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation
form whe...
We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation
form whe...
Collect,
Store,
Organize,
Analyze &
Share
Volume

Velocity

Variety"
3Vs !
The Role of Data
is Changing
We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation
form whe...
We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation
form whe...
Data
Actionable Information
Generated data
Available for analysis
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastr...
1.1M peak
requests/sec
lunch hours last year?
!
select productId, count(*) 

from page_hits 

where hour in (12,13) 

group by productId

order by count(*) desc!
!
cat ...
1PB = 10^15 (1,000,000,000,000,000) bytes
1 PB = 231 days at 50MB/s
Solution: Massively Parallel Processing
#2
○●○○○
HDFS
Reliable storage
MapReduce
Data analysis
Very%large%
log%
(e.g%TBs)%
Very%large%
log%
(e.g%TBs)%
Lots of actions
by John
Very%large%
log%
(e.g%TBs)% Split into
small
pieces
Lots of actions
by John
Very%large%
log%
(e.g%TBs)%
Process in a
hadoop cluster
Split into
small
pieces
Lots of actions
by John
Very%large%
log%
(e.g%TBs)%
John’s%
history%
Process in a
hadoop cluster
Aggregate
the results
Split into
small
pieces
Lot...
map
Input
file reduce
Output
file
Worker node
map
Input
file reduce
Output
file
map
Input
file reduce
Output
file
map
Input
file reduce
Output
file
Worker node
Worker n...
How%
can%we%
help%
John?%
Very%large%
log%
(e.g%TBs)%
Actionable Insight
Deploying%a%Hadoop%Cluster%is#Hard#
#3
♥
○○●○○
We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation
form whe...
Elastic On Demand
Pay as you go
Focus on
YOUR
business
Elastic On Demand
Pay as you go
Focus on
YOUR
business
November
Provisioned capacity
November
76%
24%
Provisioned capacity
November
November
On%and%Off% Fast%Growth%
Variable%Peaks% Predictable%Peaks%
On%and%Off% Fast%Growth%
Predictable%Peaks%Variable%Peaks%
WASTE
CUSTOMER DISSATISFACTION
Fast%Growth%On%and%Off%
Predictable%peaks%Variable%peaks%
#4
○○○●○
EMR is Hadoop in the Cloud!
Media/
Advertising
Targeted
Advertising
Image and
Video
Processing
Oil & Gas
Seismic
Analysis
Retail
Recommendations
Trans...
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
Versions
1.0.3
0.20.205
0.20
0.18
Distributions
Apache Hadoop
Job Flows
Custom JAR
Cascading
Streaming
Ruby, Perl, Python, PHP, R, Bash, C++
Data Warehouse for Hadoop
SQL-like query language
Hive
High-level programming
Ideal for data flow / ETL
Pig
Near real time key/value
store for structured data
HBase
Distributed monitoring
of cluster and nodes
Ganglia
Statistical computing
and graphics
Machine learning library
discover Value in Data
Data Strategist
Unknown Unknowns
Elastic On Demand
Pay as you go
Focus on
YOUR
business
Undifferen#ated%
Heavy%LiRing%
Focus on
YOUR
business
elastic-mapreduce
--create
--key-pair micro
--region eu-west-1
--name MyJobFlow
--num-instances 5
--instance-type m2.4xlar...
elastic-mapreduce
--create
--key-pair micro
--region eu-west-1
--name MyJobFlow
--num-instances 5
--instance-type m2.4xlar...
Elastic On Demand
Pay as you go
Focus on
YOUR
business
1 instance for 1000 hours
=
1000 instances for 1 hour
…to Thousands
Turn Off the Resources and Stop Paying
Elastic On Demand
Pay as you go
Focus on
YOUR
business
Source: IDC Whitepaper, sponsored by Amazon, The Business Value of Amazon Web Services Accelerates Over Time. July 2012
70...
Save more money by using Spot Instances
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
PercentageoftheDistribution
Bid Price as Percentage of the On-Demand Price
Bid Dist...
14%hrs%
Without#Spot#
4%instances%*%14%hrs%*%$0.50%=%$28%
EMR with Spot Instances
14%hrs%
Without#Spot#
4%instances%*%14%hrs%*%$0.50%=%$28%
EMR with Spot Instances
14%hrs%
14%hrs%
Without#Spot#
4%instances%*%14%hrs%*%$0.50%=%$28%
7%hrs%
EMR with Spot Instances
With#Spot#
4%instances%*%7%hrs%*%$0.50%=%$14%+%%%
14%hrs%
Without#Spot#
4%instances%*%14%hrs%*%$0.50%=%$28%
EMR with Spot ...
With#Spot#
4%instances%*%7%hrs%*%$0.50%=%$14%+%
5%instances%*%7%hrs%*%$0.25%=%$8.75%
Total%=%$22.75%
14%hrs%
Without#Spot#...
Time#250%##
Cost#222%#
With#Spot#
4%instances%*%7%hrs%*%$0.50%=%$14%+%
5%instances%*%7%hrs%*%$0.25%=%$8.75%
Total%=%$22.75...
#5
○○○○●
What kind of movies do people like ?
More than 25 Million Streaming Members
50 Billion Events Per Day
30 Million plays every day
2 billion hours of video in 3
...
10 TB of streaming data per day
Da ta $C enter
S3
Netflix(Data(Center
Legacy data from on-premise
data center
Legacy Data
Customer dimension data
stored in Cassandra
~1 PB of data stored in Amazon S3
S3
Wide range of processing languages used
EMR
Prod%Cluster%
(EMR)
S3
Data consumed in multiple ways
S3
EMR
Prod%Cluster%
(EMR)
Recommendation
Engine
Ad-hoc
Analysis Personalization
EMR
S3
EMR
EMR
Prod%Cluster%
(EMR)
Query%Cluster%
(EMR)
EMR
EMR
Durability
Versioning
Foursquare…
33 million users
1.3 million businesses
…generates a lot of Data
3.5 billion check-ins
15M+ venues,
Terabytes ...
Uses EMR for
Evaluation of new features
Machine learning
Exploratory analysis
Daily customer usage reporting
Long-term tre...
Benefits of EMR
Ease-of-Use
“We have decreased the processing time for urgent data-analysis”
Flexibility
To deal with chan...
ApplicationStack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/
Flat Files
Data...
ApplicationStack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/
Flat Files
Data...
ApplicationStack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/
Flat Files
Data...
ApplicationStack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/
Flat Files
Data...
0
0.1
0.2
0.3
0.4
0.5
0.6
Female Male
Gender
0 10 20 30 40 50 60 70 80
Age
Gorilla Coffee
Gray's Papaya
Amorino
Thursday% Friday% Saturday% Sunday%
Who is using our service?
Finding signal in the noise of logs
Python library
https://github.com/Yelp/mrjob
Log files
250 EMR clusters spun up
and down every week
Common Crawl
1000 Genomes Project
Census Data
54 other datasets
http://aws.amazon.com/publicdatasets/
Challenge:%%
Large%amounts%of%compu#ng%resources%
needed%for%short%periods%of%#me;%significant%
data%storage%costs%
Solu<on...
Challenge:%%
Vola#le%weather%is%deadly%to%crops%like%grapes%
Solu<on:#
Built%a%predic#ve%model%based%on%freely%
available%...
150B Soil
Observations
3M Daily Weather
Measurements
850K Precision Rainfall
Grids Tracked
200 TB in Amazon S3
Training
Videos
Basic Overview
Documentation
Getting Started Guide
Developer Guide
API Reference
FAQs
Think Big Training
(...
Amazon Elastic MapReduce
Elastic and scalable
No upfront CapEx
Pay per use
+
+
On demand
+
=
Remove
constraints
Remove constraints = More experimentation
More experimentation = More innovation
Focus on your business
Leave undifferentiated heavy lifting to us
Thank you!
aws.amazon.com/big-data
@markkulepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
Upcoming SlideShare
Loading in …5
×

COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto

2,070 views
1,973 views

Published on

Published in: Sports, Technology, Business
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,070
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
1
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto

  1. 1. Markku Lepistö Technology Evangelist Amazon Web Services @markkulepisto
  2. 2. #1 ●○○○○
  3. 3. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance. We are constantly producing more data
  4. 4. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance. From all types of industries
  5. 5. Collect, Store, Organize, Analyze & Share
  6. 6. Volume
 Velocity
 Variety" 3Vs !
  7. 7. The Role of Data is Changing
  8. 8. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance. Un#l%now,%Ques#ons%you%ask%drove%Data%model% New%model%is%collect%as%much%data%as%possible% –%“Data>First%Philosophy”%
  9. 9. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance. Data is the new raw material for any business on par with capital, people, labor Datais the new raw material for business on par with capital & labor
  10. 10. Data Actionable Information
  11. 11. Generated data Available for analysis Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  12. 12. 1.1M peak requests/sec
  13. 13. lunch hours last year?
  14. 14. ! select productId, count(*) 
 from page_hits 
 where hour in (12,13) 
 group by productId
 order by count(*) desc! ! cat *-(12|13) | cut –f3 | sort | uniq -c > out! Hit <enter>?
  15. 15. 1PB = 10^15 (1,000,000,000,000,000) bytes 1 PB = 231 days at 50MB/s
  16. 16. Solution: Massively Parallel Processing
  17. 17. #2 ○●○○○
  18. 18. HDFS Reliable storage MapReduce Data analysis
  19. 19. Very%large% log% (e.g%TBs)%
  20. 20. Very%large% log% (e.g%TBs)% Lots of actions by John
  21. 21. Very%large% log% (e.g%TBs)% Split into small pieces Lots of actions by John
  22. 22. Very%large% log% (e.g%TBs)% Process in a hadoop cluster Split into small pieces Lots of actions by John
  23. 23. Very%large% log% (e.g%TBs)% John’s% history% Process in a hadoop cluster Aggregate the results Split into small pieces Lots of actions by John
  24. 24. map Input file reduce Output file Worker node
  25. 25. map Input file reduce Output file map Input file reduce Output file map Input file reduce Output file Worker node Worker node Worker node
  26. 26. How% can%we% help% John?% Very%large% log% (e.g%TBs)% Actionable Insight
  27. 27. Deploying%a%Hadoop%Cluster%is#Hard#
  28. 28. #3 ♥ ○○●○○
  29. 29. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.
  30. 30. Elastic On Demand Pay as you go Focus on YOUR business
  31. 31. Elastic On Demand Pay as you go Focus on YOUR business
  32. 32. November
  33. 33. Provisioned capacity November
  34. 34. 76% 24% Provisioned capacity November
  35. 35. November
  36. 36. On%and%Off% Fast%Growth% Variable%Peaks% Predictable%Peaks%
  37. 37. On%and%Off% Fast%Growth% Predictable%Peaks%Variable%Peaks% WASTE CUSTOMER DISSATISFACTION
  38. 38. Fast%Growth%On%and%Off% Predictable%peaks%Variable%peaks%
  39. 39. #4 ○○○●○
  40. 40. EMR is Hadoop in the Cloud!
  41. 41. Media/ Advertising Targeted Advertising Image and Video Processing Oil & Gas Seismic Analysis Retail Recommendations Transactions Analysis Life Sciences Genome Analysis Financial Services Monte Carlo Simulations Risk Analysis Security Anti-virus Fraud Detection Image Recognition Social Network/ Gaming User Demographics Usage analysis In-game metrics
  42. 42. 0 1,000,000 2,000,000 3,000,000 4,000,000 5,000,000 6,000,000
  43. 43. Versions 1.0.3 0.20.205 0.20 0.18 Distributions Apache Hadoop
  44. 44. Job Flows Custom JAR Cascading Streaming Ruby, Perl, Python, PHP, R, Bash, C++
  45. 45. Data Warehouse for Hadoop SQL-like query language Hive
  46. 46. High-level programming Ideal for data flow / ETL Pig
  47. 47. Near real time key/value store for structured data HBase
  48. 48. Distributed monitoring of cluster and nodes Ganglia
  49. 49. Statistical computing and graphics Machine learning library discover Value in Data
  50. 50. Data Strategist
  51. 51. Unknown Unknowns
  52. 52. Elastic On Demand Pay as you go Focus on YOUR business
  53. 53. Undifferen#ated% Heavy%LiRing% Focus on YOUR business
  54. 54. elastic-mapreduce --create --key-pair micro --region eu-west-1 --name MyJobFlow --num-instances 5 --instance-type m2.4xlarge –-alive --log-uri s3n://mybucket/EMR/log Instance type/count
  55. 55. elastic-mapreduce --create --key-pair micro --region eu-west-1 --name MyJobFlow --num-instances 5 --instance-type m2.4xlarge –-alive --pig-interactive --pig-versions latest --hive-interactive –-hive-versions latest --hbase --log-uri s3n://mybucket/EMR/log Adding Hive, Pig and Hbase to the job flow
  56. 56. Elastic On Demand Pay as you go Focus on YOUR business
  57. 57. 1 instance for 1000 hours = 1000 instances for 1 hour
  58. 58. …to Thousands
  59. 59. Turn Off the Resources and Stop Paying
  60. 60. Elastic On Demand Pay as you go Focus on YOUR business
  61. 61. Source: IDC Whitepaper, sponsored by Amazon, The Business Value of Amazon Web Services Accelerates Over Time. July 2012 70% lower 5 year TCO per app AWS On- premises $3.01M $0.90M 50% reduction in analytics costs
  62. 62. Save more money by using Spot Instances
  63. 63. 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% PercentageoftheDistribution Bid Price as Percentage of the On-Demand Price Bid Distribution Typical Spot Bidding Strategies
  64. 64. 14%hrs% Without#Spot# 4%instances%*%14%hrs%*%$0.50%=%$28% EMR with Spot Instances
  65. 65. 14%hrs% Without#Spot# 4%instances%*%14%hrs%*%$0.50%=%$28% EMR with Spot Instances 14%hrs%
  66. 66. 14%hrs% Without#Spot# 4%instances%*%14%hrs%*%$0.50%=%$28% 7%hrs% EMR with Spot Instances
  67. 67. With#Spot# 4%instances%*%7%hrs%*%$0.50%=%$14%+%%% 14%hrs% Without#Spot# 4%instances%*%14%hrs%*%$0.50%=%$28% EMR with Spot Instances 7%hrs%
  68. 68. With#Spot# 4%instances%*%7%hrs%*%$0.50%=%$14%+% 5%instances%*%7%hrs%*%$0.25%=%$8.75% Total%=%$22.75% 14%hrs% Without#Spot# 4%instances%*%14%hrs%*%$0.50%=%$28% EMR with Spot Instances 7%hrs%
  69. 69. Time#250%## Cost#222%# With#Spot# 4%instances%*%7%hrs%*%$0.50%=%$14%+% 5%instances%*%7%hrs%*%$0.25%=%$8.75% Total%=%$22.75% 14%hrs% Without#Spot# 4%instances%*%14%hrs%*%$0.50%=%$28% EMR with Spot Instances 7%hrs%
  70. 70. #5 ○○○○●
  71. 71. What kind of movies do people like ?
  72. 72. More than 25 Million Streaming Members 50 Billion Events Per Day 30 Million plays every day 2 billion hours of video in 3 months 4 million ratings per day 3 million searches Device location , time , day, week etc. Social data
  73. 73. 10 TB of streaming data per day
  74. 74. Da ta $C enter S3 Netflix(Data(Center Legacy data from on-premise data center Legacy Data
  75. 75. Customer dimension data stored in Cassandra
  76. 76. ~1 PB of data stored in Amazon S3 S3
  77. 77. Wide range of processing languages used EMR Prod%Cluster% (EMR) S3
  78. 78. Data consumed in multiple ways S3 EMR Prod%Cluster% (EMR) Recommendation Engine Ad-hoc Analysis Personalization
  79. 79. EMR S3 EMR EMR Prod%Cluster% (EMR) Query%Cluster% (EMR) EMR EMR
  80. 80. Durability
  81. 81. Versioning
  82. 82. Foursquare… 33 million users 1.3 million businesses …generates a lot of Data 3.5 billion check-ins 15M+ venues, Terabytes of log data
  83. 83. Uses EMR for Evaluation of new features Machine learning Exploratory analysis Daily customer usage reporting Long-term trend analysis
  84. 84. Benefits of EMR Ease-of-Use “We have decreased the processing time for urgent data-analysis” Flexibility To deal with changing requirements & dynamically expand reporting clusters Costs “We have reduced our analytics costs by over 50%”
  85. 85. ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/ Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/ Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  86. 86. ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/ Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/ Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  87. 87. ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/ Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/ Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  88. 88. ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/ Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/ Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  89. 89. 0 0.1 0.2 0.3 0.4 0.5 0.6 Female Male Gender 0 10 20 30 40 50 60 70 80 Age
  90. 90. Gorilla Coffee Gray's Papaya Amorino Thursday% Friday% Saturday% Sunday%
  91. 91. Who is using our service?
  92. 92. Finding signal in the noise of logs
  93. 93. Python library https://github.com/Yelp/mrjob
  94. 94. Log files 250 EMR clusters spun up and down every week
  95. 95. Common Crawl 1000 Genomes Project Census Data 54 other datasets http://aws.amazon.com/publicdatasets/
  96. 96. Challenge:%% Large%amounts%of%compu#ng%resources% needed%for%short%periods%of%#me;%significant% data%storage%costs% Solu<on:# Clusters%of%100s%of%nodes%on%EMR%running%4>5%hours% at%a%#me% Leverages%1000%genomes%Public%Data%Set%on%AWS%— free%access%to%~200%TB%of%genomes%for%over%2,600% people%from%26%popula#ons%around%the%world.%
  97. 97. Challenge:%% Vola#le%weather%is%deadly%to%crops%like%grapes% Solu<on:# Built%a%predic#ve%model%based%on%freely% available%data—% 60%years%of%crop%data,%% 14%TBs%of%soil%data,%and%% 1M%government%Doppler%radar%points% 50%EMR%clusters%process%new%data%as%it%comes% into%S3%each%day,%con#nuously%upda#ng%the% model.% %%%
  98. 98. 150B Soil Observations 3M Daily Weather Measurements 850K Precision Rainfall Grids Tracked 200 TB in Amazon S3
  99. 99. Training Videos Basic Overview Documentation Getting Started Guide Developer Guide API Reference FAQs Think Big Training (3-day Dev Course) EMR Bootcamp (on-site consulting)
  100. 100. Amazon Elastic MapReduce
  101. 101. Elastic and scalable No upfront CapEx Pay per use + + On demand + = Remove constraints
  102. 102. Remove constraints = More experimentation
  103. 103. More experimentation = More innovation
  104. 104. Focus on your business Leave undifferentiated heavy lifting to us
  105. 105. Thank you! aws.amazon.com/big-data @markkulepisto

×