Jose Papo
Amazon Evangelist
@josepapo
@josepapo
HANDS-ON DEMOS
AFTER THE BIG
DATA SESSION
La Nube es el driver de las nuevas tendencias tecnológicas
Accelerating the startup boom
Optimizing the corporate world
#1
●○○○○
We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation
form whe...
We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation
form whe...
Collect,
Store,
Organize,
Analyze &
Share
3Vs
27 TB per day
Large Hadron Collider – CERN
The Role of Data
is Changing
We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation
form whe...
We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation
form whe...
Data
Actionable Information
Generated
data
Available for analysis
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastr...
Data Strategist
1.1M peak
requests/sec
lunch hours last year?
select productId, count(*)
from page_hits
where hour in (12,13)
group by productId
order by count(*) desc
cat *-(12|13) | ...
1PB = 10^15 (1,000,000,000,000,000) bytes
1 PB = 231 days at 50MB/s
Solution: Massively Parallel Processing
#2
○●○○○
HDFS
Reliable storage
MapReduce
Data analysis
Very large
log
(e.g TBs)
Very large
log
(e.g TBs)
Lots of actions
by John
Very large
log
(e.g TBs) Split into
small
pieces
Lots of actions
by John
Very large
log
(e.g TBs)
Process in a
hadoop cluster
Split into
small
pieces
Lots of actions
by John
Very large
log
(e.g TBs)
John’s
history
Process in a
hadoop cluster
Aggregate
the results
Split into
small
pieces
Lots of ...
map
Input
file reduce
Output
file
Worker node
map
Input
file reduce
Output
file
map
Input
file reduce
Output
file
map
Input
file reduce
Output
file
Worker node
Worker n...
How
can we
help
John?
Very large
log
(e.g TBs)
Actionable Insight
Deploying a Hadoop Cluster is Hard
#3
♥
○○●○○
We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation
form whe...
Elastic On Demand
Pay as you go
Focus on
YOUR
business
Elastic On Demand
Pay as you go
Focus on
YOUR
business
November
Provisioned capacity
November
76%
24%
Provisioned capacity
November
November
On and Off Fast Growth
Variable Peaks Predictable Peaks
On and Off Fast Growth
Predictable PeaksVariable Peaks
WASTE
CUSTOMER DISSATISFACTION
Fast GrowthOn and Off
Predictable peaksVariable peaks
#4
○○○●○
EMR is Hadoop in the Cloud
Media/Advertising
Targeted
Advertising
Image and
Video
Processing
Oil & Gas
Seismic
Analysis
Retail
Recommendations
Transa...
0
1.000.000
2.000.000
3.000.000
4.000.000
5.000.000
6.000.000
Versions
1.0.3
0.20.205
0.20
0.18
Distributions
Apache Hadoop
Job Flows
Custom JAR
Cascading
Streaming
Ruby, Perl, Python, PHP, R, Bash, C++
Data Warehouse for Hadoop
SQL-like query language
Hive
High-level programming
Ideal for data flow / ETL
Pig
Near real time key/value
store for structured data
HBase
Distributed monitoring
of cluster and nodes
Ganglia
Statistical computing
and graphics
Machine learning library
discover Value in Data
Unknown Unknowns
Elastic On Demand
Pay as you go
Focus on
YOUR
business
Undifferentiated
Heavy Lifting
Focus on
YOUR
business
elastic-mapreduce
--create
--key-pair micro
--region eu-west-1
--name MyJobFlow
--num-instances 5
--instance-type m2.4xlar...
elastic-mapreduce
--create
--key-pair micro
--region eu-west-1
--name MyJobFlow
--num-instances 5
--instance-type m2.4xlar...
Elastic On Demand
Pay as you go
Focus on
YOUR
business
1 instance for 1000 hours
=
1000 instances for 1 hour
…to Thousands
Turn Off the Resources and Stop Paying
Elastic On Demand
Pay as you go
Focus on
YOUR
business
Source: IDC Whitepaper, sponsored by Amazon, “The Business Value of Amazon Web Services Accelerates Over Time.” July 2012
...
Save more money by using Spot Instances
14 hrs
Without Spot
4 instances * 14 hrs * $0.50 = $28
EMR with Spot Instances
14 hrs
Without Spot
4 instances * 14 hrs * $0.50 = $28
EMR with Spot Instances
14 hrs
14 hrs
Without Spot
4 instances * 14 hrs * $0.50 = $28
7 hrs
EMR with Spot Instances
With Spot
4 instances * 7 hrs * $0.50 = $14 +
14 hrs
Without Spot
4 instances * 14 hrs * $0.50 = $28
EMR with Spot Instanc...
With Spot
4 instances * 7 hrs * $0.50 = $14 +
5 instances * 7 hrs * $0.25 = $8.75
Total = $22.75
14 hrs
Without Spot
4 ins...
Time -50%
Cost -22%
With Spot
4 instances * 7 hrs * $0.50 = $14 +
5 instances * 7 hrs * $0.25 = $8.75
Total = $22.75
14 hr...
#5
○○○○●
“What kind of movies do people like ?”
More than 25 Million Streaming Members
50 Billion Events Per Day
30 Million plays every day
2 billion hours of video in 3
...
10 TB of streaming data per day
~1 PB of data stored in Amazon S3
S3
Wide range of processing languages used
EMR
Prod Cluster
(EMR)
S3
Data consumed in multiple ways
S3
EMR
Prod Cluster
(EMR)
Recommendation
Engine
Ad-hoc
Analysis Personalization
EMR
S3
EMR
EMR
Prod Cluster
(EMR)
Query Cluster
(EMR)
EMR
EMR
Durability
Versioning
Foursquare…
33 million users
1.3 million businesses
…generates a lot of Data
3.5 billion check-ins
15M+ venues,
Terabytes ...
Uses EMR for
Evaluation of new features
Machine learning
Exploratory analysis
Daily customer usage reporting
Long-term tre...
Benefits of EMR
Ease-of-Use
“We have decreased the processing time for urgent data-analysis”
Flexibility
To deal with chan...
ApplicationStack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/Flat
Files
Datab...
ApplicationStack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/Flat
Files
Datab...
ApplicationStack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/Flat
Files
Datab...
ApplicationStack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/Flat
Files
Datab...
0
0,1
0,2
0,3
0,4
0,5
0,6
Female Male
Gender
0 10 20 30 40 50 60 70 80
Age
Gorilla Coffee
Gray's Papaya
Amorino
Thursday Friday Saturday Sunday
Python library
https://github.com/Yelp/mrjob
Log files
250 EMR clusters spun up
and down every week
Common Crawl
1000 Genomes Project
Census Data
54 other datasets
http://aws.amazon.com/publicdatasets/
Challenge:
Large amounts of computing resources
needed for short periods of time; significant
data storage costs
Solution:...
Challenge:
Volatile weather is deadly to crops like grapes
Solution:
Built a predictive model based on freely
available da...
150B Soil
Observations
3M Daily Weather
Measurements
850K Precision Rainfall
Grids Tracked
200 TB in Amazon S3
Big Data and AWS Cloud
Elastic and scalable
No upfront CapEx
Pay per use
+
+
On demand
+
=
Remove
constraints
Remove constraints = More experimentation
More experimentation = More innovation
Focus on your business
Leave undifferentiated heavy lifting to us
GRACIAS!
slideshare.net/AmazonWebServicesLATAM
http://aws.amazon.com/es/big-data/
José Papo
AWS Tech Evangelist
@josepapo
Big Data and Hadoop in the Cloud
Big Data and Hadoop in the Cloud
Big Data and Hadoop in the Cloud
Big Data and Hadoop in the Cloud
Big Data and Hadoop in the Cloud
Big Data and Hadoop in the Cloud
Big Data and Hadoop in the Cloud
Big Data and Hadoop in the Cloud
Big Data and Hadoop in the Cloud
Big Data and Hadoop in the Cloud
Big Data and Hadoop in the Cloud
Big Data and Hadoop in the Cloud
Big Data and Hadoop in the Cloud
Big Data and Hadoop in the Cloud
Big Data and Hadoop in the Cloud
Big Data and Hadoop in the Cloud
Big Data and Hadoop in the Cloud
Big Data and Hadoop in the Cloud
Upcoming SlideShare
Loading in...5
×

Big Data and Hadoop in the Cloud

875

Published on

Big Data and Hadoop in the Cloud - Presentation made in the conference Colombia 3.0 in Bogotá, Colombia

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
875
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
60
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Big Data and Hadoop in the Cloud

  1. 1. Jose Papo Amazon Evangelist @josepapo @josepapo
  2. 2. HANDS-ON DEMOS AFTER THE BIG DATA SESSION
  3. 3. La Nube es el driver de las nuevas tendencias tecnológicas
  4. 4. Accelerating the startup boom
  5. 5. Optimizing the corporate world
  6. 6. #1 ●○○○○
  7. 7. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance. We are constantly producing more data
  8. 8. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance. From all types of industries
  9. 9. Collect, Store, Organize, Analyze & Share
  10. 10. 3Vs
  11. 11. 27 TB per day Large Hadron Collider – CERN
  12. 12. The Role of Data is Changing
  13. 13. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance. Until now, Questions you ask drove Data model New model is collect as much data as possible – “Data-First Philosophy”
  14. 14. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance. Data is the new raw material for any business on par with capital, people, labor Datais the new raw material for business on par with capital & labor
  15. 15. Data Actionable Information
  16. 16. Generated data Available for analysis Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  17. 17. Data Strategist
  18. 18. 1.1M peak requests/sec
  19. 19. lunch hours last year?
  20. 20. select productId, count(*) from page_hits where hour in (12,13) group by productId order by count(*) desc cat *-(12|13) | cut –f3 | sort | uniq -c > out Hit <enter>?
  21. 21. 1PB = 10^15 (1,000,000,000,000,000) bytes 1 PB = 231 days at 50MB/s
  22. 22. Solution: Massively Parallel Processing
  23. 23. #2 ○●○○○
  24. 24. HDFS Reliable storage MapReduce Data analysis
  25. 25. Very large log (e.g TBs)
  26. 26. Very large log (e.g TBs) Lots of actions by John
  27. 27. Very large log (e.g TBs) Split into small pieces Lots of actions by John
  28. 28. Very large log (e.g TBs) Process in a hadoop cluster Split into small pieces Lots of actions by John
  29. 29. Very large log (e.g TBs) John’s history Process in a hadoop cluster Aggregate the results Split into small pieces Lots of actions by John
  30. 30. map Input file reduce Output file Worker node
  31. 31. map Input file reduce Output file map Input file reduce Output file map Input file reduce Output file Worker node Worker node Worker node
  32. 32. How can we help John? Very large log (e.g TBs) Actionable Insight
  33. 33. Deploying a Hadoop Cluster is Hard
  34. 34. #3 ♥ ○○●○○
  35. 35. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.
  36. 36. Elastic On Demand Pay as you go Focus on YOUR business
  37. 37. Elastic On Demand Pay as you go Focus on YOUR business
  38. 38. November
  39. 39. Provisioned capacity November
  40. 40. 76% 24% Provisioned capacity November
  41. 41. November
  42. 42. On and Off Fast Growth Variable Peaks Predictable Peaks
  43. 43. On and Off Fast Growth Predictable PeaksVariable Peaks WASTE CUSTOMER DISSATISFACTION
  44. 44. Fast GrowthOn and Off Predictable peaksVariable peaks
  45. 45. #4 ○○○●○
  46. 46. EMR is Hadoop in the Cloud
  47. 47. Media/Advertising Targeted Advertising Image and Video Processing Oil & Gas Seismic Analysis Retail Recommendations Transactions Analysis Life Sciences Genome Analysis Financial Services Monte Carlo Simulations Risk Analysis Security Anti-virus Fraud Detection Image Recognition Social Network/Gaming User Demographics Usage analysis In-game metrics
  48. 48. 0 1.000.000 2.000.000 3.000.000 4.000.000 5.000.000 6.000.000
  49. 49. Versions 1.0.3 0.20.205 0.20 0.18 Distributions Apache Hadoop
  50. 50. Job Flows Custom JAR Cascading Streaming Ruby, Perl, Python, PHP, R, Bash, C++
  51. 51. Data Warehouse for Hadoop SQL-like query language Hive
  52. 52. High-level programming Ideal for data flow / ETL Pig
  53. 53. Near real time key/value store for structured data HBase
  54. 54. Distributed monitoring of cluster and nodes Ganglia
  55. 55. Statistical computing and graphics Machine learning library discover Value in Data
  56. 56. Unknown Unknowns
  57. 57. Elastic On Demand Pay as you go Focus on YOUR business
  58. 58. Undifferentiated Heavy Lifting Focus on YOUR business
  59. 59. elastic-mapreduce --create --key-pair micro --region eu-west-1 --name MyJobFlow --num-instances 5 --instance-type m2.4xlarge –-alive --log-uri s3n://mybucket/EMR/log Instance type/count
  60. 60. elastic-mapreduce --create --key-pair micro --region eu-west-1 --name MyJobFlow --num-instances 5 --instance-type m2.4xlarge –-alive --pig-interactive --pig-versions latest --hive-interactive –-hive-versions latest --hbase --log-uri s3n://mybucket/EMR/log Adding Hive, Pig and Hbase to the job flow
  61. 61. Elastic On Demand Pay as you go Focus on YOUR business
  62. 62. 1 instance for 1000 hours = 1000 instances for 1 hour
  63. 63. …to Thousands
  64. 64. Turn Off the Resources and Stop Paying
  65. 65. Elastic On Demand Pay as you go Focus on YOUR business
  66. 66. Source: IDC Whitepaper, sponsored by Amazon, “The Business Value of Amazon Web Services Accelerates Over Time.” July 2012 70% lower 5 year TCO per app AWS On- premises $3.01M $0.90M 50% reduction in analytics costs
  67. 67. Save more money by using Spot Instances
  68. 68. 14 hrs Without Spot 4 instances * 14 hrs * $0.50 = $28 EMR with Spot Instances
  69. 69. 14 hrs Without Spot 4 instances * 14 hrs * $0.50 = $28 EMR with Spot Instances 14 hrs
  70. 70. 14 hrs Without Spot 4 instances * 14 hrs * $0.50 = $28 7 hrs EMR with Spot Instances
  71. 71. With Spot 4 instances * 7 hrs * $0.50 = $14 + 14 hrs Without Spot 4 instances * 14 hrs * $0.50 = $28 EMR with Spot Instances 7 hrs
  72. 72. With Spot 4 instances * 7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $22.75 14 hrs Without Spot 4 instances * 14 hrs * $0.50 = $28 EMR with Spot Instances 7 hrs
  73. 73. Time -50% Cost -22% With Spot 4 instances * 7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $22.75 14 hrs Without Spot 4 instances * 14 hrs * $0.50 = $28 EMR with Spot Instances 7 hrs
  74. 74. #5 ○○○○●
  75. 75. “What kind of movies do people like ?”
  76. 76. More than 25 Million Streaming Members 50 Billion Events Per Day 30 Million plays every day 2 billion hours of video in 3 months 4 million ratings per day 3 million searches Device location , time , day, week etc. Social data
  77. 77. 10 TB of streaming data per day
  78. 78. ~1 PB of data stored in Amazon S3 S3
  79. 79. Wide range of processing languages used EMR Prod Cluster (EMR) S3
  80. 80. Data consumed in multiple ways S3 EMR Prod Cluster (EMR) Recommendation Engine Ad-hoc Analysis Personalization
  81. 81. EMR S3 EMR EMR Prod Cluster (EMR) Query Cluster (EMR) EMR EMR
  82. 82. Durability
  83. 83. Versioning
  84. 84. Foursquare… 33 million users 1.3 million businesses …generates a lot of Data 3.5 billion check-ins 15M+ venues, Terabytes of log data
  85. 85. Uses EMR for Evaluation of new features Machine learning Exploratory analysis Daily customer usage reporting Long-term trend analysis
  86. 86. Benefits of EMR Ease-of-Use “We have decreased the processing time for urgent data-analysis” Flexibility To deal with changing requirements & dynamically expand reporting clusters Costs “We have reduced our analytics costs by over 50%”
  87. 87. ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  88. 88. ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  89. 89. ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  90. 90. ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  91. 91. 0 0,1 0,2 0,3 0,4 0,5 0,6 Female Male Gender 0 10 20 30 40 50 60 70 80 Age
  92. 92. Gorilla Coffee Gray's Papaya Amorino Thursday Friday Saturday Sunday
  93. 93. Python library https://github.com/Yelp/mrjob
  94. 94. Log files 250 EMR clusters spun up and down every week
  95. 95. Common Crawl 1000 Genomes Project Census Data 54 other datasets http://aws.amazon.com/publicdatasets/
  96. 96. Challenge: Large amounts of computing resources needed for short periods of time; significant data storage costs Solution: Clusters of 100s of nodes on EMR running 4-5 hours at a time Leverages 1000 genomes Public Data Set on AWS — free access to ~200 TB of genomes for over 2,600 people from 26 populations around the world.
  97. 97. Challenge: Volatile weather is deadly to crops like grapes Solution: Built a predictive model based on freely available data— 60 years of crop data, 14 TBs of soil data, and 1M government Doppler radar points 50 EMR clusters process new data as it comes into S3 each day, continuously updating the model.
  98. 98. 150B Soil Observations 3M Daily Weather Measurements 850K Precision Rainfall Grids Tracked 200 TB in Amazon S3
  99. 99. Big Data and AWS Cloud
  100. 100. Elastic and scalable No upfront CapEx Pay per use + + On demand + = Remove constraints
  101. 101. Remove constraints = More experimentation
  102. 102. More experimentation = More innovation
  103. 103. Focus on your business Leave undifferentiated heavy lifting to us
  104. 104. GRACIAS! slideshare.net/AmazonWebServicesLATAM http://aws.amazon.com/es/big-data/ José Papo AWS Tech Evangelist @josepapo
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×