Big Data and Hadoop in the Cloud
Upcoming SlideShare
Loading in...5
×
 

Big Data and Hadoop in the Cloud

on

  • 1,109 views

Big Data and Hadoop in the Cloud - Presentation made in the conference Colombia 3.0 in Bogotá, Colombia

Big Data and Hadoop in the Cloud - Presentation made in the conference Colombia 3.0 in Bogotá, Colombia

Statistics

Views

Total Views
1,109
Views on SlideShare
1,044
Embed Views
65

Actions

Likes
3
Downloads
45
Comments
0

1 Embed 65

https://twitter.com 65

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Big Data and Hadoop in the Cloud Big Data and Hadoop in the Cloud Presentation Transcript

  • Jose Papo Amazon Evangelist @josepapo @josepapo
  • HANDS-ON DEMOS AFTER THE BIG DATA SESSION
  • La Nube es el driver de las nuevas tendencias tecnológicas View slide
  • Accelerating the startup boom View slide
  • Optimizing the corporate world
  • #1 ●○○○○
  • We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance. We are constantly producing more data
  • We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance. From all types of industries
  • Collect, Store, Organize, Analyze & Share
  • 3Vs
  • 27 TB per day Large Hadron Collider – CERN
  • The Role of Data is Changing
  • We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance. Until now, Questions you ask drove Data model New model is collect as much data as possible – “Data-First Philosophy”
  • We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance. Data is the new raw material for any business on par with capital, people, labor Datais the new raw material for business on par with capital & labor
  • Data Actionable Information
  • Generated data Available for analysis Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  • Data Strategist
  • 1.1M peak requests/sec
  • lunch hours last year?
  • select productId, count(*) from page_hits where hour in (12,13) group by productId order by count(*) desc cat *-(12|13) | cut –f3 | sort | uniq -c > out Hit <enter>?
  • 1PB = 10^15 (1,000,000,000,000,000) bytes 1 PB = 231 days at 50MB/s
  • Solution: Massively Parallel Processing
  • #2 ○●○○○
  • HDFS Reliable storage MapReduce Data analysis
  • Very large log (e.g TBs)
  • Very large log (e.g TBs) Lots of actions by John
  • Very large log (e.g TBs) Split into small pieces Lots of actions by John
  • Very large log (e.g TBs) Process in a hadoop cluster Split into small pieces Lots of actions by John
  • Very large log (e.g TBs) John’s history Process in a hadoop cluster Aggregate the results Split into small pieces Lots of actions by John
  • map Input file reduce Output file Worker node
  • map Input file reduce Output file map Input file reduce Output file map Input file reduce Output file Worker node Worker node Worker node
  • How can we help John? Very large log (e.g TBs) Actionable Insight
  • Deploying a Hadoop Cluster is Hard
  • #3 ♥ ○○●○○
  • We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.
  • Elastic On Demand Pay as you go Focus on YOUR business
  • Elastic On Demand Pay as you go Focus on YOUR business
  • November
  • Provisioned capacity November
  • 76% 24% Provisioned capacity November
  • November
  • On and Off Fast Growth Variable Peaks Predictable Peaks
  • On and Off Fast Growth Predictable PeaksVariable Peaks WASTE CUSTOMER DISSATISFACTION
  • Fast GrowthOn and Off Predictable peaksVariable peaks
  • #4 ○○○●○
  • EMR is Hadoop in the Cloud
  • Media/Advertising Targeted Advertising Image and Video Processing Oil & Gas Seismic Analysis Retail Recommendations Transactions Analysis Life Sciences Genome Analysis Financial Services Monte Carlo Simulations Risk Analysis Security Anti-virus Fraud Detection Image Recognition Social Network/Gaming User Demographics Usage analysis In-game metrics
  • 0 1.000.000 2.000.000 3.000.000 4.000.000 5.000.000 6.000.000
  • Versions 1.0.3 0.20.205 0.20 0.18 Distributions Apache Hadoop
  • Job Flows Custom JAR Cascading Streaming Ruby, Perl, Python, PHP, R, Bash, C++
  • Data Warehouse for Hadoop SQL-like query language Hive
  • High-level programming Ideal for data flow / ETL Pig
  • Near real time key/value store for structured data HBase
  • Distributed monitoring of cluster and nodes Ganglia
  • Statistical computing and graphics Machine learning library discover Value in Data
  • Unknown Unknowns
  • Elastic On Demand Pay as you go Focus on YOUR business
  • Undifferentiated Heavy Lifting Focus on YOUR business
  • elastic-mapreduce --create --key-pair micro --region eu-west-1 --name MyJobFlow --num-instances 5 --instance-type m2.4xlarge –-alive --log-uri s3n://mybucket/EMR/log Instance type/count
  • elastic-mapreduce --create --key-pair micro --region eu-west-1 --name MyJobFlow --num-instances 5 --instance-type m2.4xlarge –-alive --pig-interactive --pig-versions latest --hive-interactive –-hive-versions latest --hbase --log-uri s3n://mybucket/EMR/log Adding Hive, Pig and Hbase to the job flow
  • Elastic On Demand Pay as you go Focus on YOUR business
  • 1 instance for 1000 hours = 1000 instances for 1 hour
  • …to Thousands
  • Turn Off the Resources and Stop Paying
  • Elastic On Demand Pay as you go Focus on YOUR business
  • Source: IDC Whitepaper, sponsored by Amazon, “The Business Value of Amazon Web Services Accelerates Over Time.” July 2012 70% lower 5 year TCO per app AWS On- premises $3.01M $0.90M 50% reduction in analytics costs
  • Save more money by using Spot Instances
  • 14 hrs Without Spot 4 instances * 14 hrs * $0.50 = $28 EMR with Spot Instances
  • 14 hrs Without Spot 4 instances * 14 hrs * $0.50 = $28 EMR with Spot Instances 14 hrs
  • 14 hrs Without Spot 4 instances * 14 hrs * $0.50 = $28 7 hrs EMR with Spot Instances
  • With Spot 4 instances * 7 hrs * $0.50 = $14 + 14 hrs Without Spot 4 instances * 14 hrs * $0.50 = $28 EMR with Spot Instances 7 hrs
  • With Spot 4 instances * 7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $22.75 14 hrs Without Spot 4 instances * 14 hrs * $0.50 = $28 EMR with Spot Instances 7 hrs
  • Time -50% Cost -22% With Spot 4 instances * 7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $22.75 14 hrs Without Spot 4 instances * 14 hrs * $0.50 = $28 EMR with Spot Instances 7 hrs
  • #5 ○○○○●
  • “What kind of movies do people like ?”
  • More than 25 Million Streaming Members 50 Billion Events Per Day 30 Million plays every day 2 billion hours of video in 3 months 4 million ratings per day 3 million searches Device location , time , day, week etc. Social data
  • 10 TB of streaming data per day
  • ~1 PB of data stored in Amazon S3 S3
  • Wide range of processing languages used EMR Prod Cluster (EMR) S3
  • Data consumed in multiple ways S3 EMR Prod Cluster (EMR) Recommendation Engine Ad-hoc Analysis Personalization
  • EMR S3 EMR EMR Prod Cluster (EMR) Query Cluster (EMR) EMR EMR
  • Durability
  • Versioning
  • Foursquare… 33 million users 1.3 million businesses …generates a lot of Data 3.5 billion check-ins 15M+ venues, Terabytes of log data
  • Uses EMR for Evaluation of new features Machine learning Exploratory analysis Daily customer usage reporting Long-term trend analysis
  • Benefits of EMR Ease-of-Use “We have decreased the processing time for urgent data-analysis” Flexibility To deal with changing requirements & dynamically expand reporting clusters Costs “We have reduced our analytics costs by over 50%”
  • ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  • ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  • ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  • ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  • 0 0,1 0,2 0,3 0,4 0,5 0,6 Female Male Gender 0 10 20 30 40 50 60 70 80 Age
  • Gorilla Coffee Gray's Papaya Amorino Thursday Friday Saturday Sunday
  • Python library https://github.com/Yelp/mrjob
  • Log files 250 EMR clusters spun up and down every week
  • Common Crawl 1000 Genomes Project Census Data 54 other datasets http://aws.amazon.com/publicdatasets/
  • Challenge: Large amounts of computing resources needed for short periods of time; significant data storage costs Solution: Clusters of 100s of nodes on EMR running 4-5 hours at a time Leverages 1000 genomes Public Data Set on AWS — free access to ~200 TB of genomes for over 2,600 people from 26 populations around the world.
  • Challenge: Volatile weather is deadly to crops like grapes Solution: Built a predictive model based on freely available data— 60 years of crop data, 14 TBs of soil data, and 1M government Doppler radar points 50 EMR clusters process new data as it comes into S3 each day, continuously updating the model.
  • 150B Soil Observations 3M Daily Weather Measurements 850K Precision Rainfall Grids Tracked 200 TB in Amazon S3
  • Big Data and AWS Cloud
  • Elastic and scalable No upfront CapEx Pay per use + + On demand + = Remove constraints
  • Remove constraints = More experimentation
  • More experimentation = More innovation
  • Focus on your business Leave undifferentiated heavy lifting to us
  • GRACIAS! slideshare.net/AmazonWebServicesLATAM http://aws.amazon.com/es/big-data/ José Papo AWS Tech Evangelist @josepapo