Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto

2,638 views

Published on

Published in: Sports, Technology, Business
  • Hey guys! Who wants to chat with me? More photos with me here 👉 http://www.bit.ly/katekoxx
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto

  1. 1. Markku Lepistö Technology Evangelist Amazon Web Services @markkulepisto
  2. 2. #1 ●○○○○
  3. 3. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance. We are constantly producing more data
  4. 4. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance. From all types of industries
  5. 5. Collect, Store, Organize, Analyze & Share
  6. 6. Volume
 Velocity
 Variety" 3Vs !
  7. 7. The Role of Data is Changing
  8. 8. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance. Un#l%now,%Ques#ons%you%ask%drove%Data%model% New%model%is%collect%as%much%data%as%possible% –%“Data>First%Philosophy”%
  9. 9. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance. Data is the new raw material for any business on par with capital, people, labor Datais the new raw material for business on par with capital & labor
  10. 10. Data Actionable Information
  11. 11. Generated data Available for analysis Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  12. 12. 1.1M peak requests/sec
  13. 13. lunch hours last year?
  14. 14. ! select productId, count(*) 
 from page_hits 
 where hour in (12,13) 
 group by productId
 order by count(*) desc! ! cat *-(12|13) | cut –f3 | sort | uniq -c > out! Hit <enter>?
  15. 15. 1PB = 10^15 (1,000,000,000,000,000) bytes 1 PB = 231 days at 50MB/s
  16. 16. Solution: Massively Parallel Processing
  17. 17. #2 ○●○○○
  18. 18. HDFS Reliable storage MapReduce Data analysis
  19. 19. Very%large% log% (e.g%TBs)%
  20. 20. Very%large% log% (e.g%TBs)% Lots of actions by John
  21. 21. Very%large% log% (e.g%TBs)% Split into small pieces Lots of actions by John
  22. 22. Very%large% log% (e.g%TBs)% Process in a hadoop cluster Split into small pieces Lots of actions by John
  23. 23. Very%large% log% (e.g%TBs)% John’s% history% Process in a hadoop cluster Aggregate the results Split into small pieces Lots of actions by John
  24. 24. map Input file reduce Output file Worker node
  25. 25. map Input file reduce Output file map Input file reduce Output file map Input file reduce Output file Worker node Worker node Worker node
  26. 26. How% can%we% help% John?% Very%large% log% (e.g%TBs)% Actionable Insight
  27. 27. Deploying%a%Hadoop%Cluster%is#Hard#
  28. 28. #3 ♥ ○○●○○
  29. 29. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.
  30. 30. Elastic On Demand Pay as you go Focus on YOUR business
  31. 31. Elastic On Demand Pay as you go Focus on YOUR business
  32. 32. November
  33. 33. Provisioned capacity November
  34. 34. 76% 24% Provisioned capacity November
  35. 35. November
  36. 36. On%and%Off% Fast%Growth% Variable%Peaks% Predictable%Peaks%
  37. 37. On%and%Off% Fast%Growth% Predictable%Peaks%Variable%Peaks% WASTE CUSTOMER DISSATISFACTION
  38. 38. Fast%Growth%On%and%Off% Predictable%peaks%Variable%peaks%
  39. 39. #4 ○○○●○
  40. 40. EMR is Hadoop in the Cloud!
  41. 41. Media/ Advertising Targeted Advertising Image and Video Processing Oil & Gas Seismic Analysis Retail Recommendations Transactions Analysis Life Sciences Genome Analysis Financial Services Monte Carlo Simulations Risk Analysis Security Anti-virus Fraud Detection Image Recognition Social Network/ Gaming User Demographics Usage analysis In-game metrics
  42. 42. 0 1,000,000 2,000,000 3,000,000 4,000,000 5,000,000 6,000,000
  43. 43. Versions 1.0.3 0.20.205 0.20 0.18 Distributions Apache Hadoop
  44. 44. Job Flows Custom JAR Cascading Streaming Ruby, Perl, Python, PHP, R, Bash, C++
  45. 45. Data Warehouse for Hadoop SQL-like query language Hive
  46. 46. High-level programming Ideal for data flow / ETL Pig
  47. 47. Near real time key/value store for structured data HBase
  48. 48. Distributed monitoring of cluster and nodes Ganglia
  49. 49. Statistical computing and graphics Machine learning library discover Value in Data
  50. 50. Data Strategist
  51. 51. Unknown Unknowns
  52. 52. Elastic On Demand Pay as you go Focus on YOUR business
  53. 53. Undifferen#ated% Heavy%LiRing% Focus on YOUR business
  54. 54. elastic-mapreduce --create --key-pair micro --region eu-west-1 --name MyJobFlow --num-instances 5 --instance-type m2.4xlarge –-alive --log-uri s3n://mybucket/EMR/log Instance type/count
  55. 55. elastic-mapreduce --create --key-pair micro --region eu-west-1 --name MyJobFlow --num-instances 5 --instance-type m2.4xlarge –-alive --pig-interactive --pig-versions latest --hive-interactive –-hive-versions latest --hbase --log-uri s3n://mybucket/EMR/log Adding Hive, Pig and Hbase to the job flow
  56. 56. Elastic On Demand Pay as you go Focus on YOUR business
  57. 57. 1 instance for 1000 hours = 1000 instances for 1 hour
  58. 58. …to Thousands
  59. 59. Turn Off the Resources and Stop Paying
  60. 60. Elastic On Demand Pay as you go Focus on YOUR business
  61. 61. Source: IDC Whitepaper, sponsored by Amazon, The Business Value of Amazon Web Services Accelerates Over Time. July 2012 70% lower 5 year TCO per app AWS On- premises $3.01M $0.90M 50% reduction in analytics costs
  62. 62. Save more money by using Spot Instances
  63. 63. 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% PercentageoftheDistribution Bid Price as Percentage of the On-Demand Price Bid Distribution Typical Spot Bidding Strategies
  64. 64. 14%hrs% Without#Spot# 4%instances%*%14%hrs%*%$0.50%=%$28% EMR with Spot Instances
  65. 65. 14%hrs% Without#Spot# 4%instances%*%14%hrs%*%$0.50%=%$28% EMR with Spot Instances 14%hrs%
  66. 66. 14%hrs% Without#Spot# 4%instances%*%14%hrs%*%$0.50%=%$28% 7%hrs% EMR with Spot Instances
  67. 67. With#Spot# 4%instances%*%7%hrs%*%$0.50%=%$14%+%%% 14%hrs% Without#Spot# 4%instances%*%14%hrs%*%$0.50%=%$28% EMR with Spot Instances 7%hrs%
  68. 68. With#Spot# 4%instances%*%7%hrs%*%$0.50%=%$14%+% 5%instances%*%7%hrs%*%$0.25%=%$8.75% Total%=%$22.75% 14%hrs% Without#Spot# 4%instances%*%14%hrs%*%$0.50%=%$28% EMR with Spot Instances 7%hrs%
  69. 69. Time#250%## Cost#222%# With#Spot# 4%instances%*%7%hrs%*%$0.50%=%$14%+% 5%instances%*%7%hrs%*%$0.25%=%$8.75% Total%=%$22.75% 14%hrs% Without#Spot# 4%instances%*%14%hrs%*%$0.50%=%$28% EMR with Spot Instances 7%hrs%
  70. 70. #5 ○○○○●
  71. 71. What kind of movies do people like ?
  72. 72. More than 25 Million Streaming Members 50 Billion Events Per Day 30 Million plays every day 2 billion hours of video in 3 months 4 million ratings per day 3 million searches Device location , time , day, week etc. Social data
  73. 73. 10 TB of streaming data per day
  74. 74. Da ta $C enter S3 Netflix(Data(Center Legacy data from on-premise data center Legacy Data
  75. 75. Customer dimension data stored in Cassandra
  76. 76. ~1 PB of data stored in Amazon S3 S3
  77. 77. Wide range of processing languages used EMR Prod%Cluster% (EMR) S3
  78. 78. Data consumed in multiple ways S3 EMR Prod%Cluster% (EMR) Recommendation Engine Ad-hoc Analysis Personalization
  79. 79. EMR S3 EMR EMR Prod%Cluster% (EMR) Query%Cluster% (EMR) EMR EMR
  80. 80. Durability
  81. 81. Versioning
  82. 82. Foursquare… 33 million users 1.3 million businesses …generates a lot of Data 3.5 billion check-ins 15M+ venues, Terabytes of log data
  83. 83. Uses EMR for Evaluation of new features Machine learning Exploratory analysis Daily customer usage reporting Long-term trend analysis
  84. 84. Benefits of EMR Ease-of-Use “We have decreased the processing time for urgent data-analysis” Flexibility To deal with changing requirements & dynamically expand reporting clusters Costs “We have reduced our analytics costs by over 50%”
  85. 85. ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/ Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/ Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  86. 86. ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/ Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/ Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  87. 87. ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/ Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/ Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  88. 88. ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/ Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/ Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  89. 89. 0 0.1 0.2 0.3 0.4 0.5 0.6 Female Male Gender 0 10 20 30 40 50 60 70 80 Age
  90. 90. Gorilla Coffee Gray's Papaya Amorino Thursday% Friday% Saturday% Sunday%
  91. 91. Who is using our service?
  92. 92. Finding signal in the noise of logs
  93. 93. Python library https://github.com/Yelp/mrjob
  94. 94. Log files 250 EMR clusters spun up and down every week
  95. 95. Common Crawl 1000 Genomes Project Census Data 54 other datasets http://aws.amazon.com/publicdatasets/
  96. 96. Challenge:%% Large%amounts%of%compu#ng%resources% needed%for%short%periods%of%#me;%significant% data%storage%costs% Solu<on:# Clusters%of%100s%of%nodes%on%EMR%running%4>5%hours% at%a%#me% Leverages%1000%genomes%Public%Data%Set%on%AWS%— free%access%to%~200%TB%of%genomes%for%over%2,600% people%from%26%popula#ons%around%the%world.%
  97. 97. Challenge:%% Vola#le%weather%is%deadly%to%crops%like%grapes% Solu<on:# Built%a%predic#ve%model%based%on%freely% available%data—% 60%years%of%crop%data,%% 14%TBs%of%soil%data,%and%% 1M%government%Doppler%radar%points% 50%EMR%clusters%process%new%data%as%it%comes% into%S3%each%day,%con#nuously%upda#ng%the% model.% %%%
  98. 98. 150B Soil Observations 3M Daily Weather Measurements 850K Precision Rainfall Grids Tracked 200 TB in Amazon S3
  99. 99. Training Videos Basic Overview Documentation Getting Started Guide Developer Guide API Reference FAQs Think Big Training (3-day Dev Course) EMR Bootcamp (on-site consulting)
  100. 100. Amazon Elastic MapReduce
  101. 101. Elastic and scalable No upfront CapEx Pay per use + + On demand + = Remove constraints
  102. 102. Remove constraints = More experimentation
  103. 103. More experimentation = More innovation
  104. 104. Focus on your business Leave undifferentiated heavy lifting to us
  105. 105. Thank you! aws.amazon.com/big-data @markkulepisto

×