Hadoop and HBase on Amazon Web Services

14,292 views

Published on

Introducing big data and analytics with Hadoop, Hbase and Amazon Elastic Mapreduce.

Published in: Technology

Hadoop and HBase on Amazon Web Services

  1. 1. Hadoop & HBasewith Amazon Web ServicesDr. Matt Woodmatthew@amazon.com
  2. 2. Thank you.
  3. 3. 3Introducing Hadoop
  4. 4. 3Introducing Hadoop g HBase on AWS
  5. 5. 3Introducing Hadoop g HBase on AWS v Cost optimization
  6. 6. Data for competitive advantage.
  7. 7. Using data Customer segmentation, financial modeling, system analysis, line-of-sight, business intelligence...
  8. 8. Generation Collection & storageAnalytics & computationCollaboration & sharing
  9. 9. Cost of data generation is falling.
  10. 10. lower cost,increased throughput Generation Collection & storage Analytics & computation Collaboration & sharing
  11. 11. Generation HIGHLY CONSTRAINED Collection & storageAnalytics & computationCollaboration & sharing
  12. 12. Very high barrier to turning data into information.
  13. 13. Move from a data generation challenge to analytics challenge.
  14. 14. Enter the AWS Cloud.
  15. 15. Remove the constraints.
  16. 16. Enable data-driven innovation.
  17. 17. Move to a distributed data approach.
  18. 18. Maturation of two things.
  19. 19. Software for distributed storage and analysisMaturation of two things.
  20. 20. Software for distributed storage and analysisMaturation of two things. Infrastructure for distributed storage and analysis
  21. 21. Software Frameworks for data-intensive workloads. Distributed by design.
  22. 22. Infrastructure Platform for data-intensive workloads. Distributed by design.
  23. 23. Support the data life cycle.
  24. 24. Generation HIGHLY CONSTRAINED Collection & storageAnalytics & computationCollaboration & sharing
  25. 25. Generation Collection & storageAnalytics & computationCollaboration & sharing
  26. 26. Lower thebarrier to entry.
  27. 27. Accelerate time to market and increase agility.
  28. 28. Enable new business opportunities.
  29. 29. Washington Post Pinterest NASA
  30. 30. “AWS enables Pfizer to exploredifficult or deep scientific questions ina timely, scalable manner and helps us make better decisions more quickly” Michael Miller, Pfizer
  31. 31. 3Introducing Hadoop
  32. 32. Software for distributed storage and analysisMaturation of two things. Infrastructure for distributed storage and analysis
  33. 33. Software for distributed storage and analysisMaturation of two things. Infrastructure for distributed storage and analysis
  34. 34. Apache Hadoop Software for distributed storage and analysis Implements the map/reduce pattern Focus on your data
  35. 35. Built for uncertainty Hadoop provides tools to navigate data Allows discovery Query flexibility at scale
  36. 36. Built for flexibility Java native Executes code in any language Just a distribution mechanism
  37. 37. Rich ecosystem Diverse tools Machine learning, recommendations, predictive analytics, segmentation, real time analysis Lots of innovation
  38. 38. But... A very big project 500k+ lines of code Challenging to configure and optimize
  39. 39. GUndifferentiated heavy lifting
  40. 40. Amazon Elastic MapReduce
  41. 41. Amazon Elastic MapReduce Web service for data processing Hosted Hadoop Configured and optimized
  42. 42. Amazon Elastic MapReduce Job flows Elastic platform Maintain clusters or run once and terminate Debugging tools
  43. 43. S3Input data
  44. 44. S3 Input dataCode Elastic MapReduce
  45. 45. S3 Input dataCode Elastic Name MapReduce node
  46. 46. S3 Input dataCode Elastic Name MapReduce node Elastic cluster
  47. 47. S3 Input dataCode Elastic Name MapReduce node HDFS Elastic cluster
  48. 48. S3 Input dataCode Elastic Name MapReduce node Queries HDFS + BI Via JDBC, Pig, Hive Elastic cluster
  49. 49. S3 Input dataCode Elastic Name Output MapReduce node S3 + SimpleDB Queries HDFS + BI Via JDBC, Pig, Hive Elastic cluster
  50. 50. S3Input data Output S3 + SimpleDB
  51. 51. Hadoop all the way down Amazon Hadoop distribution HDFS Streaming interface Hive, Pig, Mahout, Spark, Shark
  52. 52. Data integration Optimized and integrated into AWS environment Reads and writes to S3 Analytics on DynamoDB data Can process data from any source: Cassandra, Mongo, Couch, Amazon RDS
  53. 53. Data movement Multi-part upload Import/Export AWS Direct Connect Aspera
  54. 54. Cluster scalability Resize running job flows Add capacity for shorter runs Remove capacity during off peak hours Balance scale and cost
  55. 55. Cluster scalability 14 hours remaining
  56. 56. Cluster scalability 7 hours remaining
  57. 57. Cluster scalability 3 hours remaining
  58. 58. Cluster scalabilitySteady state Steady state Large batch task
  59. 59. Cluster availability Canonical source of data Any one in the engineering team IAM integration Monitoring
  60. 60. Click stream analysis for retail 3.5 billion records 71 million unique cookies 1.7 million targeted ads 13 Tb of clickstream logs Each day
  61. 61. Click stream analysis for retail Workflow time from 2 days to 8 hoursProcurement time from 2 months to 5 minutes $13k per month 500% increase return on advertising spend
  62. 62. Log data stored in Amazon S3Amazon S3 Months of user click-through data Search terms Ads displayed Premium listing inventory
  63. 63. Elastic Map Reduce spins up 200 instance cluster Hadoop ClusterAmazon S3 Amazon EMR
  64. 64. Find patterns across logs. Write results to S3. Hadoop Cluster Amazon S3 Amazon EMR
  65. 65. Hadoop in the AWS Cloud Elastic MapReduce for hosted Hadoop Optimized, configured, ready to roll Focus on the business benefit of data Hadoop all the way down
  66. 66. Software for distributed storage and analysisMaturation of two things. Infrastructure for distributed storage and analysis
  67. 67. gHBase on AWS
  68. 68. Vibrant ecosystem Mahout for machine learning Mesos for cluster management Spark for fast analytics HBase for unstructured data
  69. 69. HBase NoSQL data store Runs on top of HDFS Scalable Rapid retrieval across large datasets
  70. 70. Architecture Huge, distributed map/hash Distributed Implements Bloom filters Sortable
  71. 71. Column based Columns are similar to fields Rows are records
  72. 72. Built for data Built to scale across billions of rows The more data, the better the relative performance
  73. 73. But... Large, complex project Running in production can be challenging Distributed system
  74. 74. GUndifferentiated heavy lifting
  75. 75. HBase for Elastic MapReduce
  76. 76. Using HBase Social media firehose Customer information Usage and application logs Hadoop analytics
  77. 77. Generation Collection & storageAnalytics & computationCollaboration & sharing
  78. 78. Amazon DynamoDB NoSQL database service Provisioned throughput Unlimited storage Very easy to use
  79. 79. DynamoDB & Amazon EMR SQL like queries Query flexibility at scale Integrate queries across datasets Hive
  80. 80. NoSQL on the AWS Marketplace CouchDB Cassandra MongoDB aws.amazon.com/marketplace
  81. 81. vCost optimization
  82. 82. Lowered prices 19 times in the past six years.
  83. 83. On-demand
  84. 84. Reserved capacity
  85. 85. 100% Reserved capacity
  86. 86. 100% On-demand Reserved capacity
  87. 87. 100% On-demand Reserved capacity
  88. 88. Spot market
  89. 89. $0.08 vs $0.007 (yesterday evening)
  90. 90. Reserved Instance Marketplace
  91. 91. 3Introducing Hadoop g HBase on AWS v Cost optimization
  92. 92. Baws.amazon.com/elasticmapreduce
  93. 93. Thank youaws.amazon.com/rdsmatthew@amazon.com @mza

×