Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

2,682 views

Published on

Big data technologies let you work with any velocity, volume, or variety of data in a highly productive environment. Join the General Manager of Amazon EMR, Peter Sirota, to learn how to scale your analytics, use Hadoop with Amazon EMR, write queries with Hive, develop real world data flows with Pig, and understand the operational needs of a production data platform.

Published in: Technology, Business
  • Be the first to comment

Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

  1. 1. Scaling Your Analytics with Amazon Elastic MapReduce Peter Sirota, General Manager - Amazon Elastic MapReduce November 14, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  2. 2. Agenda • Amazon EMR: Hadoop in the cloud • Hadoop Ecosystem on Amazon EMR • Customer Use Cases
  3. 3. Hadoop is the right system for Big Data • Scalable and fault tolerant • Flexibility for multiple languages and data formats • Open source • Ecosystem of tools • Batch and real-time analytics
  4. 4. Challenges with Hadoop On Premise On Amazon EC2 • Manage HDFS, upgrades, and system administration • Pay for expensive support contracts • Select hardware in advance and stick with predictions • Difficult to integrate with AWS storage services • Independently manage and monitor clusters
  5. 5. Amazon EMR is the easiest way to run Hadoop in the cloud
  6. 6. Why Amazon EMR? • • • • Managed services Easy to tune clusters and trim costs Support for multiple data stores Unique features and ecosystem support
  7. 7. S3, DynamoDB, Redshift Input data
  8. 8. S3, DynamoDB, Redshift Input data Code Elastic MapReduce
  9. 9. S3, DynamoDB, Redshift Input data Code Elastic MapReduce Name node
  10. 10. S3, DynamoDB, Redshift Input data Code Elastic MapReduce Name node S3/HDFS Elastic cluster
  11. 11. S3, DynamoDB, Redshift Input data Code Elastic MapReduce Name node Queries + BI S3/HDFS Via JDBC, Pig, Hive Elastic cluster
  12. 12. S3, DynamoDB, Redshift Input data Code Elastic MapReduce Output Name node Queries + BI S3/HDFS Via JDBC, Pig, Hive Elastic cluster
  13. 13. S3, DynamoDB, Redshift Input data Output
  14. 14. Elastic clusters Customize size and type to reduce costs
  15. 15. Choose your instance types Try out different configurations to find your optimal architecture CPU c1.xlarge cc1.4xlarge cc2.8xlarge Memory m1.large m2.2xlarge m2.4xlarge Disk hs1.8xlarge
  16. 16. Long running or transient clusters Easy to run Hadoop clusters short-term or 24/7, and only pay for what you need =
  17. 17. Resizable clusters Easy to add and remove compute capacity on your cluster 10 hours
  18. 18. Resizable clusters Easy to add and remove compute capacity on your cluster 6 hours
  19. 19. Resizable clusters Easy to add and remove compute capacity on your cluster Peak capacity
  20. 20. Resizable clusters Easy to add and remove compute capacity on your cluster Matched compute demands with cluster sizing 10 hours
  21. 21. Use Spot and Reserved Instances Minimize costs by supplementing on-demand pricing
  22. 22. Easy to use Spot Instances Name-your-price supercomputing to minimize costs Spot for task nodes On-demand for core nodes Up to 90% off Amazon EC2 on-demand pricing Standard Amazon EC2 pricing for on-demand capacity
  23. 23. 24/7 clusters on Reserved Instances Minimize cost for consistent capacity Reserved Instances for long running clusters Up to 65% off on-demand pricing
  24. 24. Your data, your choice Easy to integrate Amazon EMR with your data stores
  25. 25. Using Amazon S3 and HDFS Data aggregated and stored in Amazon S3 Ad-hoc Query Long running EMR cluster holding data in HDFS for Hive interactive queries Weekly Report Data Sources Transient EMR cluster for batch map/reduce jobs for daily reports
  26. 26. Use Amazon EMR with Amazon Redshift and Amazon S3 Processed data loaded into Amazon Redshift data warehouse Daily data aggregated in Amazon S3 Data Sources Amazon EMR cluster used to process data
  27. 27. Use the Hadoop Ecosystem on Amazon EMR Leverage a diverse set of tools to get the most out of your data
  28. 28. Hadoop 2.x • • • • • and much more... Databases Machine learning Metadata stores Exchange formats Diverse query languages
  29. 29. Use Hive on Amazon EMR to interact with your data in HDFS and Amazon S3 • Data warehouse for Hadoop • Integration with Amazon S3 for better performance reading and writing to Amazon S3 • SQL-like query language to make iterative queries easier • Easy to scale in HDFS on a persistent Amazon EMR cluster
  30. 30. Use HBase on a persistent Amazon EMR cluster as a column-oriented scalable data store • Billions of rows and millions of columns • Backup to and restore from Amazon S3 • Flexible datatypes • Modulate your HBase tables when adding new data to your system
  31. 31. Use ad-hoc queries on your cluster to drive insights in real-time Spark / Shark • In-memory MapReduce for faster queries • Use HiveQL to interact with your data
  32. 32. Use ad-hoc queries on your cluster to drive insights in real-time Spark / Shark • In-memory MapReduce for faster queries • Use HiveQL to interact with your data Impala (coming soon!) • Parallel database engine for Hadoop • Use SQL to query data in HDFS on your cluster in real-time
  33. 33. “Hadoop-as-a-Service [Amazon EMR] offers a better price-performance ratio [than bare-metal Hadoop].” 1. Elastic clusters and cost optimization 2. Rapid, tuned provisioning 3. Agility for experimentation 4. Easy integration with diverse datastores
  34. 34. Diverse set of partners to build on Amazon EMR BI / Visualization Hadoop Distribution Monitoring Business Intelligence Data Transformation Data Transfer Performance Tuning Available on AWS Marketplace BI / Visualization ETL Tool Graphical IDE Available as a distribution in Amazon Elastic MapReduce BI / Visualization Encryption Graphical IDE
  35. 35. Thousands of customers
  36. 36. How Netflix scales Big Data Platform on Amazon EMR Eva Tse, Director of Big Data Platform, Netflix November 14, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  37. 37. Hadoop ecosystem as our Data Analytics platform in the cloud
  38. 38. How we got here?
  39. 39. How do we scale?
  40. 40. Separate compute and storage layers
  41. 41. Amazon S3 as our DW
  42. 42. S3 Source of truth
  43. 43. S3 S3mper-enabled Source of truth
  44. 44. Multiple clusters
  45. 45. Ad hoc SLA zone y zone x S3 Source of truth
  46. 46. Ad hoc SLA Bonus zone x Bonus zone y S3 Source of truth Bonus zone z
  47. 47. Unified and global big data collection pipeline
  48. 48. Events Pipeline SLA cloud apps Suro Ursula S3 Bonus Source of truth Aegisthus Dimension Pipeline Adhoc
  49. 49. Innovate – services and tools
  50. 50. Sting CLIs Gateways
  51. 51. Putting into perspective … • • • • • Billions of viewing hours of data ~3000 nodes clusters Hundred billion events / day Few petabytes DW on Amazon S3 Thousands of jobs / day
  52. 52. Adhoc querying
  53. 53. Simple Reporting
  54. 54. E E T T L T L
  55. 55. Analytics and statistical modeling
  56. 56. Open Connect
  57. 57. What works for us? Scalability
  58. 58. What works for us? Hadoop integration on Amazon EC2 / AWS
  59. 59. What works for us? Let us focus on innovation and build a solution
  60. 60. What works for us? Tight engagement with Amazon EMR & Amazon EC2 teams for tactical issues and strategic roadmap
  61. 61. Next Steps … • Heterogeneous node cluster • Auto expand shrink • Richer monitoring infrastructure
  62. 62. We strive to build the best of class big data platform in the cloud
  63. 63. Big Data at Channel 4 Amazon Elastic MapReduce for Competitive Advantage Bob Harris – Channel 4 Television 14th November 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  64. 64. Channel 4 – Background • Channel 4 is a public service, commercially funded, not-for-profit, broadcaster. • We have a remit to deliver innovative, experimental, distinctive, and diverse content across television, film, and digital media. • We are funded predominantly by television advertising, competing with the other established UK commercial broadcasters, and increasingly with emerging, Internet based, providers. • Our content, is available across our portfolio of around 10 core and time-shift channels, and our on demand service 4oD is accessible across multiple devices and platforms.
  65. 65. Why Big Data at C4
  66. 66. Business Intelligence at C4 • Well established Business Intelligence capability • Based on industry standard proprietary products • Real-time data warehousing • Comprehensive business reporting • Excellent internal skills • Good external skills availability
  67. 67. Big Data Technology at C4 • 2011 - Embarked on Big Data initiative – – • 2012 - Ran Amazon EMR in parallel with conventional BI – – • Ran in-house and cloud-based PoCs Selected Amazon EMR Hive deployed to Data Analysts Amazon EMR workflows deployed to production 2013 – Amazon EMR confirmed as primary Big Data platform – – Amazon EMR usage growing, focus on automation Experimenting with Mahout for Machine Learning
  68. 68. What problems are we solving? Single view of the viewer recognising them across devices and serving relevant content Personalising the viewer experience
  69. 69. How are we doing this? • Principal tasks… – Audience segmentation – Personalisation – Recommendations • What data do we process… – – – – Website clickstream logs 4oD activity and viewing history Over 9m registered users Majority of activity now from “logged-in” users
  70. 70. High-Level Architecture
  71. 71. High-Level Architecture • Amazon EMR and existing BI technology are complementary • Process billions of data rows in Amazon EMR, store millions of result rows in RDBMS • No need to “rip and replace”, existing technology investment is protected • Amazon EMR will continue to underpin major growth in data volumes and processing complexity
  72. 72. Where Next? • Continued growth in usage of Amazon EMR • Migrate to Hadoop 2.x • Adopt Amazon Redshift • Improved integration between C4 and AWS • Shift toward “near real-time” processing
  73. 73. Please give us your feedback on this presentation BDT301 As a thank you, we will select prize winners daily for completed surveys!

×