AWS Summit 2011: Big Data Analytics in the AWS cloud


Published on

Published in: Technology, Business
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

AWS Summit 2011: Big Data Analytics in the AWS cloud

  1. 1. Analytics in the Cloud Peter Sirota, GM Elastic MapReduce
  2. 2. Data-Driven Decision MakingData is the new rawmaterial for anybusiness on par withcapital, people, andlabor.
  3. 3. What is Big Data? Terabytes of semi-structured log data in which businesses want to:  find correlations/perform pattern matching  generate recommendations  calculate advanced statistics (i.e., TP99) Twitter “Firehose”  50 million tweets per day  1,400% growth per year  How can advertisers drink from it? Social graphs  Value increases with exponential growth in data connectionsBig Data is full of valuable, unanswered questions!
  4. 4. Why is Big Data Hard (and Getting Harder)? Today’s Data Warehouses  Need to consolidate from multiple data sources in multiple formats across multiple businesses  Unconstrained growth of this business-critical information Today’s Users  Expect faster response time of fresher data  Sampling is not good enough and history is important  Demand inexpensive experimentation with new data  Become increasingly sophisticated Data Scientists Current systems don’t scale (and weren’t meant to)  Long time to provision more infrastructure  Specialized DB expertise required  Expensive and inelastic solutionsWe need tools built specifically for Big Data!
  5. 5. What is this thing called Hadoop?Dealing with Big Data requires two things:  Distributed, scalable storage  Inexpensive, flexible analyticsApache Hadoop is an open source softwareplatform that addresses both of these needs  Includes a fault‐tolerant, distributed storage system (HDFS) developed for commodity servers  Uses a technique called MapReduce to carry out exhaustive analysis over huge distributed data setsKey benefits Affordable – Cost / TB is a fraction of traditional options Proven at scale – Numerous petabyte implementations in production; linear scalability Flexible – Data can be stored with or without schema
  6. 6. RDBMS vs. MapReduce/HadoopRDBMS MapReduce/Hadoop Predefined schema  No schema is required Strategic data placement for query  Random data placement tuning  Fast scan of the entire dataset Exploit indexes for fast retrieving  Uniform query performance SQL only  Linearly scales for reads and Doesn’t scale linearly writes  Support many languages including SQL Complementary technologies
  7. 7. Why Amazon Elastic MapReduce?Managed Apache Hadoop Web Service  Monitor thousands of clusters per day  Use cases span from University students to Fortune 50Reduces complexity of Hadoop management  Handles node provisioning, customization, and shutdown  Tunes Hadoop to your hardware and network  Provides tools to debug and monitor your Hadoop clustersProvides tight integration with AWS services  Improved performance working with S3  Automatic re-provisioning on node failure  Dynamic expanding/shrinking of cluster size  Spot integration
  8. 8. Elastic MapReduce Key FeaturesSimplified Cluster Configuration/Management  Resize running job flows  Support for EIP/IAM/Tagging  Workload-specific configurations  Bootstrap ActionsEnhanced Monitoring/Debugging  Free CloudWatch Metrics / Alarms  Hadoop Metrics in Console  Ganglia SupportImproved Performance  S3 Multipart Upload  Cluster Compute Instances
  9. 9. Analytics Use CasesTargeted advertising / Clickstream analysisData warehousing applicationsBio-informatics (Genome analysis)Financial simulation (Monte Carlo simulation)File processing (resize jpegs)Web indexingData mining and BI
  10. 10. APACHE H IVEDATA WAREHOUSE FOR H ADOOP Open source project started at Facebook Turns data on Hadoop into a virtually limitless data warehouse Provides data summarization, ad hoc querying and analysis Enables SQL-like queries on structured and unstructured data  E.g. arbitrary field separators possible such as “,” in CSV file formats Inherits linear scalability of Hadoop
  11. 11. AWS Data Warehousing Architecture
  12. 12. Elastic Data Warehouse Customize cluster size to support varying resource needs (e.g. query support during the day versus batch processing overnight) Reduce costs by increasing server utilization Improve performance during high usage periods Data Warehouse (Batch Processing) Data Warehouse Data Warehouse (Steady State) (Steady State) Shrink to Expand to 9 instances 25 instances
  13. 13. Reducing Costs with Spot InstancesMix Spot and On-Demand instances to reduce cost andaccelerate computation while protecting against interruption Scenario #1 Scenario #2 #1: Cost without Spot Job Flow 4 instances *14 hrs * $0.50 = $28 Job Flow #2: Cost with Spot 4 instances *7 hrs * $0.50 = $13 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $21.75 Duration: Duration: 14 Hours Time Savings: 50% 7 Hours Cost Savings: ~22%Other EMR + Spot Use CasesRun entire cluster on Spot for biggest cost savingsReduce the cost of application testing
  14. 14. Monitoring Clusters with CloudWatch Free CloudWatch Metrics and Alarms  Track Hadoop job progress  Alarm on degradations in cluster health  Monitor aggregate Elastic MapReduce usage
  15. 15. Big Data Ecosystem And ToolsWe have a rapidly growing ecosystem and will continueto integrate with a wide range of partners. Someexamples: Business Intelligence  MicroStrategy, Pentaho Analytics  Datameer, Karmasphere, Quest Open source  Ganglia, SQuirrel SQL
  16. 16. ResourcesAmazon Elastic MapReduce