AWS Summit 2011: Big Data Analytics in the AWS cloud
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce
Data-Driven Decision MakingData is the new rawmaterial for anybusiness on par withcapital, people, andlabor.
What is Big Data? Terabytes of semi-structured log data in which businesses want to: find correlations/perform pattern matching generate recommendations calculate advanced statistics (i.e., TP99) Twitter “Firehose” 50 million tweets per day 1,400% growth per year How can advertisers drink from it? Social graphs Value increases with exponential growth in data connectionsBig Data is full of valuable, unanswered questions!
Why is Big Data Hard (and Getting Harder)? Today’s Data Warehouses Need to consolidate from multiple data sources in multiple formats across multiple businesses Unconstrained growth of this business-critical information Today’s Users Expect faster response time of fresher data Sampling is not good enough and history is important Demand inexpensive experimentation with new data Become increasingly sophisticated Data Scientists Current systems don’t scale (and weren’t meant to) Long time to provision more infrastructure Specialized DB expertise required Expensive and inelastic solutionsWe need tools built specifically for Big Data!
What is this thing called Hadoop?Dealing with Big Data requires two things: Distributed, scalable storage Inexpensive, flexible analyticsApache Hadoop is an open source softwareplatform that addresses both of these needs Includes a fault‐tolerant, distributed storage system (HDFS) developed for commodity servers Uses a technique called MapReduce to carry out exhaustive analysis over huge distributed data setsKey benefits Affordable – Cost / TB is a fraction of traditional options Proven at scale – Numerous petabyte implementations in production; linear scalability Flexible – Data can be stored with or without schema
RDBMS vs. MapReduce/HadoopRDBMS MapReduce/Hadoop Predefined schema No schema is required Strategic data placement for query Random data placement tuning Fast scan of the entire dataset Exploit indexes for fast retrieving Uniform query performance SQL only Linearly scales for reads and Doesn’t scale linearly writes Support many languages including SQL Complementary technologies
Why Amazon Elastic MapReduce?Managed Apache Hadoop Web Service Monitor thousands of clusters per day Use cases span from University students to Fortune 50Reduces complexity of Hadoop management Handles node provisioning, customization, and shutdown Tunes Hadoop to your hardware and network Provides tools to debug and monitor your Hadoop clustersProvides tight integration with AWS services Improved performance working with S3 Automatic re-provisioning on node failure Dynamic expanding/shrinking of cluster size Spot integration
Analytics Use CasesTargeted advertising / Clickstream analysisData warehousing applicationsBio-informatics (Genome analysis)Financial simulation (Monte Carlo simulation)File processing (resize jpegs)Web indexingData mining and BI
APACHE H IVEDATA WAREHOUSE FOR H ADOOP Open source project started at Facebook Turns data on Hadoop into a virtually limitless data warehouse Provides data summarization, ad hoc querying and analysis Enables SQL-like queries on structured and unstructured data E.g. arbitrary field separators possible such as “,” in CSV file formats Inherits linear scalability of Hadoop
Elastic Data Warehouse Customize cluster size to support varying resource needs (e.g. query support during the day versus batch processing overnight) Reduce costs by increasing server utilization Improve performance during high usage periods Data Warehouse (Batch Processing) Data Warehouse Data Warehouse (Steady State) (Steady State) Shrink to Expand to 9 instances 25 instances
Reducing Costs with Spot InstancesMix Spot and On-Demand instances to reduce cost andaccelerate computation while protecting against interruption Scenario #1 Scenario #2 #1: Cost without Spot Job Flow 4 instances *14 hrs * $0.50 = $28 Job Flow #2: Cost with Spot 4 instances *7 hrs * $0.50 = $13 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $21.75 Duration: Duration: 14 Hours Time Savings: 50% 7 Hours Cost Savings: ~22%Other EMR + Spot Use CasesRun entire cluster on Spot for biggest cost savingsReduce the cost of application testing
Monitoring Clusters with CloudWatch Free CloudWatch Metrics and Alarms Track Hadoop job progress Alarm on degradations in cluster health Monitor aggregate Elastic MapReduce usage
Big Data Ecosystem And ToolsWe have a rapidly growing ecosystem and will continueto integrate with a wide range of partners. Someexamples: Business Intelligence MicroStrategy, Pentaho Analytics Datameer, Karmasphere, Quest Open source Ganglia, SQuirrel SQL