Analytics in the Cloud       Peter Sirota, GM Elastic MapReduce
Data-Driven Decision MakingData is the new rawmaterial for anybusiness on par withcapital, people, andlabor.
What is Big Data?  Terabytes of semi-structured log data  in which businesses want to:   find correlations/perform patter...
Why is Big Data Hard (and Getting Harder)?  Today’s Data Warehouses     Need to consolidate from multiple data sources in...
What is this thing called Hadoop?Dealing with Big Data requires two things:  Distributed, scalable storage  Inexpensive,...
RDBMS vs. MapReduce/HadoopRDBMS                                   MapReduce/Hadoop Predefined schema                     ...
Why Amazon Elastic MapReduce?Managed Apache Hadoop Web Service  Monitor thousands of clusters per day  Use cases span fr...
Elastic MapReduce Key FeaturesSimplified Cluster Configuration/Management    Resize running job flows    Support for EIP...
Analytics Use CasesTargeted advertising / Clickstream analysisData warehousing applicationsBio-informatics (Genome analysi...
APACHE H IVEDATA WAREHOUSE FOR H ADOOP Open source project started at Facebook Turns data on Hadoop into a virtually limit...
AWS Data Warehousing Architecture
Elastic Data Warehouse Customize cluster size to support varying resource needs (e.g. query support during the day versus ...
Reducing Costs with Spot InstancesMix Spot and On-Demand instances to reduce cost andaccelerate computation while protecti...
Monitoring Clusters with CloudWatch Free CloudWatch Metrics and Alarms   Track Hadoop job progress   Alarm on degradatio...
Big Data Ecosystem And ToolsWe have a rapidly growing ecosystem and will continueto integrate with a wide range of partner...
ResourcesAmazon Elastic MapReduce  aws.amazon.com/elasticmapreduce  aws.amazon.com/articles/Elastic-MapReduce  forums.aws....
AWS Summit 2011: Big Data Analytics in the AWS cloud
Upcoming SlideShare
Loading in...5
×

AWS Summit 2011: Big Data Analytics in the AWS cloud

3,241

Published on

Published in: Technology, Business
2 Comments
7 Likes
Statistics
Notes
No Downloads
Views
Total Views
3,241
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
74
Comments
2
Likes
7
Embeds 0
No embeds

No notes for slide

AWS Summit 2011: Big Data Analytics in the AWS cloud

  1. 1. Analytics in the Cloud Peter Sirota, GM Elastic MapReduce
  2. 2. Data-Driven Decision MakingData is the new rawmaterial for anybusiness on par withcapital, people, andlabor.
  3. 3. What is Big Data? Terabytes of semi-structured log data in which businesses want to:  find correlations/perform pattern matching  generate recommendations  calculate advanced statistics (i.e., TP99) Twitter “Firehose”  50 million tweets per day  1,400% growth per year  How can advertisers drink from it? Social graphs  Value increases with exponential growth in data connectionsBig Data is full of valuable, unanswered questions!
  4. 4. Why is Big Data Hard (and Getting Harder)? Today’s Data Warehouses  Need to consolidate from multiple data sources in multiple formats across multiple businesses  Unconstrained growth of this business-critical information Today’s Users  Expect faster response time of fresher data  Sampling is not good enough and history is important  Demand inexpensive experimentation with new data  Become increasingly sophisticated Data Scientists Current systems don’t scale (and weren’t meant to)  Long time to provision more infrastructure  Specialized DB expertise required  Expensive and inelastic solutionsWe need tools built specifically for Big Data!
  5. 5. What is this thing called Hadoop?Dealing with Big Data requires two things:  Distributed, scalable storage  Inexpensive, flexible analyticsApache Hadoop is an open source softwareplatform that addresses both of these needs  Includes a fault‐tolerant, distributed storage system (HDFS) developed for commodity servers  Uses a technique called MapReduce to carry out exhaustive analysis over huge distributed data setsKey benefits Affordable – Cost / TB is a fraction of traditional options Proven at scale – Numerous petabyte implementations in production; linear scalability Flexible – Data can be stored with or without schema
  6. 6. RDBMS vs. MapReduce/HadoopRDBMS MapReduce/Hadoop Predefined schema  No schema is required Strategic data placement for query  Random data placement tuning  Fast scan of the entire dataset Exploit indexes for fast retrieving  Uniform query performance SQL only  Linearly scales for reads and Doesn’t scale linearly writes  Support many languages including SQL Complementary technologies
  7. 7. Why Amazon Elastic MapReduce?Managed Apache Hadoop Web Service  Monitor thousands of clusters per day  Use cases span from University students to Fortune 50Reduces complexity of Hadoop management  Handles node provisioning, customization, and shutdown  Tunes Hadoop to your hardware and network  Provides tools to debug and monitor your Hadoop clustersProvides tight integration with AWS services  Improved performance working with S3  Automatic re-provisioning on node failure  Dynamic expanding/shrinking of cluster size  Spot integration
  8. 8. Elastic MapReduce Key FeaturesSimplified Cluster Configuration/Management  Resize running job flows  Support for EIP/IAM/Tagging  Workload-specific configurations  Bootstrap ActionsEnhanced Monitoring/Debugging  Free CloudWatch Metrics / Alarms  Hadoop Metrics in Console  Ganglia SupportImproved Performance  S3 Multipart Upload  Cluster Compute Instances
  9. 9. Analytics Use CasesTargeted advertising / Clickstream analysisData warehousing applicationsBio-informatics (Genome analysis)Financial simulation (Monte Carlo simulation)File processing (resize jpegs)Web indexingData mining and BI
  10. 10. APACHE H IVEDATA WAREHOUSE FOR H ADOOP Open source project started at Facebook Turns data on Hadoop into a virtually limitless data warehouse Provides data summarization, ad hoc querying and analysis Enables SQL-like queries on structured and unstructured data  E.g. arbitrary field separators possible such as “,” in CSV file formats Inherits linear scalability of Hadoop
  11. 11. AWS Data Warehousing Architecture
  12. 12. Elastic Data Warehouse Customize cluster size to support varying resource needs (e.g. query support during the day versus batch processing overnight) Reduce costs by increasing server utilization Improve performance during high usage periods Data Warehouse (Batch Processing) Data Warehouse Data Warehouse (Steady State) (Steady State) Shrink to Expand to 9 instances 25 instances
  13. 13. Reducing Costs with Spot InstancesMix Spot and On-Demand instances to reduce cost andaccelerate computation while protecting against interruption Scenario #1 Scenario #2 #1: Cost without Spot Job Flow 4 instances *14 hrs * $0.50 = $28 Job Flow #2: Cost with Spot 4 instances *7 hrs * $0.50 = $13 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $21.75 Duration: Duration: 14 Hours Time Savings: 50% 7 Hours Cost Savings: ~22%Other EMR + Spot Use CasesRun entire cluster on Spot for biggest cost savingsReduce the cost of application testing
  14. 14. Monitoring Clusters with CloudWatch Free CloudWatch Metrics and Alarms  Track Hadoop job progress  Alarm on degradations in cluster health  Monitor aggregate Elastic MapReduce usage
  15. 15. Big Data Ecosystem And ToolsWe have a rapidly growing ecosystem and will continueto integrate with a wide range of partners. Someexamples: Business Intelligence  MicroStrategy, Pentaho Analytics  Datameer, Karmasphere, Quest Open source  Ganglia, SQuirrel SQL
  16. 16. ResourcesAmazon Elastic MapReduce aws.amazon.com/elasticmapreduce aws.amazon.com/articles/Elastic-MapReduce forums.aws.amazon.com/forum.jspa?forumID=52

×