• Save
AWS Summit 2011: Big Data Analytics in the AWS cloud
 

AWS Summit 2011: Big Data Analytics in the AWS cloud

on

  • 3,906 views

 

Statistics

Views

Total Views
3,906
Views on SlideShare
3,714
Embed Views
192

Actions

Likes
7
Downloads
74
Comments
2

3 Embeds 192

http://ultravision.tistory.com 175
http://paper.li 15
http://www.twylah.com 2

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

AWS Summit 2011: Big Data Analytics in the AWS cloud AWS Summit 2011: Big Data Analytics in the AWS cloud Presentation Transcript

  • Analytics in the Cloud Peter Sirota, GM Elastic MapReduce
  • Data-Driven Decision MakingData is the new rawmaterial for anybusiness on par withcapital, people, andlabor.
  • What is Big Data? Terabytes of semi-structured log data in which businesses want to:  find correlations/perform pattern matching  generate recommendations  calculate advanced statistics (i.e., TP99) Twitter “Firehose”  50 million tweets per day  1,400% growth per year  How can advertisers drink from it? Social graphs  Value increases with exponential growth in data connectionsBig Data is full of valuable, unanswered questions!
  • Why is Big Data Hard (and Getting Harder)? Today’s Data Warehouses  Need to consolidate from multiple data sources in multiple formats across multiple businesses  Unconstrained growth of this business-critical information Today’s Users  Expect faster response time of fresher data  Sampling is not good enough and history is important  Demand inexpensive experimentation with new data  Become increasingly sophisticated Data Scientists Current systems don’t scale (and weren’t meant to)  Long time to provision more infrastructure  Specialized DB expertise required  Expensive and inelastic solutionsWe need tools built specifically for Big Data!
  • What is this thing called Hadoop?Dealing with Big Data requires two things:  Distributed, scalable storage  Inexpensive, flexible analyticsApache Hadoop is an open source softwareplatform that addresses both of these needs  Includes a fault‐tolerant, distributed storage system (HDFS) developed for commodity servers  Uses a technique called MapReduce to carry out exhaustive analysis over huge distributed data setsKey benefits Affordable – Cost / TB is a fraction of traditional options Proven at scale – Numerous petabyte implementations in production; linear scalability Flexible – Data can be stored with or without schema
  • RDBMS vs. MapReduce/HadoopRDBMS MapReduce/Hadoop Predefined schema  No schema is required Strategic data placement for query  Random data placement tuning  Fast scan of the entire dataset Exploit indexes for fast retrieving  Uniform query performance SQL only  Linearly scales for reads and Doesn’t scale linearly writes  Support many languages including SQL Complementary technologies
  • Why Amazon Elastic MapReduce?Managed Apache Hadoop Web Service  Monitor thousands of clusters per day  Use cases span from University students to Fortune 50Reduces complexity of Hadoop management  Handles node provisioning, customization, and shutdown  Tunes Hadoop to your hardware and network  Provides tools to debug and monitor your Hadoop clustersProvides tight integration with AWS services  Improved performance working with S3  Automatic re-provisioning on node failure  Dynamic expanding/shrinking of cluster size  Spot integration
  • Elastic MapReduce Key FeaturesSimplified Cluster Configuration/Management  Resize running job flows  Support for EIP/IAM/Tagging  Workload-specific configurations  Bootstrap ActionsEnhanced Monitoring/Debugging  Free CloudWatch Metrics / Alarms  Hadoop Metrics in Console  Ganglia SupportImproved Performance  S3 Multipart Upload  Cluster Compute Instances
  • Analytics Use CasesTargeted advertising / Clickstream analysisData warehousing applicationsBio-informatics (Genome analysis)Financial simulation (Monte Carlo simulation)File processing (resize jpegs)Web indexingData mining and BI
  • APACHE H IVEDATA WAREHOUSE FOR H ADOOP Open source project started at Facebook Turns data on Hadoop into a virtually limitless data warehouse Provides data summarization, ad hoc querying and analysis Enables SQL-like queries on structured and unstructured data  E.g. arbitrary field separators possible such as “,” in CSV file formats Inherits linear scalability of Hadoop
  • AWS Data Warehousing Architecture
  • Elastic Data Warehouse Customize cluster size to support varying resource needs (e.g. query support during the day versus batch processing overnight) Reduce costs by increasing server utilization Improve performance during high usage periods Data Warehouse (Batch Processing) Data Warehouse Data Warehouse (Steady State) (Steady State) Shrink to Expand to 9 instances 25 instances
  • Reducing Costs with Spot InstancesMix Spot and On-Demand instances to reduce cost andaccelerate computation while protecting against interruption Scenario #1 Scenario #2 #1: Cost without Spot Job Flow 4 instances *14 hrs * $0.50 = $28 Job Flow #2: Cost with Spot 4 instances *7 hrs * $0.50 = $13 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $21.75 Duration: Duration: 14 Hours Time Savings: 50% 7 Hours Cost Savings: ~22%Other EMR + Spot Use CasesRun entire cluster on Spot for biggest cost savingsReduce the cost of application testing
  • Monitoring Clusters with CloudWatch Free CloudWatch Metrics and Alarms  Track Hadoop job progress  Alarm on degradations in cluster health  Monitor aggregate Elastic MapReduce usage
  • Big Data Ecosystem And ToolsWe have a rapidly growing ecosystem and will continueto integrate with a wide range of partners. Someexamples: Business Intelligence  MicroStrategy, Pentaho Analytics  Datameer, Karmasphere, Quest Open source  Ganglia, SQuirrel SQL
  • ResourcesAmazon Elastic MapReduce aws.amazon.com/elasticmapreduce aws.amazon.com/articles/Elastic-MapReduce forums.aws.amazon.com/forum.jspa?forumID=52