Big Data Use Cases and Solutions in the AWS Cloud
Upcoming SlideShare
Loading in...5
×
 

Big Data Use Cases and Solutions in the AWS Cloud

on

  • 989 views

The AWS cloud computing platform has disrupted big data. Managing big data applications used to be for only well-funded research organizations and large corporations, but not any longer. Hear from ...

The AWS cloud computing platform has disrupted big data. Managing big data applications used to be for only well-funded research organizations and large corporations, but not any longer. Hear from Ben Butler, Big Data Solutions Marketing Manager for AWS, to learn how our customers are using big data services in the AWS cloud to innovate faster than ever before. Not only is AWS technology available to everyone, but it is self-service, on-demand, and featuring innovative technology and flexible pricing models at low cost with no commitments. Learn from customer success stories, as Ben shares real-world case studies describing the specific big data challenges being solved on AWS. We will conclude with a discussion around the tutorials, public datasets, test drives, and our grants program - all of the resources needed to get you started quickly.

Statistics

Views

Total Views
989
Views on SlideShare
989
Embed Views
0

Actions

Likes
10
Downloads
110
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Big Data Use Cases and Solutions in the AWS Cloud Big Data Use Cases and Solutions in the AWS Cloud Presentation Transcript

  • © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Big Data Use Cases and Solutions in the AWS Cloud Ben Butler, @bensbutler, Sr. Mgr., Big Data & HPC July 10, 2014
  • Generation Collection & storage Analytics & computation Collaboration & sharing
  • Generation Collection & storage Analytics & computation Collaboration & sharing
  • Big Data: Unconstrained data growth 95% of the 1.2 zettabytes of data in the digital universe is unstructured 70% of of this is user- generated content Unstructured data growth explosive, with estimates of compound annual growth (CAGR) at 62% Source: IDCGB TB PB ZB EB
  • The amount of information generated during the first day of a baby’s life today is equivalent to 70 times the information contained in the Library of Congress
  • Lower cost, higher throughput Generation Collection & storage Analytics & computation Collaboration & sharing
  • Highly constrained Lower cost, higher throughput Generation Collection & storage Analytics & computation Collaboration & sharing
  • Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares Available for analysis Generated data Data volume - Gap 1990 2000 2010 2020
  • Elastic and highly scalable No upfront capital expense Only pay for what you use + + Available on-demand + = Remove constraints
  • Accelerated Generation Collection & storage Analytics & computation Collaboration & sharing
  • Technologies and techniques for working productively with data, at any scale. Big Data
  • Big data and AWS Cloud computing Big data Cloud computing Variety, volume, and velocity requiring new tools Variety of compute, storage, and networking options
  • Big data and AWS Cloud computing Big data Cloud computing Potentially massive datasets Massive, virtually unlimited capacity
  • Big data and AWS Cloud computing Big data Cloud computing Iterative, experimental style of data manipulation and analysis Iterative, experimental style of infrastructure deployment/usage
  • Big data and AWS Cloud computing Big data Cloud computing Frequently not steady-state workload; peaks and valleys At its most efficient with highly variable workloads
  • Big data and AWS Cloud computing Big data Cloud computing Absolute performance not as critical as “time to results”; shared resources are a bottleneck Parallel compute projects allow each workgroup to have more autonomy, get faster results
  • One tool to rule them all
  • Use the right tools Amazon S3 Amazon Kinesis Amazon DynamoDB Amazon Redshift Amazon Elastic MapReduce
  • Store anything Object storage Scalable 99.999999999% durability Amazon S3
  • Real-time processing High throughput; elastic Easy to use EMR, S3, Redshift, DynamoDB Integrations Amazon Kinesis
  • NoSQL Database Seamless scalability Zero admin Single digit millisecond latency Amazon DynamoDB
  • Relational data warehouse Massively parallel Petabyte scale Fully managed $1,000/TB/Year Amazon Redshift
  • Try Amazon Redshift with BI & ETL for Free! aws.amazon.com/redshift/free-trial 2 months | 750 hours/month | dw2.large SSD instance 160GB of compressed storage per node Try BI & ETL for free from nine partners at aws.amazon.com/redshift/partners
  • Hadoop/HDFS clusters Hive, Pig, Impala, Hbase Easy to use; fully managed On-demand and spot pricing Tight integration with S3, DynamoDB, and Kinesis Amazon Elastic MapReduce
  • Amazon EMR now ships with ODBC and JDBC drivers for Hive, Impala, and HBase Easier to use popular BI tools like: Microsoft Excel, Tableau, MicroStrategy, and QlikView ODBC and JDBC drivers now for Amazon EMR
  • The right tools. At the right scale. At the right time.
  • HDFS Amazon EMR
  • HDFS Amazon S3 Amazon DynamoDB Amazon EMR AWS Data Pipeline
  • HDFS Amazon S3 Amazon DynamoDB Amazon EMR Amazon Kinesis AWS Data Pipeline Data Sources
  • HDFS Amazon S3 Amazon DynamoDB Amazon EMR Amazon Kinesis AWS Data Pipeline Data Sources Data management Hadoop Ecosystem analytical tools
  • HDFS Amazon RedShift Amazon RDS Amazon S3 Amazon DynamoDB Amazon EMR Amazon Kinesis AWS Data Pipeline Data management Hadoop Ecosystem analytical tools Data Sources
  • HDFS Amazon RedShift Amazon RDS Amazon S3 Amazon DynamoDB Amazon EMR Amazon Kinesis AWS Data Pipeline Data management Hadoop Ecosystem analytical tools Data Sources AWS Data Pipeline
  • Free steak campaign Disaster recovery Web site & media sharing Facebook app Ground campaign SAP & SharePoint Marketing web site Business line of sight Consumer social app IT operations Mars exploration ops Interactive TV apps Media streaming Consumer social app Facebook page Securities Trading Data Archiving Financial markets analytics Web and mobile apps Big data analytics Digital media Ticket pricing optimization Streaming webcasts Mobile analytics Consumer social app Core IT and media
  • Customer Use Cases of Big Data
  • Dropcam is the biggest inbound video service on the Web More data uploaded per minute than YouTube Petabytes of data processed every month Billions of motion events detected
  • 4 months to production 300% speed gain $500k - $1M in CAPEX saved
  • 500MM tweets/day = ~ 20.8MM tweets/hr 2k/tweet is ~12MB/sec, need 6 shards, ~1TB/day $0.015/hour per shard, $0.028/million PUTS Kinesis cost is $0.765/hour Redshift cost is $0.850/hour (for a 2TB dw1.xlarge) Total: $1.615/hour Cost & Scale
  • http://wefeel.csiro.au/#/
  • “THANKS TO AMAZON WEB SERVICES, WE CAN DELIGHT OUR PLAYERS WORLDWIDE.” Sami Yliharju | Services Lead
  • The Climate Corporation - Weather Insurance for Farms Challenge: Volatile weather is deadly to crops like grapes Solution: Built a predictive model based on freely available data: • 60 years of crop data, • 14 TBs of soil data, and • 1M government Doppler radar points • 50 EMR clusters process new data as it comes into S3 each day, continuously updating the model.
  • 150B Soil Observations 3M Daily Weather Measurements 850K Precision Rainfall Grids Tracked 200 TB in Amazon S3
  • Foursquare… 33 million users 1.3 million businesses …generates a lot of Data 3.5 billion check-ins 15M+ venues, Terabytes of log data
  • Uses EMR for Evaluation of new features Machine learning Exploratory analysis Daily customer usage reporting Long-term trend analysis
  • Benefits of Amazon EMR Ease-of-Use “We have decreased the processing time for urgent data-analysis” Flexibility To deal with changing requirements & dynamically expand reporting clusters Costs “We have reduced our analytics costs by over 50%”
  • Who is checking in? 0 0.1 0.2 0.3 0.4 0.5 0.6 Female Male Gender 0 20 40 60 80 Age
  • Gorilla Coffee Gray's Papaya Amorino Thursday Friday Saturday Sunday When do people go to a place?
  • User Sign-ups
  • Generation Collection & storage Analytics & computation Collaboration & sharing
  • a Amazon DynamoDB Amazon RDS Amazon Redshift AWS Direct Connect AWS Storage Gateway AWS Import/ Export Amazon Glacier S3 Amazon Kinesis Amazon EMR Generation Collection & storage Analytics & computation Collaboration & sharing
  • Amazon EC2 Amazon EMR Amazon Kinesis Generation Collection & storage Analytics & computation Collaboration & sharing
  • Amazon Redshift Amazon DynamoDB Amazon RDS S3 Amazon EC2 Amazon EMR Amazon CloudFront AWS CloudFormation AWS Data Pipeline Generation Collection & storage Analytics & computation Collaboration & sharing
  • © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. DataXu in the Cloud Yekesa Kosuru, V.P Technology July 10th 2014
  • What is DataXu? • Digital Marketing Platform, Ad Tech Platform • Real-time Multivariate Decision System • 5th Fastest Growing Private Company in U.S (Inc 500) • Optimize Digital Marketing Campaigns – ...put the right ad campaign in front of the right customer – …find customer who left their site without converting – …find more customers who are likely to convert – …offer insight into who, why, when, where are respondents • 950,000 times per second
  • Big Data, Little Decisions Decision impact (also proportional to risk) Decision rate 1 2000’s – “How often can we run a permission-based email mktg. campaign?” Rules-based alerts 2010’s – Millions of decisions and actions taken, all in less than a blink of an eye volume ~ value The Evolution of Real-Time Decision Systems 1 2 2 3 3 1990’s – “Should we advertise on the Superbowl? Should we run direct mail this qtr.?” Batch mode
  • Real Time Bidding Site Auctions Ads, e.g Google User Opens Browser Goes to Sports Site DataXu Bids (others bid too) DataXu Wins Bid Ad Shown, Page loads
  • Quick Statistics • 950K bid requests per second • Billions of impressions per month, Petabyte of data • 100 ms round trip response time • 100+TB of warehouse data • 3000+ Servers powering the platform
  • Why AWS • Automation, API • Costs, Pay As You Go • Auto Scaling (elasticity – up and down) • All Data in One Place (S3 foundational store) • Improved Testability • Security, Privacy • Disaster Recovery and Business Continuity
  • DataXu Stack Campaign Management Business Intelligence Data Mart Interactive Queries Batch Queries Real Time Bidding System Activity Logs 1st Party3rd Party Distributed Log Ingestion S3/HDFS Warehouse CDN User Profiles Campaign Metadata ETL Attribution Machine Learning Spend Decision System Audience Calculation Uniques/S egment Big Velocity 950K TPS Big Volume Petabyte of Data Big Variety Data Providers
  • High Level Deployment ON PREMISE SSL Meta Amazon S3 RTB System Elastic Load Balancing Availability Zone Route 53 EC2 Auto scaling Group Volumes AMI Availability Zone Log Ingestion System Machine Learning System Auto scaling Group EMR CloudWatch
  • Traditional Hadoop vs EMR • Traditional Hadoop – Anticipate and provision for peaks – Cant de-couple storage and compute – 75% cluster is idle – Data Duplication/Multiple Clusters • EMR to the rescue • Monthly savings of 72% using EMR
  • S3 Provides Linearly Scalable Bandwidth • Big volume workloads involve several datasets together and terabytes of data • Aggregate bandwidth matters • S3 scales pretty linearly S3 Streaming Performance (m1.xlarge @ $0.34/hr) 100 VMs; 9.6GB/s; $34/hr 350 VMs; 28.7GB/s; $119/hr 34 secs per terabyte
  • ThankYou www.dataxu.com Yekesa Kosuru, @ykosuru ykosuru@dataxu.com
  • © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Getting Started with Big Data on AWS
  • AWS is here to help Solution Architects Professional Services Premium Support AWS Partner Network (APN)
  • aws.amazon.com/partners/competencies/big-data Partner with an AWS Big Data expert
  • https://aws.amazon.com/architecture/ Processing large amounts of parallel data using a scalable cluster AWS Architecture Diagrams
  • http://aws.amazon.com/marketplace Big Data Case Studies Learn from other AWS customers aws.amazon.com/solutions/case-studies/big-data
  • AWS Marketplace AWS Online Software Store aws.amazon.com/marketplace Shop the big data category
  • http://aws.amazon.com/marketplace AWS Public Data Sets Free access to big data sets aws.amazon.com/publicdatasets
  • AWS Grants Program AWS in Education aws.amazon.com/grants
  • AWS Big Data Test Drives APN Partner-provided labs aws.amazon.com/testdrive/bigdata
  • https://aws.amazon.com/training AWS Training & Events Webinars, Bootcamps, and Self-Paced Labs aws.amazon.com/events
  • Big Data on AWS Course on Big Data aws.amazon.com/training/course-descriptions/bigdata
  • reinvent.awsevents.com
  • aws.amazon.com/big-data
  • © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Thank you! Ben Butler, @bensbutler, Sr. Mgr., Big Data July 10, 2014 – http://aws.amazon.com/big-data