• Save
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
Upcoming SlideShare
Loading in...5
×
 

AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha

on

  • 1,130 views

The volume, velocity and variety of data has changed drastically in the last decade. Everything generates data today, from your customers on social networks, to the instances running your web ...

The volume, velocity and variety of data has changed drastically in the last decade. Everything generates data today, from your customers on social networks, to the instances running your web applications. The tools to support collecting, storing, organizing, analyzing and sharing of data are all available in a couple of clicks, with Amazon Web Services. Attend this session to learn how Big Data in the cloud can help you easily unlock business opportunities hidden in your data today.

Statistics

Views

Total Views
1,130
Views on SlideShare
1,128
Embed Views
2

Actions

Likes
1
Downloads
0
Comments
0

1 Embed 2

https://twitter.com 2

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha Presentation Transcript

  • Abhishek Sinha Business Development Manager sinhaar@amazon.com @abysinha Big Data Analytics
  • Presenter Name Presenter Title Month Day, YearAbhishek Sinha Rajnikant
  • Customary Rajnikant Joke • How would “Rajni Saar” process big data ? View slide
  • Customary Rajnikant Joke • How would “Rajni Saar” process big data ? He could count it on his fingers ! View slide
  • An engineer’s definition When your data sets become so large that you have to start innovating how to collect, store, organize, analyze and share it
  • What is the challenge with big data ?
  • Generation Collection & storage Analytics & computation Collaboration & sharing
  • Generation Collection & storage Analytics & computation Collaboration & sharing Lower cost, higher throughput
  • Generation Collection & storage Analytics & computation Collaboration & sharing Highly constrained Lower cost, higher throughput
  • Generated data Available for analysis Data volume Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  • Big Gap in turning data into actionable information
  • Amazon Web Services helps remove constraints
  • 1 instance x 100 hours = 100 instances x 1 hour
  • Big Data Verticals and Use cases Media/Advertising Targeted Advertising Image and Video Processing Oil & Gas Seismic Analysis Retail Recommendati ons Transactions Analysis Life Sciences Genome Analysis Financial Services Monte Carlo Simulations Risk Analysis Security Anti-virus Fraud Detection Image Recognition Social Network/Gaming User Demograph ics Usage analysis In-game metrics
  • From data to actionable information
  • “Who is using our service?”
  • Identified early mobile usage Invested heavily in mobile development Finding signal in the noise of logs
  • 9,432,061 unique mobile devices used the Yelp mobile app. 4 million+ calls. 5 million+ directions. In January 2013
  • Autocomplete Search Recommendations Automatic spelling corrections
  • “What kind of movies do people like ?”
  • More than 25 Million Streaming Members 50 Billion Events Per Day 30 Million plays every day 2 billion hours of video in 3 months 4 million ratings per day 3 million searches Device location , time , day, week etc. Social data
  • “How do galaxies form?”
  • 1.42 million images from Hubble
  • 680,000 volunteers
  • A spectrum of analytics
  • K-means clustering Cascade correlation neural networks
  • A spectrum of cognition
  • Novice Domain experts
  • Novice Analytics & machine learning Expert
  • Stronger than the sum of their parts
  • 250 million classifications
  • 25 peer reviewed publications
  • Big Data tools on AWS
  • COLLECT | STORE | ANALYSE | SHARE Direct Connect SQS Glacier S3 EC2 Redshift DynamoDB Elastic Map Reduce CloudFront EC2 Basic building blocks for every workload Data pipeline Import Export Compute Fleet
  • Big Data tools on AWS In-memory Hadoop and Friends Managed Services MPP Datawarehouse NoSQL Scale out processing
  • Big Data tools on AWS In-memory Hadoop and Friends Managed Services MPP Datawarehouse NoSQL Scale out processing Hive/Pig/Cascading Shark/Spark Dynamodb Hbase Cassandra MongoDB .. EC2 .. SAP HANA one .. Treasure Data Qubole Splunk Storm Sumologic Karmasphere .. Redshift ..
  • Big Data tools on AWS In-memory Hadoop and Friends Managed Services MPP Datawarehouse NoSQL Scale out processing Hive/Pig/Cascading Shark/Spark Dynamodb Hbase Cassandra MongoDB .. EC2 .. SAP HANA one .. Treasure Data Qubole Splunk Storm Sumologic Karmasphere .. Redshift Vertica ..
  • EMR is Hadoop in the Cloud
  • How does EMR work ? EMR EMR Cluster S3 Put the data into S3 Choose: Hadoop distribution, # of nodes, types of nodes, custom configs, Hive/Pig/etc. Get the output from S3 Launch the cluster using the EMR console, CLI, SDK, or APIs You can also store everything in HDFS
  • What can you run on EMR… S3 EMR EMR Cluster
  • EMR EMR Cluster Resize Clusters S3 You can easily add and remove nodes
  • On and Off Fast Growth Predictable peaksVariable peaks WASTE CUSTOMER DISSATISFACTION
  • Fast GrowthOn and Off Predictable peaksVariable peaks
  • Resize Nodes with Spot Instances Cost without Spot 10 node cluster running for 14 hours Cost = 1.2 * 10 * 14 = $168
  • Resize Nodes with Spot Instances Cost without Spot Add 10 nodes on spot 10 node cluster running for 14 hours Cost = 1.2 * 10 * 14 = $168 20 node cluster running for 7 hours Cost = 1.2 * 10 * 7 = $84 = 0.6 * 10 * 7 = $42
  • Resize Nodes with Spot Instances Cost without Spot Add 10 nodes on spot 10 node cluster running for 14 hours Cost = 1.2 * 10 * 14 = $168 20 node cluster running for 7 hours Cost = 1.2 * 10 * 7 = $84 = 0.6 * 10 * 7 = $42 = Total $126 25% reduction in price 50% reduction in time
  • Traditional IT capacityCapacity Time Analytics needs
  • Traditional IT capacityCapacity Time Reserved Instances
  • Traditional IT capacityCapacity Time Reserved Instances On-demand
  • Traditional IT capacityCapacity Time Reserved Instances On-demand Spot
  • Run the analysis S3 Run clusters with your data in S3 Data is “streamed” in and intermediate results stored in HDFS EMR Cluster 1
  • When done shutdown the cluster EMR Cluster S3 When processing is complete, you can terminate the cluster (and stop paying) 1
  • EMR EMR Cluster You can also run 24/7 S3 If you run your jobs 24 x 7 , you can also run a persistent cluster and use RI models to save costs 2
  • Option to use S3 along with HDFS S3 EMR EMR Cluster • S3 provides 99.99999999999% of durability • Elastic • Version control against failure • Run multiple clusters with a single source of truth • Quick recovery from failure • Continuously resize clusters 3
  • Which is the data warehouse here ?  Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
  • Need faster query response time 3 Separate Map Reduce like Engine In-memory data storage for fast query response time Compatible with hadoop storage API SHARK Port of Apache Hive on SPARK Compatible with existing HIVE meta-stores Similar speed ups of upto 40x
  •  elastic-mapreduce --create --alive --name "Spark/Shark Cluster" --bootstrap-action s3://elasticmapreduce/samples/spark/0.7/ins tall-spark-shark.sh --bootstrap-name "Mesos/Spark/Shark" --instance-type m1.xlarge --instance-count 3
  •  Source https://amplab.cs.berkeley.edu/2013/06/04/comparing-large- scale-query-engines/
  • Generation Collection & storage Analytics & computation Collaboration & sharing Remove Constraints
  • Thank You sinhaar@amazon.com aws.amazon.com/elasticmapreduce aws.amazon.com/datapipeline aws.amazon.com/big-data @abysinha