Your SlideShare is downloading. ×
Abhishek Sinha
Business Development Manager
sinhaar@amazon.com
@abysinha
Big Data Analytics
Presenter Name
Presenter Title
Month Day, YearAbhishek Sinha
Rajnikant
Customary Rajnikant Joke
• How would “Rajni Saar”
process big data ?
Customary Rajnikant Joke
• How would “Rajni Saar”
process big data ?
He could count it on his
fingers !
An engineer’s definition
When your data sets become so large that you have to start
innovating how to collect, store, orga...
What is the challenge with big data ?
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Lower cost,
higher throughput
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Highly
constrained
Lower cost,
higher thro...
Generated data
Available for analysis
Data volume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Cen...
Big Gap in turning data into actionable
information
Amazon Web Services helps remove
constraints
1 instance x 100 hours = 100 instances x 1 hour
Big Data Verticals and Use cases
Media/Advertising
Targeted
Advertising
Image and
Video
Processing
Oil & Gas
Seismic
Analy...
From data to
actionable information
“Who is using our
service?”
Identified early mobile usage
Invested heavily in mobile development
Finding signal in the noise of logs
9,432,061 unique mobile devices
used the Yelp mobile app.
4 million+ calls. 5 million+ directions.
In January 2013
Autocomplete Search
Recommendations
Automatic spelling
corrections
“What kind of movies do people
like ?”
More than 25 Million Streaming Members
50 Billion Events Per Day
30 Million plays every day
2 billion hours of video in 3
...
“How do galaxies form?”
1.42 million images from
Hubble
680,000 volunteers
A spectrum of analytics
K-means
clustering
Cascade correlation
neural networks
A spectrum of cognition
Novice Domain experts
Novice
Analytics & machine learning
Expert
Stronger than the
sum of their parts
250 million classifications
25 peer reviewed publications
Big Data tools on AWS
COLLECT | STORE | ANALYSE | SHARE
Direct
Connect
SQS
Glacier
S3
EC2
Redshift
DynamoDB
Elastic Map
Reduce
CloudFront
EC2
Ba...
Big Data tools on AWS
In-memory
Hadoop and
Friends
Managed
Services
MPP
Datawarehouse
NoSQL
Scale out
processing
Big Data tools on AWS
In-memory
Hadoop and
Friends
Managed
Services
MPP
Datawarehouse
NoSQL
Scale out
processing
Hive/Pig/...
Big Data tools on AWS
In-memory
Hadoop and
Friends
Managed
Services
MPP
Datawarehouse
NoSQL
Scale out
processing
Hive/Pig/...
EMR is Hadoop in the Cloud
How does EMR work ?
EMR
EMR Cluster
S3
Put the data
into S3
Choose: Hadoop distribution, # of
nodes, types of nodes, custo...
What can you run on EMR…
S3
EMR
EMR Cluster
EMR
EMR Cluster
Resize Clusters
S3
You can easily add and
remove nodes
On and Off Fast Growth
Predictable peaksVariable peaks
WASTE
CUSTOMER DISSATISFACTION
Fast GrowthOn and Off
Predictable peaksVariable peaks
Resize Nodes with Spot Instances
Cost without Spot
10 node cluster running for 14 hours
Cost = 1.2 * 10 * 14 = $168
Resize Nodes with Spot Instances
Cost without Spot Add 10 nodes on spot
10 node cluster running for 14 hours
Cost = 1.2 * ...
Resize Nodes with Spot Instances
Cost without Spot Add 10 nodes on spot
10 node cluster running for 14 hours
Cost = 1.2 * ...
Traditional IT
capacityCapacity
Time
Analytics needs
Traditional IT
capacityCapacity
Time
Reserved Instances
Traditional IT
capacityCapacity
Time
Reserved Instances
On-demand
Traditional IT
capacityCapacity
Time
Reserved Instances
On-demand
Spot
Run the analysis
S3
Run clusters with your data in S3
Data is “streamed” in and
intermediate results stored in HDFS
EMR Cl...
When done shutdown the cluster
EMR Cluster
S3
When processing is complete, you
can terminate the cluster (and stop
paying)...
EMR
EMR Cluster
You can also run 24/7
S3
If you run your jobs 24 x 7 , you
can also run a persistent cluster
and use RI mo...
Option to use S3 along with HDFS
S3
EMR
EMR Cluster
• S3 provides 99.99999999999% of
durability
• Elastic
• Version contro...
Which is the data warehouse here ?
 Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
Need faster query response time
3
Separate Map Reduce like
Engine
In-memory data storage for fast
query response time
Comp...
 elastic-mapreduce --create --alive --name
"Spark/Shark Cluster" --bootstrap-action
s3://elasticmapreduce/samples/spark/0...
 Source https://amplab.cs.berkeley.edu/2013/06/04/comparing-large-
scale-query-engines/
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Remove
Constraints
Thank You
sinhaar@amazon.com
aws.amazon.com/elasticmapreduce
aws.amazon.com/datapipeline
aws.amazon.com/big-data
@abysinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
Upcoming SlideShare
Loading in...5
×

AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha

706

Published on

The volume, velocity and variety of data has changed drastically in the last decade. Everything generates data today, from your customers on social networks, to the instances running your web applications. The tools to support collecting, storing, organizing, analyzing and sharing of data are all available in a couple of clicks, with Amazon Web Services. Attend this session to learn how Big Data in the cloud can help you easily unlock business opportunities hidden in your data today.

Published in: Sports, Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
706
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha"

  1. 1. Abhishek Sinha Business Development Manager sinhaar@amazon.com @abysinha Big Data Analytics
  2. 2. Presenter Name Presenter Title Month Day, YearAbhishek Sinha Rajnikant
  3. 3. Customary Rajnikant Joke • How would “Rajni Saar” process big data ?
  4. 4. Customary Rajnikant Joke • How would “Rajni Saar” process big data ? He could count it on his fingers !
  5. 5. An engineer’s definition When your data sets become so large that you have to start innovating how to collect, store, organize, analyze and share it
  6. 6. What is the challenge with big data ?
  7. 7. Generation Collection & storage Analytics & computation Collaboration & sharing
  8. 8. Generation Collection & storage Analytics & computation Collaboration & sharing Lower cost, higher throughput
  9. 9. Generation Collection & storage Analytics & computation Collaboration & sharing Highly constrained Lower cost, higher throughput
  10. 10. Generated data Available for analysis Data volume Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  11. 11. Big Gap in turning data into actionable information
  12. 12. Amazon Web Services helps remove constraints
  13. 13. 1 instance x 100 hours = 100 instances x 1 hour
  14. 14. Big Data Verticals and Use cases Media/Advertising Targeted Advertising Image and Video Processing Oil & Gas Seismic Analysis Retail Recommendati ons Transactions Analysis Life Sciences Genome Analysis Financial Services Monte Carlo Simulations Risk Analysis Security Anti-virus Fraud Detection Image Recognition Social Network/Gaming User Demograph ics Usage analysis In-game metrics
  15. 15. From data to actionable information
  16. 16. “Who is using our service?”
  17. 17. Identified early mobile usage Invested heavily in mobile development Finding signal in the noise of logs
  18. 18. 9,432,061 unique mobile devices used the Yelp mobile app. 4 million+ calls. 5 million+ directions. In January 2013
  19. 19. Autocomplete Search Recommendations Automatic spelling corrections
  20. 20. “What kind of movies do people like ?”
  21. 21. More than 25 Million Streaming Members 50 Billion Events Per Day 30 Million plays every day 2 billion hours of video in 3 months 4 million ratings per day 3 million searches Device location , time , day, week etc. Social data
  22. 22. “How do galaxies form?”
  23. 23. 1.42 million images from Hubble
  24. 24. 680,000 volunteers
  25. 25. A spectrum of analytics
  26. 26. K-means clustering Cascade correlation neural networks
  27. 27. A spectrum of cognition
  28. 28. Novice Domain experts
  29. 29. Novice Analytics & machine learning Expert
  30. 30. Stronger than the sum of their parts
  31. 31. 250 million classifications
  32. 32. 25 peer reviewed publications
  33. 33. Big Data tools on AWS
  34. 34. COLLECT | STORE | ANALYSE | SHARE Direct Connect SQS Glacier S3 EC2 Redshift DynamoDB Elastic Map Reduce CloudFront EC2 Basic building blocks for every workload Data pipeline Import Export Compute Fleet
  35. 35. Big Data tools on AWS In-memory Hadoop and Friends Managed Services MPP Datawarehouse NoSQL Scale out processing
  36. 36. Big Data tools on AWS In-memory Hadoop and Friends Managed Services MPP Datawarehouse NoSQL Scale out processing Hive/Pig/Cascading Shark/Spark Dynamodb Hbase Cassandra MongoDB .. EC2 .. SAP HANA one .. Treasure Data Qubole Splunk Storm Sumologic Karmasphere .. Redshift ..
  37. 37. Big Data tools on AWS In-memory Hadoop and Friends Managed Services MPP Datawarehouse NoSQL Scale out processing Hive/Pig/Cascading Shark/Spark Dynamodb Hbase Cassandra MongoDB .. EC2 .. SAP HANA one .. Treasure Data Qubole Splunk Storm Sumologic Karmasphere .. Redshift Vertica ..
  38. 38. EMR is Hadoop in the Cloud
  39. 39. How does EMR work ? EMR EMR Cluster S3 Put the data into S3 Choose: Hadoop distribution, # of nodes, types of nodes, custom configs, Hive/Pig/etc. Get the output from S3 Launch the cluster using the EMR console, CLI, SDK, or APIs You can also store everything in HDFS
  40. 40. What can you run on EMR… S3 EMR EMR Cluster
  41. 41. EMR EMR Cluster Resize Clusters S3 You can easily add and remove nodes
  42. 42. On and Off Fast Growth Predictable peaksVariable peaks WASTE CUSTOMER DISSATISFACTION
  43. 43. Fast GrowthOn and Off Predictable peaksVariable peaks
  44. 44. Resize Nodes with Spot Instances Cost without Spot 10 node cluster running for 14 hours Cost = 1.2 * 10 * 14 = $168
  45. 45. Resize Nodes with Spot Instances Cost without Spot Add 10 nodes on spot 10 node cluster running for 14 hours Cost = 1.2 * 10 * 14 = $168 20 node cluster running for 7 hours Cost = 1.2 * 10 * 7 = $84 = 0.6 * 10 * 7 = $42
  46. 46. Resize Nodes with Spot Instances Cost without Spot Add 10 nodes on spot 10 node cluster running for 14 hours Cost = 1.2 * 10 * 14 = $168 20 node cluster running for 7 hours Cost = 1.2 * 10 * 7 = $84 = 0.6 * 10 * 7 = $42 = Total $126 25% reduction in price 50% reduction in time
  47. 47. Traditional IT capacityCapacity Time Analytics needs
  48. 48. Traditional IT capacityCapacity Time Reserved Instances
  49. 49. Traditional IT capacityCapacity Time Reserved Instances On-demand
  50. 50. Traditional IT capacityCapacity Time Reserved Instances On-demand Spot
  51. 51. Run the analysis S3 Run clusters with your data in S3 Data is “streamed” in and intermediate results stored in HDFS EMR Cluster 1
  52. 52. When done shutdown the cluster EMR Cluster S3 When processing is complete, you can terminate the cluster (and stop paying) 1
  53. 53. EMR EMR Cluster You can also run 24/7 S3 If you run your jobs 24 x 7 , you can also run a persistent cluster and use RI models to save costs 2
  54. 54. Option to use S3 along with HDFS S3 EMR EMR Cluster • S3 provides 99.99999999999% of durability • Elastic • Version control against failure • Run multiple clusters with a single source of truth • Quick recovery from failure • Continuously resize clusters 3
  55. 55. Which is the data warehouse here ?  Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
  56. 56. Need faster query response time 3 Separate Map Reduce like Engine In-memory data storage for fast query response time Compatible with hadoop storage API SHARK Port of Apache Hive on SPARK Compatible with existing HIVE meta-stores Similar speed ups of upto 40x
  57. 57.  elastic-mapreduce --create --alive --name "Spark/Shark Cluster" --bootstrap-action s3://elasticmapreduce/samples/spark/0.7/ins tall-spark-shark.sh --bootstrap-name "Mesos/Spark/Shark" --instance-type m1.xlarge --instance-count 3
  58. 58.  Source https://amplab.cs.berkeley.edu/2013/06/04/comparing-large- scale-query-engines/
  59. 59. Generation Collection & storage Analytics & computation Collaboration & sharing Remove Constraints
  60. 60. Thank You sinhaar@amazon.com aws.amazon.com/elasticmapreduce aws.amazon.com/datapipeline aws.amazon.com/big-data @abysinha

×