End Note - AWS India Summit 2012

Data without Limits
Dr. Werner Vogels
CTO, Amazon.com

Human Genome Project

Collaborative project to sequence every single letter
of the human genetic code.

13 years and $billions to complete.

Gigabyte scale datasets (transferred between sites on
iPods!)

Beyond the Human Genome

45+ species sequenced: mouse, rat, gorilla, rabbit,
platypus, nematode, zebra fish...

Compare genomes between species to identify
biologically interesting areas of the genome.

100Gb scale datasets. Increased computational
requirements.

The Next Generation

New sequencing instruments lead to a dramatic
drop in cost and time required to sequence a genome.

Sequence and compare genetic code of individuals to
find areas of variation. Much more interesting.

Terabyte scale datasets. Significant computational
requirements.

The 1000 Genomes Projects

Public/private consortium to build world’s largest
collection of human genetic variation.

Hugely important dataset to drive new insight into
known genetic traits, and the identification of new ones.

Vast, complex data and computational resources required,
beyond reach of most research groups and hospitals.

1000 Genomes in the Cloud

The 1000 Genomes data made available to all on AWS.

Stored for free as part of the Public Datasets program.
Updated regularly.

200Tb. 1700 individual genomes. As much compute and
storage as required available to all.

The Cloud
Helps do the science we are capable of

50,000 core
CycleCloud Super Computer
running on the Amazon Cloud

How big is 50,000 cores?
Why does it matter?

Every day is crucial and costly

Find matches in millions of keys

Challenge: To run a virtual screen with a higher
accuracy algorithm & 21 million compounds

Metric Count
Compute Hours of 109,927 hours
Work
Compute Days of 4,580 days
Work
Using CycleCloud & Amazon Cloud
Compute Years of 12.55 years
TheWork
impossible run finished in...
Ligand Count ~21 million ligands

Using CycleCloud & Amazon Cloud
The impossible run finished in...

Instead of $20+
Million in
Infrastructure

Big Data powered by AWS

BIG-DATA
The collection and analysis of large
amounts of data to create a
competitive advantage

Big Data Verticals

Social
Media/Advertising Oil & Gas Retail Life Sciences Financial Services Security
Network/Gaming

User
Anti-virus Demographics
Targeted Recommendations
Monte Carlo
Advertising Simulations

Seismic Genome Fraud Usage
Analysis Analysis Detection analysis

Image and
Transaction Risk
Video Analysis Analysis Image In-game
Processing
Recognition metrics


Storage Big Data Compute

Challenges start at relatively small volumes

100 GB 1,000 PB



When data sets and data analytics need to scale to the
point that you have to start innovating around how to
collect, store, organize, analyze and share it


Storage Innovation Compute

DynamoDB Glacier HPC EMR
S3 Spot

Unconstrained data growth

95% of the 1.2 zettabytes of
ZB data in the digital universe is
unstructured
70% of of this is user-
generated content
EB Unstructured data growth
explosive, with estimates of
compound annual growth
(CAGR) at 62% from 2008 –
PB 2012.
Source: IDC
TB
GB

Why now?

Web sites Sensor data
Blogs/Reviews/Emails/Pictures Weather, water, smart grids
Social Graphs Images/videos
Facebook, Linked-in, Contacts Traffic, security cameras
Application server logs Twitter
Web sites, games 50m tweets/day 1,400% growth per
year

Why now?


Mobile connected world
Social Graphs
Facebook, Linked-in, Contacts
Images/videos
Traffic, security cameras
Application server logs using, easier to collect) Twitter
(more people
year

Why now?


More aspects of data
Social Graphs
Images/videos
Application server logs
(variety, depth, location, frequency) Twitter
year

Why now?


Possible to understand
Social Graphs
Images/videos
Application server logs
(not just answer specific questions) Twitter
year

Why now?

Who is your consumer really?
What do people really like?
What is happening socially with your products?
How do people really use your product?

Why now?

Social Graphs Images/videos
More server logs => better results
data
Application
Twitter
year

From one instance…

…to thousands

and back again…

Big Data Pipeline
Collect | Store | Organize | Analyze | Share

Where do you put your slice of it?

Collection - Ingestion

AWS Direct Connect AWS Import/Export Queuing Amazon Storage Gateway
Dedicated bandwidth between Physical transfer of media Reliable messaging for task Shrink-wrapped gateway for
you site and AWS into and out of AWS distribution & collection volume synchronization


Relational Database Service DynamoDB Simple Storage Service (S3)
Fully managed database NoSQL, Schemaless, Object datastore up to 5TB per
(MySQL, Oracle, MSSQL) Provisioned throughput object
database 99.999999999% durability


Glacier
Long term cold storage
From $0.01 per GB/Month
99.999999999% durability

Glacier - Full lifecycle big data management

Data import Computation & Long term archive
Visualization

Once data analysis complete,
Physical shipping of devices for HPC & EMR cluster jobs of many
entire resultant dataset placed in
creation of data in AWS thousands of cores
cold storage rather than tape

e.g. Cost effective when compared
e.g. 50TB of Seismic data created e.g. 200TB of visualization data
to tape, retrieval in 3-5 hours if
as EBS volumes in a Gluster file generated from cluster processing
required
system

How quick do you need to read it?

Single digit ms 10s-100s ms <5 hours
DynamoDB S3 Glacier
Social scale applications Any object, any app Media & asset archives
Provisioned throughput performance 99.999999999% durability Extremely low cost
Flexible consistency models Objects up to 5TB in size S3 levels of durability

Performance

Scale Price

Operate at any scale

Unlimited data

Performance

Scale Price

Pay for only what you use

Provisioned IOPS Volume used
Provisioned read/write performance Pay for volume stored per
per Dynamo table/EBS volume month & puts/gets
Pay for a given provisioned capacity No capacity planning required
whether used or not to maintain unlimited storage

Performance

Scale Price

“Big data” change the dynamics of computation and data sharing

Collection Computation Collaboration
How do I acquire it? What horsepower How do I work with
Where do I put it? can I apply to it? others on it?

“Big data” change the dynamics of computation and data sharing

Collection Computation Collaboration
How do I acquire it? What horsepower How do I work with
Where do I put it? can I apply to it? others on it?

Direct Connect EC2 Cloud Formation
Import/Export GPUs Simple Workflow
S3 Elastic Map Reduce S3
DynamoDB

Hadoop-as-a-Service – Elastic MapReduce

Elastic MapReduce
Managed, elastic Hadoop cluster
Integrates with S3 & DynamoDB
Leverage Hive & Pig analytics scripts
Integrates with instance types such
as spot

Elastic MapReduce
Managed, elastic Hadoop cluster
Integrates with S3 & DynamoDB
Leverage Hive & Pig analytics scripts
Integrates with instance types such as spot

Feature Details
Scalable Use as many or as few compute instances running
Hadoop as you want. Modify the number of
instances while your job flow is running
Integrated with other Works seamlessly with S3 as origin and output.
services Integrates with DynamoDB
Comprehensive Supports languages such as Hive and Pig for defining
analytics, and allows complex definitions in
Cascading, Java, Ruby, Perl, Python, PHP, R, or C++
Cost effective Works with Spot instance types
Monitoring Monitor job flows from with the management
console

A framework
Splits data into pieces
Lets processing occur
Gathers the results

S3 + DynamoDB

Input data

Code Elastic
MapReduce

S3 + DynamoDB

Input data

Code Elastic Name
MapReduce node

S3 + DynamoDB

Input data

Code Elastic Name
MapReduce node

Elastic
cluster

S3 + DynamoDB

Input data

Code Elastic Name
MapReduce node

HDFS

Elastic
cluster

S3 + DynamoDB

Input data

Code Elastic Name
MapReduce node

Queries
HDFS
+ BI
Via JDBC, Pig, Hive
Elastic
cluster

S3 + DynamoDB

Input data

Code Elastic Name Output
MapReduce node S3 + DynamoDB

Queries
HDFS
+ BI
Via JDBC, Pig, Hive
Elastic
cluster

S3 + DynamoDB

Input data

Output
S3 + DynamoDB

Very large
click log
(e.g TBs)

Lots of actions
by John Smith

Very large
click log
(e.g TBs)

Lots of actions
by John Smith

Very large
click log
(e.g TBs) Split the
log into
many small
pieces

Process in an
EMR cluster
Lots of actions
by John Smith

Very large
click log
(e.g TBs) Split the
log into
many small
pieces

Process in an
EMR cluster
Lots of actions
by John Smith

Very large
click log
(e.g TBs) Split the Aggregate
log into the results
many small from all
pieces the nodes

Process in an
EMR cluster
Lots of actions
by John Smith

What
Very large John
click log
(e.g TBs) Smith
Split the Aggregate
log into the results did
many small from all
pieces the nodes

What
Very large John
click log
(e.g TBs) Smith
Insight in a fraction of the time
did

1 instance for 100 hours
=
100 instances for 1 hour

Operated 2 million+ Hadoop clusters last year

Features powered by Amazon Elastic
MapReduce:
People Who Viewed this Also Viewed
Review highlights
Auto complete as you type on search
Search spelling suggestions
Top searches
Ads

200 Elastic MapReduce jobs per day
Processing 3TB of data

Hadoop-as-a-Service – Elastic MapReduce

"With Amazon Elastic MapReduce, there
was no upfront investment in hardware, no
hardware procurement delay, and no need
to hire additional operations staff.

Because of the flexibility of the
platform, our first new online advertising
campaign experienced a 500% increase in
return on ad spend from a similar
campaign a year before.”

Data Analytics

3.5 billion records Execute batch processing data sets
ranging in size from dozens of
“Our first client
71 million unique cookies Gigabytes to Terabytes campaign experienced
1.7 million targeted ads a 500% increase in
Building in-house infrastructure to
required per day analyze these click stream datasets
their return on ad
requires investment in expensive spend from a similar
“headroom” to handle peak demand. campaign a year
before”

User recently
purchased a
sports movie Targeted Ad
and is searching
for video games (1.7 Million per day)

“AWS gave us the flexibility to bring a massive
amount of capacity online in a short period of
time and allowed us to do so in an operationally
DynamoDB: straightforward way.
over 500,000 writes per
second
AWS is now Shazam’s cloud provider of choice,”
Amazon EMR: Jason Titus,
more than 1 million writes
CTO
per second

Step 1: Tracking Step 2: Panel Step 3: Dashboard

We’ve created a unique tracking application. It We invite members of a research panel Usage data now begins to pour into the
keeps track of all website visited, software used, to install it. We know not only their digital Wakoopa dashboard in real-time. Log in,
and/or ads seen. habits, but also their offline and create beautiful visualizations and
demographics and behavior. useful reports.

Technology

Panel

AWS

Activity SQS EMR RDS Data
Kamek*

Metrics
S3

Wakoopa dashboard

Rediff uses Amazon EMR along with Amazon S3 to
perform data mining, log processing and analytics for
their online business. Inputs gained are used to power
a better user experience on their portal.

Rediff needed 12-15 hours to run this on a 10-12 node
cluster on premise. AWS gave choice and flexibility of
an on demand model which can be scaled up and
down and shortened the time required to process data.

More than 25 Million Streaming Members

50 Billion Events Per Day

S3

~1 PB of data stored in Amazon S3

Leader in 2011 Gartner IaaS
Magic Quadrant

Cloud enables big data
collection

processing

collaboration

aws.amazon.com
get started with the free tier

End Note - AWS India Summit 2012

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to End Note - AWS India Summit 2012

Similar to End Note - AWS India Summit 2012 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

End Note - AWS India Summit 2012

Editor's Notes