Aaum Analytics event - Big data in the cloud

Big Data in the Cloud
Ganesh Raja
Solutions Architect AWS
graja@amazon.com
©Amazon.com, Inc. and its affiliates. All rights reserved.

When you do a directory listing on a folder and the lights
start to flicker
What is Big Data ?

When your data sets become
so large that you have to start innovating how to Collect,
Store, Organize, Analyze and Share it
Its tough because of
Velocity, Volume and Variety
What is Big Data ?

Big Data
Generation
Collection & storage
Analytics & computation
Collaboration & sharing

The cost of data generation
is falling

GB TB
PB
ZB
EB
Unconstrained Data Growth
Big Data is now moving fast …
• IT/ Application server logs
IT Infrastructure logs, Metering,
Audit logs, Change logs
• Web sites / Mobile Apps/ Ads
Clickstream, User Engagement
• Sensor data
Weather, Smart Grids, Wearables
• Social Media, User Content
450MM+ Tweets/day

The volume of data
is increasing

Generation
Lower cost,
higher throughput

Generation
Lower cost,
higher throughput
Highly
constrained

Data Volume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Generated data
Available for analysis

Elastic & Highly Scalable
+
No capital expense
+
Pay-per-use
+
On-demand
Cloud Computing
$0
= Remove constraints

Generation
Accelerated

3.5 Billion records, 71 Million unique cookies, 1.7 Million targeted ads per day
Analyzed customer clicks and impressions with Elastic MapReduce
“…no upfront investment in hardware, no hardware procurement delay, and no
additional operations staff was hired.”
“Because of the richness of the algorithm and the flexibility of the platform to support it
at scale, our first client campaign experienced a 500% increase in their return on
ad spend from a similar campaign a year before.”
Targeted Ad
User recently
purchased a sports
movie and is
searching for video
games
Case Study: Razorfish

Results:
500% return on ad spend
From 2 months procurement
time to a minute

Big Data Technology
Technologies and techniques for
working productively with data,
at any scale

Generation
AWS
Data Pipeline
Amazon S3,
Amazon Glacier,
AWS Storage Gateway,
Amazon DynamoDB,
Amazon Redshift,
Amazon RDS,
Amazon Kinesis
Amazon EC2,
Amazon Elastic
MapReduce
Amazon EC2
Amazon S3,
Amazon Redshift,
Amazon RDS

Amazon
Simple Storage Service
(S3)

EMR is Hadoop in the Cloud
Hadoop is an open-source framework for
parallel processing huge amounts of data
on a cluster of machines
What is Amazon Elastic MapReduce (EMR)?

How does it work?
1. Put the data
into S3 (or HDFS)
2. Launch your cluster.
Choose:
• Hadoop distribution
• How many nodes
• Node type (hi-CPU, hi-
memory, etc.)
• Hadoop apps (Hive, Pig,
HBase)
3. Get the results.
S3 EMR Cluster

How does it work?
S3 EMR Cluster
You can easily
resize the cluster.

How does it work?
S3 EMR Cluster
Use Spot instances
to save time and
money.

How does it work?
S3 EMR Cluster
Launch parallel
clusters against the
same data source

How does it work?
S3 EMR Cluster
When the work is
complete, you can
terminate the
cluster (and stop
paying)

Cost to run a 100 node Elastic MapReduce Cluster
INR 450/hour
($7.5/h)

Cost to run a 100 node Elastic MapReduce Cluster
Photos: renee_mcgurk https://www.flickr.com/photos/51018933@N08/5355664961/in/photostream/
Calgary Reviews https://www.flickr.com/photos/calgaryreviews/6328302248/in/photostream/

Each day AWS adds the equivalent
server capacity to a global, $7B
enterprise

Thousands of Customers, 5+ Millions of Clusters

The Zoo
Apache
Kafka
Amazon
Kinesis
Apache
Flume
Storm
Apache
Spark
Apache
Spark
Streaming
Hadoop/
EMR
Redshift S3
DynamoDB
Hive Pig Shark
HDFS
Impala
?

EMR makes it easy to use Hive and Pig
Pig:
• High-level programming
language (Pig Latin)
• Supports UDFs
• Ideal for data flow/ETL
Hive:
• Data Warehouse for Hadoop
• SQL-like query language (HiveQL)
• Initially developed at Facebook

HBase:
• Column-oriented database
• Runs on top of HDFS
• Ideal for sparse data
• Random, read/write access
• Ideal for very large tables (billions of
rows, millions of columns)
Mahout:
• Machine learning library
• Supports recommendation
mining, clustering,
classification, and frequent
itemset mining
EMR makes it easy to use HBase and Mahout

EMR makes it easy to use Spark and Shark
Shark:
• Hive on Spark
• Up to 100x Faster
• Compatible with Hive
• Used at Yahoo, Airbnb, etc
• Download via BA
Spark:
• In-memory MapReduce
• Up to 100x faster than Hadoop
• Access HDFS, HBase, S3
• Developed at UCBerkely
• Download via BA

Ganglia
• Scalable distributed monitoring
• View performance of the cluster and
individual nodes
• Open source
R:
• Language and software
environment for statistical
computing and graphics
• Open source
EMR makes it easy to use other tools and
applications

EMR Also Supports MapR’s Hadoop Distributions
In addition to the Amazon Hadoop distribution, EMR supports
two MapR Hadoop distributions
M3
M5
M7
Key features of MapR
NFS
No NameNode
JobTracker high availability
Cluster mirroring (disaster recovery)
Compression

Integrates with Hadoop Eco-System
EMR

Amazon S3 Amazon EMR Amazon Redshift

Amazon Redshift
• Easy to provision, scale, operate
• No up-front costs, pay-as-you-go
• Fast: Columnar storage, compression,
Specialized nodes
• 1-100 node clusters 2TB - 1.6PB
• $999 per TB per year
• Transparent backups, restore, failover
• Security in transit, at rest, for backups, VPC
Data Warehousing the AWS way

Amazon S3 Amazon EMR Amazon Redshift
Reporting
+BI

Amazon Redshift
AWS
Marketplace
AWS BI Partners

Amazon Redshift
Reporting
+BI
On-premise
Systems

3 million (100 * 30 thousand) queries are executed per hour on 100
HANA instances.

Each SAP HANA instance is deployed on an Amazon Web Services (AWS) Linux
server.
Each AWS Server has 16 cores, 60 GB of RAM and 600 million rows
(Total: 1776 cores, 6.6TB RAM and 60 billion rows).”

Rolls-Royce performance for the cost of a Ford.USD 400 / Hour

Experiment
often
Fail quickly,
at low cost
More
Innovation

How does Rajinikanth do Big
Data ?

Aaum Analytics event - Big data in the cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Aaum Analytics event - Big data in the cloud

Similar to Aaum Analytics event - Big data in the cloud (20)

Recently uploaded

Recently uploaded (20)

Aaum Analytics event - Big data in the cloud