Big Data in the Cloud
Ganesh Raja
Solutions Architect AWS
graja@amazon.com
©Amazon.com, Inc. and its affiliates. All rights reserved.
When you do a directory listing on a folder and the lights
start to flicker
What is Big Data ?
When your data sets become
so large that you have to start innovating how to Collect,
Store, Organize, Analyze and Share it
Its tough because of
Velocity, Volume and Variety
What is Big Data ?
Big Data
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
The cost of data generation
is falling
GB TB
PB
ZB
EB
Unconstrained Data Growth
Big Data is now moving fast …
• IT/ Application server logs
IT Infrastructure logs, Metering,
Audit logs, Change logs
• Web sites / Mobile Apps/ Ads
Clickstream, User Engagement
• Sensor data
Weather, Smart Grids, Wearables
• Social Media, User Content
450MM+ Tweets/day
The volume of data
is increasing
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Lower cost,
higher throughput
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Lower cost,
higher throughput
Highly
constrained
Data Volume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Generated data
Available for analysis
Elastic & Highly Scalable
+
No capital expense
+
Pay-per-use
+
On-demand
Cloud Computing
$0
= Remove constraints
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Accelerated
3.5 Billion records, 71 Million unique cookies, 1.7 Million targeted ads per day
Analyzed customer clicks and impressions with Elastic MapReduce
“…no upfront investment in hardware, no hardware procurement delay, and no
additional operations staff was hired.”
“Because of the richness of the algorithm and the flexibility of the platform to support it
at scale, our first client campaign experienced a 500% increase in their return on
ad spend from a similar campaign a year before.”
Targeted Ad
User recently
purchased a sports
movie and is
searching for video
games
Case Study: Razorfish
Results:
500% return on ad spend
From 2 months procurement
time to a minute
How?
Big Data Technology
Technologies and techniques for
working productively with data,
at any scale
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
AWS
Data Pipeline
Amazon S3,
Amazon Glacier,
AWS Storage Gateway,
Amazon DynamoDB,
Amazon Redshift,
Amazon RDS,
Amazon Kinesis
Amazon EC2,
Amazon Elastic
MapReduce
Amazon EC2
Amazon S3,
Amazon Redshift,
Amazon RDS
Amazon
Simple Storage Service
(S3)
EMR is Hadoop in the Cloud
Hadoop is an open-source framework for
parallel processing huge amounts of data
on a cluster of machines
What is Amazon Elastic MapReduce (EMR)?
How does it work?
1. Put the data
into S3 (or HDFS)
2. Launch your cluster.
Choose:
• Hadoop distribution
• How many nodes
• Node type (hi-CPU, hi-
memory, etc.)
• Hadoop apps (Hive, Pig,
HBase)
3. Get the results.
S3 EMR Cluster
How does it work?
S3 EMR Cluster
You can easily
resize the cluster.
How does it work?
S3 EMR Cluster
Use Spot instances
to save time and
money.
How does it work?
S3 EMR Cluster
Launch parallel
clusters against the
same data source
How does it work?
S3 EMR Cluster
When the work is
complete, you can
terminate the
cluster (and stop
paying)
Cost to run a 100 node Elastic MapReduce Cluster
INR 450/hour
($7.5/h)
Cost to run a 100 node Elastic MapReduce Cluster
Photos: renee_mcgurk https://www.flickr.com/photos/51018933@N08/5355664961/in/photostream/
Calgary Reviews https://www.flickr.com/photos/calgaryreviews/6328302248/in/photostream/
Each day AWS adds the equivalent
server capacity to a global, $7B
enterprise
Thousands of Customers, 5+ Millions of Clusters
The Zoo
Apache
Kafka
Amazon
Kinesis
Apache
Flume
Storm
Apache
Spark
Apache
Spark
Streaming
Hadoop/
EMR
Redshift S3
DynamoDB
Hive Pig Shark
HDFS
Impala
?
EMR makes it easy to use Hive and Pig
Pig:
• High-level programming
language (Pig Latin)
• Supports UDFs
• Ideal for data flow/ETL
Hive:
• Data Warehouse for Hadoop
• SQL-like query language (HiveQL)
• Initially developed at Facebook
HBase:
• Column-oriented database
• Runs on top of HDFS
• Ideal for sparse data
• Random, read/write access
• Ideal for very large tables (billions of
rows, millions of columns)
Mahout:
• Machine learning library
• Supports recommendation
mining, clustering,
classification, and frequent
itemset mining
EMR makes it easy to use HBase and Mahout
EMR makes it easy to use Spark and Shark
Shark:
• Hive on Spark
• Up to 100x Faster
• Compatible with Hive
• Used at Yahoo, Airbnb, etc
• Download via BA
Spark:
• In-memory MapReduce
• Up to 100x faster than Hadoop
• Access HDFS, HBase, S3
• Developed at UCBerkely
• Download via BA
Ganglia
• Scalable distributed monitoring
• View performance of the cluster and
individual nodes
• Open source
R:
• Language and software
environment for statistical
computing and graphics
• Open source
EMR makes it easy to use other tools and
applications
EMR Also Supports MapR’s Hadoop Distributions
In addition to the Amazon Hadoop distribution, EMR supports
two MapR Hadoop distributions
M3
M5
M7
Key features of MapR
NFS
No NameNode
JobTracker high availability
Cluster mirroring (disaster recovery)
Compression
Integrates with Hadoop Eco-System
EMR
Amazon S3 Amazon EMR
Amazon S3 Amazon EMR Amazon Redshift
Amazon Redshift
• Easy to provision, scale, operate
• No up-front costs, pay-as-you-go
• Fast: Columnar storage, compression,
Specialized nodes
• 1-100 node clusters 2TB - 1.6PB
• $999 per TB per year
• Transparent backups, restore, failover
• Security in transit, at rest, for backups, VPC
Data Warehousing the AWS way
Amazon S3 Amazon EMR Amazon Redshift
Amazon S3 Amazon EMR Amazon Redshift
Reporting
+BI
Amazon Redshift
AWS
Marketplace
AWS BI Partners
Amazon Redshift
Reporting
+BI
On-premise
Systems
3 million (100 * 30 thousand) queries are executed per hour on 100
HANA instances.
Each SAP HANA instance is deployed on an Amazon Web Services (AWS) Linux
server.
Each AWS Server has 16 cores, 60 GB of RAM and 600 million rows
(Total: 1776 cores, 6.6TB RAM and 60 billion rows).”
Rolls-Royce performance for the cost of a Ford.USD 400 / Hour
Experiment
often
Fail quickly,
at low cost
More
Innovation
Data is the
New Oil
How does Rajinikanth do Big
Data ?
Thank you!
Ganesh Raja
Solutions Architect AWS
graja@amazon.com
©Amazon.com, Inc. and its affiliates. All rights reserved.

Aaum Analytics event - Big data in the cloud

  • 1.
    Big Data inthe Cloud Ganesh Raja Solutions Architect AWS graja@amazon.com ©Amazon.com, Inc. and its affiliates. All rights reserved.
  • 2.
    When you doa directory listing on a folder and the lights start to flicker What is Big Data ?
  • 3.
    When your datasets become so large that you have to start innovating how to Collect, Store, Organize, Analyze and Share it Its tough because of Velocity, Volume and Variety What is Big Data ?
  • 4.
    Big Data Generation Collection &storage Analytics & computation Collaboration & sharing
  • 5.
    The cost ofdata generation is falling
  • 6.
    GB TB PB ZB EB Unconstrained DataGrowth Big Data is now moving fast … • IT/ Application server logs IT Infrastructure logs, Metering, Audit logs, Change logs • Web sites / Mobile Apps/ Ads Clickstream, User Engagement • Sensor data Weather, Smart Grids, Wearables • Social Media, User Content 450MM+ Tweets/day
  • 7.
    The volume ofdata is increasing
  • 8.
    Generation Collection & storage Analytics& computation Collaboration & sharing Lower cost, higher throughput
  • 9.
    Generation Collection & storage Analytics& computation Collaboration & sharing Lower cost, higher throughput Highly constrained
  • 10.
    Data Volume Gartner: UserSurvey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares Generated data Available for analysis
  • 11.
    Elastic & HighlyScalable + No capital expense + Pay-per-use + On-demand Cloud Computing $0 = Remove constraints
  • 12.
    Generation Collection & storage Analytics& computation Collaboration & sharing Accelerated
  • 14.
    3.5 Billion records,71 Million unique cookies, 1.7 Million targeted ads per day Analyzed customer clicks and impressions with Elastic MapReduce “…no upfront investment in hardware, no hardware procurement delay, and no additional operations staff was hired.” “Because of the richness of the algorithm and the flexibility of the platform to support it at scale, our first client campaign experienced a 500% increase in their return on ad spend from a similar campaign a year before.” Targeted Ad User recently purchased a sports movie and is searching for video games Case Study: Razorfish
  • 15.
    Results: 500% return onad spend From 2 months procurement time to a minute
  • 16.
  • 17.
    Big Data Technology Technologiesand techniques for working productively with data, at any scale
  • 18.
    Generation Collection & storage Analytics& computation Collaboration & sharing AWS Data Pipeline Amazon S3, Amazon Glacier, AWS Storage Gateway, Amazon DynamoDB, Amazon Redshift, Amazon RDS, Amazon Kinesis Amazon EC2, Amazon Elastic MapReduce Amazon EC2 Amazon S3, Amazon Redshift, Amazon RDS
  • 19.
  • 20.
    EMR is Hadoopin the Cloud Hadoop is an open-source framework for parallel processing huge amounts of data on a cluster of machines What is Amazon Elastic MapReduce (EMR)?
  • 21.
    How does itwork? 1. Put the data into S3 (or HDFS) 2. Launch your cluster. Choose: • Hadoop distribution • How many nodes • Node type (hi-CPU, hi- memory, etc.) • Hadoop apps (Hive, Pig, HBase) 3. Get the results. S3 EMR Cluster
  • 22.
    How does itwork? S3 EMR Cluster You can easily resize the cluster.
  • 23.
    How does itwork? S3 EMR Cluster Use Spot instances to save time and money.
  • 24.
    How does itwork? S3 EMR Cluster Launch parallel clusters against the same data source
  • 25.
    How does itwork? S3 EMR Cluster When the work is complete, you can terminate the cluster (and stop paying)
  • 26.
    Cost to runa 100 node Elastic MapReduce Cluster INR 450/hour ($7.5/h)
  • 27.
    Cost to runa 100 node Elastic MapReduce Cluster Photos: renee_mcgurk https://www.flickr.com/photos/51018933@N08/5355664961/in/photostream/ Calgary Reviews https://www.flickr.com/photos/calgaryreviews/6328302248/in/photostream/
  • 28.
    Each day AWSadds the equivalent server capacity to a global, $7B enterprise
  • 29.
    Thousands of Customers,5+ Millions of Clusters
  • 30.
  • 31.
    EMR makes iteasy to use Hive and Pig Pig: • High-level programming language (Pig Latin) • Supports UDFs • Ideal for data flow/ETL Hive: • Data Warehouse for Hadoop • SQL-like query language (HiveQL) • Initially developed at Facebook
  • 32.
    HBase: • Column-oriented database •Runs on top of HDFS • Ideal for sparse data • Random, read/write access • Ideal for very large tables (billions of rows, millions of columns) Mahout: • Machine learning library • Supports recommendation mining, clustering, classification, and frequent itemset mining EMR makes it easy to use HBase and Mahout
  • 33.
    EMR makes iteasy to use Spark and Shark Shark: • Hive on Spark • Up to 100x Faster • Compatible with Hive • Used at Yahoo, Airbnb, etc • Download via BA Spark: • In-memory MapReduce • Up to 100x faster than Hadoop • Access HDFS, HBase, S3 • Developed at UCBerkely • Download via BA
  • 34.
    Ganglia • Scalable distributedmonitoring • View performance of the cluster and individual nodes • Open source R: • Language and software environment for statistical computing and graphics • Open source EMR makes it easy to use other tools and applications
  • 35.
    EMR Also SupportsMapR’s Hadoop Distributions In addition to the Amazon Hadoop distribution, EMR supports two MapR Hadoop distributions M3 M5 M7 Key features of MapR NFS No NameNode JobTracker high availability Cluster mirroring (disaster recovery) Compression
  • 36.
    Integrates with HadoopEco-System EMR
  • 37.
  • 38.
    Amazon S3 AmazonEMR Amazon Redshift
  • 39.
    Amazon Redshift • Easyto provision, scale, operate • No up-front costs, pay-as-you-go • Fast: Columnar storage, compression, Specialized nodes • 1-100 node clusters 2TB - 1.6PB • $999 per TB per year • Transparent backups, restore, failover • Security in transit, at rest, for backups, VPC Data Warehousing the AWS way
  • 40.
    Amazon S3 AmazonEMR Amazon Redshift
  • 41.
    Amazon S3 AmazonEMR Amazon Redshift Reporting +BI
  • 42.
  • 43.
  • 45.
    3 million (100* 30 thousand) queries are executed per hour on 100 HANA instances.
  • 46.
    Each SAP HANAinstance is deployed on an Amazon Web Services (AWS) Linux server. Each AWS Server has 16 cores, 60 GB of RAM and 600 million rows (Total: 1776 cores, 6.6TB RAM and 60 billion rows).”
  • 47.
    Rolls-Royce performance forthe cost of a Ford.USD 400 / Hour
  • 49.
  • 50.
  • 51.
    How does Rajinikanthdo Big Data ?
  • 52.
    Thank you! Ganesh Raja SolutionsArchitect AWS graja@amazon.com ©Amazon.com, Inc. and its affiliates. All rights reserved.