2. When you do a directory listing on a folder and the lights
start to flicker
What is Big Data ?
3. When your data sets become
so large that you have to start innovating how to Collect,
Store, Organize, Analyze and Share it
Its tough because of
Velocity, Volume and Variety
What is Big Data ?
6. GB TB
PB
ZB
EB
Unconstrained Data Growth
Big Data is now moving fast …
• IT/ Application server logs
IT Infrastructure logs, Metering,
Audit logs, Change logs
• Web sites / Mobile Apps/ Ads
Clickstream, User Engagement
• Sensor data
Weather, Smart Grids, Wearables
• Social Media, User Content
450MM+ Tweets/day
10. Data Volume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Generated data
Available for analysis
11. Elastic & Highly Scalable
+
No capital expense
+
Pay-per-use
+
On-demand
Cloud Computing
$0
= Remove constraints
14. 3.5 Billion records, 71 Million unique cookies, 1.7 Million targeted ads per day
Analyzed customer clicks and impressions with Elastic MapReduce
“…no upfront investment in hardware, no hardware procurement delay, and no
additional operations staff was hired.”
“Because of the richness of the algorithm and the flexibility of the platform to support it
at scale, our first client campaign experienced a 500% increase in their return on
ad spend from a similar campaign a year before.”
Targeted Ad
User recently
purchased a sports
movie and is
searching for video
games
Case Study: Razorfish
20. EMR is Hadoop in the Cloud
Hadoop is an open-source framework for
parallel processing huge amounts of data
on a cluster of machines
What is Amazon Elastic MapReduce (EMR)?
21. How does it work?
1. Put the data
into S3 (or HDFS)
2. Launch your cluster.
Choose:
• Hadoop distribution
• How many nodes
• Node type (hi-CPU, hi-
memory, etc.)
• Hadoop apps (Hive, Pig,
HBase)
3. Get the results.
S3 EMR Cluster
22. How does it work?
S3 EMR Cluster
You can easily
resize the cluster.
23. How does it work?
S3 EMR Cluster
Use Spot instances
to save time and
money.
24. How does it work?
S3 EMR Cluster
Launch parallel
clusters against the
same data source
25. How does it work?
S3 EMR Cluster
When the work is
complete, you can
terminate the
cluster (and stop
paying)
26. Cost to run a 100 node Elastic MapReduce Cluster
INR 450/hour
($7.5/h)
27. Cost to run a 100 node Elastic MapReduce Cluster
Photos: renee_mcgurk https://www.flickr.com/photos/51018933@N08/5355664961/in/photostream/
Calgary Reviews https://www.flickr.com/photos/calgaryreviews/6328302248/in/photostream/
28. Each day AWS adds the equivalent
server capacity to a global, $7B
enterprise
31. EMR makes it easy to use Hive and Pig
Pig:
• High-level programming
language (Pig Latin)
• Supports UDFs
• Ideal for data flow/ETL
Hive:
• Data Warehouse for Hadoop
• SQL-like query language (HiveQL)
• Initially developed at Facebook
32. HBase:
• Column-oriented database
• Runs on top of HDFS
• Ideal for sparse data
• Random, read/write access
• Ideal for very large tables (billions of
rows, millions of columns)
Mahout:
• Machine learning library
• Supports recommendation
mining, clustering,
classification, and frequent
itemset mining
EMR makes it easy to use HBase and Mahout
33. EMR makes it easy to use Spark and Shark
Shark:
• Hive on Spark
• Up to 100x Faster
• Compatible with Hive
• Used at Yahoo, Airbnb, etc
• Download via BA
Spark:
• In-memory MapReduce
• Up to 100x faster than Hadoop
• Access HDFS, HBase, S3
• Developed at UCBerkely
• Download via BA
34. Ganglia
• Scalable distributed monitoring
• View performance of the cluster and
individual nodes
• Open source
R:
• Language and software
environment for statistical
computing and graphics
• Open source
EMR makes it easy to use other tools and
applications
35. EMR Also Supports MapR’s Hadoop Distributions
In addition to the Amazon Hadoop distribution, EMR supports
two MapR Hadoop distributions
M3
M5
M7
Key features of MapR
NFS
No NameNode
JobTracker high availability
Cluster mirroring (disaster recovery)
Compression
39. Amazon Redshift
• Easy to provision, scale, operate
• No up-front costs, pay-as-you-go
• Fast: Columnar storage, compression,
Specialized nodes
• 1-100 node clusters 2TB - 1.6PB
• $999 per TB per year
• Transparent backups, restore, failover
• Security in transit, at rest, for backups, VPC
Data Warehousing the AWS way
45. 3 million (100 * 30 thousand) queries are executed per hour on 100
HANA instances.
46. Each SAP HANA instance is deployed on an Amazon Web Services (AWS) Linux
server.
Each AWS Server has 16 cores, 60 GB of RAM and 600 million rows
(Total: 1776 cores, 6.6TB RAM and 60 billion rows).”