Big Data Lessons from the Cloud

1
Big Data Lessons from the Cloud
Jack Norris, MapR Technologies

2
Data Volume
Growing 44x
2020: 35.2
Zettabytes
2010:
1.2
Zettabytes
The Challenge of Big Data
Business Analytics Requires a New Approach
Source: IDC Digital Universe Study, sponsored by EMC, May 2010
IDC
Digital Universe
Study
Data is Growing Faster than Moore’s Law

3
What are the Requirements for Big Data?
 Process it quickly
 Combine multiple
data sources
 Expand analysis

4
Big Data in the Cloud
 Distributed, scalable computing platform
– Data/Compute framework
– Commodity hardware
 Pioneered at Google
 Commercially available as Hadoop

5
Important Drivers for Hadoop
 Data on compute
 You don’t need to know what
questions to ask beforehand
 Simple algorithms on Big Data
 Analysis of unstructured data

7
Apache Hadoop Distribution
 Combination of Various
Packages
 Integrated, tested and
hardened

9
Amazon Example: Elastic MapReduce (EMR)
EMR provides Hadoop as a Service
in the Cloud

10
How does it work?
EMR
EMR ClusterS3
You can store the
data in S3 and/or on
the cluster (HDFS)
You decide which Hadoop
distribution to run, how many
nodes, and what types of nodes

11
EMR
EMR Cluster
How does it work?
S3
You can easily add
additional nodes

12
How does it work?
EMR ClusterS3
When processing is complete,
you can shut down the cluster
(and stop paying)

14
Thousands of customers, 2 million+ clusters

16
Hadoop in the Cloud is a Flexible
Infrastructure for Big Data

17
 MinuteSort - Amount of data that can be sorted in 60.00 seconds.
– Benchmark is technology Agnostic
 Previous record was 1.4TB set by Microsoft Research using
specially designed software across physical hardware
 Previous Hadoop MinuteSort record was 578 GB
17
Cloud Example of Scalability

18
A New MinuteSort World Record
New World Record
1.5 TB in 60seconds
3X more data processed
than the previous
Hadoop Record

19
Previous Record
3452 physical servers
Prepare datacenter
Rack and stack servers
Maintain hardware
2103 instances
Invoke gcutil command
Months Minutes
Cloud Deployment Comparison

20
Previous Record
3452 1U servers x
$4K/server =
2103 n1-standard-4-d x
$.58/instance hour x
60 seconds =
$13,808,000 $20.33
Cost Comparison

21
Use Case 1:
Expand Data for Analysis

22
Comparing an EDW to Hadoop
 Major telecom vendor
 Key step in billing pipeline
handled by data warehouse
(EDW)
 EDW at maximum capacity
 Multiple rounds of software
optimization already done
 Revenue limiting (= career
limiting) bottleneck

23
Transformation
Extract and Load
CDR billing
records
Billing
reports
Data Warehouse
Customer
bills
Original Flow

24
Problem Analysis
 70% of EDW load is related to call detail record (CDR)
normalization
–< 10% of total lines of code
–CDR normalization difficult within the EDW
–Binary extraction and conversion
 Data rates are too high for upstream transform
–Requires high volume joins

25
ETL
CDR billing
records
Billing
reports
Data Warehouse
Customer
billing
With ETL Offload
Hadoop Cluster

26
ETL Offload
Hadoop Distribution

27
Simplified Analysis
 70% of EDW consumed by ETL processing – Offload
frees capacity
 EDW direct hardware cost is approximately $30 million
vs. Hadoop cluster at 1/50 the cost
 Additional EDW only increases capacity by 50% due to
poor division of labor

28
The Results
 EDW strategy
–1.5 x performance
–$30 million
 Hadoop Strategy
–3 x faster
–20x cost/performance advantage for Hadoop strategy
–With High Availability and data protection

29
Use Case 2:
Combine Many Different Data Sources

30
Combining different feeds on one platform
Hadoop and HBase
Storage and Processing
…
Real-time data feed
from social network
Stored in
Hadoop
Historical
Purchase
Information
Predictive Analytics from
Historical data combined with
NoSQL querying on real-time
social networking data
Billing
Data

31
Results
 New Service Rolled out in 1 quarter
 Processing time cut from 20 hours per day to 3
 Recommendation engine load time decreased from 8
hours to 3 minutes
 Includes data versioning support for easier
development and updating of models

32
Collect Data from Dispersed Data Sources

33
Leading Veterinary Equipment Mfgr
 Aggregates data across 6000 veterinary clinics
 Nightly extracts from each clinic
 One job runs once a week for a few hours
 Expanding applications to include vaccination analysis for 300M
vaccinations
 Predictive analytics for disease prevalence and prevention

34
Use Case 3:
New Application from New Data Source

35
Ancestry.com – Family Tree

36
Overview and Requirements
 Collect and Collate information from disparate sources
(Text files, Images, etc.)
 Leverage new data source: Spit
 Machine learning techniques and DNA Matching
Algorithms

37
The Results
 Storage Infrastructure for billions of small and large
files
 Blob Store for large images through NoSQL solutions
 Multi-tenant capability for data-mining and machine-
learning algorithm development

38
Use Case 4:
New Analytics on Existing Data

39
Analytic Flexibility
 MapReduce enabled Machine learning algorithms
 Enhanced Search
 Real-time event processing
 No need to sample the data
Fraud Detection Target Marketing
Consumer
Behavior Analysis …

40
Hadoop Expands Analytics
“Simple algorithms and lots of data
trump complex models ”
Halevy, Norvig, and Pereira, Google
IEEE Intelligent Systems

41
Advanced Simple Analytics
 Fraud detection:
– Detect small frauds using transaction patterns across the entire
portfolio
– Identify compromise signature to prevent further exploits and
provide solid case explanations
 Google Flu Trends vs. Traditional Flu Surveillance
systems and modeling
 Netflix recommendation engine
– Complex models vs. adding IMDB data

43
Clickstream Analysis –
 Big Box Retailer came to Razorfish
– 3.5 billion records
– 71 million unique cookies
– 1.7 million targeted ads required per day
Problem: Improve Return on Ad Spend (ROAS)

44
Targeted Ad
User recently
purchased a
sports movie and
is searching for
video games (1.7 Million per day)

45
Processing time dropped from 2+ days to 8 hours
(with lots more data)

46
Increased Return On Ad Spend by 500%

47
Hadoop in the Cloud/EMR applications
 Targeted advertising / Clickstream analysis
 Security: anti-virus, fraud detection, image recognition
 Pattern matching / Recommendations
 Data warehousing / BI
 Bio-informatics (Genome analysis)
 Financial simulation (Monte Carlo simulation)
 File processing (resize jpegs, video encoding)
 Web indexing

48
Big Data Processing
…
99.999%
HA
Data
Protection
Disaster
Recovery
Scalability
&
Performance
Enterprise
Integration
Multi-
tenancy
Map
Reduce
File-Based
Applications
SQL Database Search Stream
Processing
Batch Orientation:
Enterprise Logfile Analysis
ETL Offload
Object Archive
Fraud Detection
Clickstream Analytics
Real-Time Orientation:
Sensor Analysis
“Twitterscraping”
Telematics
Process Optimization
Interactive Orientation:
Forensic Analysis
Analytic Modeling
BI User Focus

49
Big Data Lessons from the Cloud
1. Big Data requires a new approach
2. Hadoop is a paradigm shift
3. Easy to get started with Hadoop in the Cloud
4. Scale clusters up and down in the Cloud
5. Only pay for what you use
6. Expand data for analysis
7. Combine data sources
8. New application from new data source
9. New analytics
10. Wide variety of applications appropriate for Hadoop

Big Data Lessons from the Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data Lessons from the Cloud

Similar to Big Data Lessons from the Cloud (20)

More from MapR Technologies

More from MapR Technologies (20)

Recently uploaded

Recently uploaded (20)

Big Data Lessons from the Cloud

Editor's Notes