Big Data Paris

1©MapR Technologies - Confidential
Expect More from Hadoop

Introducing MapR
MapR offers the
technology leading
distribution for Hadoop

The Industry-Leaders Choose MapR in
the Cloud
Google chose MapR to
provide Hadoop on Google
Compute Engine
Amazon EMR is the largest
Hadoop provider in revenue
and # of clusters

MapR Supports Broad Set of Use Cases
 Log analysis
 HBase
 Customer targeting
 Social media analysis
 Customer Revenue
Analytics
 ETL Offload
 Advertising exchange
analysis and optimization
 Clickstream Analysis
 Quality profiling/field
failure analysis
 Customer
Sentiment
 Network Analytics
 Monitors and measures
behavior of online shoppers
 Fraud Detection
 Channel analytics
 Customer Behavior Analysis
 Brand Monitoring
 Customer targeting
 Viewer Behavioral analytics
 Recommendation Engine
 Family tree connections
 Intrusion detection & prevention
 Forensic analysis
 Global threat
analytics
 Virus analysis
 Patient care
monitoring
Leading Retailer
 Recommendation Engine
 Fraud detection and Prevention
Leading Bank

Introducing Hadoop
Hadoop is deployed because
a) big data
b) fast data
c) rapidly changing data

Introducing Change
Changing data implies
a need for integration

Introducing Change
a need for integration
If you copy, the data will
change before you finish.

Controlling Change
a need for stabilization

Controlling Change
a need for stabilization
Long running analyses must
have stable data

The Story Can Now be Told
Here are three true
stories about how
Hadoop integration
pays off

Story #1
ETL Off-load

The Problem
 Major telecom vendor
 Key step in billing pipeline handled by data warehouse (EDW)
 EDW at maximum capacity
 Multiple rounds of software optimization already done
 Revenue limiting (= career limiting) bottleneck

ETL
CDR billing
records
Billing
reports
Data Warehouse
Customer
bills
Original Flow

ETL
CDR billing
records
Billing
reports
Data Warehouse
Customer
bills
Original Flow
70% of total load
<10% of total code
Import by bulk
load from NFS

ETL
CDR billing
records
Billing
reports
Data Warehouse
Customer
billing
With ETL Offload
Import written
to MapR via NFS
Bulk load via NFS
from MapR

Simplified Analysis – EDW Strategy
 70% of EDW consumed by ETL processing
 EDW direct hardware cost is approximately $30 million CAPEX, 12
million OPEX
 Additional EDW only increases capacity by 50% due to poor
division of labor

Simplified Analysis – MapR Strategy
 Hardware + MapR cost ~ $1.5 million
 ETL replacement development costs ~ $1.5 million
 Result is 3x performance increase

Price Performance
 EDW strategy
– 1.5 x performance
– $30 million
 MapR Strategy
– 3 x performance
– $3 million
 20x cost/performance advantage for MapR strategy

Story #2
Search Abuse

The Problem
 Build a high performance recommendation
– Use all kinds of available data
 Deploy it to production
– Must have efficient deployment

Input Data
 User transactions
– user id, merchant id
– SIC code, amount
 Offer transactions
– user id, offer id
– vendor id, merchant id’s,
– offers, views, accepts

Input Data
 User transactions
– user id, merchant id
– SIC code, amount
 Offer transactions
– user id, offer id
– vendor id, merchant id’s,
– offers, views, accepts
Import data via standard interfaces
from log files, databases, direct
feeds
Find anomalous indicators of
behavior

Search-based Recommendations
 Sample document
– Merchant Id
– Field for text description
– Phone
– Address
– Location

 Sample “document”
– Merchant Id
– Phone
– Address
– Location
– Indicator merchant id’s
– Indicator industry (SIC) id’s
– Indicator offers
– Indicator text
– Local top40

 Sample “document”
– Merchant Id
– Phone
– Address
– Location
– Indicator merchant id’s
– Indicator industry (SIC) id’s
– Indicator offers
– Indicator text
– Local top40
 User History (query)
– Current location
– Recent merchant descriptions
– Recent merchant id’s
– Recent SIC codes
– Recent accepted offers
– Local top40

SolR
Indexer
SolR
Indexer
Solr
indexing
Cooccurrence
(Mahout)
Item meta-
data
Index
shards
Transactions
Web Views
Email
offers

SolR
Indexer
SolR
Indexer
Solr
indexing
Cooccurrence
(Mahout)
Item meta-
data
Index
shards
Transactions
Web Views
Email
offers
Legacy code runs
directly in map-
reduce framework

SolR
Indexer
SolR
Indexer
Solr
search
Web tier
Item meta-
data
Index
shards
User
history

SolR
Indexer
SolR
Indexer
Solr
search
Web tier
Item meta-
data
Index
shards
User
history
SolrCloud runs
without change
via NFS

Objective Results
 At a very large credit card company
 History is all transactions, all web interaction
 Processing time cut from 20 hours per day to 3
 Recommendation engine load time decreased from 8 hours to 3
minutes

Story #3
Stable Learning

The Theme and Setting
 A humble machine learning expert once lived in a small cubicle
 One day the CEO walked in and said
– Your machine recommended PINK WAFFLES to my wife!!!
– Tell me why it is suddenly doing this

The Theme and Setting
 The machine learning expert could say nothing because he could
not reproduce the conditions that model was trained with
 The CEO was not pleased

Why?

StormKafka
Twitter
Data Logger
Kafka
Cluster
Kafka
Cluster
Kafka
Cluster
Kafka
API
Web Service NAS
Web
Data
Hadoop
Flume
HDFS
Data
Web-
site

StormKafka
Twitter
Data Logger
Kafka
Cluster
Kafka
Cluster
Kafka
Cluster
Kafka
API
Web Service NAS
Web
Data
Hadoop
Flume
HDFS
Data
Data arrives
continuously
Web-
site
Learning steps
can’t be tied to
delayed data
It can be delayed
arbitrarily

The Essence of the Problem
 Coupling data arrival with modeling makes the data chain brittle
– Minor delays in data delivery will break modeling SLA’s
 But if data can arrive late and restate the past then we can’t easily
replicate a model build
 Existing data chains don’t support full bitemporal queries

Twitter
MapR
Data Logger
Web-
site
Snap
Data
Modeling
Model
Model
Model
Model Mirror
Live System

The New Story

The New Story
 The machine learning expert could
– Pull out all previously deployed models
– Could exactly replicate any training run with any version of software
– Could point out that PINK WAFFLES were actually quite stylish
 The CEO was very pleased … he ran off to buy pink waffles

Expect more from
Hadoop

Expect MapR

Contact me!
 tdunning@maprtech.com or tdunning@apache.org
 @ted_dunning
 Come to the MapR booth

Big Data Paris

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Big Data Paris

Similar to Big Data Paris (20)

More from Ted Dunning

More from Ted Dunning (20)

Recently uploaded

Recently uploaded (20)

Big Data Paris

Editor's Notes