More Related Content Similar to Big Data Paris (20) More from Ted Dunning (20) Big Data Paris2. 2©MapR Technologies - Confidential
Introducing MapR
MapR offers the
technology leading
distribution for Hadoop
3. 3©MapR Technologies - Confidential
The Industry-Leaders Choose MapR in
the Cloud
Google chose MapR to
provide Hadoop on Google
Compute Engine
Amazon EMR is the largest
Hadoop provider in revenue
and # of clusters
4. 4©MapR Technologies - Confidential
MapR Supports Broad Set of Use Cases
Log analysis
HBase
Customer targeting
Social media analysis
Customer Revenue
Analytics
ETL Offload
Advertising exchange
analysis and optimization
Clickstream Analysis
Quality profiling/field
failure analysis
Customer
Sentiment
Network Analytics
Monitors and measures
behavior of online shoppers
Fraud Detection
Channel analytics
Customer Behavior Analysis
Brand Monitoring
Customer targeting
Viewer Behavioral analytics
Recommendation Engine
Family tree connections
Intrusion detection & prevention
Forensic analysis
Global threat
analytics
Virus analysis
Patient care
monitoring
Leading Retailer
Recommendation Engine
Fraud detection and Prevention
Leading Bank
5. 5©MapR Technologies - Confidential
Introducing Hadoop
Hadoop is deployed because
a) big data
b) fast data
c) rapidly changing data
6. 6©MapR Technologies - Confidential
Introducing Hadoop
Hadoop is deployed because
a) big data
b) fast data
c) rapidly changing data
8. 8©MapR Technologies - Confidential
Introducing Change
Changing data implies
a need for integration
If you copy, the data will
change before you finish.
10. 10©MapR Technologies - Confidential
Controlling Change
Changing data implies
a need for stabilization
Long running analyses must
have stable data
11. 11©MapR Technologies - Confidential
The Story Can Now be Told
Here are three true
stories about how
Hadoop integration
pays off
13. 13©MapR Technologies - Confidential
The Problem
Major telecom vendor
Key step in billing pipeline handled by data warehouse (EDW)
EDW at maximum capacity
Multiple rounds of software optimization already done
Revenue limiting (= career limiting) bottleneck
14. 14©MapR Technologies - Confidential
ETL
CDR billing
records
Billing
reports
Data Warehouse
Customer
bills
Original Flow
15. 15©MapR Technologies - Confidential
ETL
CDR billing
records
Billing
reports
Data Warehouse
Customer
bills
Original Flow
70% of total load
<10% of total code
Import by bulk
load from NFS
16. 16©MapR Technologies - Confidential
ETL
CDR billing
records
Billing
reports
Data Warehouse
Customer
billing
With ETL Offload
Import written
to MapR via NFS
Bulk load via NFS
from MapR
17. 17©MapR Technologies - Confidential
Simplified Analysis – EDW Strategy
70% of EDW consumed by ETL processing
EDW direct hardware cost is approximately $30 million CAPEX, 12
million OPEX
Additional EDW only increases capacity by 50% due to poor
division of labor
18. 18©MapR Technologies - Confidential
Simplified Analysis – MapR Strategy
Hardware + MapR cost ~ $1.5 million
ETL replacement development costs ~ $1.5 million
Result is 3x performance increase
19. 19©MapR Technologies - Confidential
Price Performance
EDW strategy
– 1.5 x performance
– $30 million
MapR Strategy
– 3 x performance
– $3 million
20x cost/performance advantage for MapR strategy
21. 21©MapR Technologies - Confidential
The Problem
Build a high performance recommendation
– Use all kinds of available data
Deploy it to production
– Must have efficient deployment
22. 22©MapR Technologies - Confidential
Input Data
User transactions
– user id, merchant id
– SIC code, amount
Offer transactions
– user id, offer id
– vendor id, merchant id’s,
– offers, views, accepts
23. 23©MapR Technologies - Confidential
Input Data
User transactions
– user id, merchant id
– SIC code, amount
Offer transactions
– user id, offer id
– vendor id, merchant id’s,
– offers, views, accepts
Import data via standard interfaces
from log files, databases, direct
feeds
Find anomalous indicators of
behavior
24. 24©MapR Technologies - Confidential
Search-based Recommendations
Sample document
– Merchant Id
– Field for text description
– Phone
– Address
– Location
25. 25©MapR Technologies - Confidential
Search-based Recommendations
Sample “document”
– Merchant Id
– Field for text description
– Phone
– Address
– Location
– Indicator merchant id’s
– Indicator industry (SIC) id’s
– Indicator offers
– Indicator text
– Local top40
26. 26©MapR Technologies - Confidential
Search-based Recommendations
Sample “document”
– Merchant Id
– Field for text description
– Phone
– Address
– Location
– Indicator merchant id’s
– Indicator industry (SIC) id’s
– Indicator offers
– Indicator text
– Local top40
User History (query)
– Current location
– Recent merchant descriptions
– Recent merchant id’s
– Recent SIC codes
– Recent accepted offers
– Local top40
27. 27©MapR Technologies - Confidential
SolR
Indexer
SolR
Indexer
Solr
indexing
Cooccurrence
(Mahout)
Item meta-
data
Index
shards
Transactions
Web Views
Email
offers
28. 28©MapR Technologies - Confidential
SolR
Indexer
SolR
Indexer
Solr
indexing
Cooccurrence
(Mahout)
Item meta-
data
Index
shards
Transactions
Web Views
Email
offers
Legacy code runs
directly in map-
reduce framework
29. 29©MapR Technologies - Confidential
SolR
Indexer
SolR
Indexer
Solr
search
Web tier
Item meta-
data
Index
shards
User
history
30. 30©MapR Technologies - Confidential
SolR
Indexer
SolR
Indexer
Solr
search
Web tier
Item meta-
data
Index
shards
User
history
SolrCloud runs
without change
via NFS
31. 31©MapR Technologies - Confidential
Objective Results
At a very large credit card company
History is all transactions, all web interaction
Processing time cut from 20 hours per day to 3
Recommendation engine load time decreased from 8 hours to 3
minutes
33. 33©MapR Technologies - Confidential
The Theme and Setting
A humble machine learning expert once lived in a small cubicle
One day the CEO walked in and said
– Your machine recommended PINK WAFFLES to my wife!!!
– Tell me why it is suddenly doing this
34. 34©MapR Technologies - Confidential
The Theme and Setting
A humble machine learning expert once lived in a small cubicle
One day the CEO walked in and said
– Your machine recommended PINK WAFFLES to my wife!!!
– Tell me why it is suddenly doing this
The machine learning expert could say nothing because he could
not reproduce the conditions that model was trained with
The CEO was not pleased
36. 36©MapR Technologies - Confidential
StormKafka
Twitter
Data Logger
Kafka
Cluster
Kafka
Cluster
Kafka
Cluster
Kafka
API
Web Service NAS
Web
Data
Hadoop
Flume
HDFS
Data
Web-
site
37. 37©MapR Technologies - Confidential
StormKafka
Twitter
Data Logger
Kafka
Cluster
Kafka
Cluster
Kafka
Cluster
Kafka
API
Web Service NAS
Web
Data
Hadoop
Flume
HDFS
Data
Data arrives
continuously
Web-
site
Learning steps
can’t be tied to
delayed data
It can be delayed
arbitrarily
38. 38©MapR Technologies - Confidential
The Essence of the Problem
Coupling data arrival with modeling makes the data chain brittle
– Minor delays in data delivery will break modeling SLA’s
But if data can arrive late and restate the past then we can’t easily
replicate a model build
Existing data chains don’t support full bitemporal queries
39. 39©MapR Technologies - Confidential
Twitter
MapR
Data Logger
Web-
site
Snap
Data
Modeling
Model
Model
Model
Model Mirror
Live System
40. 40©MapR Technologies - Confidential
The New Story
A humble machine learning expert once lived in a small cubicle
One day the CEO walked in and said
– Your machine recommended PINK WAFFLES to my wife!!!
– Tell me why it is suddenly doing this
41. 41©MapR Technologies - Confidential
The New Story
A humble machine learning expert once lived in a small cubicle
One day the CEO walked in and said
– Your machine recommended PINK WAFFLES to my wife!!!
– Tell me why it is suddenly doing this
The machine learning expert could
– Pull out all previously deployed models
– Could exactly replicate any training run with any version of software
– Could point out that PINK WAFFLES were actually quite stylish
The CEO was very pleased … he ran off to buy pink waffles
44. 44©MapR Technologies - Confidential
Contact me!
tdunning@maprtech.com or tdunning@apache.org
@ted_dunning
Come to the MapR booth
Editor's Notes MapR has been selected by two of the companies most experienced with MapReduce technology which is a testament to the technology advanges of MapR’s distribution. Amazon through its Elastic MapReduce service (EMR) hosted over 2 million clusters in the past year. Amazon selected MapR to complement EMR as the only commercial Hadoop distribution being offered, sold and supported as a service by Amazon to its customers. MapR was also selected by Google – the pioneer of MapReduce and the company whose white paper on MapReduce inspired the creation of Hadoop – has also selected MapR to make our distribution available on Google Compute Engine. Hadoop in the cloud makes a great deal of sense: the elastic resource allocation that cloud computing is premised on works well for cluster-based data processing infrastructure used on varying analyses and data sets of indeterminate size. MapR has unique features such as mirroring between sites and multi-tenancy support that further enhance cloud deployments MapR is used today across industries. We have 10 of the Fortune 100 that are using MapR in production. We have leading web 2.0 properties such as leading digital advertising platforms, using MapR.These customers are using MapR in production for a variety of use cases. Examples include one of the largest credit card issuers in the world that has standardized on MapR for fraud and consumer targeting applications.Other examples include a major health care group,national cyber security, and one of the largest retailers in the world. These are all provided by MapR’s complete distribution for Apache Hadoop