0
1©MapR Technologies - Confidential
Expect More from Hadoop
2©MapR Technologies - Confidential
Introducing MapR
MapR offers the
technology leading
distribution for Hadoop
3©MapR Technologies - Confidential
The Industry-Leaders Choose MapR in
the Cloud
Google chose MapR to
provide Hadoop on Go...
4©MapR Technologies - Confidential
MapR Supports Broad Set of Use Cases
 Log analysis
 HBase
 Customer targeting
 Soci...
5©MapR Technologies - Confidential
Introducing Hadoop
Hadoop is deployed because
a) big data
b) fast data
c) rapidly chang...
6©MapR Technologies - Confidential
Introducing Hadoop
Hadoop is deployed because
a) big data
b) fast data
c) rapidly chang...
7©MapR Technologies - Confidential
Introducing Change
Changing data implies
a need for integration
8©MapR Technologies - Confidential
Introducing Change
Changing data implies
a need for integration
If you copy, the data w...
9©MapR Technologies - Confidential
Controlling Change
Changing data implies
a need for stabilization
10©MapR Technologies - Confidential
Controlling Change
Changing data implies
a need for stabilization
Long running analyse...
11©MapR Technologies - Confidential
The Story Can Now be Told
Here are three true
stories about how
Hadoop integration
pay...
12©MapR Technologies - Confidential
Story #1
ETL Off-load
13©MapR Technologies - Confidential
The Problem
 Major telecom vendor
 Key step in billing pipeline handled by data ware...
14©MapR Technologies - Confidential
ETL
CDR billing
records
Billing
reports
Data Warehouse
Customer
bills
Original Flow
15©MapR Technologies - Confidential
ETL
CDR billing
records
Billing
reports
Data Warehouse
Customer
bills
Original Flow
70...
16©MapR Technologies - Confidential
ETL
CDR billing
records
Billing
reports
Data Warehouse
Customer
billing
With ETL Offlo...
17©MapR Technologies - Confidential
Simplified Analysis – EDW Strategy
 70% of EDW consumed by ETL processing
 EDW direc...
18©MapR Technologies - Confidential
Simplified Analysis – MapR Strategy
 Hardware + MapR cost ~ $1.5 million
 ETL replac...
19©MapR Technologies - Confidential
Price Performance
 EDW strategy
– 1.5 x performance
– $30 million
 MapR Strategy
– 3...
20©MapR Technologies - Confidential
Story #2
Search Abuse
21©MapR Technologies - Confidential
The Problem
 Build a high performance recommendation
– Use all kinds of available dat...
22©MapR Technologies - Confidential
Input Data
 User transactions
– user id, merchant id
– SIC code, amount
 Offer trans...
23©MapR Technologies - Confidential
Input Data
 User transactions
– user id, merchant id
– SIC code, amount
 Offer trans...
24©MapR Technologies - Confidential
Search-based Recommendations
 Sample document
– Merchant Id
– Field for text descript...
25©MapR Technologies - Confidential
Search-based Recommendations
 Sample “document”
– Merchant Id
– Field for text descri...
26©MapR Technologies - Confidential
Search-based Recommendations
 Sample “document”
– Merchant Id
– Field for text descri...
27©MapR Technologies - Confidential
SolR
Indexer
SolR
Indexer
Solr
indexing
Cooccurrence
(Mahout)
Item meta-
data
Index
sh...
28©MapR Technologies - Confidential
SolR
Indexer
SolR
Indexer
Solr
indexing
Cooccurrence
(Mahout)
Item meta-
data
Index
sh...
29©MapR Technologies - Confidential
SolR
Indexer
SolR
Indexer
Solr
search
Web tier
Item meta-
data
Index
shards
User
histo...
30©MapR Technologies - Confidential
SolR
Indexer
SolR
Indexer
Solr
search
Web tier
Item meta-
data
Index
shards
User
histo...
31©MapR Technologies - Confidential
Objective Results
 At a very large credit card company
 History is all transactions,...
32©MapR Technologies - Confidential
Story #3
Stable Learning
33©MapR Technologies - Confidential
The Theme and Setting
 A humble machine learning expert once lived in a small cubicle...
34©MapR Technologies - Confidential
The Theme and Setting
 A humble machine learning expert once lived in a small cubicle...
35©MapR Technologies - Confidential
Why?
36©MapR Technologies - Confidential
StormKafka
Twitter
Data Logger
Kafka
Cluster
Kafka
Cluster
Kafka
Cluster
Kafka
API
Web...
37©MapR Technologies - Confidential
StormKafka
Twitter
Data Logger
Kafka
Cluster
Kafka
Cluster
Kafka
Cluster
Kafka
API
Web...
38©MapR Technologies - Confidential
The Essence of the Problem
 Coupling data arrival with modeling makes the data chain ...
39©MapR Technologies - Confidential
Twitter
MapR
Data Logger
Web-
site
Snap
Data
Modeling
Model
Model
Model
Model Mirror
L...
40©MapR Technologies - Confidential
The New Story
 A humble machine learning expert once lived in a small cubicle
 One d...
41©MapR Technologies - Confidential
The New Story
 A humble machine learning expert once lived in a small cubicle
 One d...
42©MapR Technologies - Confidential
Expect more from
Hadoop
43©MapR Technologies - Confidential
Expect MapR
44©MapR Technologies - Confidential
Contact me!
 tdunning@maprtech.com or tdunning@apache.org
 @ted_dunning
 Come to th...
Upcoming SlideShare
Loading in...5
×

Big Data Paris

5,040

Published on

A talk I gave during the vendor pitch section at Big Data Paris.

Published in: Technology, Education
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,040
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
81
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide
  • MapR has been selected by two of the companies most experienced with MapReduce technology which is a testament to the technology advanges of MapR’s distribution. Amazon through its Elastic MapReduce service (EMR) hosted over 2 million clusters in the past year. Amazon selected MapR to complement EMR as the only commercial Hadoop distribution being offered, sold and supported as a service by Amazon to its customers. MapR was also selected by Google – the pioneer of MapReduce and the company whose white paper on MapReduce inspired the creation of Hadoop – has also selected MapR to make our distribution available on Google Compute Engine. Hadoop in the cloud makes a great deal of sense: the elastic resource allocation that cloud computing is premised on works well for cluster-based data processing infrastructure used on varying analyses and data sets of indeterminate size. MapR has unique features such as mirroring between sites and multi-tenancy support that further enhance cloud deployments
  • MapR is used today across industries. We have 10 of the Fortune 100 that are using MapR in production. We have leading web 2.0 properties such as leading digital advertising platforms, using MapR.These customers are using MapR in production for a variety of use cases. Examples include one of the largest credit card issuers in the world that has standardized on MapR for fraud and consumer targeting applications.Other examples include a major health care group,national cyber security, and one of the largest retailers in the world. These are all provided by MapR’s complete distribution for Apache Hadoop
  • Transcript of "Big Data Paris"

    1. 1. 1©MapR Technologies - Confidential Expect More from Hadoop
    2. 2. 2©MapR Technologies - Confidential Introducing MapR MapR offers the technology leading distribution for Hadoop
    3. 3. 3©MapR Technologies - Confidential The Industry-Leaders Choose MapR in the Cloud Google chose MapR to provide Hadoop on Google Compute Engine Amazon EMR is the largest Hadoop provider in revenue and # of clusters
    4. 4. 4©MapR Technologies - Confidential MapR Supports Broad Set of Use Cases  Log analysis  HBase  Customer targeting  Social media analysis  Customer Revenue Analytics  ETL Offload  Advertising exchange analysis and optimization  Clickstream Analysis  Quality profiling/field failure analysis  Customer Sentiment  Network Analytics  Monitors and measures behavior of online shoppers  Fraud Detection  Channel analytics  Customer Behavior Analysis  Brand Monitoring  Customer targeting  Viewer Behavioral analytics  Recommendation Engine  Family tree connections  Intrusion detection & prevention  Forensic analysis  Global threat analytics  Virus analysis  Patient care monitoring Leading Retailer  Recommendation Engine  Fraud detection and Prevention Leading Bank
    5. 5. 5©MapR Technologies - Confidential Introducing Hadoop Hadoop is deployed because a) big data b) fast data c) rapidly changing data
    6. 6. 6©MapR Technologies - Confidential Introducing Hadoop Hadoop is deployed because a) big data b) fast data c) rapidly changing data
    7. 7. 7©MapR Technologies - Confidential Introducing Change Changing data implies a need for integration
    8. 8. 8©MapR Technologies - Confidential Introducing Change Changing data implies a need for integration If you copy, the data will change before you finish.
    9. 9. 9©MapR Technologies - Confidential Controlling Change Changing data implies a need for stabilization
    10. 10. 10©MapR Technologies - Confidential Controlling Change Changing data implies a need for stabilization Long running analyses must have stable data
    11. 11. 11©MapR Technologies - Confidential The Story Can Now be Told Here are three true stories about how Hadoop integration pays off
    12. 12. 12©MapR Technologies - Confidential Story #1 ETL Off-load
    13. 13. 13©MapR Technologies - Confidential The Problem  Major telecom vendor  Key step in billing pipeline handled by data warehouse (EDW)  EDW at maximum capacity  Multiple rounds of software optimization already done  Revenue limiting (= career limiting) bottleneck
    14. 14. 14©MapR Technologies - Confidential ETL CDR billing records Billing reports Data Warehouse Customer bills Original Flow
    15. 15. 15©MapR Technologies - Confidential ETL CDR billing records Billing reports Data Warehouse Customer bills Original Flow 70% of total load <10% of total code Import by bulk load from NFS
    16. 16. 16©MapR Technologies - Confidential ETL CDR billing records Billing reports Data Warehouse Customer billing With ETL Offload Import written to MapR via NFS Bulk load via NFS from MapR
    17. 17. 17©MapR Technologies - Confidential Simplified Analysis – EDW Strategy  70% of EDW consumed by ETL processing  EDW direct hardware cost is approximately $30 million CAPEX, 12 million OPEX  Additional EDW only increases capacity by 50% due to poor division of labor
    18. 18. 18©MapR Technologies - Confidential Simplified Analysis – MapR Strategy  Hardware + MapR cost ~ $1.5 million  ETL replacement development costs ~ $1.5 million  Result is 3x performance increase
    19. 19. 19©MapR Technologies - Confidential Price Performance  EDW strategy – 1.5 x performance – $30 million  MapR Strategy – 3 x performance – $3 million  20x cost/performance advantage for MapR strategy
    20. 20. 20©MapR Technologies - Confidential Story #2 Search Abuse
    21. 21. 21©MapR Technologies - Confidential The Problem  Build a high performance recommendation – Use all kinds of available data  Deploy it to production – Must have efficient deployment
    22. 22. 22©MapR Technologies - Confidential Input Data  User transactions – user id, merchant id – SIC code, amount  Offer transactions – user id, offer id – vendor id, merchant id’s, – offers, views, accepts
    23. 23. 23©MapR Technologies - Confidential Input Data  User transactions – user id, merchant id – SIC code, amount  Offer transactions – user id, offer id – vendor id, merchant id’s, – offers, views, accepts Import data via standard interfaces from log files, databases, direct feeds Find anomalous indicators of behavior
    24. 24. 24©MapR Technologies - Confidential Search-based Recommendations  Sample document – Merchant Id – Field for text description – Phone – Address – Location
    25. 25. 25©MapR Technologies - Confidential Search-based Recommendations  Sample “document” – Merchant Id – Field for text description – Phone – Address – Location – Indicator merchant id’s – Indicator industry (SIC) id’s – Indicator offers – Indicator text – Local top40
    26. 26. 26©MapR Technologies - Confidential Search-based Recommendations  Sample “document” – Merchant Id – Field for text description – Phone – Address – Location – Indicator merchant id’s – Indicator industry (SIC) id’s – Indicator offers – Indicator text – Local top40  User History (query) – Current location – Recent merchant descriptions – Recent merchant id’s – Recent SIC codes – Recent accepted offers – Local top40
    27. 27. 27©MapR Technologies - Confidential SolR Indexer SolR Indexer Solr indexing Cooccurrence (Mahout) Item meta- data Index shards Transactions Web Views Email offers
    28. 28. 28©MapR Technologies - Confidential SolR Indexer SolR Indexer Solr indexing Cooccurrence (Mahout) Item meta- data Index shards Transactions Web Views Email offers Legacy code runs directly in map- reduce framework
    29. 29. 29©MapR Technologies - Confidential SolR Indexer SolR Indexer Solr search Web tier Item meta- data Index shards User history
    30. 30. 30©MapR Technologies - Confidential SolR Indexer SolR Indexer Solr search Web tier Item meta- data Index shards User history SolrCloud runs without change via NFS
    31. 31. 31©MapR Technologies - Confidential Objective Results  At a very large credit card company  History is all transactions, all web interaction  Processing time cut from 20 hours per day to 3  Recommendation engine load time decreased from 8 hours to 3 minutes
    32. 32. 32©MapR Technologies - Confidential Story #3 Stable Learning
    33. 33. 33©MapR Technologies - Confidential The Theme and Setting  A humble machine learning expert once lived in a small cubicle  One day the CEO walked in and said – Your machine recommended PINK WAFFLES to my wife!!! – Tell me why it is suddenly doing this
    34. 34. 34©MapR Technologies - Confidential The Theme and Setting  A humble machine learning expert once lived in a small cubicle  One day the CEO walked in and said – Your machine recommended PINK WAFFLES to my wife!!! – Tell me why it is suddenly doing this  The machine learning expert could say nothing because he could not reproduce the conditions that model was trained with  The CEO was not pleased
    35. 35. 35©MapR Technologies - Confidential Why?
    36. 36. 36©MapR Technologies - Confidential StormKafka Twitter Data Logger Kafka Cluster Kafka Cluster Kafka Cluster Kafka API Web Service NAS Web Data Hadoop Flume HDFS Data Web- site
    37. 37. 37©MapR Technologies - Confidential StormKafka Twitter Data Logger Kafka Cluster Kafka Cluster Kafka Cluster Kafka API Web Service NAS Web Data Hadoop Flume HDFS Data Data arrives continuously Web- site Learning steps can’t be tied to delayed data It can be delayed arbitrarily
    38. 38. 38©MapR Technologies - Confidential The Essence of the Problem  Coupling data arrival with modeling makes the data chain brittle – Minor delays in data delivery will break modeling SLA’s  But if data can arrive late and restate the past then we can’t easily replicate a model build  Existing data chains don’t support full bitemporal queries
    39. 39. 39©MapR Technologies - Confidential Twitter MapR Data Logger Web- site Snap Data Modeling Model Model Model Model Mirror Live System
    40. 40. 40©MapR Technologies - Confidential The New Story  A humble machine learning expert once lived in a small cubicle  One day the CEO walked in and said – Your machine recommended PINK WAFFLES to my wife!!! – Tell me why it is suddenly doing this
    41. 41. 41©MapR Technologies - Confidential The New Story  A humble machine learning expert once lived in a small cubicle  One day the CEO walked in and said – Your machine recommended PINK WAFFLES to my wife!!! – Tell me why it is suddenly doing this  The machine learning expert could – Pull out all previously deployed models – Could exactly replicate any training run with any version of software – Could point out that PINK WAFFLES were actually quite stylish  The CEO was very pleased … he ran off to buy pink waffles
    42. 42. 42©MapR Technologies - Confidential Expect more from Hadoop
    43. 43. 43©MapR Technologies - Confidential Expect MapR
    44. 44. 44©MapR Technologies - Confidential Contact me!  tdunning@maprtech.com or tdunning@apache.org  @ted_dunning  Come to the MapR booth
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×