Big Data Paris
Upcoming SlideShare
Loading in...5
×
 

Big Data Paris

on

  • 5,366 views

A talk I gave during the vendor pitch section at Big Data Paris.

A talk I gave during the vendor pitch section at Big Data Paris.

Statistics

Views

Total Views
5,366
Views on SlideShare
2,914
Embed Views
2,452

Actions

Likes
9
Downloads
78
Comments
0

11 Embeds 2,452

http://www.scoop.it 2109
https://my.zyncro.com 199
http://www.mapr.com 86
http://itsplatform.com 34
https://twitter.com 12
http://localhost 5
http://eventifier.co 3
http://de.slideshare.net 1
http://www.itsplatform.com 1
http://translate.googleusercontent.com 1
http://mechfortjbrec.jimdo.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • MapR has been selected by two of the companies most experienced with MapReduce technology which is a testament to the technology advanges of MapR’s distribution. Amazon through its Elastic MapReduce service (EMR) hosted over 2 million clusters in the past year. Amazon selected MapR to complement EMR as the only commercial Hadoop distribution being offered, sold and supported as a service by Amazon to its customers. MapR was also selected by Google – the pioneer of MapReduce and the company whose white paper on MapReduce inspired the creation of Hadoop – has also selected MapR to make our distribution available on Google Compute Engine. Hadoop in the cloud makes a great deal of sense: the elastic resource allocation that cloud computing is premised on works well for cluster-based data processing infrastructure used on varying analyses and data sets of indeterminate size. MapR has unique features such as mirroring between sites and multi-tenancy support that further enhance cloud deployments
  • MapR is used today across industries. We have 10 of the Fortune 100 that are using MapR in production. We have leading web 2.0 properties such as leading digital advertising platforms, using MapR.These customers are using MapR in production for a variety of use cases. Examples include one of the largest credit card issuers in the world that has standardized on MapR for fraud and consumer targeting applications.Other examples include a major health care group,national cyber security, and one of the largest retailers in the world. These are all provided by MapR’s complete distribution for Apache Hadoop

Big Data Paris Big Data Paris Presentation Transcript

  • 1©MapR Technologies - Confidential Expect More from Hadoop
  • 2©MapR Technologies - Confidential Introducing MapR MapR offers the technology leading distribution for Hadoop
  • 3©MapR Technologies - Confidential The Industry-Leaders Choose MapR in the Cloud Google chose MapR to provide Hadoop on Google Compute Engine Amazon EMR is the largest Hadoop provider in revenue and # of clusters View slide
  • 4©MapR Technologies - Confidential MapR Supports Broad Set of Use Cases  Log analysis  HBase  Customer targeting  Social media analysis  Customer Revenue Analytics  ETL Offload  Advertising exchange analysis and optimization  Clickstream Analysis  Quality profiling/field failure analysis  Customer Sentiment  Network Analytics  Monitors and measures behavior of online shoppers  Fraud Detection  Channel analytics  Customer Behavior Analysis  Brand Monitoring  Customer targeting  Viewer Behavioral analytics  Recommendation Engine  Family tree connections  Intrusion detection & prevention  Forensic analysis  Global threat analytics  Virus analysis  Patient care monitoring Leading Retailer  Recommendation Engine  Fraud detection and Prevention Leading Bank View slide
  • 5©MapR Technologies - Confidential Introducing Hadoop Hadoop is deployed because a) big data b) fast data c) rapidly changing data
  • 6©MapR Technologies - Confidential Introducing Hadoop Hadoop is deployed because a) big data b) fast data c) rapidly changing data
  • 7©MapR Technologies - Confidential Introducing Change Changing data implies a need for integration
  • 8©MapR Technologies - Confidential Introducing Change Changing data implies a need for integration If you copy, the data will change before you finish.
  • 9©MapR Technologies - Confidential Controlling Change Changing data implies a need for stabilization
  • 10©MapR Technologies - Confidential Controlling Change Changing data implies a need for stabilization Long running analyses must have stable data
  • 11©MapR Technologies - Confidential The Story Can Now be Told Here are three true stories about how Hadoop integration pays off
  • 12©MapR Technologies - Confidential Story #1 ETL Off-load
  • 13©MapR Technologies - Confidential The Problem  Major telecom vendor  Key step in billing pipeline handled by data warehouse (EDW)  EDW at maximum capacity  Multiple rounds of software optimization already done  Revenue limiting (= career limiting) bottleneck
  • 14©MapR Technologies - Confidential ETL CDR billing records Billing reports Data Warehouse Customer bills Original Flow
  • 15©MapR Technologies - Confidential ETL CDR billing records Billing reports Data Warehouse Customer bills Original Flow 70% of total load <10% of total code Import by bulk load from NFS
  • 16©MapR Technologies - Confidential ETL CDR billing records Billing reports Data Warehouse Customer billing With ETL Offload Import written to MapR via NFS Bulk load via NFS from MapR
  • 17©MapR Technologies - Confidential Simplified Analysis – EDW Strategy  70% of EDW consumed by ETL processing  EDW direct hardware cost is approximately $30 million CAPEX, 12 million OPEX  Additional EDW only increases capacity by 50% due to poor division of labor
  • 18©MapR Technologies - Confidential Simplified Analysis – MapR Strategy  Hardware + MapR cost ~ $1.5 million  ETL replacement development costs ~ $1.5 million  Result is 3x performance increase
  • 19©MapR Technologies - Confidential Price Performance  EDW strategy – 1.5 x performance – $30 million  MapR Strategy – 3 x performance – $3 million  20x cost/performance advantage for MapR strategy
  • 20©MapR Technologies - Confidential Story #2 Search Abuse
  • 21©MapR Technologies - Confidential The Problem  Build a high performance recommendation – Use all kinds of available data  Deploy it to production – Must have efficient deployment
  • 22©MapR Technologies - Confidential Input Data  User transactions – user id, merchant id – SIC code, amount  Offer transactions – user id, offer id – vendor id, merchant id’s, – offers, views, accepts
  • 23©MapR Technologies - Confidential Input Data  User transactions – user id, merchant id – SIC code, amount  Offer transactions – user id, offer id – vendor id, merchant id’s, – offers, views, accepts Import data via standard interfaces from log files, databases, direct feeds Find anomalous indicators of behavior
  • 24©MapR Technologies - Confidential Search-based Recommendations  Sample document – Merchant Id – Field for text description – Phone – Address – Location
  • 25©MapR Technologies - Confidential Search-based Recommendations  Sample “document” – Merchant Id – Field for text description – Phone – Address – Location – Indicator merchant id’s – Indicator industry (SIC) id’s – Indicator offers – Indicator text – Local top40
  • 26©MapR Technologies - Confidential Search-based Recommendations  Sample “document” – Merchant Id – Field for text description – Phone – Address – Location – Indicator merchant id’s – Indicator industry (SIC) id’s – Indicator offers – Indicator text – Local top40  User History (query) – Current location – Recent merchant descriptions – Recent merchant id’s – Recent SIC codes – Recent accepted offers – Local top40
  • 27©MapR Technologies - Confidential SolR Indexer SolR Indexer Solr indexing Cooccurrence (Mahout) Item meta- data Index shards Transactions Web Views Email offers
  • 28©MapR Technologies - Confidential SolR Indexer SolR Indexer Solr indexing Cooccurrence (Mahout) Item meta- data Index shards Transactions Web Views Email offers Legacy code runs directly in map- reduce framework
  • 29©MapR Technologies - Confidential SolR Indexer SolR Indexer Solr search Web tier Item meta- data Index shards User history
  • 30©MapR Technologies - Confidential SolR Indexer SolR Indexer Solr search Web tier Item meta- data Index shards User history SolrCloud runs without change via NFS
  • 31©MapR Technologies - Confidential Objective Results  At a very large credit card company  History is all transactions, all web interaction  Processing time cut from 20 hours per day to 3  Recommendation engine load time decreased from 8 hours to 3 minutes
  • 32©MapR Technologies - Confidential Story #3 Stable Learning
  • 33©MapR Technologies - Confidential The Theme and Setting  A humble machine learning expert once lived in a small cubicle  One day the CEO walked in and said – Your machine recommended PINK WAFFLES to my wife!!! – Tell me why it is suddenly doing this
  • 34©MapR Technologies - Confidential The Theme and Setting  A humble machine learning expert once lived in a small cubicle  One day the CEO walked in and said – Your machine recommended PINK WAFFLES to my wife!!! – Tell me why it is suddenly doing this  The machine learning expert could say nothing because he could not reproduce the conditions that model was trained with  The CEO was not pleased
  • 35©MapR Technologies - Confidential Why?
  • 36©MapR Technologies - Confidential StormKafka Twitter Data Logger Kafka Cluster Kafka Cluster Kafka Cluster Kafka API Web Service NAS Web Data Hadoop Flume HDFS Data Web- site
  • 37©MapR Technologies - Confidential StormKafka Twitter Data Logger Kafka Cluster Kafka Cluster Kafka Cluster Kafka API Web Service NAS Web Data Hadoop Flume HDFS Data Data arrives continuously Web- site Learning steps can’t be tied to delayed data It can be delayed arbitrarily
  • 38©MapR Technologies - Confidential The Essence of the Problem  Coupling data arrival with modeling makes the data chain brittle – Minor delays in data delivery will break modeling SLA’s  But if data can arrive late and restate the past then we can’t easily replicate a model build  Existing data chains don’t support full bitemporal queries
  • 39©MapR Technologies - Confidential Twitter MapR Data Logger Web- site Snap Data Modeling Model Model Model Model Mirror Live System
  • 40©MapR Technologies - Confidential The New Story  A humble machine learning expert once lived in a small cubicle  One day the CEO walked in and said – Your machine recommended PINK WAFFLES to my wife!!! – Tell me why it is suddenly doing this
  • 41©MapR Technologies - Confidential The New Story  A humble machine learning expert once lived in a small cubicle  One day the CEO walked in and said – Your machine recommended PINK WAFFLES to my wife!!! – Tell me why it is suddenly doing this  The machine learning expert could – Pull out all previously deployed models – Could exactly replicate any training run with any version of software – Could point out that PINK WAFFLES were actually quite stylish  The CEO was very pleased … he ran off to buy pink waffles
  • 42©MapR Technologies - Confidential Expect more from Hadoop
  • 43©MapR Technologies - Confidential Expect MapR
  • 44©MapR Technologies - Confidential Contact me!  tdunning@maprtech.com or tdunning@apache.org  @ted_dunning  Come to the MapR booth