®© 2014 MapR Technologies 1
®
© 2014 MapR Technologies
July 23, 2014
®© 2014 MapR Technologies 2
Our Speakers
Jin Kim
VP, Marketing
Skytree
Nitin Bandugula
Product Marketing
MapR
®© 2014 MapR Technologies 3
Agenda
•  Introduction to Hadoop
•  Machine Learning on Hadoop
•  Advanced Machine Learning
•  Customer Case Studies
®© 2014 MapR Technologies 4
Big Data is Overwhelming Traditional Systems
•  Mission-critical reliability
•  Transaction guarantees
•  Deep security
•  Real-time performance
•  Backup and recovery
•  Interactive SQL
•  Rich analytics
•  Workload management
•  Data governance
•  Backup and recovery
Enterprise
Data
Architecture
ENTERPRISE
USERS
OPERATIONAL
SYSTEMS
ANALYTICAL
SYSTEMS
PRODUCTION
REQUIREMENTS
PRODUCTION
REQUIREMENTS
OUTSIDE SOURCES
®© 2014 MapR Technologies 5
Hadoop: The Disruptive Technology at the Core of Big Data
JOB TRENDS FROM INDEED.COM
Jan ‘06 Jan ‘12 Jan ‘14Jan ‘07 Jan ‘08 Jan ‘09 Jan ‘10 Jan ‘11 Jan ‘13
®© 2014 MapR Technologies 6
OPERATIONAL
SYSTEMS
ANALYTICAL
SYSTEMS
ENTERPRISE
USERS
•  Data staging
•  Archive
•  Data transformation
•  Data exploration
•  Streaming,
interactions
Hadoop Relieves the Pressure from Enterprise Systems
2 Interoperability
1 Reliability and DR
4
Supports operations
and analytics
3 High performance
Keys for Production Success
®© 2014 MapR Technologies 7
MapR: Best Hadoop Distribution for Customer Success
Top Ranked
Exponential
Growth
500+
Customers
Premier
Investors
3X bookings Q1 ‘13 – Q1 ‘14
80% of accounts expand 3X
90% software licenses
<1% lifetime churn
>$1B
in incremental revenue
generated by 1 customer
®© 2014 MapR Technologies 8
The Power of the Open Source CommunityManagement
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Spark
Batch
Spark
Streaming
Storm*
Streaming
HBase
Solr
NoSQL &
Search
Juju
Provisioning
&
coordination
Savannah*
Mahout
MLLib
ML, Graph
GraphX
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Governance
Tez*
Accumulo*
Hive
Impala
Shark
Drill*
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data
Integration
& Access
HttpFS
Hue
*	
  Cer&fica&on/support	
  planned	
  for	
  2014	
  
®© 2014 MapR Technologies 9
Machine Learning StackManagement
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Spark
Batch
Spark
Streaming
Storm*
Streaming
HBase
Solr
NoSQL &
Search
Juju
Provisioning
&
coordination
Savannah*
Mahout
MLLib
ML, Graph
GraphX
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Governance
Tez*
Accumulo*
Hive
Impala
Shark
Drill*
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data
Integration
& Access
HttpFS
Hue
*	
  Cer&fica&on/support	
  planned	
  for	
  2014	
  
®© 2014 MapR Technologies 10
ENTERPRISE
DATA HUB
MARKETING
OPTIMIZATION
RISK & SECURITY
OPTIMIZATION
OPERATIONS
INTELLIGENCE
• Multi-structured
data staging & archive
• ETL / DW optimization
• Mainframe
optimization
• Data exploration
• Recommendation
engines & targeting
• Customer 360
• Click-stream analysis
• Social media analysis
• Ad optimization
• Network security
monitoring
• Security information &
event management
• Fraudulent behavioral
analysis
• Supply chain & logistics
• System log analysis
• Manufacturing quality
assurance
• Preventative
maintenance
• Smart meter analysis
Machine Learning Cuts Across All Use Cases
®© 2014 MapR Technologies 11
How Does Big Data Help Machine Learning
Big Data => Better Models
•  A machine that has played 1 million checkers game will be smarter
than the one that played just a 100 games
•  Improves accuracy of the model esp. for unsupervised learning
•  Unlikely to overfit because of the variety of data
Past Data Model
New Data
Results
®© 2014 MapR Technologies 12
Common Machine Learning Use Cases on Hadoop
•  Linear/Polynomial Regression – fit to an equation - predict prices
•  Logistic Regression – probability of occurrence - classify spam
•  K-means Clustering – group things together - customer
segmentation
•  Recommender Systems and Collaborative Filtering – product
recommendation
•  Anomaly Detection – credit card fraud
The data scientist decides what works best
®© 2014 MapR Technologies 13© 2014 MapR Technologies
®
Machine Learning on Hadoop
®© 2014 MapR Technologies 14
Modeling Process – Constant Iterations / Free to Fail
•  Modeling Data Set + Validation Data Set
•  Constant Iterations and plotting
–  Underfit vs. Overfit
–  Feature manipulation
–  Adjusting learning rates
–  False Positive vs. False Negatives – precision levels
–  Measuring Error etc
•  Legacy applications, libraries, code used to manipulate data
®© 2014 MapR Technologies 15
Development and Deployment Process
Need newer data sets from production for model building and
validation – need complete autonomy for inventions
Develop the final solution based on models and test and deploy
working with Ops – need to coordinate heavily
Need to provide data and deploy apps while ensuring data
consistency, data compliance, HA, DR etc.
PLAYERS ACTIVITY
Mathematicians
Developers
Operations Staff
Lots of Operational Issues
®© 2014 MapR Technologies 16
Volumes and Mirroring
The Conflict:
Experimental, Free to Fail Modeling Process Needs Production Data
Solutions:
1.  Same Cluster: Separate Volumes, Multi-tenancy, Labels, Queues,
Data Placement Control etc..
2. Different Cluster for R&D purposes: Mirroring – efficient, less
network bandwidth, across the globe, easy to deploy and maintain
®© 2014 MapR Technologies 17
Snapshots
The Idea: Version control of data as well as models
Data Version Control:
How does my model work against new validation sets
How did it change across many validation sets
Model Version Control:
How can I go back and check my new model against old datasets
How do I prove that what I came up with worked for the data we had
at the time – replicate scenarios
®© 2014 MapR Technologies 18
Read Write NFS Access
•  Existing applications, custom libraries all work out-of-the-box
•  Browsers, modeling languages, scripts work out-of-the-box
•  Data ingestion is easy
–  Quickly move data in and out without having to wait for developers and
administrators to build and maintain flume cluster
®© 2014 MapR Technologies 19© 2014 MapR Technologies
®
Machine Learning Options
®© 2014 MapR Technologies 20
Apache Spark
•  Spark – In Memory Processing Framework
•  Works well with the iterative machine learning algorithms – the
matrices can be pulled into memory
•  100x better performance (in-memory) compared to MapReduce
MLLib
•  Inbuilt libraries for a variety of algorithms
•  Python and NumPy support
GraphX
•  Libraries to model relationships between entities – social media
®© 2014 MapR Technologies 21
Apache Mahout
•  In-built algorithms for popular techniques such as
Recommenders, Classification, Collaborative Filtering etc.
•  Moving towards running on Spark
®© 2014 MapR Technologies 22
Advanced Machine Learning with Skytree
DATA MARTS DATA WAREHOUSE
MapR Data Platform
Offload
Re-Load
MapR-DB MapR-FS
Batch
(MR, Spark, Hive, Pig,
…)
Interactive
(Impala, Drill, …)
Streaming
(Spark Streaming,
Storm…)
MAPR DISTRIBUTION FOR HADOOP
Adv. Modeling – Exploration - Analytics
Sources
RELATIONAL,
SAAS,
MAINFRAME
DOCUMENTS,
EMAILS
LOG FILES,
CLICKSTREAMS
SENSORS
BLOGS,
TWEETS,
LINK DATA
®© 2014 MapR Technologies 23© 2014 MapR Technologies
®
Skytree
®© 2014 MapR Technologies 24
Q&AEngage with us!
1.  Download the MapR Sandbox for Hadoop: www.mapr.com/sandbox
2. Download machine learning e-books from Ted Dunning:
http://www.mapr.com/resources/white-papers#e-books
3. Visit Skytree at www.skytree.net
4. Learn best practices for Hadoop ETL: www.mapr.com/EDH
THE MACHINE LEARNING COMPANY ®
SAME DATA.
BETTER RESULTS.
Jin H. Kim
VP of Marketing
jin@skytree.net!
1
THE MACHINE LEARNING COMPANY ®
Machine learning: !
The modern science of finding patterns and making predictions from data:!
!
multivariate statistics, data mining, pattern recognition, advanced/predictive analytics!
Our Vision
2
THE DATA DRIVEN ENTERPRISE
POWERED BY MACHINE LEARNING
THE MACHINE LEARNING COMPANY ®
Machine Learning has finally arrived!
50’s-70s Mid 90’s - Today80’s-90’s
3
1st Wave:
Artificial Intelligence
Pattern Recognition
Universities
Technology
Evolution!
Application
Evolution!
2nd Wave:
Neural Networks
Data Mining
Science
Credit scoring
OCR
Now: Machine Learning on Big Data
3rd Wave:
Machine Learning:
Convergence
Sales / Marketing
Finance
Biotech
Retail
Telco
Government
THE MACHINE LEARNING COMPANY ®
Skytree: Machine Learning for High-Value,
High-Complexity Problems!
•  Predictive optimal decision-making!
–  High-frequency algorithmic trading !
–  Online advertising exchanges!
–  Fast customer targeting and churn
analysis!
•  Predictive monitoring/discovery
assistance!
–  Point-of-compromise fraud tips/cues !
–  Network fault monitoring/diagnosis!
–  Predictive maintenance of network of
devices!
–  Fraud analysis in claims!
–  Insider threat/DLP and cyber security!
4
THE MACHINE LEARNING COMPANY ®
High-Value, High-Complexity Problems: 

Critical Elements in Common!
1.  High-accuracy needed (needle-
finding)!
–  Small number of known examples!
–  Identify anomalies with no prior examples!
!
2.  Complex data fusion needed
(unified objects)!
–  Spatial-temporal behavior/event pattern-
finding and tracking!
–  Inference of activities, entities/identities,
relations!
3.  Automation needed (augment
human analysts)!
–  Value-based attention-focusing,
recommendation of relevant content!
–  Real-time interactivity without waiting!
–  Fast construction of new reports for agility!
5
THE MACHINE LEARNING COMPANY ®
Use Case Examples!
6
Financial
Services
Fraud Analysis
Credit Scoring
Pricing
Churn Analysis
SDN/SON
Government
Fraud Analysis
Scoring
Anomaly
Detection
Fault Analysis
SDN/SON
Retail
Segmentation
Recommendation
Churn Analysis
Lead Scoring
Pricing
Asset
Intensive
Preventative
Maintenance
Defect/Fault
Detection
Supply Chain
Management
Cost Forecasting
Failure Analysis
THE MACHINE LEARNING COMPANY ®
Global Leaders Select Skytree
WORLD’S	
  LEADING:	
  
Anomaly detection
Logis3cs	
  &	
  Shipping	
  
Content recommendation
Consumer	
  Electronics	
  
On-board destination recommendation
Automobile	
  
Web	
  Portal	
  
Ad targeting
Customer lead scoring, fraud, credit risk scoring
Financial	
  Services	
  &	
  Credit	
  Card	
  
THE MACHINE LEARNING COMPANY ®
“10	
  Hot	
  Big	
  Data	
  Startups	
  to	
  Watch”	
  
“Skytree	
  Looms	
  in	
  Big	
  Data	
  Forest	
  with	
  New	
  Funding”	
  
	
  
“Skytree	
  Uses	
  Machine	
  Learning	
  To	
  Crunch	
  Big	
  Data”	
  
	
  
Skytree	
  named	
  “Big	
  Data	
  Analy3cs	
  Vendor	
  to	
  Watch”	
  
	
  
“The	
  Ten	
  Coolest	
  Big	
  Data	
  Startups	
  in	
  2013”	
  
	
  
“One	
  giant	
  leap	
  for	
  machinekind”	
  
	
  
Skytree	
  among	
  “10	
  Emerging	
  Technologies	
  for	
  Big	
  Data”	
  
	
  
“…could	
  change	
  the	
  face	
  of	
  Big	
  Data”	
  
Who’s	
  Who	
  of	
  Advanced	
  Analy3cs	
  
THE MACHINE LEARNING COMPANY ®
Insurance: Targeted Auto Policies with
Telemetric Data!
•  Business challenge!
–  Inaccurate policy pricing based on demographics
and actuarial data!
•  Example: many teens are good drivers but they often incur
higher premiums !
–  Availability of new data sources including
telemetry data !
•  Machine learning solution!
–  Use telematics to price insurance based on near-
real-time driving habits !
–  Base rates on an individual’s actual driving history!
–  Data fusion to personalize and increase objectivity
and accuracy in pricing and claims processing!
•  Business benefit!
–  Targeted customer pricing and policies!
–  Improved customer retention!
–  Higher customer satisfaction and margins!
9
THE MACHINE LEARNING COMPANY ®
•  Global 100 Financial Institution!
•  Major Pain points: Speed & Accuracy of Current approach!
•  Current Solution: SAS, Hadoop, Homegrown!
“I want our analysts to create models
rather than writing software”! - Skytree Customer !
10
Runtime 

(minutes)!
CURRENT:!
1,200 Cores @100 Node
Hadoop Cluster!
Runtime: 100 Minutes!
Accuracy (Gini): 57%!
100!
12 Cores @1 Node!
1250x Speedup!
Runtime: 8 Minutes!
Accuracy (Gini): 60%!
SKYTREE SERVER:!
8!
Customers’ Use of Skytree!
Targeting – Find New Customers
THE MACHINE LEARNING COMPANY ®
Asset Intensive: Predict Parts Failure through
Telemetric Data!
•  Business challenge!
–  Early infant mortality of parts due to rapid aging is
not easily detectable during manufacturing and
environmental acceptance tests!
–  Utilize diagnostic data such as impedance,
voltage, temperature (multidimensional data)!
•  Machine learning solution!
–  Detect transient indicators of rapid aging through
telemetric data!
•  Time between Beginning of Life and first transient is random!
•  Time between first transient and End of Life is deterministic!
–  Automatic parameter tuning!
–  Data fusion!
•  Business benefit!
–  Efficient parts inventory management!
–  Higher customer satisfaction !
–  Optimize preventative maintenance scheduling
based on predicted Time To Failure (TTF)!
11
THE MACHINE LEARNING COMPANY ®
Predict Parts Failure through Telemetric Data!
12
Data Stored on Hadoop Cluster
12
Build failure
model from
manufacturing
test data
1
Real-time
discovery of
transient part
behavior
patterns to
predict
Time-To-Failure
Geo-location
Data
Telemetric DataManufacturing
Data
Blend in data
from
telemetric and
other big data
sources
3
2
THE MACHINE LEARNING COMPANY ®
Improve Customer Retention with Machine
Learning!
•  Business challenge!
–  Cost of attracting new customers is many times
more than retaining customers!
–  Greater customer sophistication and competition
increase churn levels!
•  Machine learning solution!
–  Identify events that predict customer needs!
–  Isolate best targets and best offers for individual
customers!
•  Predict what offer or service would prevent a
customer from switching!
–  Discover purchase patterns and profiles of
customer who leave for a deeper understanding!
•  Business benefit!
–  Reduced churn and increased customer loyalty!
–  Increased margins and marketing effectiveness!
–  Improved up/cross sell opportunities!
!
13
THE MACHINE LEARNING COMPANY ®14 Skytree Confidential
Performance Studies by Customers!
Next Logical Product – Right Offer to Right Customer
•  Global Fortune 20 Company!
•  Major Pain Points: Speed & Accuracy of Legacy Approach!
•  Current Solution: Homegrown!
•  1M Data Points for a “Pilot”!
35% accurate!
20% increase in 

recommendation relevance in a
fraction of the time.!
Runtime (mins)!
SKYTREE!
LEGACY!
97! .07!
Results!Precision@5 (%)!
LEGACY!
35%! 42%!
SKYTREE!
“We are literally speechless”! - Skytree Customer !
THE MACHINE LEARNING COMPANY ®
Real-Time Fraud Detection!
•  Business challenge!
–  Growing complexity of fraud patterns!
–  Increased frequency of fraud!
–  Minimize false positives without compromising
fraud accuracy!
•  Machine Learning solution!
–  Leverage diverse big data for better context!
–  Real-time update of model parameters!
–  Faster and more accurate model for better
fraud detection !
•  Business benefit!
–  More accurate and agile fraud detection
system!
–  Improved customer satisfaction !
–  Improved financial results!
15
THE MACHINE LEARNING COMPANY ®
Global 2000 Credit Card Network – Before!
Transaction Data
Transferred
From Database to
Linux Server
Modeling Fraud Model created to
detect fraud. Model is
exported
Real-timedetection
Model is re-coded by
New set of engineers
for main-frame
New model is “loaded”
fraud could be detected
In Real-time.
•  Customer wanted
a more accurate
model
•  Current model in
system was
designed to be
updated on a
yearly basis
•  Running a model
on large dataset
took over 2 days
•  Skytree’s goal is
to move update of
the model to daily
or real time
Hardware: Linux x86 Server, Mainframe
Software: Internally developed random decision forests
SLA: Fraud scored in real-time. Fraud model updated yearly
XX XX
THE MACHINE LEARNING COMPANY ®
Global 2000 Credit Card Network - Now!
Modeling&Real-TimeScoreEnvironment
•  Customer can
use the same
environment for
modeling and for
production
•  Models can be
updated on a
daily or real-
time basis
depending on
needs
•  More frequent
updates leads to
significant
increase in lift
Hardware: Linux x86 Server
Software: MapR, Skytree fraud detection models
SLA: Fraud scored in real-time. Fraud model daily / real-time
Data Stored on MapR
Hadoop Cluster
Fraud Model
Created Using
Fraud Model updated
Daily / real-time
Data Stored on MapR
Hadoop Cluster
Unsupervised ML
Models Created Using
Fraud Model updated
Daily / real-time
THE MACHINE LEARNING COMPANY ®
“Key to increasing fraud detection accuracy”!
•  Use all of the data: Sampling can decrease accuracy of results
•  Semi-supervised learning: Combination of supervised and
unsupervised learning can improve fraud detection rates
•  Weight transactions based on date: Skytree server allows each
transaction to be weighted differently and allows fraud models to
preferentially weigh recent fraud vs older fraud
•  Use the most important variables:
o  Were the last few transactions at an un-manned location?
o  Is the transaction over the credit limit?
o  Which day of the week was the fraud committed?
o  Has the card been reported for fraud before?
o  And more…
•  Weight based on transaction value: we should care more about
larger transactions
Global 2000 Credit Card Network - Now!
THE MACHINE LEARNING COMPANY ®19
Skytree Maximizes Predictive Accuracy!
19
Advantages Benefits
Greater chance of having the best
model for your data
Breadth of Advanced Methods: more
powerful/advanced methods and options
1 1
Improved accuracy in the time
available
Speed & Scalability: use more data, test
more parameters
2 2
More productive modelers, more
people in the company can use it
Automation / Ease of Use: shorter time
to most accurate models
3 3
Skytree is designed from the ground up for these benefits.
THE MACHINE LEARNING COMPANY ®
Sources of Generalization Error!
20
Motivations: Sources of Generalization Error
Excess Error
Improper
Model
Finite
Samples
Algorithmic
Accuracy
E⇠
⇥
f(xt, ⇠) infx2H⇤ f(x, ⇠)
⇤
E⇠
⇥XXXXXXX
inf
x2H
f(x, ⇠) inf
x2H⇤
f(x, ⇠)
⇤
| {z }
ErrApproximation
E⇠
⇥
⇠⇠⇠⇠⇠⇠
f(x⇤
(N), ⇠) XXXXXXX
inf
x2H
f(x, ⇠)
⇤
| {z }
ErrEstimation
E⇠
⇥
f(xt, ⇠) ⇠⇠⇠⇠⇠⇠
f(x⇤
(N), ⇠)
⇤
| {z }
ErrExpected-Optimization
⇠ : data sample;
N : number of data samples;
H : hypothesis space of the model;
H⇤
: “true” hypothesis space that contains the optimal x⇤
Hua Ouyang Optimal Stochastic & Distributed Algorithms for Machine Learning 8
THE MACHINE LEARNING COMPANY ®
First Principles: Sources of prediction error!
21
Motivations: Sources of Generalization Error
Excess Error
Improper
Model
Finite
Samples
Algorithmic
Accuracy
E⇠
⇥
f(xt, ⇠) infx2H⇤ f(x, ⇠)
⇤
E⇠
⇥XXXXXXX
inf
x2H
f(x, ⇠) inf
x2H⇤
f(x, ⇠)
⇤
| {z }
ErrApproximation
E⇠
⇥
⇠⇠⇠⇠⇠⇠
f(x⇤
(N), ⇠) XXXXXXX
inf
x2H
f(x, ⇠)
⇤
| {z }
ErrEstimation
E⇠
⇥
f(xt, ⇠) ⇠⇠⇠⇠⇠⇠
f(x⇤
(N), ⇠)
⇤
| {z }
ErrExpected-Optimization
⇠ : data sample;
N : number of data samples;
H : hypothesis space of the model;
H⇤
: “true” hypothesis space that contains the optimal x⇤
Hua Ouyang Optimal Stochastic & Distributed Algorithms for Machine Learning 8
Use the right model:
Try many
Use more data:
All of it
Use the right parameters:
Try many
THE MACHINE LEARNING COMPANY ®
1.x
MAPR Data Platform
Spark
2.x/
YARN
ZooKeeper
Web Services
DataSources/Targets
OLTP / EDW
Command Line Interface
Skytree and Spark!
THE MACHINE LEARNING COMPANY ®
Why Skytree? 

Why do companies pick us for Big Data analytics?!
23
INVESTORS!
(22M+)!
Built on Solid Foundation
THE MACHINE LEARNING COMPANY ®
SAME DATA.
BETTER RESULTS.
Thank You.
www.skytree.net
!
24
THE MACHINE LEARNING COMPANY ®
Q&AEngage with us!
1.  Download the MapR Sandbox for Hadoop: !
www.mapr.com/sandbox!
!
2. Download machine learning e-books from Ted Dunning:!
http://www.mapr.com/resources/white-papers#e-books !
3. Visit!Skytree at www.skytree.net !
4. Learn best practices for Hadoop ETL:! !www.mapr.com/EDH!
!

MapR & Skytree:

  • 1.
    ®© 2014 MapRTechnologies 1 ® © 2014 MapR Technologies July 23, 2014
  • 2.
    ®© 2014 MapRTechnologies 2 Our Speakers Jin Kim VP, Marketing Skytree Nitin Bandugula Product Marketing MapR
  • 3.
    ®© 2014 MapRTechnologies 3 Agenda •  Introduction to Hadoop •  Machine Learning on Hadoop •  Advanced Machine Learning •  Customer Case Studies
  • 4.
    ®© 2014 MapRTechnologies 4 Big Data is Overwhelming Traditional Systems •  Mission-critical reliability •  Transaction guarantees •  Deep security •  Real-time performance •  Backup and recovery •  Interactive SQL •  Rich analytics •  Workload management •  Data governance •  Backup and recovery Enterprise Data Architecture ENTERPRISE USERS OPERATIONAL SYSTEMS ANALYTICAL SYSTEMS PRODUCTION REQUIREMENTS PRODUCTION REQUIREMENTS OUTSIDE SOURCES
  • 5.
    ®© 2014 MapRTechnologies 5 Hadoop: The Disruptive Technology at the Core of Big Data JOB TRENDS FROM INDEED.COM Jan ‘06 Jan ‘12 Jan ‘14Jan ‘07 Jan ‘08 Jan ‘09 Jan ‘10 Jan ‘11 Jan ‘13
  • 6.
    ®© 2014 MapRTechnologies 6 OPERATIONAL SYSTEMS ANALYTICAL SYSTEMS ENTERPRISE USERS •  Data staging •  Archive •  Data transformation •  Data exploration •  Streaming, interactions Hadoop Relieves the Pressure from Enterprise Systems 2 Interoperability 1 Reliability and DR 4 Supports operations and analytics 3 High performance Keys for Production Success
  • 7.
    ®© 2014 MapRTechnologies 7 MapR: Best Hadoop Distribution for Customer Success Top Ranked Exponential Growth 500+ Customers Premier Investors 3X bookings Q1 ‘13 – Q1 ‘14 80% of accounts expand 3X 90% software licenses <1% lifetime churn >$1B in incremental revenue generated by 1 customer
  • 8.
    ®© 2014 MapRTechnologies 8 The Power of the Open Source CommunityManagement MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala Shark Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integration & Access HttpFS Hue *  Cer&fica&on/support  planned  for  2014  
  • 9.
    ®© 2014 MapRTechnologies 9 Machine Learning StackManagement MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala Shark Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integration & Access HttpFS Hue *  Cer&fica&on/support  planned  for  2014  
  • 10.
    ®© 2014 MapRTechnologies 10 ENTERPRISE DATA HUB MARKETING OPTIMIZATION RISK & SECURITY OPTIMIZATION OPERATIONS INTELLIGENCE • Multi-structured data staging & archive • ETL / DW optimization • Mainframe optimization • Data exploration • Recommendation engines & targeting • Customer 360 • Click-stream analysis • Social media analysis • Ad optimization • Network security monitoring • Security information & event management • Fraudulent behavioral analysis • Supply chain & logistics • System log analysis • Manufacturing quality assurance • Preventative maintenance • Smart meter analysis Machine Learning Cuts Across All Use Cases
  • 11.
    ®© 2014 MapRTechnologies 11 How Does Big Data Help Machine Learning Big Data => Better Models •  A machine that has played 1 million checkers game will be smarter than the one that played just a 100 games •  Improves accuracy of the model esp. for unsupervised learning •  Unlikely to overfit because of the variety of data Past Data Model New Data Results
  • 12.
    ®© 2014 MapRTechnologies 12 Common Machine Learning Use Cases on Hadoop •  Linear/Polynomial Regression – fit to an equation - predict prices •  Logistic Regression – probability of occurrence - classify spam •  K-means Clustering – group things together - customer segmentation •  Recommender Systems and Collaborative Filtering – product recommendation •  Anomaly Detection – credit card fraud The data scientist decides what works best
  • 13.
    ®© 2014 MapRTechnologies 13© 2014 MapR Technologies ® Machine Learning on Hadoop
  • 14.
    ®© 2014 MapRTechnologies 14 Modeling Process – Constant Iterations / Free to Fail •  Modeling Data Set + Validation Data Set •  Constant Iterations and plotting –  Underfit vs. Overfit –  Feature manipulation –  Adjusting learning rates –  False Positive vs. False Negatives – precision levels –  Measuring Error etc •  Legacy applications, libraries, code used to manipulate data
  • 15.
    ®© 2014 MapRTechnologies 15 Development and Deployment Process Need newer data sets from production for model building and validation – need complete autonomy for inventions Develop the final solution based on models and test and deploy working with Ops – need to coordinate heavily Need to provide data and deploy apps while ensuring data consistency, data compliance, HA, DR etc. PLAYERS ACTIVITY Mathematicians Developers Operations Staff Lots of Operational Issues
  • 16.
    ®© 2014 MapRTechnologies 16 Volumes and Mirroring The Conflict: Experimental, Free to Fail Modeling Process Needs Production Data Solutions: 1.  Same Cluster: Separate Volumes, Multi-tenancy, Labels, Queues, Data Placement Control etc.. 2. Different Cluster for R&D purposes: Mirroring – efficient, less network bandwidth, across the globe, easy to deploy and maintain
  • 17.
    ®© 2014 MapRTechnologies 17 Snapshots The Idea: Version control of data as well as models Data Version Control: How does my model work against new validation sets How did it change across many validation sets Model Version Control: How can I go back and check my new model against old datasets How do I prove that what I came up with worked for the data we had at the time – replicate scenarios
  • 18.
    ®© 2014 MapRTechnologies 18 Read Write NFS Access •  Existing applications, custom libraries all work out-of-the-box •  Browsers, modeling languages, scripts work out-of-the-box •  Data ingestion is easy –  Quickly move data in and out without having to wait for developers and administrators to build and maintain flume cluster
  • 19.
    ®© 2014 MapRTechnologies 19© 2014 MapR Technologies ® Machine Learning Options
  • 20.
    ®© 2014 MapRTechnologies 20 Apache Spark •  Spark – In Memory Processing Framework •  Works well with the iterative machine learning algorithms – the matrices can be pulled into memory •  100x better performance (in-memory) compared to MapReduce MLLib •  Inbuilt libraries for a variety of algorithms •  Python and NumPy support GraphX •  Libraries to model relationships between entities – social media
  • 21.
    ®© 2014 MapRTechnologies 21 Apache Mahout •  In-built algorithms for popular techniques such as Recommenders, Classification, Collaborative Filtering etc. •  Moving towards running on Spark
  • 22.
    ®© 2014 MapRTechnologies 22 Advanced Machine Learning with Skytree DATA MARTS DATA WAREHOUSE MapR Data Platform Offload Re-Load MapR-DB MapR-FS Batch (MR, Spark, Hive, Pig, …) Interactive (Impala, Drill, …) Streaming (Spark Streaming, Storm…) MAPR DISTRIBUTION FOR HADOOP Adv. Modeling – Exploration - Analytics Sources RELATIONAL, SAAS, MAINFRAME DOCUMENTS, EMAILS LOG FILES, CLICKSTREAMS SENSORS BLOGS, TWEETS, LINK DATA
  • 23.
    ®© 2014 MapRTechnologies 23© 2014 MapR Technologies ® Skytree
  • 24.
    ®© 2014 MapRTechnologies 24 Q&AEngage with us! 1.  Download the MapR Sandbox for Hadoop: www.mapr.com/sandbox 2. Download machine learning e-books from Ted Dunning: http://www.mapr.com/resources/white-papers#e-books 3. Visit Skytree at www.skytree.net 4. Learn best practices for Hadoop ETL: www.mapr.com/EDH
  • 25.
    THE MACHINE LEARNINGCOMPANY ® SAME DATA. BETTER RESULTS. Jin H. Kim VP of Marketing jin@skytree.net! 1
  • 26.
    THE MACHINE LEARNINGCOMPANY ® Machine learning: ! The modern science of finding patterns and making predictions from data:! ! multivariate statistics, data mining, pattern recognition, advanced/predictive analytics! Our Vision 2 THE DATA DRIVEN ENTERPRISE POWERED BY MACHINE LEARNING
  • 27.
    THE MACHINE LEARNINGCOMPANY ® Machine Learning has finally arrived! 50’s-70s Mid 90’s - Today80’s-90’s 3 1st Wave: Artificial Intelligence Pattern Recognition Universities Technology Evolution! Application Evolution! 2nd Wave: Neural Networks Data Mining Science Credit scoring OCR Now: Machine Learning on Big Data 3rd Wave: Machine Learning: Convergence Sales / Marketing Finance Biotech Retail Telco Government
  • 28.
    THE MACHINE LEARNINGCOMPANY ® Skytree: Machine Learning for High-Value, High-Complexity Problems! •  Predictive optimal decision-making! –  High-frequency algorithmic trading ! –  Online advertising exchanges! –  Fast customer targeting and churn analysis! •  Predictive monitoring/discovery assistance! –  Point-of-compromise fraud tips/cues ! –  Network fault monitoring/diagnosis! –  Predictive maintenance of network of devices! –  Fraud analysis in claims! –  Insider threat/DLP and cyber security! 4
  • 29.
    THE MACHINE LEARNINGCOMPANY ® High-Value, High-Complexity Problems: 
 Critical Elements in Common! 1.  High-accuracy needed (needle- finding)! –  Small number of known examples! –  Identify anomalies with no prior examples! ! 2.  Complex data fusion needed (unified objects)! –  Spatial-temporal behavior/event pattern- finding and tracking! –  Inference of activities, entities/identities, relations! 3.  Automation needed (augment human analysts)! –  Value-based attention-focusing, recommendation of relevant content! –  Real-time interactivity without waiting! –  Fast construction of new reports for agility! 5
  • 30.
    THE MACHINE LEARNINGCOMPANY ® Use Case Examples! 6 Financial Services Fraud Analysis Credit Scoring Pricing Churn Analysis SDN/SON Government Fraud Analysis Scoring Anomaly Detection Fault Analysis SDN/SON Retail Segmentation Recommendation Churn Analysis Lead Scoring Pricing Asset Intensive Preventative Maintenance Defect/Fault Detection Supply Chain Management Cost Forecasting Failure Analysis
  • 31.
    THE MACHINE LEARNINGCOMPANY ® Global Leaders Select Skytree WORLD’S  LEADING:   Anomaly detection Logis3cs  &  Shipping   Content recommendation Consumer  Electronics   On-board destination recommendation Automobile   Web  Portal   Ad targeting Customer lead scoring, fraud, credit risk scoring Financial  Services  &  Credit  Card  
  • 32.
    THE MACHINE LEARNINGCOMPANY ® “10  Hot  Big  Data  Startups  to  Watch”   “Skytree  Looms  in  Big  Data  Forest  with  New  Funding”     “Skytree  Uses  Machine  Learning  To  Crunch  Big  Data”     Skytree  named  “Big  Data  Analy3cs  Vendor  to  Watch”     “The  Ten  Coolest  Big  Data  Startups  in  2013”     “One  giant  leap  for  machinekind”     Skytree  among  “10  Emerging  Technologies  for  Big  Data”     “…could  change  the  face  of  Big  Data”   Who’s  Who  of  Advanced  Analy3cs  
  • 33.
    THE MACHINE LEARNINGCOMPANY ® Insurance: Targeted Auto Policies with Telemetric Data! •  Business challenge! –  Inaccurate policy pricing based on demographics and actuarial data! •  Example: many teens are good drivers but they often incur higher premiums ! –  Availability of new data sources including telemetry data ! •  Machine learning solution! –  Use telematics to price insurance based on near- real-time driving habits ! –  Base rates on an individual’s actual driving history! –  Data fusion to personalize and increase objectivity and accuracy in pricing and claims processing! •  Business benefit! –  Targeted customer pricing and policies! –  Improved customer retention! –  Higher customer satisfaction and margins! 9
  • 34.
    THE MACHINE LEARNINGCOMPANY ® •  Global 100 Financial Institution! •  Major Pain points: Speed & Accuracy of Current approach! •  Current Solution: SAS, Hadoop, Homegrown! “I want our analysts to create models rather than writing software”! - Skytree Customer ! 10 Runtime 
 (minutes)! CURRENT:! 1,200 Cores @100 Node Hadoop Cluster! Runtime: 100 Minutes! Accuracy (Gini): 57%! 100! 12 Cores @1 Node! 1250x Speedup! Runtime: 8 Minutes! Accuracy (Gini): 60%! SKYTREE SERVER:! 8! Customers’ Use of Skytree! Targeting – Find New Customers
  • 35.
    THE MACHINE LEARNINGCOMPANY ® Asset Intensive: Predict Parts Failure through Telemetric Data! •  Business challenge! –  Early infant mortality of parts due to rapid aging is not easily detectable during manufacturing and environmental acceptance tests! –  Utilize diagnostic data such as impedance, voltage, temperature (multidimensional data)! •  Machine learning solution! –  Detect transient indicators of rapid aging through telemetric data! •  Time between Beginning of Life and first transient is random! •  Time between first transient and End of Life is deterministic! –  Automatic parameter tuning! –  Data fusion! •  Business benefit! –  Efficient parts inventory management! –  Higher customer satisfaction ! –  Optimize preventative maintenance scheduling based on predicted Time To Failure (TTF)! 11
  • 36.
    THE MACHINE LEARNINGCOMPANY ® Predict Parts Failure through Telemetric Data! 12 Data Stored on Hadoop Cluster 12 Build failure model from manufacturing test data 1 Real-time discovery of transient part behavior patterns to predict Time-To-Failure Geo-location Data Telemetric DataManufacturing Data Blend in data from telemetric and other big data sources 3 2
  • 37.
    THE MACHINE LEARNINGCOMPANY ® Improve Customer Retention with Machine Learning! •  Business challenge! –  Cost of attracting new customers is many times more than retaining customers! –  Greater customer sophistication and competition increase churn levels! •  Machine learning solution! –  Identify events that predict customer needs! –  Isolate best targets and best offers for individual customers! •  Predict what offer or service would prevent a customer from switching! –  Discover purchase patterns and profiles of customer who leave for a deeper understanding! •  Business benefit! –  Reduced churn and increased customer loyalty! –  Increased margins and marketing effectiveness! –  Improved up/cross sell opportunities! ! 13
  • 38.
    THE MACHINE LEARNINGCOMPANY ®14 Skytree Confidential Performance Studies by Customers! Next Logical Product – Right Offer to Right Customer •  Global Fortune 20 Company! •  Major Pain Points: Speed & Accuracy of Legacy Approach! •  Current Solution: Homegrown! •  1M Data Points for a “Pilot”! 35% accurate! 20% increase in 
 recommendation relevance in a fraction of the time.! Runtime (mins)! SKYTREE! LEGACY! 97! .07! Results!Precision@5 (%)! LEGACY! 35%! 42%! SKYTREE! “We are literally speechless”! - Skytree Customer !
  • 39.
    THE MACHINE LEARNINGCOMPANY ® Real-Time Fraud Detection! •  Business challenge! –  Growing complexity of fraud patterns! –  Increased frequency of fraud! –  Minimize false positives without compromising fraud accuracy! •  Machine Learning solution! –  Leverage diverse big data for better context! –  Real-time update of model parameters! –  Faster and more accurate model for better fraud detection ! •  Business benefit! –  More accurate and agile fraud detection system! –  Improved customer satisfaction ! –  Improved financial results! 15
  • 40.
    THE MACHINE LEARNINGCOMPANY ® Global 2000 Credit Card Network – Before! Transaction Data Transferred From Database to Linux Server Modeling Fraud Model created to detect fraud. Model is exported Real-timedetection Model is re-coded by New set of engineers for main-frame New model is “loaded” fraud could be detected In Real-time. •  Customer wanted a more accurate model •  Current model in system was designed to be updated on a yearly basis •  Running a model on large dataset took over 2 days •  Skytree’s goal is to move update of the model to daily or real time Hardware: Linux x86 Server, Mainframe Software: Internally developed random decision forests SLA: Fraud scored in real-time. Fraud model updated yearly XX XX
  • 41.
    THE MACHINE LEARNINGCOMPANY ® Global 2000 Credit Card Network - Now! Modeling&Real-TimeScoreEnvironment •  Customer can use the same environment for modeling and for production •  Models can be updated on a daily or real- time basis depending on needs •  More frequent updates leads to significant increase in lift Hardware: Linux x86 Server Software: MapR, Skytree fraud detection models SLA: Fraud scored in real-time. Fraud model daily / real-time Data Stored on MapR Hadoop Cluster Fraud Model Created Using Fraud Model updated Daily / real-time Data Stored on MapR Hadoop Cluster Unsupervised ML Models Created Using Fraud Model updated Daily / real-time
  • 42.
    THE MACHINE LEARNINGCOMPANY ® “Key to increasing fraud detection accuracy”! •  Use all of the data: Sampling can decrease accuracy of results •  Semi-supervised learning: Combination of supervised and unsupervised learning can improve fraud detection rates •  Weight transactions based on date: Skytree server allows each transaction to be weighted differently and allows fraud models to preferentially weigh recent fraud vs older fraud •  Use the most important variables: o  Were the last few transactions at an un-manned location? o  Is the transaction over the credit limit? o  Which day of the week was the fraud committed? o  Has the card been reported for fraud before? o  And more… •  Weight based on transaction value: we should care more about larger transactions Global 2000 Credit Card Network - Now!
  • 43.
    THE MACHINE LEARNINGCOMPANY ®19 Skytree Maximizes Predictive Accuracy! 19 Advantages Benefits Greater chance of having the best model for your data Breadth of Advanced Methods: more powerful/advanced methods and options 1 1 Improved accuracy in the time available Speed & Scalability: use more data, test more parameters 2 2 More productive modelers, more people in the company can use it Automation / Ease of Use: shorter time to most accurate models 3 3 Skytree is designed from the ground up for these benefits.
  • 44.
    THE MACHINE LEARNINGCOMPANY ® Sources of Generalization Error! 20 Motivations: Sources of Generalization Error Excess Error Improper Model Finite Samples Algorithmic Accuracy E⇠ ⇥ f(xt, ⇠) infx2H⇤ f(x, ⇠) ⇤ E⇠ ⇥XXXXXXX inf x2H f(x, ⇠) inf x2H⇤ f(x, ⇠) ⇤ | {z } ErrApproximation E⇠ ⇥ ⇠⇠⇠⇠⇠⇠ f(x⇤ (N), ⇠) XXXXXXX inf x2H f(x, ⇠) ⇤ | {z } ErrEstimation E⇠ ⇥ f(xt, ⇠) ⇠⇠⇠⇠⇠⇠ f(x⇤ (N), ⇠) ⇤ | {z } ErrExpected-Optimization ⇠ : data sample; N : number of data samples; H : hypothesis space of the model; H⇤ : “true” hypothesis space that contains the optimal x⇤ Hua Ouyang Optimal Stochastic & Distributed Algorithms for Machine Learning 8
  • 45.
    THE MACHINE LEARNINGCOMPANY ® First Principles: Sources of prediction error! 21 Motivations: Sources of Generalization Error Excess Error Improper Model Finite Samples Algorithmic Accuracy E⇠ ⇥ f(xt, ⇠) infx2H⇤ f(x, ⇠) ⇤ E⇠ ⇥XXXXXXX inf x2H f(x, ⇠) inf x2H⇤ f(x, ⇠) ⇤ | {z } ErrApproximation E⇠ ⇥ ⇠⇠⇠⇠⇠⇠ f(x⇤ (N), ⇠) XXXXXXX inf x2H f(x, ⇠) ⇤ | {z } ErrEstimation E⇠ ⇥ f(xt, ⇠) ⇠⇠⇠⇠⇠⇠ f(x⇤ (N), ⇠) ⇤ | {z } ErrExpected-Optimization ⇠ : data sample; N : number of data samples; H : hypothesis space of the model; H⇤ : “true” hypothesis space that contains the optimal x⇤ Hua Ouyang Optimal Stochastic & Distributed Algorithms for Machine Learning 8 Use the right model: Try many Use more data: All of it Use the right parameters: Try many
  • 46.
    THE MACHINE LEARNINGCOMPANY ® 1.x MAPR Data Platform Spark 2.x/ YARN ZooKeeper Web Services DataSources/Targets OLTP / EDW Command Line Interface Skytree and Spark!
  • 47.
    THE MACHINE LEARNINGCOMPANY ® Why Skytree? 
 Why do companies pick us for Big Data analytics?! 23 INVESTORS! (22M+)! Built on Solid Foundation
  • 48.
    THE MACHINE LEARNINGCOMPANY ® SAME DATA. BETTER RESULTS. Thank You. www.skytree.net ! 24
  • 49.
    THE MACHINE LEARNINGCOMPANY ® Q&AEngage with us! 1.  Download the MapR Sandbox for Hadoop: ! www.mapr.com/sandbox! ! 2. Download machine learning e-books from Ted Dunning:! http://www.mapr.com/resources/white-papers#e-books ! 3. Visit!Skytree at www.skytree.net ! 4. Learn best practices for Hadoop ETL:! !www.mapr.com/EDH! !