• Save
MapR & Skytree:
Upcoming SlideShare
Loading in...5
×
 

MapR & Skytree:

on

  • 475 views

Predicting failure in power networks, detecting fraudulent activities in payment card transactions, and identifying next logical products targeted at the right customer at the right time all require ...

Predicting failure in power networks, detecting fraudulent activities in payment card transactions, and identifying next logical products targeted at the right customer at the right time all require machine learning around massive data sets. This form of artificial intelligence requires complex self-learning algorithms, rapid data iteration for advanced analytics and a robust big data architecture that’s up to the task.

Learn how you can quickly exploit your existing IT infrastructure and scale operations in line with your budget to enjoy advanced data modeling, without having to invest in a large data science team.

Statistics

Views

Total Views
475
Views on SlideShare
465
Embed Views
10

Actions

Likes
2
Downloads
0
Comments
0

2 Embeds 10

https://twitter.com 7
https://www.linkedin.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

MapR & Skytree: MapR & Skytree: Presentation Transcript

  • ®© 2014 MapR Technologies 1 ® © 2014 MapR Technologies July 23, 2014
  • ®© 2014 MapR Technologies 2 Our Speakers Jin Kim VP, Marketing Skytree Nitin Bandugula Product Marketing MapR
  • ®© 2014 MapR Technologies 3 Agenda •  Introduction to Hadoop •  Machine Learning on Hadoop •  Advanced Machine Learning •  Customer Case Studies
  • ®© 2014 MapR Technologies 4 Big Data is Overwhelming Traditional Systems •  Mission-critical reliability •  Transaction guarantees •  Deep security •  Real-time performance •  Backup and recovery •  Interactive SQL •  Rich analytics •  Workload management •  Data governance •  Backup and recovery Enterprise Data Architecture ENTERPRISE USERS OPERATIONAL SYSTEMS ANALYTICAL SYSTEMS PRODUCTION REQUIREMENTS PRODUCTION REQUIREMENTS OUTSIDE SOURCES
  • ®© 2014 MapR Technologies 5 Hadoop: The Disruptive Technology at the Core of Big Data JOB TRENDS FROM INDEED.COM Jan ‘06 Jan ‘12 Jan ‘14Jan ‘07 Jan ‘08 Jan ‘09 Jan ‘10 Jan ‘11 Jan ‘13
  • ®© 2014 MapR Technologies 6 OPERATIONAL SYSTEMS ANALYTICAL SYSTEMS ENTERPRISE USERS •  Data staging •  Archive •  Data transformation •  Data exploration •  Streaming, interactions Hadoop Relieves the Pressure from Enterprise Systems 2 Interoperability 1 Reliability and DR 4 Supports operations and analytics 3 High performance Keys for Production Success
  • ®© 2014 MapR Technologies 7 MapR: Best Hadoop Distribution for Customer Success Top Ranked Exponential Growth 500+ Customers Premier Investors 3X bookings Q1 ‘13 – Q1 ‘14 80% of accounts expand 3X 90% software licenses <1% lifetime churn >$1B in incremental revenue generated by 1 customer
  • ®© 2014 MapR Technologies 8 The Power of the Open Source CommunityManagement MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala Shark Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integration & Access HttpFS Hue *  Cer&fica&on/support  planned  for  2014  
  • ®© 2014 MapR Technologies 9 Machine Learning StackManagement MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala Shark Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integration & Access HttpFS Hue *  Cer&fica&on/support  planned  for  2014  
  • ®© 2014 MapR Technologies 10 ENTERPRISE DATA HUB MARKETING OPTIMIZATION RISK & SECURITY OPTIMIZATION OPERATIONS INTELLIGENCE • Multi-structured data staging & archive • ETL / DW optimization • Mainframe optimization • Data exploration • Recommendation engines & targeting • Customer 360 • Click-stream analysis • Social media analysis • Ad optimization • Network security monitoring • Security information & event management • Fraudulent behavioral analysis • Supply chain & logistics • System log analysis • Manufacturing quality assurance • Preventative maintenance • Smart meter analysis Machine Learning Cuts Across All Use Cases
  • ®© 2014 MapR Technologies 11 How Does Big Data Help Machine Learning Big Data => Better Models •  A machine that has played 1 million checkers game will be smarter than the one that played just a 100 games •  Improves accuracy of the model esp. for unsupervised learning •  Unlikely to overfit because of the variety of data Past Data Model New Data Results
  • ®© 2014 MapR Technologies 12 Common Machine Learning Use Cases on Hadoop •  Linear/Polynomial Regression – fit to an equation - predict prices •  Logistic Regression – probability of occurrence - classify spam •  K-means Clustering – group things together - customer segmentation •  Recommender Systems and Collaborative Filtering – product recommendation •  Anomaly Detection – credit card fraud The data scientist decides what works best
  • ®© 2014 MapR Technologies 13© 2014 MapR Technologies ® Machine Learning on Hadoop
  • ®© 2014 MapR Technologies 14 Modeling Process – Constant Iterations / Free to Fail •  Modeling Data Set + Validation Data Set •  Constant Iterations and plotting –  Underfit vs. Overfit –  Feature manipulation –  Adjusting learning rates –  False Positive vs. False Negatives – precision levels –  Measuring Error etc •  Legacy applications, libraries, code used to manipulate data
  • ®© 2014 MapR Technologies 15 Development and Deployment Process Need newer data sets from production for model building and validation – need complete autonomy for inventions Develop the final solution based on models and test and deploy working with Ops – need to coordinate heavily Need to provide data and deploy apps while ensuring data consistency, data compliance, HA, DR etc. PLAYERS ACTIVITY Mathematicians Developers Operations Staff Lots of Operational Issues
  • ®© 2014 MapR Technologies 16 Volumes and Mirroring The Conflict: Experimental, Free to Fail Modeling Process Needs Production Data Solutions: 1.  Same Cluster: Separate Volumes, Multi-tenancy, Labels, Queues, Data Placement Control etc.. 2. Different Cluster for R&D purposes: Mirroring – efficient, less network bandwidth, across the globe, easy to deploy and maintain
  • ®© 2014 MapR Technologies 17 Snapshots The Idea: Version control of data as well as models Data Version Control: How does my model work against new validation sets How did it change across many validation sets Model Version Control: How can I go back and check my new model against old datasets How do I prove that what I came up with worked for the data we had at the time – replicate scenarios
  • ®© 2014 MapR Technologies 18 Read Write NFS Access •  Existing applications, custom libraries all work out-of-the-box •  Browsers, modeling languages, scripts work out-of-the-box •  Data ingestion is easy –  Quickly move data in and out without having to wait for developers and administrators to build and maintain flume cluster
  • ®© 2014 MapR Technologies 19© 2014 MapR Technologies ® Machine Learning Options
  • ®© 2014 MapR Technologies 20 Apache Spark •  Spark – In Memory Processing Framework •  Works well with the iterative machine learning algorithms – the matrices can be pulled into memory •  100x better performance (in-memory) compared to MapReduce MLLib •  Inbuilt libraries for a variety of algorithms •  Python and NumPy support GraphX •  Libraries to model relationships between entities – social media
  • ®© 2014 MapR Technologies 21 Apache Mahout •  In-built algorithms for popular techniques such as Recommenders, Classification, Collaborative Filtering etc. •  Moving towards running on Spark
  • ®© 2014 MapR Technologies 22 Advanced Machine Learning with Skytree DATA MARTS DATA WAREHOUSE MapR Data Platform Offload Re-Load MapR-DB MapR-FS Batch (MR, Spark, Hive, Pig, …) Interactive (Impala, Drill, …) Streaming (Spark Streaming, Storm…) MAPR DISTRIBUTION FOR HADOOP Adv. Modeling – Exploration - Analytics Sources RELATIONAL, SAAS, MAINFRAME DOCUMENTS, EMAILS LOG FILES, CLICKSTREAMS SENSORS BLOGS, TWEETS, LINK DATA
  • ®© 2014 MapR Technologies 23© 2014 MapR Technologies ® Skytree
  • ®© 2014 MapR Technologies 24 Q&AEngage with us! 1.  Download the MapR Sandbox for Hadoop: www.mapr.com/sandbox 2. Download machine learning e-books from Ted Dunning: http://www.mapr.com/resources/white-papers#e-books 3. Visit Skytree at www.skytree.net 4. Learn best practices for Hadoop ETL: www.mapr.com/EDH
  • THE MACHINE LEARNING COMPANY ® SAME DATA. BETTER RESULTS. Jin H. Kim VP of Marketing jin@skytree.net! 1
  • THE MACHINE LEARNING COMPANY ® Machine learning: ! The modern science of finding patterns and making predictions from data:! ! multivariate statistics, data mining, pattern recognition, advanced/predictive analytics! Our Vision 2 THE DATA DRIVEN ENTERPRISE POWERED BY MACHINE LEARNING
  • THE MACHINE LEARNING COMPANY ® Machine Learning has finally arrived! 50’s-70s Mid 90’s - Today80’s-90’s 3 1st Wave: Artificial Intelligence Pattern Recognition Universities Technology Evolution! Application Evolution! 2nd Wave: Neural Networks Data Mining Science Credit scoring OCR Now: Machine Learning on Big Data 3rd Wave: Machine Learning: Convergence Sales / Marketing Finance Biotech Retail Telco Government
  • THE MACHINE LEARNING COMPANY ® Skytree: Machine Learning for High-Value, High-Complexity Problems! •  Predictive optimal decision-making! –  High-frequency algorithmic trading ! –  Online advertising exchanges! –  Fast customer targeting and churn analysis! •  Predictive monitoring/discovery assistance! –  Point-of-compromise fraud tips/cues ! –  Network fault monitoring/diagnosis! –  Predictive maintenance of network of devices! –  Fraud analysis in claims! –  Insider threat/DLP and cyber security! 4
  • THE MACHINE LEARNING COMPANY ® High-Value, High-Complexity Problems: 
 Critical Elements in Common! 1.  High-accuracy needed (needle- finding)! –  Small number of known examples! –  Identify anomalies with no prior examples! ! 2.  Complex data fusion needed (unified objects)! –  Spatial-temporal behavior/event pattern- finding and tracking! –  Inference of activities, entities/identities, relations! 3.  Automation needed (augment human analysts)! –  Value-based attention-focusing, recommendation of relevant content! –  Real-time interactivity without waiting! –  Fast construction of new reports for agility! 5
  • THE MACHINE LEARNING COMPANY ® Use Case Examples! 6 Financial Services Fraud Analysis Credit Scoring Pricing Churn Analysis SDN/SON Government Fraud Analysis Scoring Anomaly Detection Fault Analysis SDN/SON Retail Segmentation Recommendation Churn Analysis Lead Scoring Pricing Asset Intensive Preventative Maintenance Defect/Fault Detection Supply Chain Management Cost Forecasting Failure Analysis
  • THE MACHINE LEARNING COMPANY ® Global Leaders Select Skytree WORLD’S  LEADING:   Anomaly detection Logis3cs  &  Shipping   Content recommendation Consumer  Electronics   On-board destination recommendation Automobile   Web  Portal   Ad targeting Customer lead scoring, fraud, credit risk scoring Financial  Services  &  Credit  Card  
  • THE MACHINE LEARNING COMPANY ® “10  Hot  Big  Data  Startups  to  Watch”   “Skytree  Looms  in  Big  Data  Forest  with  New  Funding”     “Skytree  Uses  Machine  Learning  To  Crunch  Big  Data”     Skytree  named  “Big  Data  Analy3cs  Vendor  to  Watch”     “The  Ten  Coolest  Big  Data  Startups  in  2013”     “One  giant  leap  for  machinekind”     Skytree  among  “10  Emerging  Technologies  for  Big  Data”     “…could  change  the  face  of  Big  Data”   Who’s  Who  of  Advanced  Analy3cs  
  • THE MACHINE LEARNING COMPANY ® Insurance: Targeted Auto Policies with Telemetric Data! •  Business challenge! –  Inaccurate policy pricing based on demographics and actuarial data! •  Example: many teens are good drivers but they often incur higher premiums ! –  Availability of new data sources including telemetry data ! •  Machine learning solution! –  Use telematics to price insurance based on near- real-time driving habits ! –  Base rates on an individual’s actual driving history! –  Data fusion to personalize and increase objectivity and accuracy in pricing and claims processing! •  Business benefit! –  Targeted customer pricing and policies! –  Improved customer retention! –  Higher customer satisfaction and margins! 9
  • THE MACHINE LEARNING COMPANY ® •  Global 100 Financial Institution! •  Major Pain points: Speed & Accuracy of Current approach! •  Current Solution: SAS, Hadoop, Homegrown! “I want our analysts to create models rather than writing software”! - Skytree Customer ! 10 Runtime 
 (minutes)! CURRENT:! 1,200 Cores @100 Node Hadoop Cluster! Runtime: 100 Minutes! Accuracy (Gini): 57%! 100! 12 Cores @1 Node! 1250x Speedup! Runtime: 8 Minutes! Accuracy (Gini): 60%! SKYTREE SERVER:! 8! Customers’ Use of Skytree! Targeting – Find New Customers
  • THE MACHINE LEARNING COMPANY ® Asset Intensive: Predict Parts Failure through Telemetric Data! •  Business challenge! –  Early infant mortality of parts due to rapid aging is not easily detectable during manufacturing and environmental acceptance tests! –  Utilize diagnostic data such as impedance, voltage, temperature (multidimensional data)! •  Machine learning solution! –  Detect transient indicators of rapid aging through telemetric data! •  Time between Beginning of Life and first transient is random! •  Time between first transient and End of Life is deterministic! –  Automatic parameter tuning! –  Data fusion! •  Business benefit! –  Efficient parts inventory management! –  Higher customer satisfaction ! –  Optimize preventative maintenance scheduling based on predicted Time To Failure (TTF)! 11
  • THE MACHINE LEARNING COMPANY ® Predict Parts Failure through Telemetric Data! 12 Data Stored on Hadoop Cluster 12 Build failure model from manufacturing test data 1 Real-time discovery of transient part behavior patterns to predict Time-To-Failure Geo-location Data Telemetric DataManufacturing Data Blend in data from telemetric and other big data sources 3 2
  • THE MACHINE LEARNING COMPANY ® Improve Customer Retention with Machine Learning! •  Business challenge! –  Cost of attracting new customers is many times more than retaining customers! –  Greater customer sophistication and competition increase churn levels! •  Machine learning solution! –  Identify events that predict customer needs! –  Isolate best targets and best offers for individual customers! •  Predict what offer or service would prevent a customer from switching! –  Discover purchase patterns and profiles of customer who leave for a deeper understanding! •  Business benefit! –  Reduced churn and increased customer loyalty! –  Increased margins and marketing effectiveness! –  Improved up/cross sell opportunities! ! 13
  • THE MACHINE LEARNING COMPANY ®14 Skytree Confidential Performance Studies by Customers! Next Logical Product – Right Offer to Right Customer •  Global Fortune 20 Company! •  Major Pain Points: Speed & Accuracy of Legacy Approach! •  Current Solution: Homegrown! •  1M Data Points for a “Pilot”! 35% accurate! 20% increase in 
 recommendation relevance in a fraction of the time.! Runtime (mins)! SKYTREE! LEGACY! 97! .07! Results!Precision@5 (%)! LEGACY! 35%! 42%! SKYTREE! “We are literally speechless”! - Skytree Customer !
  • THE MACHINE LEARNING COMPANY ® Real-Time Fraud Detection! •  Business challenge! –  Growing complexity of fraud patterns! –  Increased frequency of fraud! –  Minimize false positives without compromising fraud accuracy! •  Machine Learning solution! –  Leverage diverse big data for better context! –  Real-time update of model parameters! –  Faster and more accurate model for better fraud detection ! •  Business benefit! –  More accurate and agile fraud detection system! –  Improved customer satisfaction ! –  Improved financial results! 15
  • THE MACHINE LEARNING COMPANY ® Global 2000 Credit Card Network – Before! Transaction Data Transferred From Database to Linux Server Modeling Fraud Model created to detect fraud. Model is exported Real-timedetection Model is re-coded by New set of engineers for main-frame New model is “loaded” fraud could be detected In Real-time. •  Customer wanted a more accurate model •  Current model in system was designed to be updated on a yearly basis •  Running a model on large dataset took over 2 days •  Skytree’s goal is to move update of the model to daily or real time Hardware: Linux x86 Server, Mainframe Software: Internally developed random decision forests SLA: Fraud scored in real-time. Fraud model updated yearly XX XX
  • THE MACHINE LEARNING COMPANY ® Global 2000 Credit Card Network - Now! Modeling&Real-TimeScoreEnvironment •  Customer can use the same environment for modeling and for production •  Models can be updated on a daily or real- time basis depending on needs •  More frequent updates leads to significant increase in lift Hardware: Linux x86 Server Software: MapR, Skytree fraud detection models SLA: Fraud scored in real-time. Fraud model daily / real-time Data Stored on MapR Hadoop Cluster Fraud Model Created Using Fraud Model updated Daily / real-time Data Stored on MapR Hadoop Cluster Unsupervised ML Models Created Using Fraud Model updated Daily / real-time
  • THE MACHINE LEARNING COMPANY ® “Key to increasing fraud detection accuracy”! •  Use all of the data: Sampling can decrease accuracy of results •  Semi-supervised learning: Combination of supervised and unsupervised learning can improve fraud detection rates •  Weight transactions based on date: Skytree server allows each transaction to be weighted differently and allows fraud models to preferentially weigh recent fraud vs older fraud •  Use the most important variables: o  Were the last few transactions at an un-manned location? o  Is the transaction over the credit limit? o  Which day of the week was the fraud committed? o  Has the card been reported for fraud before? o  And more… •  Weight based on transaction value: we should care more about larger transactions Global 2000 Credit Card Network - Now!
  • THE MACHINE LEARNING COMPANY ®19 Skytree Maximizes Predictive Accuracy! 19 Advantages Benefits Greater chance of having the best model for your data Breadth of Advanced Methods: more powerful/advanced methods and options 1 1 Improved accuracy in the time available Speed & Scalability: use more data, test more parameters 2 2 More productive modelers, more people in the company can use it Automation / Ease of Use: shorter time to most accurate models 3 3 Skytree is designed from the ground up for these benefits.
  • THE MACHINE LEARNING COMPANY ® Sources of Generalization Error! 20 Motivations: Sources of Generalization Error Excess Error Improper Model Finite Samples Algorithmic Accuracy E⇠ ⇥ f(xt, ⇠) infx2H⇤ f(x, ⇠) ⇤ E⇠ ⇥XXXXXXX inf x2H f(x, ⇠) inf x2H⇤ f(x, ⇠) ⇤ | {z } ErrApproximation E⇠ ⇥ ⇠⇠⇠⇠⇠⇠ f(x⇤ (N), ⇠) XXXXXXX inf x2H f(x, ⇠) ⇤ | {z } ErrEstimation E⇠ ⇥ f(xt, ⇠) ⇠⇠⇠⇠⇠⇠ f(x⇤ (N), ⇠) ⇤ | {z } ErrExpected-Optimization ⇠ : data sample; N : number of data samples; H : hypothesis space of the model; H⇤ : “true” hypothesis space that contains the optimal x⇤ Hua Ouyang Optimal Stochastic & Distributed Algorithms for Machine Learning 8
  • THE MACHINE LEARNING COMPANY ® First Principles: Sources of prediction error! 21 Motivations: Sources of Generalization Error Excess Error Improper Model Finite Samples Algorithmic Accuracy E⇠ ⇥ f(xt, ⇠) infx2H⇤ f(x, ⇠) ⇤ E⇠ ⇥XXXXXXX inf x2H f(x, ⇠) inf x2H⇤ f(x, ⇠) ⇤ | {z } ErrApproximation E⇠ ⇥ ⇠⇠⇠⇠⇠⇠ f(x⇤ (N), ⇠) XXXXXXX inf x2H f(x, ⇠) ⇤ | {z } ErrEstimation E⇠ ⇥ f(xt, ⇠) ⇠⇠⇠⇠⇠⇠ f(x⇤ (N), ⇠) ⇤ | {z } ErrExpected-Optimization ⇠ : data sample; N : number of data samples; H : hypothesis space of the model; H⇤ : “true” hypothesis space that contains the optimal x⇤ Hua Ouyang Optimal Stochastic & Distributed Algorithms for Machine Learning 8 Use the right model: Try many Use more data: All of it Use the right parameters: Try many
  • THE MACHINE LEARNING COMPANY ® 1.x MAPR Data Platform Spark 2.x/ YARN ZooKeeper Web Services DataSources/Targets OLTP / EDW Command Line Interface Skytree and Spark!
  • THE MACHINE LEARNING COMPANY ® Why Skytree? 
 Why do companies pick us for Big Data analytics?! 23 INVESTORS! (22M+)! Built on Solid Foundation
  • THE MACHINE LEARNING COMPANY ® SAME DATA. BETTER RESULTS. Thank You. www.skytree.net ! 24
  • THE MACHINE LEARNING COMPANY ® Q&AEngage with us! 1.  Download the MapR Sandbox for Hadoop: ! www.mapr.com/sandbox! ! 2. Download machine learning e-books from Ted Dunning:! http://www.mapr.com/resources/white-papers#e-books ! 3. Visit!Skytree at www.skytree.net ! 4. Learn best practices for Hadoop ETL:! !www.mapr.com/EDH! !