Knowledge Discovery in Production
André Karpištšenko
Knowledge Discovery
Requires Automation
Growth of information and devices per knowledge worker
1. Digital universe x3.8 in size in 2020. Focus on the highest-value subset.*
2. 26.3B devices in 2020, up +61% from 2015 with x2.7 IP traffic increase.**
3. 700M knowledge workers***, automation worth $5.2T to $6.7T****
* IDC, Apr 2014
** Cisco, Jun 2016
*** Teleport.org, Jun 2016
**** McKinsey, Jun 2016
Core Dataflow
Model Engine
Preprocessing Dataflow
System Composition:
Networked Intelligence
Mature
Nascent
Emerging
networked.ai
Infrastructure, Data & IoT Platforms, Advanced Analytics Platforms
Input
Data
Info
Merger
Data Curator Preparer & Explorer
Base Library
SelectorExecutor
Self-improvementInterpreter
Output Interfaces Core Human Interfaces
Knowledge
Manager
Knowledge
Manager
Predictive Modeling Flow Example
DashOpt
Feature
Engineering
Raw
Data
Raw
Features
Labels
Feature
Integration
Features
with Labels
Data
Partitioning
Training
Data
Validation
Data
Testing
Data
Model Training
Evaluate for
model selection
Compute offline
evaluation metrics
Best model
Offline scoring
and indexing
Online/offline
systems
Online A/B test
Label
preparation
Log data
Scoring
features
Raw features
Feature
integrationModel
Performance
Test Results
Applications in Production
Electronics Manufacturing Biotechnology
Process time reduction
Predictive maintenance Quality improvement
Yield increase
Product Preview
Preprocessing data for manufacturing
analytics is complex and time consuming.
Custom built preprocessing
solutions are used to gather data
in electronics manufacturing.
The problem
How do people
solve it today
Product Scope
Data-driven electronics manufacturing
enabling understanding and prediction
• Heavy machinery
• Automotive
• Consumer Devices & Networks
• Drives
• PLC
Product for Pilot Factories
Product Solution
• Hybrid SaaS factory subscriptions and applications via open marketplace
• Real-time data streams from the field and factories for R&D and production
Electronics Factories
End Products
IoT Platforms Cloud Services
Delivering Business Value
Enabled metrics data
Increased engagement 2x
Enhanced usability of MES
Increased productivity
Test time reduction
270k-290kEUR/plant
Reducing risk through higher quality data and
improving business with data preprocessing
Industrial Analytics Example:
Bosch Competition, I
4 product lines
52 stations
Every feature has timestamp
Data rows
Parts of mechanical components
# (training data) – 1 183 747
# (test data) – 1 183 748
Data columns
Anonymized features of stations
Numeric – 970
Categorical – 2 141
Bosch has to ensure that the recipes for the production of its
advanced mechanical components are of the highest quality
and safety standards. Part of doing so is closely monitoring its
parts as they progress through the manufacturing processes.
https://www.kaggle.com/
(Dis%nct)pa,erns)of)missing)values)of)all)sta%ons)))
Utilization of stations
Industrial Analytics Example:
Bosch Competition, II
ProductFamilies
https://sites.google.com/site/iotminingtutorial/
IoT Data Streams Mining
• Continuous data, dynamic models, distributed, few seconds
Streams Mining: Actors Model
Data processing pipeline Distributed processing
Kappa Architecture
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
DashOpt: Data Science Intelligence
Real-Time Predictive Flow
ML & Simulation
Platforms
IoT Platforms
Preprocessed Data
IoT Data
Earth Data
Manufacturing
Data
Predictive Models
Decision Tree SVM
Neural Network Random Forest
Data 

Science

Intelligence
Outlier Detection
• Single point anomaly detection: likelihood over distribution
• Finding anomalous groups: divergence estimation
• Methods: percentage change, T-test, Chi-square test, Generalized ESD (Extreme
Studentized Deviate) test, Seasonal Hybrid ESD, etc.
• Goal: move from detection to automated response
Outlier Detection in Practice
• Too many detections of too little value
• Use methods for thresholds
• Breakout detection and Concept Drift
• For changing distributions move baselines over time
• Risk of overfitting to known anomalies, not finding unknown anomalies
Bayesian aka Active Optimization
• Examples: Design of Experiments, hyper-parameters of supervised
learning, algorithms tested with simulations
f is an unknown expensive black-box function with the goal to
approximately optimize f with as few experiments as possible
• No free lunch theorem
• Other bio-inspired
algorithms for optimization
exploitation and
exploration: neural
networks, genetic algorithms,
swarm intelligence, ant
colony optimisation, etc.
Bayesian Optimization in Practice
• SigOpt experience: 20 dimensions, above human capacity.
• Uber ATC experience: scaling active optimization to high
dimensions default works reliably for 5-7 dim.
• Variables are added during optimization.
• Choose fidelity using heuristics.
DashOpt: Data Science Intelligence
US Patent pending
Extensive data bases of DNA sequences,
metabolism of cells and components – enzymes
etc., high-throughput experimental omics-
methods
Software environment for in silico ab initio
design of cells, and in silico testing
(predictive modeling) of the cell designs in
manufacturing processes
Current State in Biotech
Already available Future state
Thinking about Value from Data Science

Knowledge Discovery in Production

  • 1.
    Knowledge Discovery inProduction André Karpištšenko
  • 3.
    Knowledge Discovery Requires Automation Growthof information and devices per knowledge worker 1. Digital universe x3.8 in size in 2020. Focus on the highest-value subset.* 2. 26.3B devices in 2020, up +61% from 2015 with x2.7 IP traffic increase.** 3. 700M knowledge workers***, automation worth $5.2T to $6.7T**** * IDC, Apr 2014 ** Cisco, Jun 2016 *** Teleport.org, Jun 2016 **** McKinsey, Jun 2016
  • 4.
    Core Dataflow Model Engine PreprocessingDataflow System Composition: Networked Intelligence Mature Nascent Emerging networked.ai Infrastructure, Data & IoT Platforms, Advanced Analytics Platforms Input Data Info Merger Data Curator Preparer & Explorer Base Library SelectorExecutor Self-improvementInterpreter Output Interfaces Core Human Interfaces Knowledge Manager Knowledge Manager
  • 5.
    Predictive Modeling FlowExample DashOpt Feature Engineering Raw Data Raw Features Labels Feature Integration Features with Labels Data Partitioning Training Data Validation Data Testing Data Model Training Evaluate for model selection Compute offline evaluation metrics Best model Offline scoring and indexing Online/offline systems Online A/B test Label preparation Log data Scoring features Raw features Feature integrationModel Performance Test Results
  • 6.
    Applications in Production ElectronicsManufacturing Biotechnology Process time reduction Predictive maintenance Quality improvement Yield increase
  • 7.
  • 8.
    Preprocessing data formanufacturing analytics is complex and time consuming. Custom built preprocessing solutions are used to gather data in electronics manufacturing. The problem How do people solve it today
  • 9.
    Product Scope Data-driven electronicsmanufacturing enabling understanding and prediction • Heavy machinery • Automotive • Consumer Devices & Networks • Drives • PLC
  • 10.
  • 11.
    Product Solution • HybridSaaS factory subscriptions and applications via open marketplace • Real-time data streams from the field and factories for R&D and production Electronics Factories End Products IoT Platforms Cloud Services
  • 12.
    Delivering Business Value Enabledmetrics data Increased engagement 2x Enhanced usability of MES Increased productivity Test time reduction 270k-290kEUR/plant Reducing risk through higher quality data and improving business with data preprocessing
  • 13.
    Industrial Analytics Example: BoschCompetition, I 4 product lines 52 stations Every feature has timestamp Data rows Parts of mechanical components # (training data) – 1 183 747 # (test data) – 1 183 748 Data columns Anonymized features of stations Numeric – 970 Categorical – 2 141 Bosch has to ensure that the recipes for the production of its advanced mechanical components are of the highest quality and safety standards. Part of doing so is closely monitoring its parts as they progress through the manufacturing processes. https://www.kaggle.com/
  • 14.
  • 15.
    https://sites.google.com/site/iotminingtutorial/ IoT Data StreamsMining • Continuous data, dynamic models, distributed, few seconds
  • 16.
    Streams Mining: ActorsModel Data processing pipeline Distributed processing Kappa Architecture https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
  • 17.
  • 18.
    Real-Time Predictive Flow ML& Simulation Platforms IoT Platforms Preprocessed Data IoT Data Earth Data Manufacturing Data Predictive Models Decision Tree SVM Neural Network Random Forest Data 
 Science
 Intelligence
  • 19.
    Outlier Detection • Singlepoint anomaly detection: likelihood over distribution • Finding anomalous groups: divergence estimation • Methods: percentage change, T-test, Chi-square test, Generalized ESD (Extreme Studentized Deviate) test, Seasonal Hybrid ESD, etc. • Goal: move from detection to automated response
  • 20.
    Outlier Detection inPractice • Too many detections of too little value • Use methods for thresholds • Breakout detection and Concept Drift • For changing distributions move baselines over time • Risk of overfitting to known anomalies, not finding unknown anomalies
  • 21.
    Bayesian aka ActiveOptimization • Examples: Design of Experiments, hyper-parameters of supervised learning, algorithms tested with simulations f is an unknown expensive black-box function with the goal to approximately optimize f with as few experiments as possible • No free lunch theorem • Other bio-inspired algorithms for optimization exploitation and exploration: neural networks, genetic algorithms, swarm intelligence, ant colony optimisation, etc.
  • 22.
    Bayesian Optimization inPractice • SigOpt experience: 20 dimensions, above human capacity. • Uber ATC experience: scaling active optimization to high dimensions default works reliably for 5-7 dim. • Variables are added during optimization. • Choose fidelity using heuristics.
  • 23.
    DashOpt: Data ScienceIntelligence US Patent pending
  • 24.
    Extensive data basesof DNA sequences, metabolism of cells and components – enzymes etc., high-throughput experimental omics- methods Software environment for in silico ab initio design of cells, and in silico testing (predictive modeling) of the cell designs in manufacturing processes Current State in Biotech Already available Future state
  • 25.
    Thinking about Valuefrom Data Science