Machine Learning: Past, Present,
and Future
Professor Thomas G. Dietterich
Oregon State University
Chief Scientist, BigML
11 MAY 2017
1
Question: How to build smart software?
Answer (in 1980): Interview an expert, encode knowledge in software
This worked well for expert systems
Machine Learning Past
Medical diagnosis of blood diseases (MYCIN)
Interpreting mass spectrograms (DENDRAL)
01
Configuring computer hardware systems (XCON)02
03
2
Question: How to build smart software?
Answer (in 1980): Interview an expert, encode knowledge in software
But it did NOT work well for other tasks:
Machine Learning Past
Optical character recognition
Robot control
01
Natural language processing02
03
3
1.
Machine Learning Past
➔ “3”
➔ “6”
➔ “2”
➔
4
Many new algorithms developed
Decision trees
Probabilistic graphical models including Naïve Bayes
1980-2000
01
Support vector machines
Ensemble methods: Bagging and Boosting
02
03
04
5
Learning algorithms succeed because they find
patterns in the data
Some algorithms reveal those patterns in easy-to-
understand ways
Data mining and knowledge discovery focus on
discovering and visualizing these patterns.
From Function Learning to
Knowledge Discovery
Decision trees
Probabilistic graphical models including Naïve Bayes
01
Association rules (frequent item sets)02
03
6
Machine Learning: Present
Automated Decision Making01
Perceptual Tasks02
Anomaly Detection03
7
Examples:
Challenge: You only learn the outcome of the chosen action; you are not told what the
best action would have been (“Bandit Feedback”)
Example: In advertisement placement, I might have 10,000 possible advertisements
that I could present to the user. I must choose 5. The user might click on one of these
(1) Automated Decision Making
Recommendations (Netflix; Amazon)
Robot Control (self-driving cars)
01
Advertisement placement (Google; Facebook)02
03
8
1. Must systematically try all actions in many situations, but eventually focus on the
most promising actions
2. When analyzing historical data (e.g., which advertisements a user clicked on), you
must keep track of what advertisements were displayed
3. Goal is to uncover causal relationship between the selected action and the
observed result
Automated Decision Making
Requires Experimentation
9
1.
Algorithms: Contextual Bandits
credit: Cohen, et al. PTRS B 2008
10
1.
Reinforcement Learning
11
1.Self-driving car
▪ State of world: location, speed, and
acceleration of car
▪ Actions: steering, acceleration, braking
▪ Rewards: reaching destination quickly, not
colliding with people, obstacles, or cars,
conserving fuel
2.Playing the game of Go
▪ State of world: State of the Go board
▪ Actions: placing a stone
▪ Reward: winning the game
3.Operations (logistics, inventory management)
Reinforcement Learning Examples
Tesla AutoSteer
Credit: Tesla Motors
12
1. Standard machine learning algorithms require that the
data be converted to meaningful features
2. This is easy for typical database information
3. But very difficult for signal-level data (images, speech)
4. Deep learning methods are able to automatically
discover meaningful intermediate features
(2) Deep Learning for Perception
13
Top5ClassificationError(%)
0
7,5
15
22,5
30
2010 2011 2012 2013 2014
ImageNet 1000 object categories. Is the right answer in the top 5 predictions?
Progress in Object Recognition
14
Progress in Speech Recognition
Credit: Fernando Pereira and Matthew Firestone (Google)
15
1. To develop a deep learning application, the programmer designs a task-specific
deep network
2. This network must be differentiable so that the network parameters can be adjusted
via gradient search
3. Modern deep net tools compute the derivatives automatically
4. DL programmers are busy exploring the space of differentiable programs to
understand what patterns and idioms work well
▪ Convolutional blocks
▪ LSTM gates
▪ Various forms of associative memory
▪ Auto-encoders and Generative Adversarial Networks
5.It is still very hard to get a DL application to work
Deep Learning =
Differentiable Programming
16
In many applications, we have a large amount of data describing “normal” behavior and
our goal is to detect “anomalous” behavior
1. Fraud detection
▪ Normal behavior: good customers
▪ Abnormal behavior: fraudulent customers
2. Cyber attacks
▪ Normal behavior: normal network flows
▪ Abnormal behavior: network flows caused by cyber attacks
3. Machine diagnosis
▪ Normal behavior: normal sensor readings
▪ Abnormal behavior: unusual sensor readings
It is usually not safe to assume abnormal behavior has a fixed probability distribution
(3) Anomaly Detection
17
1.25,685 Benchmark Datasets
2.Eight algorithms
3. Systematic control of relevant
parameters (e.g., anomaly
frequency, irrelevant features,
problem difficulty)
We found that the Isolation Forest
algorithm was the overall best.
Included in BigML services
Experimental Study of Anomaly
Detection Methods
ChangeinMetricwrtControl
Dataset 0
0,275
0,55
0,825
1,1
Algorithm
iforest lof abod ocsvm
logit(AUC)
log(LIFT)
[Emmott, Das, Dietterich, Fern, Wong, 2013; KDD ODD-2013]
[Emmott, Das, Dietterich, Fern, Wong. 2016; arXiv 1503.01158v2]
18
1.Detecting and Correcting for Bias
2.Risk-Sensitive Optimization
3.Explanation of Black Box Systems
4.Verification and Validation
5.Integrating ML components into larger software systems
Machine Learning: Future
19
20
1.Race, Sex, Age are legally-protected categories for certain decisions
▪ Granting loans
▪ Renting or buying houses
▪ Employment
What constraints should this place on machine learning algorithms?
Constraint 1: The algorithms should not use these categories—or any features from
which these categories can be predicted—to make these decisions
Constraint 2: The learned predictive model should exhibit the same false positive/
false negative tradeoffs (same ROC curve) for each protected subgroup
(1) Detecting and Correcting for Bias
21
1.
Sampling Bias
22
1.
(2) Risk-Sensitive Optimization
0.0
0.1
0.2
0.3
0 2 4 6 8
V
P(V)
0.0
0.1
0.2
0.3
0 2 4 6 8
V
P(V)
23
Some machine learning methods, particularly ensembles and deep neural networks, do
not provide simple visualizations or explanations of their predictions
We need explanations
1. To debug data sets and ML systems
2. To satisfy regulatory requirements (e.g., EU Right to Explanation)
3. To understand what knowledge has been discovered by the system
4. To decide whether to trust the system
General strategy: Compute interpretable approximations of complex ML systems
(3) Explanations for Black Box ML
Systems
24
1.
(4) Verification and Validation
25
ML provides new ways for constructing software
We need software engineering processes that ensure successful end-to-end system
performance
Recommended reading:
Martin Zinkevich: Rules for Machine Learning: Best Practices for ML Engineering
based on his experience at Google
http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf
(5) Integrating ML Components
26
Machine Learning Past:
1.Machine Learning started as a way to construct software from training examples.
▪ This is still a major goal
2.ML methods were extended to support data mining and knowledge discovery
Machine Learning Present:
1. Automated Decision Making: Contextual Bandit and Reinforcement Learning methods
▪ Requires the capability to perform experiments
2. Perceptual Tasks: Deep learning for computer vision, speech recognition, etc.
▪ Can learn its own features from raw signals
3. Anomaly Detection: Fraud detection, Cyber security, Machine diagnosis
Machine Learning Future:
1. Detecting and Correcting for Bias
2. Risk-Sensitive Optimization
3. Explanations of Black Box Systems
4. Verification and Validation
5. Integrating ML Components into Larger Software Systems
Summary
27
Jeff Bezos, CEO, Amazon
"Machine learning and A.I. is a horizontal enabling layer. It will empower
and improve every business, every government organization, every
philanthropy. Basically, there's no institution in the world that cannot be
improved with machine learning”
--Inc. Magazine May 9, 2017
A Final Quote
28

Machine Learning: Past, Present and Future - by Tom Dietterich

  • 1.
    Machine Learning: Past,Present, and Future Professor Thomas G. Dietterich Oregon State University Chief Scientist, BigML 11 MAY 2017 1
  • 2.
    Question: How tobuild smart software? Answer (in 1980): Interview an expert, encode knowledge in software This worked well for expert systems Machine Learning Past Medical diagnosis of blood diseases (MYCIN) Interpreting mass spectrograms (DENDRAL) 01 Configuring computer hardware systems (XCON)02 03 2
  • 3.
    Question: How tobuild smart software? Answer (in 1980): Interview an expert, encode knowledge in software But it did NOT work well for other tasks: Machine Learning Past Optical character recognition Robot control 01 Natural language processing02 03 3
  • 4.
    1. Machine Learning Past ➔“3” ➔ “6” ➔ “2” ➔ 4
  • 5.
    Many new algorithmsdeveloped Decision trees Probabilistic graphical models including Naïve Bayes 1980-2000 01 Support vector machines Ensemble methods: Bagging and Boosting 02 03 04 5
  • 6.
    Learning algorithms succeedbecause they find patterns in the data Some algorithms reveal those patterns in easy-to- understand ways Data mining and knowledge discovery focus on discovering and visualizing these patterns. From Function Learning to Knowledge Discovery Decision trees Probabilistic graphical models including Naïve Bayes 01 Association rules (frequent item sets)02 03 6
  • 7.
    Machine Learning: Present AutomatedDecision Making01 Perceptual Tasks02 Anomaly Detection03 7
  • 8.
    Examples: Challenge: You onlylearn the outcome of the chosen action; you are not told what the best action would have been (“Bandit Feedback”) Example: In advertisement placement, I might have 10,000 possible advertisements that I could present to the user. I must choose 5. The user might click on one of these (1) Automated Decision Making Recommendations (Netflix; Amazon) Robot Control (self-driving cars) 01 Advertisement placement (Google; Facebook)02 03 8
  • 9.
    1. Must systematicallytry all actions in many situations, but eventually focus on the most promising actions 2. When analyzing historical data (e.g., which advertisements a user clicked on), you must keep track of what advertisements were displayed 3. Goal is to uncover causal relationship between the selected action and the observed result Automated Decision Making Requires Experimentation 9
  • 10.
    1. Algorithms: Contextual Bandits credit:Cohen, et al. PTRS B 2008 10
  • 11.
  • 12.
    1.Self-driving car ▪ Stateof world: location, speed, and acceleration of car ▪ Actions: steering, acceleration, braking ▪ Rewards: reaching destination quickly, not colliding with people, obstacles, or cars, conserving fuel 2.Playing the game of Go ▪ State of world: State of the Go board ▪ Actions: placing a stone ▪ Reward: winning the game 3.Operations (logistics, inventory management) Reinforcement Learning Examples Tesla AutoSteer Credit: Tesla Motors 12
  • 13.
    1. Standard machinelearning algorithms require that the data be converted to meaningful features 2. This is easy for typical database information 3. But very difficult for signal-level data (images, speech) 4. Deep learning methods are able to automatically discover meaningful intermediate features (2) Deep Learning for Perception 13
  • 14.
    Top5ClassificationError(%) 0 7,5 15 22,5 30 2010 2011 20122013 2014 ImageNet 1000 object categories. Is the right answer in the top 5 predictions? Progress in Object Recognition 14
  • 15.
    Progress in SpeechRecognition Credit: Fernando Pereira and Matthew Firestone (Google) 15
  • 16.
    1. To developa deep learning application, the programmer designs a task-specific deep network 2. This network must be differentiable so that the network parameters can be adjusted via gradient search 3. Modern deep net tools compute the derivatives automatically 4. DL programmers are busy exploring the space of differentiable programs to understand what patterns and idioms work well ▪ Convolutional blocks ▪ LSTM gates ▪ Various forms of associative memory ▪ Auto-encoders and Generative Adversarial Networks 5.It is still very hard to get a DL application to work Deep Learning = Differentiable Programming 16
  • 17.
    In many applications,we have a large amount of data describing “normal” behavior and our goal is to detect “anomalous” behavior 1. Fraud detection ▪ Normal behavior: good customers ▪ Abnormal behavior: fraudulent customers 2. Cyber attacks ▪ Normal behavior: normal network flows ▪ Abnormal behavior: network flows caused by cyber attacks 3. Machine diagnosis ▪ Normal behavior: normal sensor readings ▪ Abnormal behavior: unusual sensor readings It is usually not safe to assume abnormal behavior has a fixed probability distribution (3) Anomaly Detection 17
  • 18.
    1.25,685 Benchmark Datasets 2.Eightalgorithms 3. Systematic control of relevant parameters (e.g., anomaly frequency, irrelevant features, problem difficulty) We found that the Isolation Forest algorithm was the overall best. Included in BigML services Experimental Study of Anomaly Detection Methods ChangeinMetricwrtControl Dataset 0 0,275 0,55 0,825 1,1 Algorithm iforest lof abod ocsvm logit(AUC) log(LIFT) [Emmott, Das, Dietterich, Fern, Wong, 2013; KDD ODD-2013] [Emmott, Das, Dietterich, Fern, Wong. 2016; arXiv 1503.01158v2] 18
  • 19.
    1.Detecting and Correctingfor Bias 2.Risk-Sensitive Optimization 3.Explanation of Black Box Systems 4.Verification and Validation 5.Integrating ML components into larger software systems Machine Learning: Future 19
  • 20.
    20 1.Race, Sex, Ageare legally-protected categories for certain decisions ▪ Granting loans ▪ Renting or buying houses ▪ Employment What constraints should this place on machine learning algorithms? Constraint 1: The algorithms should not use these categories—or any features from which these categories can be predicted—to make these decisions Constraint 2: The learned predictive model should exhibit the same false positive/ false negative tradeoffs (same ROC curve) for each protected subgroup (1) Detecting and Correcting for Bias
  • 21.
  • 22.
    22 1. (2) Risk-Sensitive Optimization 0.0 0.1 0.2 0.3 02 4 6 8 V P(V) 0.0 0.1 0.2 0.3 0 2 4 6 8 V P(V)
  • 23.
    23 Some machine learningmethods, particularly ensembles and deep neural networks, do not provide simple visualizations or explanations of their predictions We need explanations 1. To debug data sets and ML systems 2. To satisfy regulatory requirements (e.g., EU Right to Explanation) 3. To understand what knowledge has been discovered by the system 4. To decide whether to trust the system General strategy: Compute interpretable approximations of complex ML systems (3) Explanations for Black Box ML Systems
  • 24.
  • 25.
    25 ML provides newways for constructing software We need software engineering processes that ensure successful end-to-end system performance Recommended reading: Martin Zinkevich: Rules for Machine Learning: Best Practices for ML Engineering based on his experience at Google http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf (5) Integrating ML Components
  • 26.
    26 Machine Learning Past: 1.MachineLearning started as a way to construct software from training examples. ▪ This is still a major goal 2.ML methods were extended to support data mining and knowledge discovery Machine Learning Present: 1. Automated Decision Making: Contextual Bandit and Reinforcement Learning methods ▪ Requires the capability to perform experiments 2. Perceptual Tasks: Deep learning for computer vision, speech recognition, etc. ▪ Can learn its own features from raw signals 3. Anomaly Detection: Fraud detection, Cyber security, Machine diagnosis Machine Learning Future: 1. Detecting and Correcting for Bias 2. Risk-Sensitive Optimization 3. Explanations of Black Box Systems 4. Verification and Validation 5. Integrating ML Components into Larger Software Systems Summary
  • 27.
    27 Jeff Bezos, CEO,Amazon "Machine learning and A.I. is a horizontal enabling layer. It will empower and improve every business, every government organization, every philanthropy. Basically, there's no institution in the world that cannot be improved with machine learning” --Inc. Magazine May 9, 2017 A Final Quote
  • 28.