© OCTO 2015
Tél : +41 (0) 21 312 94 15
www.octo.com
Avenue du théatre, 7
1005 Lausanne SUISSEData Science & Machine Learning
Alexandre Masselot
amasselot@octo.com
@alex_mass
Catherine Zwahlen
czwahlen@octo.com
2016 is the Year
of Big Data
@OCTO Switzerland
Big Data Romandie
OCTO PUBLICATIONS
OCTO TECHNOLOGY > THERE IS A BETTER WAY 4
WE ARE CONSUMING DATA SCIENCE EVERY DAY!
Facial recognition
Spam detection
WE ARE CONSUMING DATA SCIENCE EVERY DAY!
Voice
recognition
WE ARE CONSUMING DATA SCIENCE EVERY DAY!
Movie
recommendation
WE ARE CONSUMING DATA SCIENCE EVERY DAY!
9
DATA SCIENCE, A DOMAIN DRIVEN BY COMPETITION
To solve your business problems!
Problem Data Crowd
Knowledge
& Tools
Model for
Prediction
OCTO Folks Work Hard, Play Hard
◉ Caisse de dépôts - score de délivrance d'un brevet européen
◉ Argus - prédiction du prix de vente de véhicules d'occasion
◉ SNCF - prédiction de la fréquentation des gares en Ile de France
◉ Imperial College London - Loan Default Prediction
◉ Allstate – purchase prediction challenge
◉ Tradeshift – Text classification
◉ Microsoft - Malware classification
OCTO, there is a better way to learn, recruit and have fun!
1st
2&4
3rd
6th
13th
2nd
5th
DATA SCIENCE TONIGHT
OCTO TECHNOLOGY > THERE IS A BETTER WAY 11
Visualization
1
2
3
4
Why the buzz about data science?
Demystifying machine learning
Data science in your business
12
“Data science is an interdisciplinary field about
processes and systems to extract knowledge
or insights from data”
OCTO TECHNOLOGY > THERE IS A BETTER WAY
https://en.wikipedia.org/wiki/Data_science
13OCTO TECHNOLOGY > THERE IS A BETTER WAY
Cray 2 iPhone 4=1 1
15OCTO TECHNOLOGY > THERE IS A BETTER WAY
16
AGILE DATA SCIENCE
OCTO TECHNOLOGY > THERE IS A BETTER WAY
DATA SCIENCE TONIGHT
OCTO TECHNOLOGY > THERE IS A BETTER WAY 17
Visualization
1
2
3
4
Why the buzz about data science?
Demystifying machine learning
Data science in your business
18
“Machine learning explores the study and
construction of algorithms that can learn
from and make predictions on data”
OCTO TECHNOLOGY > THERE IS A BETTER WAY
https://en.wikipedia.org/wiki/Machine_learning
19
MACHINE LEARNING
Conditions
OCTO TECHNOLOGY > THERE IS A BETTER WAY
1
2
3
A pattern exists
The problem cannot be described
analytically by a mathematical formula
Data, data, data
Machine learning algorithms exists
for many years
In general, model performances
improve with more data
20OCTO TECHNOLOGY > THERE IS A BETTER WAY
21
FLIGHT CHARACTERISTICS
OCTO TECHNOLOGY > THERE IS A BETTER WAY 21
Flight #
Dep
Airport
Dep
Hour
Dep
Week Day
Aircraft
Model
…
1 SYD 8:10 1 A330
2 SYD 14:15 2 B777
3 MEL 18:10 1 B777
4 PER 6:50 4 A320
5 SYD 9:50 3 A320
6 PER 12:10 1 A320
7 TZN 14:50 1 B777
8 MEL 14:15 4 A320
9 SYD 8:30 3 A320
10 MEL 16:40 1 A320
11 MEL 9:30 3 B747
12 TZN 9:30 1 A320
13 PER 9:50 3 A320
14 SYD 13:10 1 A320
22
EVENTS
OCTO TECHNOLOGY > THERE IS A BETTER WAY 22
Flight #
Dep
Airport
Dep
Hour
Dep
Week Day
Aircraft
Model
…
Actual
Delay
1 SYD 8:10 1 A330 0
2 SYD 14:15 2 B777 3
3 MEL 18:10 1 B777 0
4 PER 6:50 4 A320 17
5 SYD 9:50 3 A320 0
6 PER 12:10 1 A320 23
7 TZN 14:50 1 B777 0
8 MEL 14:15 4 A320 0
9 SYD 8:30 3 A320 0
10 MEL 16:40 1 A320 12
11 MEL 9:30 3 B747 32
12 TZN 9:30 1 A320 20
13 PER 9:50 3 A320 0
14 SYD 13:10 1 A320 9
23
EVENTS
OCTO TECHNOLOGY > THERE IS A BETTER WAY 23
Flight #
Dep
Airport
Dep
Hour
Dep
Week Day
Aircraft
Model
…
Actual
Delay
1 SYD 8:10 1 A330 0
2 SYD 14:15 2 B777 3
3 MEL 18:10 1 B777 0
4 PER 6:50 4 A320 17
5 SYD 9:50 3 A320 0
6 PER 12:10 1 A320 23
7 TZN 14:50 1 B777 0
8 MEL 14:15 4 A320 0
9 SYD 8:30 3 A320 0
10 MEL 16:40 1 A320 12
11 MEL 9:30 3 B747 32
12 TZN 9:30 1 A320 20
13 PER 9:50 3 A320 0
14 SYD 13:10 1 A320 9
A flight is labeled “delayed”
if actual delay >= 15min
24
LABEL
OCTO TECHNOLOGY > THERE IS A BETTER WAY 24
Flight #
Dep
Airport
Dep
Hour
Dep
Week Day
Aircraft
Model
…
Actual
Delay
1 SYD 8:10 1 A330 0
2 SYD 14:15 2 B777 3
3 MEL 18:10 1 B777 0
4 PER 6:50 4 A320 17
5 SYD 9:50 3 A320 0
6 PER 12:10 1 A320 23
7 TZN 14:50 1 B777 0
8 MEL 14:15 4 A320 0
9 SYD 8:30 3 A320 0
10 MEL 16:40 1 A320 12
11 MEL 9:30 3 B747 32
12 TZN 9:30 1 A320 20
13 PER 9:50 3 A320 0
14 SYD 13:10 1 A320 9
Class
0
0
0
1
0
1
0
0
0
0
1
1
0
0
25
BUILD A MODEL
OCTO TECHNOLOGY > THERE IS A BETTER WAY
1 SYD 8:10 1 A330 0
2 SYD 14:15 2 B777 0
3 MEL 18:10 1 B777 0
4 PER 6:50 4 A320 1
5 SYD 9:50 3 A320 0
6 PER 12:10 1 A320 1
7 TZN 14:50 1 B777 0
8 MEL 14:15 4 A320 0
9 SYD 8:30 3 A320 0
10 MEL 16:40 1 A320 0
… … … … … …
11 MEL 9:30 3 B747 1
12 TZN 9:30 1 A320 1
13 PER 9:50 3 A320 0
14 SYD 13:10 1 A320 0
Flight
#
Dep
Airport
Dep
Hour
Dep
Week Day
Aircraft
Model Delay
θ1
θ2
θ3
…
θn
X Y
26
LOGISTIC REGRESSION
Classification algorithm
OCTO TECHNOLOGY > THERE IS A BETTER WAY
27
DECISION TREE
Classification algorithm
OCTO TECHNOLOGY > THERE IS A BETTER WAY
DoW
>5
Month
>5
PAX
>35%
AoD
=“SYD”
no
no
no
yes
yes
yes
yesno
+-
-+-
28
RANDOM FOREST
Classification algorithm
OCTO TECHNOLOGY > THERE IS A BETTER WAY 28
29
TEST CLASSIFIER
OCTO TECHNOLOGY > THERE IS A BETTER WAY 29
Flight #
Dep
Airport
Dep
Hour
Dep
Week Day
Aircraft
Model
…
9 SYD 8:30 3 A320
1 positive (delayed)
0 negative (on time)
30
A PERFECT CLASSIFIER
OCTO TECHNOLOGY > THERE IS A BETTER WAY 30
Flight #
Dep
Airport
Dep
Hour
Dep
Week Day
Aircraft
Model
1 SYD 8:10 1 A330
2 SYD 14:15 2 B777
3 MEL 18:10 1 B777
4 PER 6:50 4 A320
5 SYD 9:50 3 A320
6 PER 12:10 1 A320
7 TZN 14:50 1 B777
8 MEL 14:15 4 A320
9 SYD 8:30 3 A320
10 MEL 16:40 1 A320
11 MEL 9:30 3 B747
12 TZN 9:30 1 A320
13 PER 9:50 3 A320
14 SYD 13:10 1 A320
1
1
1
1
0
0
0
0
0
0
0
0
0
0
31
1
1
0
1
0
1
0
0
0
0
0
0
1
0
A MORE REALISTIC CLASSIFIER
OCTO TECHNOLOGY > THERE IS A BETTER WAY 31
Flight #
Dep
Airport
Dep
Hour
Dep
Week Day
Aircraft
Model
1 SYD 8:10 1 A330
2 SYD 14:15 2 B777
3 MEL 18:10 1 B777
4 PER 6:50 4 A320
5 SYD 9:50 3 A320
6 PER 12:10 1 A320
7 TZN 14:50 1 B777
8 MEL 14:15 4 A320
9 SYD 8:30 3 A320
10 MEL 16:40 1 A320
11 MEL 9:30 3 B747
12 TZN 9:30 1 A320
13 PER 9:50 3 A320
14 SYD 13:10 1 A320
Wrongly
classified
32
CONFUSION MATRIX
The summary to optimize
OCTO TECHNOLOGY > THERE IS A BETTER WAY
32
Actually
delayed on time
Predicted
+
(delayed)
3 2
-
(on time)
1 8
True Positive
False Negative
False Positive
True Negative
33
PERFORMANCE INDICATORS
OCTO TECHNOLOGY > THERE IS A BETTER WAY
33
Actually
delayed on time
Predicted
+
(delayed)
3 2
-
(on time)
1 8
TP
FN
FP
TN
False Positive Rate =
True Positive Rate =
TP
TP + FN
FP
FP + TN(1 – Specificity)
(Sensitivity)
Precision =
TP
TP + FP
Recall =
TP
TP + FN
34
0.9
0.8
0.8
0.3
0.2
0.1
0.5
0.4
0.5
0.4
0.3
0.7
0.8
0.5
CLASSIFIER
Assigning a continuous score of being delayed
OCTO TECHNOLOGY > THERE IS A BETTER WAY 34
0 1
+-
35
PREDICTOR SCORE DISTRIBUTION
OCTO TECHNOLOGY > THERE IS A BETTER WAY 35
Score
Delayed flights
On time
flights
Eventscount
A perfect
score cutoff
36
PREDICTOR SCORE DISTRIBUTION
Fixing a score cutoff leads to false positive and negative
OCTO TECHNOLOGY > THERE IS A BETTER WAY 36
Score
False Positive
False Negative
Eventscount
37
ROC CURVES TO COMPARE CLASSIFIERS
Fixing score cutoffs lead to different false positive and negative rates
OCTO TECHNOLOGY > THERE IS A BETTER WAY 37
False Positive Rate
TruePositiveRate
0
1
0 1
38
ROC AND ROLL
 ROC allow to compare different models
 Area Under the Curve (AUC) is only a projection of the overall
performance
 Significantly different models can have close ROC
 Other comparisons methods exists (and are intimately related to ROC):
> Precision/Recall
> LIFT
A few comments about ROC curves
OCTO TECHNOLOGY > THERE IS A BETTER WAY 38
AUC
39
MODELS & DATA
Precision score for the TOP 20%
Traditional models Advanced models Advanced models
with more data
Advanced models
with more data
and more features
Precision
40
MODELS & DATA
Traditional models Advanced models Advanced models
with more data
Advanced models
with more data
and more features
Precision
Precision score for the TOP 20%
MODELS & DATA
Traditional models Advanced models Advanced models
with more data
Advanced models
with more data
and more features
Precision
Precision score for the TOP 20%
42
FIGHT DELAY PREDICTION: RESULTS
All reasons for delays
 Overall improvement by a factor 3
Focus on air traffic
 Overall improvement by a factor 6
Delay caused by passengers
 No improvement
10% LIFT score
OCTO TECHNOLOGY > THERE IS A BETTER WAY
43
PREDICT NUMBER OF PASSENGERS ON A PLANE
Optimize catering
OCTO TECHNOLOGY > THERE IS A BETTER WAY 43
t0 - 4 hours t0
Flight
Number
Booked Departure
port
… Departure
hour
0777 152 PER … 14
1116 201 SYD … 9
0961 92 BNE … 6
0538 189 MEL … 12
1078 136 SYD … 23
Final Number
of passengers
164
186
125
189
87
t
?
~ 50 explanatory variables
X y
t0 - 1 hour
44
RESULTS
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Passenger
difference
No model Model
< 5 55% 69%
< 10 80% 89%
$1-2M per year
45
UNSTRUCTURED DATA
OCTO TECHNOLOGY > THERE IS A BETTER WAY
47
1
48
WHAT ARE THE FEATURES?
mimagesfortraining
n features
X
…
6
…
Y
49
WHAT ARE THE FEATURES?
5
4
3
2
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
4
5
5
5
5
4
1
4
4
1
0
1
5
4
5
1
0
0
0
1
5
1
5
0
0
0
0
0
5
4
4
0
0
0
0
0
2
5
2
0
0
0
0
0
0
0
0
50
WHAT ARE THE FEATURES?
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
4
5
5
5
5
4
1
4
4
1
0
1
5
4
5
1
0
0
0
1
5
1
5
0
0
0
0
0
5
4
4
0
0
0
0
0
2
5
2
0
0
0
0
0
0
0
0
= 6
(…)
6
6
3
…
0
7
n features
mimagesfortraining
X Y
51
NEURAL NETWORK
OCTO TECHNOLOGY > THERE IS A BETTER WAY
CAN COMPUTER VISION SPOT DISTRACTED DRIVERS?
 24 Juin 2016 – Julien Krywyk
OCTO TECHNOLOGY > THERE IS A BETTER WAY 52
Phone right Safe Text right Phone left Text left
Speaking Makeup Behind Drink Radio
OCTO TECHNOLOGY > THERE IS A BETTER WAY 53
Build classifier
Train 22K images Test 80K images
Predicted
classes
X Y
Make predictions
?
DEEP LEARNING
OCTO TECHNOLOGY > THERE IS A BETTER WAY 54
Identify pixels
Identify edges and
simple shape
Identify complex
shapes and object
Identify which shape to
be used to define a
human face
DEEP LEARNING
Transfer learning
OCTO TECHNOLOGY > THERE IS A BETTER WAY 55
n features
X Y
Features
extractions
pre-trained CNN
DATA SCIENCE TONIGHT
OCTO TECHNOLOGY > THERE IS A BETTER WAY 56
Visualization
1
2
3
4
Why the buzz about data science?
Demystifying machine learning
Data science in your business
57
VISUALIZATION
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Understand
Communicate
results & analysis
58
1880: TEXTILE PRODUCTION IN ENGLAND (OTTO NEURATH, ~1920)
Changing the world by educating people about the world around them
OCTO TECHNOLOGY > THERE IS A BETTER WAY
59
NAPOLEON 1812 CAMPAIGN (CHARLES MINARD, 1869)
OCTO TECHNOLOGY > THERE IS A BETTER WAY
60
HOW TRUMP PUSHED THE ELECTION MAP TO THE RIGHT (NEW YORK TIMES)
OCTO TECHNOLOGY > THERE IS A BETTER WAY
61
VISUALIZATION TO GET ACQUAINTED WITH DATA
OCTO TECHNOLOGY > THERE IS A BETTER WAY
EXPLORATION: FLIGHT DELAY PER MONTH AND DAY OF WEEK
63
DATA VISUALIZATION
Correlation between ‘Departure Hour’ and passenger delta
OCTO TECHNOLOGY > THERE IS A BETTER WAY 63
64
NOTEBOOKS
Interactive data analysis
OCTO TECHNOLOGY > THERE IS A BETTER WAY
65
VISUALIZATION AS A GAME CHANGER
OCTO TECHNOLOGY > THERE IS A BETTER WAY
66
VALIDATION
OCTO TECHNOLOGY > THERE IS A BETTER WAY
https://github.com/genentech/fishtones-js
DATA SCIENCE TONIGHT
OCTO TECHNOLOGY > THERE IS A BETTER WAY 69
Visualization
1
2
3
4
Why the buzz about data science?
Demystifying machine learning
Data science in your business
70
I WANT A DATA SCIENTIST!
OCTO TECHNOLOGY > THERE IS A BETTER WAY
71OCTO TECHNOLOGY > THERE IS A BETTER WAY
72
AGILE DATA SCIENCE
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Agile Data science
Feature
Team
Operations
Business
analyst
Developper
tech expertProject
Manager
Data
scientist
Architect
Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan
That is, while there is value in the items on the right, we value the items on the left more
OCTO TECHNOLOGY > THERE IS A BETTER WAY
BUILDING A DATALAB
OCTO TECHNOLOGY > THERE IS A BETTER WAY 75
Source System Collect, storage et data preparation Analysis delivery
External
sources
Datalab
Existing infrastructure
(multiples sources)
ETL
Extract
cleanup,
transfor
m
load
Staging area Datawarehouse
technical layer
(referential/
Operation)
Datamart
technique
(zone de collecte)
Datamart
(management,
marketing, sales
User access
(Reporting, Analytics)
Batch
• Analyses
• Indicators
• Statistics
Online
• Dashboards
• Reporting
• Requests
Administratio
n
• Admini
• Validation
DEVOPS – EMBRACING NEW KNOW HOW
And new collaborations…
Data Scientist
• Innovates
• With new technologies
“What !? A unit test on my
neural network???
OPS
• Look after rationalization
“What!? Your piece of Scala
calls a Python library embedding C ???”
OCTO TECHNOLOGY > THERE IS A BETTER WAY
78
DEMOCRATIZATION
 cours
OCTO TECHNOLOGY > THERE IS A BETTER WAY
1 million
enrollments
OCTO TECHNOLOGY > THERE IS A BETTER WAY
81
Business must be aware of opportunities to use
algorithms
BUSINESS & DATA SCIENCE
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Data must be easily accessible
Focus on lowest time to market possible
USE CASES CLASSES AND THEIR BUSINESS VALUE
OCTO TECHNOLOGY > THERE IS A BETTER WAY 82
The prediction is a
support for decision
Analyses support
data understanding
The prediction is the
decision
Business value
OCTO TECHNOLOGY > THERE IS A BETTER WAY 83
???
???

Afterwork Big Data - Data Science & Machine Learning : explorer, comprendre et prédire

  • 1.
    © OCTO 2015 Tél: +41 (0) 21 312 94 15 www.octo.com Avenue du théatre, 7 1005 Lausanne SUISSEData Science & Machine Learning
  • 2.
  • 3.
    2016 is theYear of Big Data @OCTO Switzerland Big Data Romandie
  • 4.
    OCTO PUBLICATIONS OCTO TECHNOLOGY> THERE IS A BETTER WAY 4
  • 5.
    WE ARE CONSUMINGDATA SCIENCE EVERY DAY! Facial recognition
  • 6.
    Spam detection WE ARECONSUMING DATA SCIENCE EVERY DAY!
  • 7.
    Voice recognition WE ARE CONSUMINGDATA SCIENCE EVERY DAY!
  • 8.
  • 9.
    9 DATA SCIENCE, ADOMAIN DRIVEN BY COMPETITION To solve your business problems! Problem Data Crowd Knowledge & Tools Model for Prediction
  • 10.
    OCTO Folks WorkHard, Play Hard ◉ Caisse de dépôts - score de délivrance d'un brevet européen ◉ Argus - prédiction du prix de vente de véhicules d'occasion ◉ SNCF - prédiction de la fréquentation des gares en Ile de France ◉ Imperial College London - Loan Default Prediction ◉ Allstate – purchase prediction challenge ◉ Tradeshift – Text classification ◉ Microsoft - Malware classification OCTO, there is a better way to learn, recruit and have fun! 1st 2&4 3rd 6th 13th 2nd 5th
  • 11.
    DATA SCIENCE TONIGHT OCTOTECHNOLOGY > THERE IS A BETTER WAY 11 Visualization 1 2 3 4 Why the buzz about data science? Demystifying machine learning Data science in your business
  • 12.
    12 “Data science isan interdisciplinary field about processes and systems to extract knowledge or insights from data” OCTO TECHNOLOGY > THERE IS A BETTER WAY https://en.wikipedia.org/wiki/Data_science
  • 13.
    13OCTO TECHNOLOGY >THERE IS A BETTER WAY Cray 2 iPhone 4=1 1
  • 15.
    15OCTO TECHNOLOGY >THERE IS A BETTER WAY
  • 16.
    16 AGILE DATA SCIENCE OCTOTECHNOLOGY > THERE IS A BETTER WAY
  • 17.
    DATA SCIENCE TONIGHT OCTOTECHNOLOGY > THERE IS A BETTER WAY 17 Visualization 1 2 3 4 Why the buzz about data science? Demystifying machine learning Data science in your business
  • 18.
    18 “Machine learning exploresthe study and construction of algorithms that can learn from and make predictions on data” OCTO TECHNOLOGY > THERE IS A BETTER WAY https://en.wikipedia.org/wiki/Machine_learning
  • 19.
    19 MACHINE LEARNING Conditions OCTO TECHNOLOGY> THERE IS A BETTER WAY 1 2 3 A pattern exists The problem cannot be described analytically by a mathematical formula Data, data, data Machine learning algorithms exists for many years In general, model performances improve with more data
  • 20.
    20OCTO TECHNOLOGY >THERE IS A BETTER WAY
  • 21.
    21 FLIGHT CHARACTERISTICS OCTO TECHNOLOGY> THERE IS A BETTER WAY 21 Flight # Dep Airport Dep Hour Dep Week Day Aircraft Model … 1 SYD 8:10 1 A330 2 SYD 14:15 2 B777 3 MEL 18:10 1 B777 4 PER 6:50 4 A320 5 SYD 9:50 3 A320 6 PER 12:10 1 A320 7 TZN 14:50 1 B777 8 MEL 14:15 4 A320 9 SYD 8:30 3 A320 10 MEL 16:40 1 A320 11 MEL 9:30 3 B747 12 TZN 9:30 1 A320 13 PER 9:50 3 A320 14 SYD 13:10 1 A320
  • 22.
    22 EVENTS OCTO TECHNOLOGY >THERE IS A BETTER WAY 22 Flight # Dep Airport Dep Hour Dep Week Day Aircraft Model … Actual Delay 1 SYD 8:10 1 A330 0 2 SYD 14:15 2 B777 3 3 MEL 18:10 1 B777 0 4 PER 6:50 4 A320 17 5 SYD 9:50 3 A320 0 6 PER 12:10 1 A320 23 7 TZN 14:50 1 B777 0 8 MEL 14:15 4 A320 0 9 SYD 8:30 3 A320 0 10 MEL 16:40 1 A320 12 11 MEL 9:30 3 B747 32 12 TZN 9:30 1 A320 20 13 PER 9:50 3 A320 0 14 SYD 13:10 1 A320 9
  • 23.
    23 EVENTS OCTO TECHNOLOGY >THERE IS A BETTER WAY 23 Flight # Dep Airport Dep Hour Dep Week Day Aircraft Model … Actual Delay 1 SYD 8:10 1 A330 0 2 SYD 14:15 2 B777 3 3 MEL 18:10 1 B777 0 4 PER 6:50 4 A320 17 5 SYD 9:50 3 A320 0 6 PER 12:10 1 A320 23 7 TZN 14:50 1 B777 0 8 MEL 14:15 4 A320 0 9 SYD 8:30 3 A320 0 10 MEL 16:40 1 A320 12 11 MEL 9:30 3 B747 32 12 TZN 9:30 1 A320 20 13 PER 9:50 3 A320 0 14 SYD 13:10 1 A320 9 A flight is labeled “delayed” if actual delay >= 15min
  • 24.
    24 LABEL OCTO TECHNOLOGY >THERE IS A BETTER WAY 24 Flight # Dep Airport Dep Hour Dep Week Day Aircraft Model … Actual Delay 1 SYD 8:10 1 A330 0 2 SYD 14:15 2 B777 3 3 MEL 18:10 1 B777 0 4 PER 6:50 4 A320 17 5 SYD 9:50 3 A320 0 6 PER 12:10 1 A320 23 7 TZN 14:50 1 B777 0 8 MEL 14:15 4 A320 0 9 SYD 8:30 3 A320 0 10 MEL 16:40 1 A320 12 11 MEL 9:30 3 B747 32 12 TZN 9:30 1 A320 20 13 PER 9:50 3 A320 0 14 SYD 13:10 1 A320 9 Class 0 0 0 1 0 1 0 0 0 0 1 1 0 0
  • 25.
    25 BUILD A MODEL OCTOTECHNOLOGY > THERE IS A BETTER WAY 1 SYD 8:10 1 A330 0 2 SYD 14:15 2 B777 0 3 MEL 18:10 1 B777 0 4 PER 6:50 4 A320 1 5 SYD 9:50 3 A320 0 6 PER 12:10 1 A320 1 7 TZN 14:50 1 B777 0 8 MEL 14:15 4 A320 0 9 SYD 8:30 3 A320 0 10 MEL 16:40 1 A320 0 … … … … … … 11 MEL 9:30 3 B747 1 12 TZN 9:30 1 A320 1 13 PER 9:50 3 A320 0 14 SYD 13:10 1 A320 0 Flight # Dep Airport Dep Hour Dep Week Day Aircraft Model Delay θ1 θ2 θ3 … θn X Y
  • 26.
    26 LOGISTIC REGRESSION Classification algorithm OCTOTECHNOLOGY > THERE IS A BETTER WAY
  • 27.
    27 DECISION TREE Classification algorithm OCTOTECHNOLOGY > THERE IS A BETTER WAY DoW >5 Month >5 PAX >35% AoD =“SYD” no no no yes yes yes yesno +- -+-
  • 28.
    28 RANDOM FOREST Classification algorithm OCTOTECHNOLOGY > THERE IS A BETTER WAY 28
  • 29.
    29 TEST CLASSIFIER OCTO TECHNOLOGY> THERE IS A BETTER WAY 29 Flight # Dep Airport Dep Hour Dep Week Day Aircraft Model … 9 SYD 8:30 3 A320 1 positive (delayed) 0 negative (on time)
  • 30.
    30 A PERFECT CLASSIFIER OCTOTECHNOLOGY > THERE IS A BETTER WAY 30 Flight # Dep Airport Dep Hour Dep Week Day Aircraft Model 1 SYD 8:10 1 A330 2 SYD 14:15 2 B777 3 MEL 18:10 1 B777 4 PER 6:50 4 A320 5 SYD 9:50 3 A320 6 PER 12:10 1 A320 7 TZN 14:50 1 B777 8 MEL 14:15 4 A320 9 SYD 8:30 3 A320 10 MEL 16:40 1 A320 11 MEL 9:30 3 B747 12 TZN 9:30 1 A320 13 PER 9:50 3 A320 14 SYD 13:10 1 A320 1 1 1 1 0 0 0 0 0 0 0 0 0 0
  • 31.
    31 1 1 0 1 0 1 0 0 0 0 0 0 1 0 A MORE REALISTICCLASSIFIER OCTO TECHNOLOGY > THERE IS A BETTER WAY 31 Flight # Dep Airport Dep Hour Dep Week Day Aircraft Model 1 SYD 8:10 1 A330 2 SYD 14:15 2 B777 3 MEL 18:10 1 B777 4 PER 6:50 4 A320 5 SYD 9:50 3 A320 6 PER 12:10 1 A320 7 TZN 14:50 1 B777 8 MEL 14:15 4 A320 9 SYD 8:30 3 A320 10 MEL 16:40 1 A320 11 MEL 9:30 3 B747 12 TZN 9:30 1 A320 13 PER 9:50 3 A320 14 SYD 13:10 1 A320 Wrongly classified
  • 32.
    32 CONFUSION MATRIX The summaryto optimize OCTO TECHNOLOGY > THERE IS A BETTER WAY 32 Actually delayed on time Predicted + (delayed) 3 2 - (on time) 1 8 True Positive False Negative False Positive True Negative
  • 33.
    33 PERFORMANCE INDICATORS OCTO TECHNOLOGY> THERE IS A BETTER WAY 33 Actually delayed on time Predicted + (delayed) 3 2 - (on time) 1 8 TP FN FP TN False Positive Rate = True Positive Rate = TP TP + FN FP FP + TN(1 – Specificity) (Sensitivity) Precision = TP TP + FP Recall = TP TP + FN
  • 34.
    34 0.9 0.8 0.8 0.3 0.2 0.1 0.5 0.4 0.5 0.4 0.3 0.7 0.8 0.5 CLASSIFIER Assigning a continuousscore of being delayed OCTO TECHNOLOGY > THERE IS A BETTER WAY 34 0 1 +-
  • 35.
    35 PREDICTOR SCORE DISTRIBUTION OCTOTECHNOLOGY > THERE IS A BETTER WAY 35 Score Delayed flights On time flights Eventscount A perfect score cutoff
  • 36.
    36 PREDICTOR SCORE DISTRIBUTION Fixinga score cutoff leads to false positive and negative OCTO TECHNOLOGY > THERE IS A BETTER WAY 36 Score False Positive False Negative Eventscount
  • 37.
    37 ROC CURVES TOCOMPARE CLASSIFIERS Fixing score cutoffs lead to different false positive and negative rates OCTO TECHNOLOGY > THERE IS A BETTER WAY 37 False Positive Rate TruePositiveRate 0 1 0 1
  • 38.
    38 ROC AND ROLL ROC allow to compare different models  Area Under the Curve (AUC) is only a projection of the overall performance  Significantly different models can have close ROC  Other comparisons methods exists (and are intimately related to ROC): > Precision/Recall > LIFT A few comments about ROC curves OCTO TECHNOLOGY > THERE IS A BETTER WAY 38 AUC
  • 39.
    39 MODELS & DATA Precisionscore for the TOP 20% Traditional models Advanced models Advanced models with more data Advanced models with more data and more features Precision
  • 40.
    40 MODELS & DATA Traditionalmodels Advanced models Advanced models with more data Advanced models with more data and more features Precision Precision score for the TOP 20%
  • 41.
    MODELS & DATA Traditionalmodels Advanced models Advanced models with more data Advanced models with more data and more features Precision Precision score for the TOP 20%
  • 42.
    42 FIGHT DELAY PREDICTION:RESULTS All reasons for delays  Overall improvement by a factor 3 Focus on air traffic  Overall improvement by a factor 6 Delay caused by passengers  No improvement 10% LIFT score OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 43.
    43 PREDICT NUMBER OFPASSENGERS ON A PLANE Optimize catering OCTO TECHNOLOGY > THERE IS A BETTER WAY 43 t0 - 4 hours t0 Flight Number Booked Departure port … Departure hour 0777 152 PER … 14 1116 201 SYD … 9 0961 92 BNE … 6 0538 189 MEL … 12 1078 136 SYD … 23 Final Number of passengers 164 186 125 189 87 t ? ~ 50 explanatory variables X y t0 - 1 hour
  • 44.
    44 RESULTS OCTO TECHNOLOGY >THERE IS A BETTER WAY Passenger difference No model Model < 5 55% 69% < 10 80% 89% $1-2M per year
  • 45.
    45 UNSTRUCTURED DATA OCTO TECHNOLOGY> THERE IS A BETTER WAY
  • 47.
  • 48.
    48 WHAT ARE THEFEATURES? mimagesfortraining n features X … 6 … Y
  • 49.
    49 WHAT ARE THEFEATURES? 5 4 3 2 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 4 5 5 5 5 4 1 4 4 1 0 1 5 4 5 1 0 0 0 1 5 1 5 0 0 0 0 0 5 4 4 0 0 0 0 0 2 5 2 0 0 0 0 0 0 0 0
  • 50.
    50 WHAT ARE THEFEATURES? 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 4 5 5 5 5 4 1 4 4 1 0 1 5 4 5 1 0 0 0 1 5 1 5 0 0 0 0 0 5 4 4 0 0 0 0 0 2 5 2 0 0 0 0 0 0 0 0 = 6 (…) 6 6 3 … 0 7 n features mimagesfortraining X Y
  • 51.
    51 NEURAL NETWORK OCTO TECHNOLOGY> THERE IS A BETTER WAY
  • 52.
    CAN COMPUTER VISIONSPOT DISTRACTED DRIVERS?  24 Juin 2016 – Julien Krywyk OCTO TECHNOLOGY > THERE IS A BETTER WAY 52 Phone right Safe Text right Phone left Text left Speaking Makeup Behind Drink Radio
  • 53.
    OCTO TECHNOLOGY >THERE IS A BETTER WAY 53 Build classifier Train 22K images Test 80K images Predicted classes X Y Make predictions ?
  • 54.
    DEEP LEARNING OCTO TECHNOLOGY> THERE IS A BETTER WAY 54 Identify pixels Identify edges and simple shape Identify complex shapes and object Identify which shape to be used to define a human face
  • 55.
    DEEP LEARNING Transfer learning OCTOTECHNOLOGY > THERE IS A BETTER WAY 55 n features X Y Features extractions pre-trained CNN
  • 56.
    DATA SCIENCE TONIGHT OCTOTECHNOLOGY > THERE IS A BETTER WAY 56 Visualization 1 2 3 4 Why the buzz about data science? Demystifying machine learning Data science in your business
  • 57.
    57 VISUALIZATION OCTO TECHNOLOGY >THERE IS A BETTER WAY Understand Communicate results & analysis
  • 58.
    58 1880: TEXTILE PRODUCTIONIN ENGLAND (OTTO NEURATH, ~1920) Changing the world by educating people about the world around them OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 59.
    59 NAPOLEON 1812 CAMPAIGN(CHARLES MINARD, 1869) OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 60.
    60 HOW TRUMP PUSHEDTHE ELECTION MAP TO THE RIGHT (NEW YORK TIMES) OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 61.
    61 VISUALIZATION TO GETACQUAINTED WITH DATA OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 62.
    EXPLORATION: FLIGHT DELAYPER MONTH AND DAY OF WEEK
  • 63.
    63 DATA VISUALIZATION Correlation between‘Departure Hour’ and passenger delta OCTO TECHNOLOGY > THERE IS A BETTER WAY 63
  • 64.
    64 NOTEBOOKS Interactive data analysis OCTOTECHNOLOGY > THERE IS A BETTER WAY
  • 65.
    65 VISUALIZATION AS AGAME CHANGER OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 66.
    66 VALIDATION OCTO TECHNOLOGY >THERE IS A BETTER WAY
  • 67.
  • 69.
    DATA SCIENCE TONIGHT OCTOTECHNOLOGY > THERE IS A BETTER WAY 69 Visualization 1 2 3 4 Why the buzz about data science? Demystifying machine learning Data science in your business
  • 70.
    70 I WANT ADATA SCIENTIST! OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 71.
    71OCTO TECHNOLOGY >THERE IS A BETTER WAY
  • 72.
    72 AGILE DATA SCIENCE OCTOTECHNOLOGY > THERE IS A BETTER WAY
  • 73.
    Agile Data science Feature Team Operations Business analyst Developper techexpertProject Manager Data scientist Architect Individuals and interactions over processes and tools Working software over comprehensive documentation Customer collaboration over contract negotiation Responding to change over following a plan That is, while there is value in the items on the right, we value the items on the left more
  • 74.
    OCTO TECHNOLOGY >THERE IS A BETTER WAY
  • 75.
    BUILDING A DATALAB OCTOTECHNOLOGY > THERE IS A BETTER WAY 75 Source System Collect, storage et data preparation Analysis delivery External sources Datalab Existing infrastructure (multiples sources) ETL Extract cleanup, transfor m load Staging area Datawarehouse technical layer (referential/ Operation) Datamart technique (zone de collecte) Datamart (management, marketing, sales User access (Reporting, Analytics) Batch • Analyses • Indicators • Statistics Online • Dashboards • Reporting • Requests Administratio n • Admini • Validation
  • 76.
    DEVOPS – EMBRACINGNEW KNOW HOW And new collaborations… Data Scientist • Innovates • With new technologies “What !? A unit test on my neural network??? OPS • Look after rationalization “What!? Your piece of Scala calls a Python library embedding C ???”
  • 77.
    OCTO TECHNOLOGY >THERE IS A BETTER WAY
  • 78.
    78 DEMOCRATIZATION  cours OCTO TECHNOLOGY> THERE IS A BETTER WAY 1 million enrollments
  • 79.
    OCTO TECHNOLOGY >THERE IS A BETTER WAY
  • 80.
    81 Business must beaware of opportunities to use algorithms BUSINESS & DATA SCIENCE OCTO TECHNOLOGY > THERE IS A BETTER WAY Data must be easily accessible Focus on lowest time to market possible
  • 81.
    USE CASES CLASSESAND THEIR BUSINESS VALUE OCTO TECHNOLOGY > THERE IS A BETTER WAY 82 The prediction is a support for decision Analyses support data understanding The prediction is the decision Business value
  • 82.
    OCTO TECHNOLOGY >THERE IS A BETTER WAY 83 ??? ???