Pour notre troisième Afterwork sur le thème du « Big Data », nous proposons une introduction aux pratiques et bénéfices de la Data Science. Si les précédentes sessions ont dévoilé comment stocker et traiter de gros volumes de données à moindre coût, nous aborderons un nouvel aspect : comment découvrir les trésors d’information présents dans vos données.
Nous vous présenterons les grands principes du Machine Learning et la puissance de la visualisation. S’appuyant sur des retours d’expériences OCTO, nous réaliserons un tour d’horizon des méthodes et des outils disponibles.
A l’issue de cette présentation. vous aurez découvert des approches pragmatiques pour explorer et comprendre vos données. Voire prédire votre futur …
9. 9
DATA SCIENCE, A DOMAIN DRIVEN BY COMPETITION
To solve your business problems!
Problem Data Crowd
Knowledge
& Tools
Model for
Prediction
10. OCTO Folks Work Hard, Play Hard
◉ Caisse de dépôts - score de délivrance d'un brevet européen
◉ Argus - prédiction du prix de vente de véhicules d'occasion
◉ SNCF - prédiction de la fréquentation des gares en Ile de France
◉ Imperial College London - Loan Default Prediction
◉ Allstate – purchase prediction challenge
◉ Tradeshift – Text classification
◉ Microsoft - Malware classification
OCTO, there is a better way to learn, recruit and have fun!
1st
2&4
3rd
6th
13th
2nd
5th
11. DATA SCIENCE TONIGHT
OCTO TECHNOLOGY > THERE IS A BETTER WAY 11
Visualization
1
2
3
4
Why the buzz about data science?
Demystifying machine learning
Data science in your business
12. 12
“Data science is an interdisciplinary field about
processes and systems to extract knowledge
or insights from data”
OCTO TECHNOLOGY > THERE IS A BETTER WAY
https://en.wikipedia.org/wiki/Data_science
17. DATA SCIENCE TONIGHT
OCTO TECHNOLOGY > THERE IS A BETTER WAY 17
Visualization
1
2
3
4
Why the buzz about data science?
Demystifying machine learning
Data science in your business
18. 18
“Machine learning explores the study and
construction of algorithms that can learn
from and make predictions on data”
OCTO TECHNOLOGY > THERE IS A BETTER WAY
https://en.wikipedia.org/wiki/Machine_learning
19. 19
MACHINE LEARNING
Conditions
OCTO TECHNOLOGY > THERE IS A BETTER WAY
1
2
3
A pattern exists
The problem cannot be described
analytically by a mathematical formula
Data, data, data
Machine learning algorithms exists
for many years
In general, model performances
improve with more data
29. 29
TEST CLASSIFIER
OCTO TECHNOLOGY > THERE IS A BETTER WAY 29
Flight #
Dep
Airport
Dep
Hour
Dep
Week Day
Aircraft
Model
…
9 SYD 8:30 3 A320
1 positive (delayed)
0 negative (on time)
30. 30
A PERFECT CLASSIFIER
OCTO TECHNOLOGY > THERE IS A BETTER WAY 30
Flight #
Dep
Airport
Dep
Hour
Dep
Week Day
Aircraft
Model
1 SYD 8:10 1 A330
2 SYD 14:15 2 B777
3 MEL 18:10 1 B777
4 PER 6:50 4 A320
5 SYD 9:50 3 A320
6 PER 12:10 1 A320
7 TZN 14:50 1 B777
8 MEL 14:15 4 A320
9 SYD 8:30 3 A320
10 MEL 16:40 1 A320
11 MEL 9:30 3 B747
12 TZN 9:30 1 A320
13 PER 9:50 3 A320
14 SYD 13:10 1 A320
1
1
1
1
0
0
0
0
0
0
0
0
0
0
31. 31
1
1
0
1
0
1
0
0
0
0
0
0
1
0
A MORE REALISTIC CLASSIFIER
OCTO TECHNOLOGY > THERE IS A BETTER WAY 31
Flight #
Dep
Airport
Dep
Hour
Dep
Week Day
Aircraft
Model
1 SYD 8:10 1 A330
2 SYD 14:15 2 B777
3 MEL 18:10 1 B777
4 PER 6:50 4 A320
5 SYD 9:50 3 A320
6 PER 12:10 1 A320
7 TZN 14:50 1 B777
8 MEL 14:15 4 A320
9 SYD 8:30 3 A320
10 MEL 16:40 1 A320
11 MEL 9:30 3 B747
12 TZN 9:30 1 A320
13 PER 9:50 3 A320
14 SYD 13:10 1 A320
Wrongly
classified
32. 32
CONFUSION MATRIX
The summary to optimize
OCTO TECHNOLOGY > THERE IS A BETTER WAY
32
Actually
delayed on time
Predicted
+
(delayed)
3 2
-
(on time)
1 8
True Positive
False Negative
False Positive
True Negative
33. 33
PERFORMANCE INDICATORS
OCTO TECHNOLOGY > THERE IS A BETTER WAY
33
Actually
delayed on time
Predicted
+
(delayed)
3 2
-
(on time)
1 8
TP
FN
FP
TN
False Positive Rate =
True Positive Rate =
TP
TP + FN
FP
FP + TN(1 – Specificity)
(Sensitivity)
Precision =
TP
TP + FP
Recall =
TP
TP + FN
35. 35
PREDICTOR SCORE DISTRIBUTION
OCTO TECHNOLOGY > THERE IS A BETTER WAY 35
Score
Delayed flights
On time
flights
Eventscount
A perfect
score cutoff
36. 36
PREDICTOR SCORE DISTRIBUTION
Fixing a score cutoff leads to false positive and negative
OCTO TECHNOLOGY > THERE IS A BETTER WAY 36
Score
False Positive
False Negative
Eventscount
37. 37
ROC CURVES TO COMPARE CLASSIFIERS
Fixing score cutoffs lead to different false positive and negative rates
OCTO TECHNOLOGY > THERE IS A BETTER WAY 37
False Positive Rate
TruePositiveRate
0
1
0 1
38. 38
ROC AND ROLL
ROC allow to compare different models
Area Under the Curve (AUC) is only a projection of the overall
performance
Significantly different models can have close ROC
Other comparisons methods exists (and are intimately related to ROC):
> Precision/Recall
> LIFT
A few comments about ROC curves
OCTO TECHNOLOGY > THERE IS A BETTER WAY 38
AUC
39. 39
MODELS & DATA
Precision score for the TOP 20%
Traditional models Advanced models Advanced models
with more data
Advanced models
with more data
and more features
Precision
40. 40
MODELS & DATA
Traditional models Advanced models Advanced models
with more data
Advanced models
with more data
and more features
Precision
Precision score for the TOP 20%
41. MODELS & DATA
Traditional models Advanced models Advanced models
with more data
Advanced models
with more data
and more features
Precision
Precision score for the TOP 20%
42. 42
FIGHT DELAY PREDICTION: RESULTS
All reasons for delays
Overall improvement by a factor 3
Focus on air traffic
Overall improvement by a factor 6
Delay caused by passengers
No improvement
10% LIFT score
OCTO TECHNOLOGY > THERE IS A BETTER WAY
43. 43
PREDICT NUMBER OF PASSENGERS ON A PLANE
Optimize catering
OCTO TECHNOLOGY > THERE IS A BETTER WAY 43
t0 - 4 hours t0
Flight
Number
Booked Departure
port
… Departure
hour
0777 152 PER … 14
1116 201 SYD … 9
0961 92 BNE … 6
0538 189 MEL … 12
1078 136 SYD … 23
Final Number
of passengers
164
186
125
189
87
t
?
~ 50 explanatory variables
X y
t0 - 1 hour
44. 44
RESULTS
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Passenger
difference
No model Model
< 5 55% 69%
< 10 80% 89%
$1-2M per year
52. CAN COMPUTER VISION SPOT DISTRACTED DRIVERS?
24 Juin 2016 – Julien Krywyk
OCTO TECHNOLOGY > THERE IS A BETTER WAY 52
Phone right Safe Text right Phone left Text left
Speaking Makeup Behind Drink Radio
53. OCTO TECHNOLOGY > THERE IS A BETTER WAY 53
Build classifier
Train 22K images Test 80K images
Predicted
classes
X Y
Make predictions
?
54. DEEP LEARNING
OCTO TECHNOLOGY > THERE IS A BETTER WAY 54
Identify pixels
Identify edges and
simple shape
Identify complex
shapes and object
Identify which shape to
be used to define a
human face
56. DATA SCIENCE TONIGHT
OCTO TECHNOLOGY > THERE IS A BETTER WAY 56
Visualization
1
2
3
4
Why the buzz about data science?
Demystifying machine learning
Data science in your business
58. 58
1880: TEXTILE PRODUCTION IN ENGLAND (OTTO NEURATH, ~1920)
Changing the world by educating people about the world around them
OCTO TECHNOLOGY > THERE IS A BETTER WAY
69. DATA SCIENCE TONIGHT
OCTO TECHNOLOGY > THERE IS A BETTER WAY 69
Visualization
1
2
3
4
Why the buzz about data science?
Demystifying machine learning
Data science in your business
70. 70
I WANT A DATA SCIENTIST!
OCTO TECHNOLOGY > THERE IS A BETTER WAY
73. Agile Data science
Feature
Team
Operations
Business
analyst
Developper
tech expertProject
Manager
Data
scientist
Architect
Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan
That is, while there is value in the items on the right, we value the items on the left more
75. BUILDING A DATALAB
OCTO TECHNOLOGY > THERE IS A BETTER WAY 75
Source System Collect, storage et data preparation Analysis delivery
External
sources
Datalab
Existing infrastructure
(multiples sources)
ETL
Extract
cleanup,
transfor
m
load
Staging area Datawarehouse
technical layer
(referential/
Operation)
Datamart
technique
(zone de collecte)
Datamart
(management,
marketing, sales
User access
(Reporting, Analytics)
Batch
• Analyses
• Indicators
• Statistics
Online
• Dashboards
• Reporting
• Requests
Administratio
n
• Admini
• Validation
76. DEVOPS – EMBRACING NEW KNOW HOW
And new collaborations…
Data Scientist
• Innovates
• With new technologies
“What !? A unit test on my
neural network???
OPS
• Look after rationalization
“What!? Your piece of Scala
calls a Python library embedding C ???”
80. 81
Business must be aware of opportunities to use
algorithms
BUSINESS & DATA SCIENCE
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Data must be easily accessible
Focus on lowest time to market possible
81. USE CASES CLASSES AND THEIR BUSINESS VALUE
OCTO TECHNOLOGY > THERE IS A BETTER WAY 82
The prediction is a
support for decision
Analyses support
data understanding
The prediction is the
decision
Business value