The objective of my final project at Metis is to categorize drivers based on their behaviour on the roads - their driving style and the type of roads that they follow.
The challenge associated with this objective is to identify uniquely a driver (and hence his proper “driving behaviour”) based on the GPS log of a mobile phone located inside the car.
My idea to solve this issue is to experiment Topic Modeling techniques especially Latent Semantic Indexing/Analysis (LSI/LSA) and Latent Dirichlet Allocation (LDA) and explain the observed trips by the unobserved behaviour of drivers.
2. I am an Automotive Management Professional and a Computer
Science Engineer from France, with an extensive experience in managing
complex projects in Supply Chain and IT, as well as starting, developing
and acquiring businesses in France, Russia, USA and the Middle East.
I came to Metis to understand, learn and practice how data science is
transforming the Automotive Business. During my projects, I focused on:
● Sentiment Analysis / Topic Modeling
● Predictive Behavior Modeling
● Driver Telematics
Philippe Dagher
3. Objective:
Categorize drivers based on their behaviour on the roads - their driving style
and the type of roads that they follow.
Challenge:
Identify uniquely a driver (and hence his proper “driving behaviour”) based on
the GPS log of a mobile phone located inside the car.
Idea:
Experiment Topic Modeling techniques especially Latent Semantic
Indexing/Analysis (LSI/LSA) and Latent Dirichlet Allocation (LDA) to explain the
observed trips by the unobserved behaviour of drivers.
Final Project @ Metis
5. Machine learning approach (1/2)
❖ Preprocess the data using statistical smoothing and compression algorithms
➢ Kalman Filtering
➢ Ramer–Douglas–Peucker
❖ Extract road and driving style features
➢ per Segment: Length, Slip Angle, Convexity, Radius
➢ per Meter: Speed, Accelerations (tangential and normal), Jerk, Yaw, Pauses
❖ Bin the ouput and generate the Driving Alphabet
➢ ex: d0, d1, d2… v0, v1, v2… a0, a1, a2… etc
❖ Build the Driving Vocabulary - “Driving Slides” per meter
➢ ex: d3L4v2n3y1
➢ for various preprocessing sensitivities or features combinations (langages)
❖ Translate trips from GPS log into documents
➢ Tokenize, filter, … data is ready!
7. LDA: Bayesian Topic Model
Per trip
“Driving Behaviour”
proportions
for each trip select a distribution of
“Driving Behaviours”
Dirichlet
parameter
Corpus: possible “Driving
Behaviour” distributions
for trips
Per “Driving Slide”
“Driving Behaviour” assignment
for each “Driving Slide” select a “Driving Behaviour”
Observed
“Driving Slide”
select actual “Driving Slide”
from the slected “Driving
Behaviour”
“Driving Behaviours”
each “Driving Behaviour” is a
distribution of “Driving Slides”
“Driving Behaviour” hyperparameter
possible “Driving Slide” distributions
for “Driving Behaviours”
8. Posterior Inference in LDA
❖ Goal is to obtain this posterior:
➢ How much a trip contain of “Driving Behaviour” k( ) and
➢ “Driving Behaviour” “Driving Slides” assignements z
❖ Which means that I need to calculate:
❖ GENSIM Library
➢ a Python+NumPy implementation of online LDA for inputs larger than the available RAM
10. ❖ 2736 drivers
❖ 200 trips/driver
Total : 547200 csv files (5.92 GB)
Challenge:
To come up with a "telematic fingerprint" capable of distinguishing when a trip
was driven by a given driver, knowing that among the 200 provided trips of
each driver, a few number of trips was not driven by him/her.
Submissions are judged on area under the ROC curve calculated in a global manner (all predictions
together).
Validation on a Kaggle Competition
11. ❖ Transpose all trips into the new Driving Behaviours Space
❖ Take one by one each trip from a selected Driver
❖ Build a prediction model trained with all other trips in the dataset:
➢ Trues if they belong to the selected Driver
➢ Falses if they do not belong to this Driver
❖ Predict with the trained model, the belonging of the selected Trip to the Driver, then Ensemble
several predictions using various sensitivities to enhance the score...
For performance reasons I will proceed by batches of 10 or 20 selected trips and compare each
time to a randomly selected limited number of False trips
Other outlier detection / clustering techniques appear to be less performing
Machine learning approach (2/2)
12. MongoDB to hold 3.3 MM documents generated
Parallel processing setup on 4 DigitalOcean Droplets with 8CPU each
Gensim Library which implements three methods:
❖ latent semantic indexing (LSI, or LSA - A for Analysis)
❖ latent Dirichlet Allocation (LDA)
❖ random projections (RP)
Also, it implements online versions of each technique.
Setting the infrastructure
13. Predicting
❖ Achieving an AUC of 0.9 on Kaggle without any ensembling technique
which confirms the robustness of my approach...