Driver Location Intelligence at Scale using Apache Spark, Delta Lake, and MLflow on Databricks

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Sergio Ballesteros, TomTom
Kia Eisinga, TomTom
Driver Location Intelligence at
Scale using Apache Spark, Delta
Lake and MLflow on Databricks
#UnifiedDataAnalytics #SparkAISummit

Ourvision
A safe, connected, autonomous world that is free of
congestion and emissions.

4
Bigdatadrivesour
business,but
dataprivacyalways
comesfirst

Data
• Anonymous location (GPS) traces
5

742.000.000kmevery day
18.000 x
6

Data
• Anonymous location (GPS) Traces
• Community inputs
• User events
• Journalistic data
• Car sensor data
8

Dataflow
9
~150 trillion
data points
~80 billion data
points per day

In dash systems are outperformed by smartphones
The embedded systemis expected to be up-to-date, with no user interaction. And the most visible component of it is a
map.
Usecase1:IQMapsanalytics
12

Driversdonotupdatetheirmaps
Today’s solutions provide manual updates,
oftenwith a necessity to drive to the dealer.
This is way too complex and inefficient.
13

OEMsrequire dataefficient
solutions
While drivers expect up-to-date system, the carmakers
are usually concerned about the data cost required for
the map management.
15

98% OF TRIPS ARE DRIVEN WITHIN150KM RADIUS99.8% OF TRIPS ARE DRIVEN WITHIN1000KM RADIUS
16

Whenradiusis0km
• User drives within 2 regions every week day
• Radius of 0 km.
• Download and install justhome regions
• Cellular data usage kept to a minimum
17

Whenradiusis150km
• User drives within 2 update regions every
week day
• Radius of 150 km.
• Home region: 6 update regions.
• Cellular data usage increased
18

Realresultsusing0.5Mtrips
21
“This insight has led me to the conclusion
that a default radius of 150km is
unnecessary, and a small radius of ~10km
would already satisfy mostdrivers while
keeping cellular data usage low for OEMs.”
- Rolf Dorland, PM at TomTom

Goingonholidays
• User goes for his holiday (less frequent
updated region)
• Once user starts driving, updates for all
update regions the route goes through are
downloaded and installed.
22

Opportunity
25
Past: Rule-based solution
Delta Lake pipelines
Present: Machine Learning

Data
26
Original trace data from 1 source
227K device serials
Filtering out invalid trips
143K device serials
Users with at least 50 trips
3.6K device serials
Devices feasible for modelling
2.5K device serials

Features
For each trip, we have the following information:
• Where did the trip start?
• At what speed were you driving when the trip started?
• What was the time of day (morning/afternoon/evening) when the trip started?
• Was it rush hour when the trip started?
• What day of the week was it?
• Was it a weekend day?
• What was the season?
• Which driver profile do you belong to?
Historical information:
• Which destination did you go to your last trip? And the one before that? And the one before that?
• If it is a, let's say Monday, where did you go to the last Monday you made a trip? (do this for every weekday)
To predict: To which destination are you going?
What do we use in the end?
27

Labels
• We are given the latitude and longitude of a destination
of a trip.
• In order to find out which latitude and longitudes belong
to the same destination, we apply a clustering algorithm
called DBSCAN.
• DBSCAN clusters together destinations that are within
500 meters from each other. We should have at least 5
trips to a destination in order to call it a cluster.
How do we define where you are going?
28

Train,validationandtestsplit
Trip ID Date Destination
Trip 1 January1 Cluster1
Trip 3 February3 Cluster1
Trip 5 March 2 Cluster1
Trip 8 April 4 Cluster1
Trip 9 April 16 Cluster2
Trip 10 May 8 Cluster1
Train
& validation
dataset
Test
dataset
TIME-SERIES CROSS-VALIDATION
Iterativeevaluation of the trips to
avoid overfitting
Trip 3 February3 ?
Trip 4 February15 ?
Data for 1 driver:
… … …
Trip 10 May 8 ?
30

Majoritybaseline
Distribution of precision on the test set with a majority baseline classifier
33

Results
Distribution of precision on the test set with a tuned classifier
34

AcceleratingtheFutureofMobility
By embracing Apache Spark, Databricks and the Azure cloud
3535

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Driver Location Intelligence at Scale using Apache Spark, Delta Lake, and MLflow on Databricks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Driver Location Intelligence at Scale using Apache Spark, Delta Lake, and MLflow on Databricks

Similar to Driver Location Intelligence at Scale using Apache Spark, Delta Lake, and MLflow on Databricks (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Driver Location Intelligence at Scale using Apache Spark, Delta Lake, and MLflow on Databricks