Jonathan Lenaghan, VP of Science and Technology, PlaceIQ at MLconf ATL 2016

1
Discerning Human Behavior from Mobility Data
Jonathan Lenaghan
VP of Science and Technology
2016

2
Agenda
PLACEIQ AND MOBILE OVERVIEW
DATA GENERATION PROCESS
BESTIARY OF LOCATION FRAUD
DISCERNING HUMAN BEHAVIOR

4
Company
Overview
PlaceIQ is building an advanced understanding of
consumer behavior to revolutionize the customer
experience through location analytics.
We create customer-driven audiences & understanding
for activation in digital media and deployment across
enterprise applications.
• Founded 5 ½ years ago, Employs 140 people.
• Headquartered in NYC, with offices in Palo Alto,
Chicago, Detroit, LA, Boulder, and London UK

5
M O V E M E N T D ATA
New Model
Consumer Behavior
of
Mobile is the Key to Understanding the Consumer Journey
T H I R D PA R T Y D ATA
Age IncomeAuto PurchaseTV
F I R S T PA R T Y D ATA
DMP CRM
TARGET
ANALYZE
MEASURE
MANAGE

6
Significant Statistics
Location-Based POIs
Location points-of-interest: 475 Million
Location Commercial polygons: 1.4+ Million
Location Residential Parcels: 137 Million
Predicted Home Dwells: 90 Million
Current Behavioral Profiles: 4+ Thousand
Unique Devices
Device IDs 4+ Billion
PIQ IDs/unique users 130 Million
Infrastructure
Data Storage ~10 Petabytes
Ad Requests per Second 250 Thousand
Production Cluster 8K Nodes

7
Driving Components of the Platform
Work
Home
M O V E M E N T D ATA B AS E M AP
ruleb for Retail {
use time_periods Monday--Sunday 09:00--20:00;
Walmart and K-Mart where count >= 20 in 10 months;
}
P I Q L

8
Mobility Data
• Scale of data is vast (~10 PB over three years)
• Most of the data is very noisy
• Much of the data is fraudulent
• Location analytics from high-scale mobile ad request data is full of
challenging and interesting problems!

How is movement data generated?

10
How Direct Location Data is Obtained
4
App passes location
to ad exchange
2 OS gets location from device
5 Location analytics platform matches to place or audience
1 App asks OS for location
3 OS passes best available location to app
On iOS, the app uses the
Core Location API.
On Android, the app uses the
android.location API.
Operating System
• ( Lat, Long )
• UDID
Ad Request
Location Analytics Platform
Avg. Accuracy: 2,000
m
Avg. Accuracy: 424 m
Avg. Accuracy: 23 m
Cellular Antenna
WIFI Antenna
GPS Antenna
• ( Lat, Long )
• Accuracy
Location Response
Ad Exchange
Key Processes
• Identify and filter spam
• Verify places and map in high detail
• Understand the surrounding context
• Unify a single device’s many hashed IDs

12
Quality of Movement Data Varies Greatly by Partner

13
Programmatically-Generated Movement is Common

14
Misrepresentation May be Nefarious or Not
Spoofing High-Value Locations Centroid Geocoding

15
Misrepresentation May be Nefarious or Not
A single device is observed in tens of metros
across the United States over the course of a
few minutes.
Location-Spoofing Short Distance Jitter
Jitter is typically caused by switching between
GPS and cell-tower triangulation.

17
• “I know that half of my advertising budget is wasted, I just don’t know which half.”
• Darwin removes on average 40% of all ad requests as misrepresenting
location
• Not filtering ensures that nearly half of all location-based ads are wasted
• Inferring human behavior from ad request data is impossible without such a enabling
technology
Darwin is a Location Fraud Detection and Prevention Product

18
Measure quality of data and then filter bad data
HyQuP (Hyperlocality Quality Pipeline)
Produces metrics to judge how closely a corpus of movement data reflects
human movement and behavior.
Computes two metrics: Hyperlocality and Clusterability
Darwin Fraud Filters
Detects and filters on a locate-by-locate basis those devices IDs and locations
that are plagued with misrepresented geocodes.

20
Hyperlocality
Location data should reflect human movement in the real world at high
resolution
• Information theoretic techniques can be very powerful and since they
typically employ simple counting are easy to compute.
• Determines the efficiency of location data as it moves from low to high
resolutions
• How good is our inference of out-of-home behavior?
• Is the data human generated or computer-generated?

21
Distribution of Digits
What would the expected distribution of the individual digits of the coordinate
pairs representing the movement of humans be?
Consider both the distribution of the individual digits after the decimal places as
well as the joint distribution, e.g. for the coordinate pair
(90.123456, 88.981239)
Generating the empirical distribution of digits and compute the Kullback-Leibler
divergence (KLD) between these distributions with the uniform distribution.

22
Zoom-Stack Efficiency
We apply the notion of information efficiency
and changes in this quantity as we move down a
zoom-stack from 1km to 100m to 10m.
The metric measures how much information is
gained as we add additional digits to the
coordinates.
This is a way to measure the amount of
randomness gained with the addition of each
digit.
10km x 10km
1km x 1km
100m x 100m
10m x 10m

23
Zoom-Stack Efficiency
Given our knowledge of the Nth digits in a
coordinate pair, how much more information do
we gain by knowing the next digit?
In other words, how much randomness is
induced at the next level of the zoom stack?
10km x 10km
1km x 1km
100m x 100m
10m x 10m

25
Clusterability
Location data shouldn’t be evenly distributed, because humans aren’t
• The clustering of coordinate points captures real-life human behaviors and
habits
• Most devices have a few tight clusters that represent where they live and
work
• They have less dense clusters around usual social venues
• Does the data give clean clusters around homes and businesses?

26
✗ ✓
• Do the locates tend to cluster over
residential lots and workplaces in a
manner consistent with human behavior?
• Locates on a device-by-device basis
should be scattered into clusters with a
predictable pattern.
• The silhouette of the clusters should also
be well defined. It should not neither
point-like nor diffuse.
• Clusters computed using DBSCAN
Clusterability: Does Location Data Look like Humans?

27
Quality Scores
D measures whether clusters are formed and numerically represents the
density of the clustering
R measures the robustness of the clustering of the data set
S measures the tightness of the clustering
Clusterability = D * R * (1+S) / [R + (1+S)/2]
i.e. the product of the density of the clustering and the harmonic mean of the
robustness and the normalized silhouette score.
Clusterability

28
Quality Scores
• Misrepresentation of location data is widespread in the mobile ad ecosystem
• Rely on hyperlocality (information theoretic approach) and clusterability
(unsupervised learning)
• Essential to measure and filter devices, applications and locations
• High scale and challenging problems to be solved
Conclusions

Jonathan Lenaghan, VP of Science and Technology, PlaceIQ at MLconf ATL 2016

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Jonathan Lenaghan, VP of Science and Technology, PlaceIQ at MLconf ATL 2016

Similar to Jonathan Lenaghan, VP of Science and Technology, PlaceIQ at MLconf ATL 2016 (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Jonathan Lenaghan, VP of Science and Technology, PlaceIQ at MLconf ATL 2016

Editor's Notes