Discerning Human Behavior from Mobility Data: Mobility data encompasses many elements, including location history, latitude coordinates, longitude coordinates, anonymized mobile device IDs, and timestamps. Such data are generated, for instance, by automobile navigation applications and by the mobile advertising ecosystem. Typical sources of mobility data contain extensive inaccuracies that result from a variety of sources, ranging from shortcomings in location services on mobile devices to the intentional misrepresentation of spatial coordinates by bad ecosystem actors. In this talk, we describe a production data pipeline, Darwin, which analyzes the location quality of mobility data to measure how accurately a set of mobility data represents true movement patterns. Darwin uses a number of measures that are ultimately combined into two quality scores: hyper-locality and clusterability. These measurements include techniques from information theory, the mean number of spatial clusters, the compactness of the clusters, and the differences between the empirical distribution of digits in the spatial coordinates and reference distributions.
4. 4
Company
Overview
PlaceIQ is building an advanced understanding of
consumer behavior to revolutionize the customer
experience through location analytics.
We create customer-driven audiences & understanding
for activation in digital media and deployment across
enterprise applications.
• Founded 5 ½ years ago, Employs 140 people.
• Headquartered in NYC, with offices in Palo Alto,
Chicago, Detroit, LA, Boulder, and London UK
5. 5
M O V E M E N T D ATA
New Model
Consumer Behavior
of
Mobile is the Key to Understanding the Consumer Journey
T H I R D PA R T Y D ATA
Age IncomeAuto PurchaseTV
F I R S T PA R T Y D ATA
DMP CRM
TARGET
ANALYZE
MEASURE
MANAGE
6. 6
Significant Statistics
Location-Based POIs
Location points-of-interest: 475 Million
Location Commercial polygons: 1.4+ Million
Location Residential Parcels: 137 Million
Predicted Home Dwells: 90 Million
Current Behavioral Profiles: 4+ Thousand
Unique Devices
Device IDs 4+ Billion
PIQ IDs/unique users 130 Million
Infrastructure
Data Storage ~10 Petabytes
Ad Requests per Second 250 Thousand
Production Cluster 8K Nodes
7. 7
Driving Components of the Platform
Work
Home
M O V E M E N T D ATA B AS E M AP
ruleb for Retail {
use time_periods Monday--Sunday 09:00--20:00;
Walmart and K-Mart where count >= 20 in 10 months;
}
P I Q L
8. 8
Mobility Data
• Scale of data is vast (~10 PB over three years)
• Most of the data is very noisy
• Much of the data is fraudulent
• Location analytics from high-scale mobile ad request data is full of
challenging and interesting problems!
10. 10
How Direct Location Data is Obtained
4
App passes location
to ad exchange
2 OS gets location from device
5 Location analytics platform matches to place or audience
1 App asks OS for location
3 OS passes best available location to app
On iOS, the app uses the
Core Location API.
On Android, the app uses the
android.location API.
Operating System
• ( Lat, Long )
• UDID
Ad Request
Location Analytics Platform
Avg. Accuracy: 2,000
m
Avg. Accuracy: 424 m
Avg. Accuracy: 23 m
Cellular Antenna
WIFI Antenna
GPS Antenna
• ( Lat, Long )
• Accuracy
Location Response
Ad Exchange
Key Processes
• Identify and filter spam
• Verify places and map in high detail
• Understand the surrounding context
• Unify a single device’s many hashed IDs
15. 15
Misrepresentation May be Nefarious or Not
A single device is observed in tens of metros
across the United States over the course of a
few minutes.
Location-Spoofing Short Distance Jitter
Jitter is typically caused by switching between
GPS and cell-tower triangulation.
17. 17
• “I know that half of my advertising budget is wasted, I just don’t know which half.”
• Darwin removes on average 40% of all ad requests as misrepresenting
location
• Not filtering ensures that nearly half of all location-based ads are wasted
• Inferring human behavior from ad request data is impossible without such a enabling
technology
Darwin is a Location Fraud Detection and Prevention Product
18. 18
Measure quality of data and then filter bad data
HyQuP (Hyperlocality Quality Pipeline)
Produces metrics to judge how closely a corpus of movement data reflects
human movement and behavior.
Computes two metrics: Hyperlocality and Clusterability
Darwin Fraud Filters
Detects and filters on a locate-by-locate basis those devices IDs and locations
that are plagued with misrepresented geocodes.
20. 20
Hyperlocality
Location data should reflect human movement in the real world at high
resolution
• Information theoretic techniques can be very powerful and since they
typically employ simple counting are easy to compute.
• Determines the efficiency of location data as it moves from low to high
resolutions
• How good is our inference of out-of-home behavior?
• Is the data human generated or computer-generated?
21. 21
Distribution of Digits
What would the expected distribution of the individual digits of the coordinate
pairs representing the movement of humans be?
Consider both the distribution of the individual digits after the decimal places as
well as the joint distribution, e.g. for the coordinate pair
(90.123456, 88.981239)
Generating the empirical distribution of digits and compute the Kullback-Leibler
divergence (KLD) between these distributions with the uniform distribution.
22. 22
Zoom-Stack Efficiency
We apply the notion of information efficiency
and changes in this quantity as we move down a
zoom-stack from 1km to 100m to 10m.
The metric measures how much information is
gained as we add additional digits to the
coordinates.
This is a way to measure the amount of
randomness gained with the addition of each
digit.
10km x 10km
1km x 1km
100m x 100m
10m x 10m
23. 23
Zoom-Stack Efficiency
Given our knowledge of the Nth digits in a
coordinate pair, how much more information do
we gain by knowing the next digit?
In other words, how much randomness is
induced at the next level of the zoom stack?
10km x 10km
1km x 1km
100m x 100m
10m x 10m
25. 25
Clusterability
Location data shouldn’t be evenly distributed, because humans aren’t
• The clustering of coordinate points captures real-life human behaviors and
habits
• Most devices have a few tight clusters that represent where they live and
work
• They have less dense clusters around usual social venues
• Does the data give clean clusters around homes and businesses?
26. 26
✗ ✓
• Do the locates tend to cluster over
residential lots and workplaces in a
manner consistent with human behavior?
• Locates on a device-by-device basis
should be scattered into clusters with a
predictable pattern.
• The silhouette of the clusters should also
be well defined. It should not neither
point-like nor diffuse.
• Clusters computed using DBSCAN
Clusterability: Does Location Data Look like Humans?
27. 27
Quality Scores
D measures whether clusters are formed and numerically represents the
density of the clustering
R measures the robustness of the clustering of the data set
S measures the tightness of the clustering
Clusterability = D * R * (1+S) / [R + (1+S)/2]
i.e. the product of the density of the clustering and the harmonic mean of the
robustness and the normalized silhouette score.
Clusterability
28. 28
Quality Scores
• Misrepresentation of location data is widespread in the mobile ad ecosystem
• Rely on hyperlocality (information theoretic approach) and clusterability
(unsupervised learning)
• Essential to measure and filter devices, applications and locations
• High scale and challenging problems to be solved
Conclusions
Editor's Notes
Don’t want spend much time on PlaceIQ but want to show scale of data and that there are actually very interesting and challenging problems to be solved in ad tech.
Location fraud detection is really just a sliver of what is done at PlaceIQ but it is a very important sliver. It enables everything else that we do.
Three main constructs to help us understand and contextualize human behavior
Many applications beyond marketing
With today’s heightened importance and awareness of the location-based ecosystem, it’s best to level-set and make sure we’re all have the same understanding of location.
DEVICE
- Hardware - Location understanding starts with the device. Depending on model and manufacturer, there are one or two antennas that pull in signal - there is nothing else contained inside our devices to understand location positioning. (NOTES: shown in side from top to bottom are the cellular antenna and WiFi antenna in the latest iPhone 6 Plus - GPS and mobile data is pulled through the cellular antenna; optional side story reference, CNET reports that iPhone 5 models for Verizon and Sprint don’t’ support data-use and voice simultaneously while on 4G LTE networks).
- Software - Use of location data is controlled by operating system location APIs. Both Apple’s Core Location and Android Location for Google are the gatekeepers of all location data and any app/SDK/developer that wants access to a user’s location must request it from this same central source. While all of the apps on your phone may be requesting location data, each is able to request a certain level of “desired accuracy.” I’d like to emphasize that the desired accuracy is not a guarantee, rather this is how developers request the types of signals required when passing location data - Apple: AccuracyKilometer would rely on simple cellular signals while AccuracyNearestTenMeters would only work for GPS (AccuracyBest means exactly that, it could be 2km or 20m). Google: ACCURACY_FINE would rely only on GPS while _COARSE would use any signal.
SIGNAL
The second major component to location is signal - while we all might know generalities about these such as GPS is the most accurate, it’s important to know how precise each signal type can be. Interest in location extends beyond our advertising ecosystem, as independent research from the scientific community has found just how reliable and precise each signal can be. These distances represent the size of blue bubble you would expect to see on your navigation or map app if you used each of these signal types.
It’s worth noting that the highest level of accuracy today is found with the inventors of GPS, the US Government, that show highly-sophisticated GPS-dedicated devices can receive signals accurate to 3.5 meters or 11 ½ feet. (NOTES: the ranges shown above are the where 95% of the signals are accurate to - meaning 95% of GPS signals to our phones are accurate to 23m indoors; when outdoors GPS precision increases to 10m; WifFi can be more precise depending on network, but consumer scale is limited as comScore reported 42% of smartphones connect to WiFi).
PLACES
Once you understand the influence of hardware, software, and signal, the final component is about place - even if we know this phone/user is located in one spot, where is that one spot located? To help visualize this, we’d like to look at this AT&T Store located in Fresno, CA - you can see the red outline of the store based on a mall map where this store is located. When we compare how Google, MapQuest, and AT&T’s own store locator identify where this store is located, you can see that Google is the only one that gets it right - AT&T identifies a random location in the parking lot over 100m away from their actual storefront.
Therefore, an understanding of location can be summarized into two categories: location data and place data. Knowing this state of technology, here is what we do at PlaceIQ...(NEXT SLIDE)
I know half my ad budget is wasted, but I don’t know which half. With all this fraud it well more than half but we remove it.
Need something that says (THIS IS FRAUD DETECTION, if you are not using Darwin you are wasting your money and you can not truly understand true consumer behavior
What would be expect the distribution of latN lngN to be in a large data set representing the movement of people? Well given this is just identifying a sub square in the larger square, and the numbers don't really mean anything else, there is no reason one pair of latN lngN should occur more often than another. Hence, this distribution should intuitively be uniform.
Entropy = H measure of the information or randomness or uncertainty in a random variable X.
Conditional entropy H(X|Y) = H(Y| X=x) p(x) Amount of information in Y not contained in X.
How much extra information is there in knowing the 10m resolution given that we already know the 100 meter resolution
Again since PIQ standard are tiles, we set N=3, and want to calculate how much extra information the 4th digit is giving. ie. How much extra information is there in knowing the ~10 m by 10m tile a point lies in given we already know what 100m by 100m tile it lies in.
Entropy = H measure of the information or randomness or uncertainty in a random variable X.
Conditional entropy H(X|Y) = H(Y| X=x) p(x) Amount of information in Y not contained in X.
How much extra information is there in knowing the 10m resolution given that we already know the 100 meter resolution
Again since PIQ standard are tiles, we set N=3, and want to calculate how much extra information the 4th digit is giving. ie. How much extra information is there in knowing the ~10 m by 10m tile a point lies in given we already know what 100m by 100m tile it lies in.
People tend to move around the world leaving breadcrumbs as they go
However, biggest indication of human behavior is a few repeating clusters of locates represending home, work, recreation
Should be not too big, not to small (about 35 meters across and tightly packed).