Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Jonathan Lenaghan, VP of Science and Technology, PlaceIQ at MLconf ATL 2016

Discerning Human Behavior from Mobility Data: Mobility data encompasses many elements, including location history, latitude coordinates, longitude coordinates, anonymized mobile device IDs, and timestamps. Such data are generated, for instance, by automobile navigation applications and by the mobile advertising ecosystem. Typical sources of mobility data contain extensive inaccuracies that result from a variety of sources, ranging from shortcomings in location services on mobile devices to the intentional misrepresentation of spatial coordinates by bad ecosystem actors. In this talk, we describe a production data pipeline, Darwin, which analyzes the location quality of mobility data to measure how accurately a set of mobility data represents true movement patterns. Darwin uses a number of measures that are ultimately combined into two quality scores: hyper-locality and clusterability. These measurements include techniques from information theory, the mean number of spatial clusters, the compactness of the clusters, and the differences between the empirical distribution of digits in the spatial coordinates and reference distributions.

  • Login to see the comments

Jonathan Lenaghan, VP of Science and Technology, PlaceIQ at MLconf ATL 2016

  1. 1. 1 Discerning Human Behavior from Mobility Data Jonathan Lenaghan VP of Science and Technology 2016
  3. 3. Overview
  4. 4. 4 Company Overview PlaceIQ is building an advanced understanding of consumer behavior to revolutionize the customer experience through location analytics. We create customer-driven audiences & understanding for activation in digital media and deployment across enterprise applications. • Founded 5 ½ years ago, Employs 140 people. • Headquartered in NYC, with offices in Palo Alto, Chicago, Detroit, LA, Boulder, and London UK
  5. 5. 5 M O V E M E N T D ATA New Model Consumer Behavior of Mobile is the Key to Understanding the Consumer Journey T H I R D PA R T Y D ATA Age IncomeAuto PurchaseTV F I R S T PA R T Y D ATA DMP CRM TARGET ANALYZE MEASURE MANAGE
  6. 6. 6 Significant Statistics Location-Based POIs Location points-of-interest: 475 Million Location Commercial polygons: 1.4+ Million Location Residential Parcels: 137 Million Predicted Home Dwells: 90 Million Current Behavioral Profiles: 4+ Thousand Unique Devices Device IDs 4+ Billion PIQ IDs/unique users 130 Million Infrastructure Data Storage ~10 Petabytes Ad Requests per Second 250 Thousand Production Cluster 8K Nodes
  7. 7. 7 Driving Components of the Platform Work Home M O V E M E N T D ATA B AS E M AP ruleb for Retail { use time_periods Monday--Sunday 09:00--20:00; Walmart and K-Mart where count >= 20 in 10 months; } P I Q L
  8. 8. 8 Mobility Data • Scale of data is vast (~10 PB over three years) • Most of the data is very noisy • Much of the data is fraudulent • Location analytics from high-scale mobile ad request data is full of challenging and interesting problems!
  9. 9. How is movement data generated?
  10. 10. 10 How Direct Location Data is Obtained 4 App passes location to ad exchange 2 OS gets location from device 5 Location analytics platform matches to place or audience 1 App asks OS for location 3 OS passes best available location to app On iOS, the app uses the Core Location API. On Android, the app uses the android.location API. Operating System • ( Lat, Long ) • UDID Ad Request Location Analytics Platform Avg. Accuracy: 2,000 m Avg. Accuracy: 424 m Avg. Accuracy: 23 m Cellular Antenna WIFI Antenna GPS Antenna • ( Lat, Long ) • Accuracy Location Response Ad Exchange Key Processes • Identify and filter spam • Verify places and map in high detail • Understand the surrounding context • Unify a single device’s many hashed IDs
  11. 11. Bestiary of Location Fraud
  12. 12. 12 Quality of Movement Data Varies Greatly by Partner
  13. 13. 13 Programmatically-Generated Movement is Common
  14. 14. 14 Misrepresentation May be Nefarious or Not Spoofing High-Value Locations Centroid Geocoding
  15. 15. 15 Misrepresentation May be Nefarious or Not A single device is observed in tens of metros across the United States over the course of a few minutes. Location-Spoofing Short Distance Jitter Jitter is typically caused by switching between GPS and cell-tower triangulation.
  16. 16. HyQuP and Darwin
  17. 17. 17 • “I know that half of my advertising budget is wasted, I just don’t know which half.” • Darwin removes on average 40% of all ad requests as misrepresenting location • Not filtering ensures that nearly half of all location-based ads are wasted • Inferring human behavior from ad request data is impossible without such a enabling technology Darwin is a Location Fraud Detection and Prevention Product
  18. 18. 18 Measure quality of data and then filter bad data HyQuP (Hyperlocality Quality Pipeline) Produces metrics to judge how closely a corpus of movement data reflects human movement and behavior. Computes two metrics: Hyperlocality and Clusterability Darwin Fraud Filters Detects and filters on a locate-by-locate basis those devices IDs and locations that are plagued with misrepresented geocodes.
  19. 19. Hyperlocality
  20. 20. 20 Hyperlocality Location data should reflect human movement in the real world at high resolution • Information theoretic techniques can be very powerful and since they typically employ simple counting are easy to compute. • Determines the efficiency of location data as it moves from low to high resolutions • How good is our inference of out-of-home behavior? • Is the data human generated or computer-generated?
  21. 21. 21 Distribution of Digits What would the expected distribution of the individual digits of the coordinate pairs representing the movement of humans be? Consider both the distribution of the individual digits after the decimal places as well as the joint distribution, e.g. for the coordinate pair (90.123456, 88.981239) Generating the empirical distribution of digits and compute the Kullback-Leibler divergence (KLD) between these distributions with the uniform distribution.
  22. 22. 22 Zoom-Stack Efficiency We apply the notion of information efficiency and changes in this quantity as we move down a zoom-stack from 1km to 100m to 10m. The metric measures how much information is gained as we add additional digits to the coordinates. This is a way to measure the amount of randomness gained with the addition of each digit. 10km x 10km 1km x 1km 100m x 100m 10m x 10m
  23. 23. 23 Zoom-Stack Efficiency Given our knowledge of the Nth digits in a coordinate pair, how much more information do we gain by knowing the next digit? In other words, how much randomness is induced at the next level of the zoom stack? 10km x 10km 1km x 1km 100m x 100m 10m x 10m
  24. 24. Clusterability
  25. 25. 25 Clusterability Location data shouldn’t be evenly distributed, because humans aren’t • The clustering of coordinate points captures real-life human behaviors and habits • Most devices have a few tight clusters that represent where they live and work • They have less dense clusters around usual social venues • Does the data give clean clusters around homes and businesses?
  26. 26. 26 ✗ ✓ • Do the locates tend to cluster over residential lots and workplaces in a manner consistent with human behavior? • Locates on a device-by-device basis should be scattered into clusters with a predictable pattern. • The silhouette of the clusters should also be well defined. It should not neither point-like nor diffuse. • Clusters computed using DBSCAN Clusterability: Does Location Data Look like Humans?
  27. 27. 27 Quality Scores D measures whether clusters are formed and numerically represents the density of the clustering R measures the robustness of the clustering of the data set S measures the tightness of the clustering Clusterability = D * R * (1+S) / [R + (1+S)/2] i.e. the product of the density of the clustering and the harmonic mean of the robustness and the normalized silhouette score. Clusterability
  28. 28. 28 Quality Scores • Misrepresentation of location data is widespread in the mobile ad ecosystem • Rely on hyperlocality (information theoretic approach) and clusterability (unsupervised learning) • Essential to measure and filter devices, applications and locations • High scale and challenging problems to be solved Conclusions