Data Philly Meetup - Big (Geo) Data

1,246 views
1,128 views

Published on

Data Philly Meetup for 2/19/2013 on geospatial data science with crime data and applications of GeoTrellis to solve challenges related to large data sets.

Published in: Technology

Data Philly Meetup - Big (Geo) Data

  1. 1. Big (Geo) Data ScienceRobert Cheethamcheetham@azavea.com @rcheetham
  2. 2. Web/MobileGeospatialUI/UX DesignHigh PerformanceComputingR&D
  3. 3. B Corporation • Projects w/ Social Value • Summer of Maps • Pro Bono Program • Donate share of profitsResearch-Driven • 10% Research Program • Academic Collaborations • Open Source
  4. 4. Spatial Temporal Forecastingwith Philadelphia Crime Data
  5. 5. How Phila PD uses Maps Customized Map Products Weekly CompStat Meetings Web Crime Analysis
  6. 6. INCT & PARS – main database sourcesover 5,000 incidents daily, over 2 million annually PARS Complainant INCT Verizon Daily download 911 District & Geocoding Routines 48 Desk Incident Report Completed by Officer District X 911 Operator Police Officer Maps distributed Through Intranet, District Y Printing, CompStat Radio Dispatcher CAD District Z
  7. 7. The Context1,500,000 people7,000 police1,000 civilian employees2,000,000 new incidents / year3 crime analysts
  8. 8. What we did• Weekly Compstat• Lots of maps• Automation of map creation• Web-based systems
  9. 9. … but what if we could… Accelerate the cycle Proactively notify Automate the process
  10. 10. Prototype VB & MapObjects ArcView .ini fileProcess Documentation Shapefiles and GRIDs MS SQL Server Crime Incidents Database
  11. 11. … but there was a problem …
  12. 12. …it was crap …
  13. 13. … sort of.
  14. 14. We needed ….1. Better Statistics2. Notification3. Simplicity
  15. 15. Crime Analysis – What has happened? – Mapping (spatial / temporal densities) – Trending – Intelligence DashboardEarly Warning – What is out of the ordinary? – Statistical & Threshold-based Hunches (data mining) – AlertingRisk Forecasting – What is likely to happen next? – Near Repeat Pattern – Load Forecasting
  16. 16. Crime Analysis – Mapping (spatial / temporal densities) – Trending – Intelligence DashboardEarly Warning – Statistical & Threshold-based Hunches (data mining) – AlertingRisk Forecasting – Near Repeat Pattern – Load Forecasting
  17. 17. Crime Analysis
  18. 18. Intelligence Dashboard
  19. 19. Crime Analysis
  20. 20. Early Warning
  21. 21. Early Warning• Geographic Early Warning System – A system to alert staff of an unusual situation in a particular location – Ingests data sets to automatically “cook on” and only involves staff when a statistically unusual situation is found Geostatistical Engine Operational Operational Database Alerting Operational Database HunchLab Database System Databases
  22. 22. Early Warning
  23. 23. What is a Hunch?• A proposed hypothesis, saved into the system, and continually tested for validity• Incident Attribute Requirements – Location (x, y) – Time (timestamp) – Classification• Hunch Attributes – Location (area) – Time (recent / historic periods) – Classification• Analyses – Statistical Hunch – Threshold Hunch
  24. 24. Hunch Parameters: Location• Address & Radius• Precinct/County/Country• Custom Drawn Area• Mass Hunch
  25. 25. Hunch Parameters: Time• Statistical Hunch – Recent Past – Historic Past
  26. 26. Hunch Parameters: Classification• Category• Time of Day• Narrative
  27. 27. Hunch Helper
  28. 28. Email Alert
  29. 29. Hunch Details
  30. 30. Risk Forecasting
  31. 31. Predictive Analytics?• Prediction vs. Forecasting
  32. 32. Near Repeat Pattern Analysis
  33. 33. Contagious Crime?• Near repeat pattern analysis • “If one burglary occurs, how does the risk change nearby?”
  34. 34. What Do We Mean By Near Repeat?• Repeat victimization – Incident at the same location at a later time (likely related)• Near repeat victimization – Incident at a nearby location at a later time (likely related)• Incident A (place, time) --> Incident B (place, time)
  35. 35. Near Repeat Pattern Analysis• The goal: – Quantify short term risk due to near-repeat victimization • “If one burglary occurs, how does the risk of burglary for the neighbors change?”• What we know: – Incident A (place, time) --> Incident B (place, time) • Distance between A and B • Timeframe between A and B• What we need to know: – What distances/timeframes are not simply random?
  36. 36. Near Repeat Pattern Analysis• The process – Observe the pattern in historic data – Simulate the pattern in randomized historic data – Compare the observed pattern to the simulated patterns – Apply the non-random pattern to new incidents• An example – 180 days of burglaries in Division 6 of Philadelphia
  37. 37. Near Repeat Pattern Analysis
  38. 38. Near Repeat Pattern Analysis
  39. 39. Near Repeat Pattern Analysis
  40. 40. Near Repeat Pattern Analysis
  41. 41. Near Repeat Pattern Analysis• How can you test your own data? – Near Repeat Calculator • http://www.temple.edu/cj/misc/nr/• Papers – Near-Repeat Patterns in Philadelphia Shootings (2008) • One city block & two weeks after one shooting – 33% increase in likelihood of a second event Jerry Ratcliffe Temple University
  42. 42. Contagious Crime?
  43. 43. Workload Forecasting
  44. 44. Improving CompStat• Workload forecasting • “Given the time of year, day of week, time of day and general trend, what counts of crimes should I expect?”
  45. 45. What Do We Mean By Load Forecasting? • Workload forecasting • Generating aggregate crime counts for a future timeframe using cyclical time series analysis Measure cyclical patterns + Identify non-cyclical trend Forecast expected countbit.ly/gorrcrimeforecastingpaper
  46. 46. Load Forecasting• Measure cyclical patterns • Take historic incidents (for example: last five years) • Generate multiplicative seasonal indices – For each time cycle: » time of year » day of week » time of day – Count incidents within each time unit (for example: Monday) – Calculate average per time unit if incidents were evenly distributed – Divide counts within each time unit by the calculated average to generate multiplicative indices » Index ~ 1 means at the average » Index > 1 means above average » Index < 1 means below average
  47. 47. Load Forecasting
  48. 48. Load Forecasting
  49. 49. Load Forecasting
  50. 50. Load Forecasting
  51. 51. Load Forecasting• Identify non-cyclical trend • Take recent daily counts (for example: last year daily counts) • Remove cyclical trends by dividing by indices • Run a trending function on the new counts – Simple average » Last X Days – Smoothing function » Exponential smoothing » Holt’s linear exponential smoothing
  52. 52. Load Forecasting• Forecast expected count • Project trend into future timeframe – Always flat » Simple average » Exponential smoothing – Linear trend » Holt’s linear exponential smoothing • Multiple by seasonal indices to reseasonalize the data
  53. 53. Load Forecasting Measure cyclical patterns + Identify non-cyclical trend Forecast expected countbit.ly/gorrcrimeforecastingpaper
  54. 54. Improving CompStat
  55. 55. How Do We Know It’s Accurate?• Testing • Generated forecasting techniques(examples) – Commonly Used » Average of last 30 days » Average of last 365 days » Last year’s count for the same time period – Advanced Combinations » Different cyclical indices (example: day of year vs. month of year) » Different levels of geographic aggregation for indices » Different trending functions • Scoring methodologies (examples) – Mean absolute percent error (with some enhancements) – Mean percent error – Mean squared error • Run thousands of forecasts through testing framework • Choose the right technique in the right situation
  56. 56. Ongoing Research
  57. 57. Research Topics• Risk Forecasting – Load forecasting enhancements • Weather and special events – Combining short and long term risk forecasts (Temple) • Socioeconomic changes in neighborhoods – Risk Terrain Modeling (Rutgers) • Context of crime at the microplace
  58. 58. Research Topics
  59. 59. Research Topics• Risk Forecasting – Offender Management • Prioritize offenders based upon statistical models using past behaviors• Evaluation – Automate Randomized Controlled Trials
  60. 60. Data Processing for Big (Geo) Data
  61. 61. A Story
  62. 62. Robert’s Rules of Housing Close to Center City  somewhat important Walk to Grocery Store  vital Nearby Restaurants  very important Library  nice to have Near a Park  somewhat importantBiking / walking distance from our work  very important Biking distance to fencing  somewhat important
  63. 63. Your factors might include…  Child Care  Local School Rankings  Farmers Market  Car Share  Public Transit
  64. 64. We stand on theshoulders of giants
  65. 65. Not a new idea … Design with Nature
  66. 66. Not a new Idea … Dana Tomlin
  67. 67. Desktop GIS
  68. 68. Weighted Overlay + + + x5 x1 x3 x2 =
  69. 69. Summary Geography-driven Decisions Iterative Individual Web [and Mobile] Growing data sets
  70. 70. Web Challenges
  71. 71. Web is different from the Desktop  Lots of simultaneous users  Stateless environment  HTML+JS+CSS  Users are less skilled  Users are less patient
  72. 72. But wait … there’s a problem 10 – 60 second calculation time Multiple simultaneous users … … that are impatient
  73. 73. Data Challenges
  74. 74. Big Data – Social Media
  75. 75. Big Data – Science
  76. 76. Big Data – Citizen Science
  77. 77. Big Data – Cities
  78. 78. Early Prototype
  79. 79. Specific Optimization Goals New Raster File Structure Distributed processing Binary messaging protocol
  80. 80. Optimization: File Format Limit data type and range 1D arrays are fast to read/write Tiled Pyramids Azavea Raster Grid (ARG)
  81. 81. Optimization: Distributed Processing Parallelizable - Local Ops and Focal Ops Support multiple – Threads – Cores – CPU’s – Machines Considered – Hadoop – Amazon Map Reduce – Beowolf
  82. 82. Success!! Reduced from 10-60 seconds to <500 milliseconds
  83. 83. Optimizing one process sub-optimizes others Complex to configure and maintain Limited to one operation No interpolation No mixing – cell sizes – extents – projections etc.
  84. 84.  Broader set of functionality Both raster and vector Scala + Akka Open source
  85. 85. Faster is Different
  86. 86. Regional/State: 84 msNational: 84 msLarge Country 115 msContinental 271 msPlanet 1.2 – 2.0 s
  87. 87. Ongoing R&D
  88. 88. GPUs
  89. 89. GPU Results  Re-wrote a few Map Algebra operations:  Local  Neighborhood  Zonal  Viewshed  etc.  15 – 120x  Large grids  Large kernels
  90. 90. New Spatial Operations Vector Neighborhood/Focal Spatial Statistics Integration
  91. 91. Urban Forest Ecosystem Modeling
  92. 92. Crime Analysis, Early Warning and Forecasting
  93. 93. Open Source Geoprocessing  GDAL  GeoServer  PostGIS R  GeoDa
  94. 94. Many Thanks!© Photo used with permission from Alphafish, via Flickr.com
  95. 95. Big (Geo) Data Science [We are hiring]Robert Cheethamcheetham@azavea.com @rcheetham

×