Big (Geo) Data ScienceRobert Cheethamcheetham@azavea.com   @rcheetham
Web/MobileGeospatialUI/UX DesignHigh PerformanceComputingR&D
B Corporation   • Projects w/ Social Value   • Summer of Maps   • Pro Bono Program   • Donate share of profitsResearch-Dri...
Spatial Temporal Forecastingwith Philadelphia Crime Data
How Phila PD uses Maps Customized Map Products            Weekly CompStat Meetings   Web Crime Analysis
INCT & PARS – main database sourcesover 5,000 incidents daily, over 2 million annually                                   ...
The Context1,500,000 people7,000 police1,000 civilian employees2,000,000 new incidents / year3 crime analysts
What we did•   Weekly Compstat•   Lots of maps•   Automation of map creation•   Web-based systems
… but what if we could… Accelerate the cycle Proactively notify Automate the process
Prototype          VB & MapObjects                                ArcView                                                 ...
… but there was a problem …
…it was crap …
… sort of.
We needed ….1. Better Statistics2. Notification3. Simplicity
Crime Analysis – What has happened?   – Mapping (spatial / temporal densities)   – Trending   – Intelligence DashboardEarl...
Crime Analysis   – Mapping (spatial / temporal densities)   – Trending   – Intelligence DashboardEarly Warning   – Statist...
Crime Analysis
Intelligence Dashboard
Crime Analysis
Early Warning
Early Warning• Geographic Early Warning System   – A system to alert staff of an unusual situation in a particular     loc...
Early Warning
What is a Hunch?• A proposed hypothesis, saved into the system, and  continually tested for validity• Incident Attribute R...
Hunch Parameters: Location•   Address & Radius•   Precinct/County/Country•   Custom Drawn Area•   Mass Hunch
Hunch Parameters: Time• Statistical Hunch   – Recent Past   – Historic Past
Hunch Parameters: Classification• Category• Time of Day• Narrative
Hunch Helper
Email Alert
Hunch Details
Risk Forecasting
Predictive Analytics?• Prediction vs. Forecasting
Near Repeat Pattern Analysis
Contagious Crime?• Near repeat pattern analysis      • “If one burglary occurs, how does the risk change nearby?”
What Do We Mean By Near Repeat?• Repeat victimization   – Incident at the same location at a later time (likely related)• ...
Near Repeat Pattern Analysis• The goal:   – Quantify short term risk due to near-repeat victimization      • “If one burgl...
Near Repeat Pattern Analysis• The process   –   Observe the pattern in historic data   –   Simulate the pattern in randomi...
Near Repeat Pattern Analysis
Near Repeat Pattern Analysis
Near Repeat Pattern Analysis
Near Repeat Pattern Analysis
Near Repeat Pattern Analysis• How can you test your own data?   – Near Repeat Calculator      • http://www.temple.edu/cj/m...
Contagious Crime?
Workload Forecasting
Improving CompStat• Workload forecasting      • “Given the time of year, day of week, time of day and        general trend...
What Do We Mean By Load Forecasting? • Workload forecasting         • Generating aggregate crime counts for a future timef...
Load Forecasting• Measure cyclical patterns      • Take historic incidents (for example: last five years)      • Generate ...
Load Forecasting
Load Forecasting
Load Forecasting
Load Forecasting
Load Forecasting• Identify non-cyclical trend      • Take recent daily counts (for example: last year daily counts)      •...
Load Forecasting• Forecast expected count      • Project trend into future timeframe          – Always flat              »...
Load Forecasting                                   Measure cyclical patterns                                             +...
Improving CompStat
How Do We Know It’s Accurate?• Testing      • Generated forecasting techniques(examples)            – Commonly Used       ...
Ongoing Research
Research Topics• Risk Forecasting   – Load forecasting enhancements      • Weather and special events   – Combining short ...
Research Topics
Research Topics• Risk Forecasting   – Offender Management      • Prioritize offenders based upon statistical models using ...
Data Processing for Big (Geo) Data
A Story
Robert’s Rules of Housing                     Close to Center City      somewhat important                   Walk to Groc...
Your factors might include…                      Child Care                      Local School Rankings                  ...
We stand on theshoulders of giants
Not a new idea … Design with Nature
Not a new Idea … Dana Tomlin
Desktop GIS
Weighted Overlay             +        +        +    x5           x1       x3       x2         =
Summary      Geography-driven Decisions      Iterative      Individual      Web [and Mobile]      Growing data sets
Web Challenges
Web is different from the Desktop  Lots of simultaneous users  Stateless environment  HTML+JS+CSS  Users are less skil...
But wait … there’s a problem 10 – 60 second calculation time Multiple simultaneous users … … that are impatient
Data Challenges
Big Data – Social Media
Big Data – Science
Big Data – Citizen Science
Big Data – Cities
Early Prototype
Specific Optimization Goals New Raster File Structure Distributed processing Binary messaging protocol
Optimization: File Format Limit data type and range 1D arrays are fast to read/write Tiled Pyramids Azavea Raster Gri...
Optimization: Distributed Processing Parallelizable - Local Ops and Focal Ops Support multiple  –   Threads  –   Cores  ...
Success!!  Reduced from 10-60 seconds to  <500 milliseconds
Optimizing one process sub-optimizes others   Complex to configure and maintain   Limited to one operation   No interpo...
 Broader set of functionality Both raster and vector Scala + Akka Open source
Faster is Different
Regional/State:     84 msNational:           84 msLarge Country     115 msContinental       271 msPlanet          1.2 – 2....
Ongoing R&D
GPUs
GPU Results  Re-wrote a few Map   Algebra operations:    Local    Neighborhood    Zonal    Viewshed    etc.  15 – 1...
New Spatial Operations Vector Neighborhood/Focal Spatial Statistics Integration
Urban Forest Ecosystem Modeling
Crime Analysis, Early Warning and Forecasting
Open Source Geoprocessing       GDAL       GeoServer       PostGIS      R       GeoDa
Many Thanks!© Photo used with permission from Alphafish, via Flickr.com
Big (Geo) Data Science                 [We are hiring]Robert Cheethamcheetham@azavea.com   @rcheetham
Data Philly Meetup - Big (Geo) Data
Data Philly Meetup - Big (Geo) Data
Data Philly Meetup - Big (Geo) Data
Data Philly Meetup - Big (Geo) Data
Data Philly Meetup - Big (Geo) Data
Data Philly Meetup - Big (Geo) Data
Data Philly Meetup - Big (Geo) Data
Data Philly Meetup - Big (Geo) Data
Data Philly Meetup - Big (Geo) Data
Data Philly Meetup - Big (Geo) Data
Data Philly Meetup - Big (Geo) Data
Data Philly Meetup - Big (Geo) Data
Data Philly Meetup - Big (Geo) Data
Data Philly Meetup - Big (Geo) Data
Data Philly Meetup - Big (Geo) Data
Data Philly Meetup - Big (Geo) Data
Upcoming SlideShare
Loading in...5
×

Data Philly Meetup - Big (Geo) Data

951

Published on

Data Philly Meetup for 2/19/2013 on geospatial data science with crime data and applications of GeoTrellis to solve challenges related to large data sets.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
951
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
64
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Data Philly Meetup - Big (Geo) Data

  1. 1. Big (Geo) Data ScienceRobert Cheethamcheetham@azavea.com @rcheetham
  2. 2. Web/MobileGeospatialUI/UX DesignHigh PerformanceComputingR&D
  3. 3. B Corporation • Projects w/ Social Value • Summer of Maps • Pro Bono Program • Donate share of profitsResearch-Driven • 10% Research Program • Academic Collaborations • Open Source
  4. 4. Spatial Temporal Forecastingwith Philadelphia Crime Data
  5. 5. How Phila PD uses Maps Customized Map Products Weekly CompStat Meetings Web Crime Analysis
  6. 6. INCT & PARS – main database sourcesover 5,000 incidents daily, over 2 million annually PARS Complainant INCT Verizon Daily download 911 District & Geocoding Routines 48 Desk Incident Report Completed by Officer District X 911 Operator Police Officer Maps distributed Through Intranet, District Y Printing, CompStat Radio Dispatcher CAD District Z
  7. 7. The Context1,500,000 people7,000 police1,000 civilian employees2,000,000 new incidents / year3 crime analysts
  8. 8. What we did• Weekly Compstat• Lots of maps• Automation of map creation• Web-based systems
  9. 9. … but what if we could… Accelerate the cycle Proactively notify Automate the process
  10. 10. Prototype VB & MapObjects ArcView .ini fileProcess Documentation Shapefiles and GRIDs MS SQL Server Crime Incidents Database
  11. 11. … but there was a problem …
  12. 12. …it was crap …
  13. 13. … sort of.
  14. 14. We needed ….1. Better Statistics2. Notification3. Simplicity
  15. 15. Crime Analysis – What has happened? – Mapping (spatial / temporal densities) – Trending – Intelligence DashboardEarly Warning – What is out of the ordinary? – Statistical & Threshold-based Hunches (data mining) – AlertingRisk Forecasting – What is likely to happen next? – Near Repeat Pattern – Load Forecasting
  16. 16. Crime Analysis – Mapping (spatial / temporal densities) – Trending – Intelligence DashboardEarly Warning – Statistical & Threshold-based Hunches (data mining) – AlertingRisk Forecasting – Near Repeat Pattern – Load Forecasting
  17. 17. Crime Analysis
  18. 18. Intelligence Dashboard
  19. 19. Crime Analysis
  20. 20. Early Warning
  21. 21. Early Warning• Geographic Early Warning System – A system to alert staff of an unusual situation in a particular location – Ingests data sets to automatically “cook on” and only involves staff when a statistically unusual situation is found Geostatistical Engine Operational Operational Database Alerting Operational Database HunchLab Database System Databases
  22. 22. Early Warning
  23. 23. What is a Hunch?• A proposed hypothesis, saved into the system, and continually tested for validity• Incident Attribute Requirements – Location (x, y) – Time (timestamp) – Classification• Hunch Attributes – Location (area) – Time (recent / historic periods) – Classification• Analyses – Statistical Hunch – Threshold Hunch
  24. 24. Hunch Parameters: Location• Address & Radius• Precinct/County/Country• Custom Drawn Area• Mass Hunch
  25. 25. Hunch Parameters: Time• Statistical Hunch – Recent Past – Historic Past
  26. 26. Hunch Parameters: Classification• Category• Time of Day• Narrative
  27. 27. Hunch Helper
  28. 28. Email Alert
  29. 29. Hunch Details
  30. 30. Risk Forecasting
  31. 31. Predictive Analytics?• Prediction vs. Forecasting
  32. 32. Near Repeat Pattern Analysis
  33. 33. Contagious Crime?• Near repeat pattern analysis • “If one burglary occurs, how does the risk change nearby?”
  34. 34. What Do We Mean By Near Repeat?• Repeat victimization – Incident at the same location at a later time (likely related)• Near repeat victimization – Incident at a nearby location at a later time (likely related)• Incident A (place, time) --> Incident B (place, time)
  35. 35. Near Repeat Pattern Analysis• The goal: – Quantify short term risk due to near-repeat victimization • “If one burglary occurs, how does the risk of burglary for the neighbors change?”• What we know: – Incident A (place, time) --> Incident B (place, time) • Distance between A and B • Timeframe between A and B• What we need to know: – What distances/timeframes are not simply random?
  36. 36. Near Repeat Pattern Analysis• The process – Observe the pattern in historic data – Simulate the pattern in randomized historic data – Compare the observed pattern to the simulated patterns – Apply the non-random pattern to new incidents• An example – 180 days of burglaries in Division 6 of Philadelphia
  37. 37. Near Repeat Pattern Analysis
  38. 38. Near Repeat Pattern Analysis
  39. 39. Near Repeat Pattern Analysis
  40. 40. Near Repeat Pattern Analysis
  41. 41. Near Repeat Pattern Analysis• How can you test your own data? – Near Repeat Calculator • http://www.temple.edu/cj/misc/nr/• Papers – Near-Repeat Patterns in Philadelphia Shootings (2008) • One city block & two weeks after one shooting – 33% increase in likelihood of a second event Jerry Ratcliffe Temple University
  42. 42. Contagious Crime?
  43. 43. Workload Forecasting
  44. 44. Improving CompStat• Workload forecasting • “Given the time of year, day of week, time of day and general trend, what counts of crimes should I expect?”
  45. 45. What Do We Mean By Load Forecasting? • Workload forecasting • Generating aggregate crime counts for a future timeframe using cyclical time series analysis Measure cyclical patterns + Identify non-cyclical trend Forecast expected countbit.ly/gorrcrimeforecastingpaper
  46. 46. Load Forecasting• Measure cyclical patterns • Take historic incidents (for example: last five years) • Generate multiplicative seasonal indices – For each time cycle: » time of year » day of week » time of day – Count incidents within each time unit (for example: Monday) – Calculate average per time unit if incidents were evenly distributed – Divide counts within each time unit by the calculated average to generate multiplicative indices » Index ~ 1 means at the average » Index > 1 means above average » Index < 1 means below average
  47. 47. Load Forecasting
  48. 48. Load Forecasting
  49. 49. Load Forecasting
  50. 50. Load Forecasting
  51. 51. Load Forecasting• Identify non-cyclical trend • Take recent daily counts (for example: last year daily counts) • Remove cyclical trends by dividing by indices • Run a trending function on the new counts – Simple average » Last X Days – Smoothing function » Exponential smoothing » Holt’s linear exponential smoothing
  52. 52. Load Forecasting• Forecast expected count • Project trend into future timeframe – Always flat » Simple average » Exponential smoothing – Linear trend » Holt’s linear exponential smoothing • Multiple by seasonal indices to reseasonalize the data
  53. 53. Load Forecasting Measure cyclical patterns + Identify non-cyclical trend Forecast expected countbit.ly/gorrcrimeforecastingpaper
  54. 54. Improving CompStat
  55. 55. How Do We Know It’s Accurate?• Testing • Generated forecasting techniques(examples) – Commonly Used » Average of last 30 days » Average of last 365 days » Last year’s count for the same time period – Advanced Combinations » Different cyclical indices (example: day of year vs. month of year) » Different levels of geographic aggregation for indices » Different trending functions • Scoring methodologies (examples) – Mean absolute percent error (with some enhancements) – Mean percent error – Mean squared error • Run thousands of forecasts through testing framework • Choose the right technique in the right situation
  56. 56. Ongoing Research
  57. 57. Research Topics• Risk Forecasting – Load forecasting enhancements • Weather and special events – Combining short and long term risk forecasts (Temple) • Socioeconomic changes in neighborhoods – Risk Terrain Modeling (Rutgers) • Context of crime at the microplace
  58. 58. Research Topics
  59. 59. Research Topics• Risk Forecasting – Offender Management • Prioritize offenders based upon statistical models using past behaviors• Evaluation – Automate Randomized Controlled Trials
  60. 60. Data Processing for Big (Geo) Data
  61. 61. A Story
  62. 62. Robert’s Rules of Housing Close to Center City  somewhat important Walk to Grocery Store  vital Nearby Restaurants  very important Library  nice to have Near a Park  somewhat importantBiking / walking distance from our work  very important Biking distance to fencing  somewhat important
  63. 63. Your factors might include…  Child Care  Local School Rankings  Farmers Market  Car Share  Public Transit
  64. 64. We stand on theshoulders of giants
  65. 65. Not a new idea … Design with Nature
  66. 66. Not a new Idea … Dana Tomlin
  67. 67. Desktop GIS
  68. 68. Weighted Overlay + + + x5 x1 x3 x2 =
  69. 69. Summary Geography-driven Decisions Iterative Individual Web [and Mobile] Growing data sets
  70. 70. Web Challenges
  71. 71. Web is different from the Desktop  Lots of simultaneous users  Stateless environment  HTML+JS+CSS  Users are less skilled  Users are less patient
  72. 72. But wait … there’s a problem 10 – 60 second calculation time Multiple simultaneous users … … that are impatient
  73. 73. Data Challenges
  74. 74. Big Data – Social Media
  75. 75. Big Data – Science
  76. 76. Big Data – Citizen Science
  77. 77. Big Data – Cities
  78. 78. Early Prototype
  79. 79. Specific Optimization Goals New Raster File Structure Distributed processing Binary messaging protocol
  80. 80. Optimization: File Format Limit data type and range 1D arrays are fast to read/write Tiled Pyramids Azavea Raster Grid (ARG)
  81. 81. Optimization: Distributed Processing Parallelizable - Local Ops and Focal Ops Support multiple – Threads – Cores – CPU’s – Machines Considered – Hadoop – Amazon Map Reduce – Beowolf
  82. 82. Success!! Reduced from 10-60 seconds to <500 milliseconds
  83. 83. Optimizing one process sub-optimizes others Complex to configure and maintain Limited to one operation No interpolation No mixing – cell sizes – extents – projections etc.
  84. 84.  Broader set of functionality Both raster and vector Scala + Akka Open source
  85. 85. Faster is Different
  86. 86. Regional/State: 84 msNational: 84 msLarge Country 115 msContinental 271 msPlanet 1.2 – 2.0 s
  87. 87. Ongoing R&D
  88. 88. GPUs
  89. 89. GPU Results  Re-wrote a few Map Algebra operations:  Local  Neighborhood  Zonal  Viewshed  etc.  15 – 120x  Large grids  Large kernels
  90. 90. New Spatial Operations Vector Neighborhood/Focal Spatial Statistics Integration
  91. 91. Urban Forest Ecosystem Modeling
  92. 92. Crime Analysis, Early Warning and Forecasting
  93. 93. Open Source Geoprocessing  GDAL  GeoServer  PostGIS R  GeoDa
  94. 94. Many Thanks!© Photo used with permission from Alphafish, via Flickr.com
  95. 95. Big (Geo) Data Science [We are hiring]Robert Cheethamcheetham@azavea.com @rcheetham
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×