Data Philly Meetup - Big (Geo) Data

  • 664 views
Uploaded on

Data Philly Meetup for 2/19/2013 on geospatial data science with crime data and applications of GeoTrellis to solve challenges related to large data sets.

Data Philly Meetup for 2/19/2013 on geospatial data science with crime data and applications of GeoTrellis to solve challenges related to large data sets.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
664
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
27
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Big (Geo) Data ScienceRobert Cheethamcheetham@azavea.com @rcheetham
  • 2. Web/MobileGeospatialUI/UX DesignHigh PerformanceComputingR&D
  • 3. B Corporation • Projects w/ Social Value • Summer of Maps • Pro Bono Program • Donate share of profitsResearch-Driven • 10% Research Program • Academic Collaborations • Open Source
  • 4. Spatial Temporal Forecastingwith Philadelphia Crime Data
  • 5. How Phila PD uses Maps Customized Map Products Weekly CompStat Meetings Web Crime Analysis
  • 6. INCT & PARS – main database sourcesover 5,000 incidents daily, over 2 million annually PARS Complainant INCT Verizon Daily download 911 District & Geocoding Routines 48 Desk Incident Report Completed by Officer District X 911 Operator Police Officer Maps distributed Through Intranet, District Y Printing, CompStat Radio Dispatcher CAD District Z
  • 7. The Context1,500,000 people7,000 police1,000 civilian employees2,000,000 new incidents / year3 crime analysts
  • 8. What we did• Weekly Compstat• Lots of maps• Automation of map creation• Web-based systems
  • 9. … but what if we could… Accelerate the cycle Proactively notify Automate the process
  • 10. Prototype VB & MapObjects ArcView .ini fileProcess Documentation Shapefiles and GRIDs MS SQL Server Crime Incidents Database
  • 11. … but there was a problem …
  • 12. …it was crap …
  • 13. … sort of.
  • 14. We needed ….1. Better Statistics2. Notification3. Simplicity
  • 15. Crime Analysis – What has happened? – Mapping (spatial / temporal densities) – Trending – Intelligence DashboardEarly Warning – What is out of the ordinary? – Statistical & Threshold-based Hunches (data mining) – AlertingRisk Forecasting – What is likely to happen next? – Near Repeat Pattern – Load Forecasting
  • 16. Crime Analysis – Mapping (spatial / temporal densities) – Trending – Intelligence DashboardEarly Warning – Statistical & Threshold-based Hunches (data mining) – AlertingRisk Forecasting – Near Repeat Pattern – Load Forecasting
  • 17. Crime Analysis
  • 18. Intelligence Dashboard
  • 19. Crime Analysis
  • 20. Early Warning
  • 21. Early Warning• Geographic Early Warning System – A system to alert staff of an unusual situation in a particular location – Ingests data sets to automatically “cook on” and only involves staff when a statistically unusual situation is found Geostatistical Engine Operational Operational Database Alerting Operational Database HunchLab Database System Databases
  • 22. Early Warning
  • 23. What is a Hunch?• A proposed hypothesis, saved into the system, and continually tested for validity• Incident Attribute Requirements – Location (x, y) – Time (timestamp) – Classification• Hunch Attributes – Location (area) – Time (recent / historic periods) – Classification• Analyses – Statistical Hunch – Threshold Hunch
  • 24. Hunch Parameters: Location• Address & Radius• Precinct/County/Country• Custom Drawn Area• Mass Hunch
  • 25. Hunch Parameters: Time• Statistical Hunch – Recent Past – Historic Past
  • 26. Hunch Parameters: Classification• Category• Time of Day• Narrative
  • 27. Hunch Helper
  • 28. Email Alert
  • 29. Hunch Details
  • 30. Risk Forecasting
  • 31. Predictive Analytics?• Prediction vs. Forecasting
  • 32. Near Repeat Pattern Analysis
  • 33. Contagious Crime?• Near repeat pattern analysis • “If one burglary occurs, how does the risk change nearby?”
  • 34. What Do We Mean By Near Repeat?• Repeat victimization – Incident at the same location at a later time (likely related)• Near repeat victimization – Incident at a nearby location at a later time (likely related)• Incident A (place, time) --> Incident B (place, time)
  • 35. Near Repeat Pattern Analysis• The goal: – Quantify short term risk due to near-repeat victimization • “If one burglary occurs, how does the risk of burglary for the neighbors change?”• What we know: – Incident A (place, time) --> Incident B (place, time) • Distance between A and B • Timeframe between A and B• What we need to know: – What distances/timeframes are not simply random?
  • 36. Near Repeat Pattern Analysis• The process – Observe the pattern in historic data – Simulate the pattern in randomized historic data – Compare the observed pattern to the simulated patterns – Apply the non-random pattern to new incidents• An example – 180 days of burglaries in Division 6 of Philadelphia
  • 37. Near Repeat Pattern Analysis
  • 38. Near Repeat Pattern Analysis
  • 39. Near Repeat Pattern Analysis
  • 40. Near Repeat Pattern Analysis
  • 41. Near Repeat Pattern Analysis• How can you test your own data? – Near Repeat Calculator • http://www.temple.edu/cj/misc/nr/• Papers – Near-Repeat Patterns in Philadelphia Shootings (2008) • One city block & two weeks after one shooting – 33% increase in likelihood of a second event Jerry Ratcliffe Temple University
  • 42. Contagious Crime?
  • 43. Workload Forecasting
  • 44. Improving CompStat• Workload forecasting • “Given the time of year, day of week, time of day and general trend, what counts of crimes should I expect?”
  • 45. What Do We Mean By Load Forecasting? • Workload forecasting • Generating aggregate crime counts for a future timeframe using cyclical time series analysis Measure cyclical patterns + Identify non-cyclical trend Forecast expected countbit.ly/gorrcrimeforecastingpaper
  • 46. Load Forecasting• Measure cyclical patterns • Take historic incidents (for example: last five years) • Generate multiplicative seasonal indices – For each time cycle: » time of year » day of week » time of day – Count incidents within each time unit (for example: Monday) – Calculate average per time unit if incidents were evenly distributed – Divide counts within each time unit by the calculated average to generate multiplicative indices » Index ~ 1 means at the average » Index > 1 means above average » Index < 1 means below average
  • 47. Load Forecasting
  • 48. Load Forecasting
  • 49. Load Forecasting
  • 50. Load Forecasting
  • 51. Load Forecasting• Identify non-cyclical trend • Take recent daily counts (for example: last year daily counts) • Remove cyclical trends by dividing by indices • Run a trending function on the new counts – Simple average » Last X Days – Smoothing function » Exponential smoothing » Holt’s linear exponential smoothing
  • 52. Load Forecasting• Forecast expected count • Project trend into future timeframe – Always flat » Simple average » Exponential smoothing – Linear trend » Holt’s linear exponential smoothing • Multiple by seasonal indices to reseasonalize the data
  • 53. Load Forecasting Measure cyclical patterns + Identify non-cyclical trend Forecast expected countbit.ly/gorrcrimeforecastingpaper
  • 54. Improving CompStat
  • 55. How Do We Know It’s Accurate?• Testing • Generated forecasting techniques(examples) – Commonly Used » Average of last 30 days » Average of last 365 days » Last year’s count for the same time period – Advanced Combinations » Different cyclical indices (example: day of year vs. month of year) » Different levels of geographic aggregation for indices » Different trending functions • Scoring methodologies (examples) – Mean absolute percent error (with some enhancements) – Mean percent error – Mean squared error • Run thousands of forecasts through testing framework • Choose the right technique in the right situation
  • 56. Ongoing Research
  • 57. Research Topics• Risk Forecasting – Load forecasting enhancements • Weather and special events – Combining short and long term risk forecasts (Temple) • Socioeconomic changes in neighborhoods – Risk Terrain Modeling (Rutgers) • Context of crime at the microplace
  • 58. Research Topics
  • 59. Research Topics• Risk Forecasting – Offender Management • Prioritize offenders based upon statistical models using past behaviors• Evaluation – Automate Randomized Controlled Trials
  • 60. Data Processing for Big (Geo) Data
  • 61. A Story
  • 62. Robert’s Rules of Housing Close to Center City  somewhat important Walk to Grocery Store  vital Nearby Restaurants  very important Library  nice to have Near a Park  somewhat importantBiking / walking distance from our work  very important Biking distance to fencing  somewhat important
  • 63. Your factors might include…  Child Care  Local School Rankings  Farmers Market  Car Share  Public Transit
  • 64. We stand on theshoulders of giants
  • 65. Not a new idea … Design with Nature
  • 66. Not a new Idea … Dana Tomlin
  • 67. Desktop GIS
  • 68. Weighted Overlay + + + x5 x1 x3 x2 =
  • 69. Summary Geography-driven Decisions Iterative Individual Web [and Mobile] Growing data sets
  • 70. Web Challenges
  • 71. Web is different from the Desktop  Lots of simultaneous users  Stateless environment  HTML+JS+CSS  Users are less skilled  Users are less patient
  • 72. But wait … there’s a problem 10 – 60 second calculation time Multiple simultaneous users … … that are impatient
  • 73. Data Challenges
  • 74. Big Data – Social Media
  • 75. Big Data – Science
  • 76. Big Data – Citizen Science
  • 77. Big Data – Cities
  • 78. Early Prototype
  • 79. Specific Optimization Goals New Raster File Structure Distributed processing Binary messaging protocol
  • 80. Optimization: File Format Limit data type and range 1D arrays are fast to read/write Tiled Pyramids Azavea Raster Grid (ARG)
  • 81. Optimization: Distributed Processing Parallelizable - Local Ops and Focal Ops Support multiple – Threads – Cores – CPU’s – Machines Considered – Hadoop – Amazon Map Reduce – Beowolf
  • 82. Success!! Reduced from 10-60 seconds to <500 milliseconds
  • 83. Optimizing one process sub-optimizes others Complex to configure and maintain Limited to one operation No interpolation No mixing – cell sizes – extents – projections etc.
  • 84.  Broader set of functionality Both raster and vector Scala + Akka Open source
  • 85. Faster is Different
  • 86. Regional/State: 84 msNational: 84 msLarge Country 115 msContinental 271 msPlanet 1.2 – 2.0 s
  • 87. Ongoing R&D
  • 88. GPUs
  • 89. GPU Results  Re-wrote a few Map Algebra operations:  Local  Neighborhood  Zonal  Viewshed  etc.  15 – 120x  Large grids  Large kernels
  • 90. New Spatial Operations Vector Neighborhood/Focal Spatial Statistics Integration
  • 91. Urban Forest Ecosystem Modeling
  • 92. Crime Analysis, Early Warning and Forecasting
  • 93. Open Source Geoprocessing  GDAL  GeoServer  PostGIS R  GeoDa
  • 94. Many Thanks!© Photo used with permission from Alphafish, via Flickr.com
  • 95. Big (Geo) Data Science [We are hiring]Robert Cheethamcheetham@azavea.com @rcheetham