Data Philly Meetup - Big (Geo) Data
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Data Philly Meetup - Big (Geo) Data

on

  • 774 views

Data Philly Meetup for 2/19/2013 on geospatial data science with crime data and applications of GeoTrellis to solve challenges related to large data sets.

Data Philly Meetup for 2/19/2013 on geospatial data science with crime data and applications of GeoTrellis to solve challenges related to large data sets.

Statistics

Views

Total Views
774
Views on SlideShare
770
Embed Views
4

Actions

Likes
3
Downloads
23
Comments
0

3 Embeds 4

http://192.168.6.179 2
http://localhost 1
http://pmomale-ld1 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Data Philly Meetup - Big (Geo) Data Presentation Transcript

  • 1. Big (Geo) Data ScienceRobert Cheethamcheetham@azavea.com @rcheetham
  • 2. Web/MobileGeospatialUI/UX DesignHigh PerformanceComputingR&D
  • 3. B Corporation • Projects w/ Social Value • Summer of Maps • Pro Bono Program • Donate share of profitsResearch-Driven • 10% Research Program • Academic Collaborations • Open Source
  • 4. Spatial Temporal Forecastingwith Philadelphia Crime Data
  • 5. How Phila PD uses Maps Customized Map Products Weekly CompStat Meetings Web Crime Analysis
  • 6. INCT & PARS – main database sourcesover 5,000 incidents daily, over 2 million annually PARS Complainant INCT Verizon Daily download 911 District & Geocoding Routines 48 Desk Incident Report Completed by Officer District X 911 Operator Police Officer Maps distributed Through Intranet, District Y Printing, CompStat Radio Dispatcher CAD District Z
  • 7. The Context1,500,000 people7,000 police1,000 civilian employees2,000,000 new incidents / year3 crime analysts
  • 8. What we did• Weekly Compstat• Lots of maps• Automation of map creation• Web-based systems
  • 9. … but what if we could… Accelerate the cycle Proactively notify Automate the process
  • 10. Prototype VB & MapObjects ArcView .ini fileProcess Documentation Shapefiles and GRIDs MS SQL Server Crime Incidents Database
  • 11. … but there was a problem …
  • 12. …it was crap …
  • 13. … sort of.
  • 14. We needed ….1. Better Statistics2. Notification3. Simplicity
  • 15. Crime Analysis – What has happened? – Mapping (spatial / temporal densities) – Trending – Intelligence DashboardEarly Warning – What is out of the ordinary? – Statistical & Threshold-based Hunches (data mining) – AlertingRisk Forecasting – What is likely to happen next? – Near Repeat Pattern – Load Forecasting
  • 16. Crime Analysis – Mapping (spatial / temporal densities) – Trending – Intelligence DashboardEarly Warning – Statistical & Threshold-based Hunches (data mining) – AlertingRisk Forecasting – Near Repeat Pattern – Load Forecasting
  • 17. Crime Analysis
  • 18. Intelligence Dashboard
  • 19. Crime Analysis
  • 20. Early Warning
  • 21. Early Warning• Geographic Early Warning System – A system to alert staff of an unusual situation in a particular location – Ingests data sets to automatically “cook on” and only involves staff when a statistically unusual situation is found Geostatistical Engine Operational Operational Database Alerting Operational Database HunchLab Database System Databases
  • 22. Early Warning
  • 23. What is a Hunch?• A proposed hypothesis, saved into the system, and continually tested for validity• Incident Attribute Requirements – Location (x, y) – Time (timestamp) – Classification• Hunch Attributes – Location (area) – Time (recent / historic periods) – Classification• Analyses – Statistical Hunch – Threshold Hunch
  • 24. Hunch Parameters: Location• Address & Radius• Precinct/County/Country• Custom Drawn Area• Mass Hunch
  • 25. Hunch Parameters: Time• Statistical Hunch – Recent Past – Historic Past
  • 26. Hunch Parameters: Classification• Category• Time of Day• Narrative
  • 27. Hunch Helper
  • 28. Email Alert
  • 29. Hunch Details
  • 30. Risk Forecasting
  • 31. Predictive Analytics?• Prediction vs. Forecasting
  • 32. Near Repeat Pattern Analysis
  • 33. Contagious Crime?• Near repeat pattern analysis • “If one burglary occurs, how does the risk change nearby?”
  • 34. What Do We Mean By Near Repeat?• Repeat victimization – Incident at the same location at a later time (likely related)• Near repeat victimization – Incident at a nearby location at a later time (likely related)• Incident A (place, time) --> Incident B (place, time)
  • 35. Near Repeat Pattern Analysis• The goal: – Quantify short term risk due to near-repeat victimization • “If one burglary occurs, how does the risk of burglary for the neighbors change?”• What we know: – Incident A (place, time) --> Incident B (place, time) • Distance between A and B • Timeframe between A and B• What we need to know: – What distances/timeframes are not simply random?
  • 36. Near Repeat Pattern Analysis• The process – Observe the pattern in historic data – Simulate the pattern in randomized historic data – Compare the observed pattern to the simulated patterns – Apply the non-random pattern to new incidents• An example – 180 days of burglaries in Division 6 of Philadelphia
  • 37. Near Repeat Pattern Analysis
  • 38. Near Repeat Pattern Analysis
  • 39. Near Repeat Pattern Analysis
  • 40. Near Repeat Pattern Analysis
  • 41. Near Repeat Pattern Analysis• How can you test your own data? – Near Repeat Calculator • http://www.temple.edu/cj/misc/nr/• Papers – Near-Repeat Patterns in Philadelphia Shootings (2008) • One city block & two weeks after one shooting – 33% increase in likelihood of a second event Jerry Ratcliffe Temple University
  • 42. Contagious Crime?
  • 43. Workload Forecasting
  • 44. Improving CompStat• Workload forecasting • “Given the time of year, day of week, time of day and general trend, what counts of crimes should I expect?”
  • 45. What Do We Mean By Load Forecasting? • Workload forecasting • Generating aggregate crime counts for a future timeframe using cyclical time series analysis Measure cyclical patterns + Identify non-cyclical trend Forecast expected countbit.ly/gorrcrimeforecastingpaper
  • 46. Load Forecasting• Measure cyclical patterns • Take historic incidents (for example: last five years) • Generate multiplicative seasonal indices – For each time cycle: » time of year » day of week » time of day – Count incidents within each time unit (for example: Monday) – Calculate average per time unit if incidents were evenly distributed – Divide counts within each time unit by the calculated average to generate multiplicative indices » Index ~ 1 means at the average » Index > 1 means above average » Index < 1 means below average
  • 47. Load Forecasting
  • 48. Load Forecasting
  • 49. Load Forecasting
  • 50. Load Forecasting
  • 51. Load Forecasting• Identify non-cyclical trend • Take recent daily counts (for example: last year daily counts) • Remove cyclical trends by dividing by indices • Run a trending function on the new counts – Simple average » Last X Days – Smoothing function » Exponential smoothing » Holt’s linear exponential smoothing
  • 52. Load Forecasting• Forecast expected count • Project trend into future timeframe – Always flat » Simple average » Exponential smoothing – Linear trend » Holt’s linear exponential smoothing • Multiple by seasonal indices to reseasonalize the data
  • 53. Load Forecasting Measure cyclical patterns + Identify non-cyclical trend Forecast expected countbit.ly/gorrcrimeforecastingpaper
  • 54. Improving CompStat
  • 55. How Do We Know It’s Accurate?• Testing • Generated forecasting techniques(examples) – Commonly Used » Average of last 30 days » Average of last 365 days » Last year’s count for the same time period – Advanced Combinations » Different cyclical indices (example: day of year vs. month of year) » Different levels of geographic aggregation for indices » Different trending functions • Scoring methodologies (examples) – Mean absolute percent error (with some enhancements) – Mean percent error – Mean squared error • Run thousands of forecasts through testing framework • Choose the right technique in the right situation
  • 56. Ongoing Research
  • 57. Research Topics• Risk Forecasting – Load forecasting enhancements • Weather and special events – Combining short and long term risk forecasts (Temple) • Socioeconomic changes in neighborhoods – Risk Terrain Modeling (Rutgers) • Context of crime at the microplace
  • 58. Research Topics
  • 59. Research Topics• Risk Forecasting – Offender Management • Prioritize offenders based upon statistical models using past behaviors• Evaluation – Automate Randomized Controlled Trials
  • 60. Data Processing for Big (Geo) Data
  • 61. A Story
  • 62. Robert’s Rules of Housing Close to Center City  somewhat important Walk to Grocery Store  vital Nearby Restaurants  very important Library  nice to have Near a Park  somewhat importantBiking / walking distance from our work  very important Biking distance to fencing  somewhat important
  • 63. Your factors might include…  Child Care  Local School Rankings  Farmers Market  Car Share  Public Transit
  • 64. We stand on theshoulders of giants
  • 65. Not a new idea … Design with Nature
  • 66. Not a new Idea … Dana Tomlin
  • 67. Desktop GIS
  • 68. Weighted Overlay + + + x5 x1 x3 x2 =
  • 69. Summary Geography-driven Decisions Iterative Individual Web [and Mobile] Growing data sets
  • 70. Web Challenges
  • 71. Web is different from the Desktop  Lots of simultaneous users  Stateless environment  HTML+JS+CSS  Users are less skilled  Users are less patient
  • 72. But wait … there’s a problem 10 – 60 second calculation time Multiple simultaneous users … … that are impatient
  • 73. Data Challenges
  • 74. Big Data – Social Media
  • 75. Big Data – Science
  • 76. Big Data – Citizen Science
  • 77. Big Data – Cities
  • 78. Early Prototype
  • 79. Specific Optimization Goals New Raster File Structure Distributed processing Binary messaging protocol
  • 80. Optimization: File Format Limit data type and range 1D arrays are fast to read/write Tiled Pyramids Azavea Raster Grid (ARG)
  • 81. Optimization: Distributed Processing Parallelizable - Local Ops and Focal Ops Support multiple – Threads – Cores – CPU’s – Machines Considered – Hadoop – Amazon Map Reduce – Beowolf
  • 82. Success!! Reduced from 10-60 seconds to <500 milliseconds
  • 83. Optimizing one process sub-optimizes others Complex to configure and maintain Limited to one operation No interpolation No mixing – cell sizes – extents – projections etc.
  • 84.  Broader set of functionality Both raster and vector Scala + Akka Open source
  • 85. Faster is Different
  • 86. Regional/State: 84 msNational: 84 msLarge Country 115 msContinental 271 msPlanet 1.2 – 2.0 s
  • 87. Ongoing R&D
  • 88. GPUs
  • 89. GPU Results  Re-wrote a few Map Algebra operations:  Local  Neighborhood  Zonal  Viewshed  etc.  15 – 120x  Large grids  Large kernels
  • 90. New Spatial Operations Vector Neighborhood/Focal Spatial Statistics Integration
  • 91. Urban Forest Ecosystem Modeling
  • 92. Crime Analysis, Early Warning and Forecasting
  • 93. Open Source Geoprocessing  GDAL  GeoServer  PostGIS R  GeoDa
  • 94. Many Thanks!© Photo used with permission from Alphafish, via Flickr.com
  • 95. Big (Geo) Data Science [We are hiring]Robert Cheethamcheetham@azavea.com @rcheetham