Data Philly Meetup - Big (Geo) Data
Upcoming SlideShare
Loading in...5
×
 

Data Philly Meetup - Big (Geo) Data

on

  • 720 views

Data Philly Meetup for 2/19/2013 on geospatial data science with crime data and applications of GeoTrellis to solve challenges related to large data sets.

Data Philly Meetup for 2/19/2013 on geospatial data science with crime data and applications of GeoTrellis to solve challenges related to large data sets.

Statistics

Views

Total Views
720
Slideshare-icon Views on SlideShare
718
Embed Views
2

Actions

Likes
1
Downloads
16
Comments
0

1 Embed 2

http://192.168.6.179 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Data Philly Meetup - Big (Geo) Data Data Philly Meetup - Big (Geo) Data Presentation Transcript

    • Big (Geo) Data ScienceRobert Cheethamcheetham@azavea.com @rcheetham
    • Web/MobileGeospatialUI/UX DesignHigh PerformanceComputingR&D
    • B Corporation • Projects w/ Social Value • Summer of Maps • Pro Bono Program • Donate share of profitsResearch-Driven • 10% Research Program • Academic Collaborations • Open Source
    • Spatial Temporal Forecastingwith Philadelphia Crime Data
    • How Phila PD uses Maps Customized Map Products Weekly CompStat Meetings Web Crime Analysis
    • INCT & PARS – main database sourcesover 5,000 incidents daily, over 2 million annually PARS Complainant INCT Verizon Daily download 911 District & Geocoding Routines 48 Desk Incident Report Completed by Officer District X 911 Operator Police Officer Maps distributed Through Intranet, District Y Printing, CompStat Radio Dispatcher CAD District Z
    • The Context1,500,000 people7,000 police1,000 civilian employees2,000,000 new incidents / year3 crime analysts
    • What we did• Weekly Compstat• Lots of maps• Automation of map creation• Web-based systems
    • … but what if we could… Accelerate the cycle Proactively notify Automate the process
    • Prototype VB & MapObjects ArcView .ini fileProcess Documentation Shapefiles and GRIDs MS SQL Server Crime Incidents Database
    • … but there was a problem …
    • …it was crap …
    • … sort of.
    • We needed ….1. Better Statistics2. Notification3. Simplicity
    • Crime Analysis – What has happened? – Mapping (spatial / temporal densities) – Trending – Intelligence DashboardEarly Warning – What is out of the ordinary? – Statistical & Threshold-based Hunches (data mining) – AlertingRisk Forecasting – What is likely to happen next? – Near Repeat Pattern – Load Forecasting
    • Crime Analysis – Mapping (spatial / temporal densities) – Trending – Intelligence DashboardEarly Warning – Statistical & Threshold-based Hunches (data mining) – AlertingRisk Forecasting – Near Repeat Pattern – Load Forecasting
    • Crime Analysis
    • Intelligence Dashboard
    • Crime Analysis
    • Early Warning
    • Early Warning• Geographic Early Warning System – A system to alert staff of an unusual situation in a particular location – Ingests data sets to automatically “cook on” and only involves staff when a statistically unusual situation is found Geostatistical Engine Operational Operational Database Alerting Operational Database HunchLab Database System Databases
    • Early Warning
    • What is a Hunch?• A proposed hypothesis, saved into the system, and continually tested for validity• Incident Attribute Requirements – Location (x, y) – Time (timestamp) – Classification• Hunch Attributes – Location (area) – Time (recent / historic periods) – Classification• Analyses – Statistical Hunch – Threshold Hunch
    • Hunch Parameters: Location• Address & Radius• Precinct/County/Country• Custom Drawn Area• Mass Hunch
    • Hunch Parameters: Time• Statistical Hunch – Recent Past – Historic Past
    • Hunch Parameters: Classification• Category• Time of Day• Narrative
    • Hunch Helper
    • Email Alert
    • Hunch Details
    • Risk Forecasting
    • Predictive Analytics?• Prediction vs. Forecasting
    • Near Repeat Pattern Analysis
    • Contagious Crime?• Near repeat pattern analysis • “If one burglary occurs, how does the risk change nearby?”
    • What Do We Mean By Near Repeat?• Repeat victimization – Incident at the same location at a later time (likely related)• Near repeat victimization – Incident at a nearby location at a later time (likely related)• Incident A (place, time) --> Incident B (place, time)
    • Near Repeat Pattern Analysis• The goal: – Quantify short term risk due to near-repeat victimization • “If one burglary occurs, how does the risk of burglary for the neighbors change?”• What we know: – Incident A (place, time) --> Incident B (place, time) • Distance between A and B • Timeframe between A and B• What we need to know: – What distances/timeframes are not simply random?
    • Near Repeat Pattern Analysis• The process – Observe the pattern in historic data – Simulate the pattern in randomized historic data – Compare the observed pattern to the simulated patterns – Apply the non-random pattern to new incidents• An example – 180 days of burglaries in Division 6 of Philadelphia
    • Near Repeat Pattern Analysis
    • Near Repeat Pattern Analysis
    • Near Repeat Pattern Analysis
    • Near Repeat Pattern Analysis
    • Near Repeat Pattern Analysis• How can you test your own data? – Near Repeat Calculator • http://www.temple.edu/cj/misc/nr/• Papers – Near-Repeat Patterns in Philadelphia Shootings (2008) • One city block & two weeks after one shooting – 33% increase in likelihood of a second event Jerry Ratcliffe Temple University
    • Contagious Crime?
    • Workload Forecasting
    • Improving CompStat• Workload forecasting • “Given the time of year, day of week, time of day and general trend, what counts of crimes should I expect?”
    • What Do We Mean By Load Forecasting? • Workload forecasting • Generating aggregate crime counts for a future timeframe using cyclical time series analysis Measure cyclical patterns + Identify non-cyclical trend Forecast expected countbit.ly/gorrcrimeforecastingpaper
    • Load Forecasting• Measure cyclical patterns • Take historic incidents (for example: last five years) • Generate multiplicative seasonal indices – For each time cycle: » time of year » day of week » time of day – Count incidents within each time unit (for example: Monday) – Calculate average per time unit if incidents were evenly distributed – Divide counts within each time unit by the calculated average to generate multiplicative indices » Index ~ 1 means at the average » Index > 1 means above average » Index < 1 means below average
    • Load Forecasting
    • Load Forecasting
    • Load Forecasting
    • Load Forecasting
    • Load Forecasting• Identify non-cyclical trend • Take recent daily counts (for example: last year daily counts) • Remove cyclical trends by dividing by indices • Run a trending function on the new counts – Simple average » Last X Days – Smoothing function » Exponential smoothing » Holt’s linear exponential smoothing
    • Load Forecasting• Forecast expected count • Project trend into future timeframe – Always flat » Simple average » Exponential smoothing – Linear trend » Holt’s linear exponential smoothing • Multiple by seasonal indices to reseasonalize the data
    • Load Forecasting Measure cyclical patterns + Identify non-cyclical trend Forecast expected countbit.ly/gorrcrimeforecastingpaper
    • Improving CompStat
    • How Do We Know It’s Accurate?• Testing • Generated forecasting techniques(examples) – Commonly Used » Average of last 30 days » Average of last 365 days » Last year’s count for the same time period – Advanced Combinations » Different cyclical indices (example: day of year vs. month of year) » Different levels of geographic aggregation for indices » Different trending functions • Scoring methodologies (examples) – Mean absolute percent error (with some enhancements) – Mean percent error – Mean squared error • Run thousands of forecasts through testing framework • Choose the right technique in the right situation
    • Ongoing Research
    • Research Topics• Risk Forecasting – Load forecasting enhancements • Weather and special events – Combining short and long term risk forecasts (Temple) • Socioeconomic changes in neighborhoods – Risk Terrain Modeling (Rutgers) • Context of crime at the microplace
    • Research Topics
    • Research Topics• Risk Forecasting – Offender Management • Prioritize offenders based upon statistical models using past behaviors• Evaluation – Automate Randomized Controlled Trials
    • Data Processing for Big (Geo) Data
    • A Story
    • Robert’s Rules of Housing Close to Center City  somewhat important Walk to Grocery Store  vital Nearby Restaurants  very important Library  nice to have Near a Park  somewhat importantBiking / walking distance from our work  very important Biking distance to fencing  somewhat important
    • Your factors might include…  Child Care  Local School Rankings  Farmers Market  Car Share  Public Transit
    • We stand on theshoulders of giants
    • Not a new idea … Design with Nature
    • Not a new Idea … Dana Tomlin
    • Desktop GIS
    • Weighted Overlay + + + x5 x1 x3 x2 =
    • Summary Geography-driven Decisions Iterative Individual Web [and Mobile] Growing data sets
    • Web Challenges
    • Web is different from the Desktop  Lots of simultaneous users  Stateless environment  HTML+JS+CSS  Users are less skilled  Users are less patient
    • But wait … there’s a problem 10 – 60 second calculation time Multiple simultaneous users … … that are impatient
    • Data Challenges
    • Big Data – Social Media
    • Big Data – Science
    • Big Data – Citizen Science
    • Big Data – Cities
    • Early Prototype
    • Specific Optimization Goals New Raster File Structure Distributed processing Binary messaging protocol
    • Optimization: File Format Limit data type and range 1D arrays are fast to read/write Tiled Pyramids Azavea Raster Grid (ARG)
    • Optimization: Distributed Processing Parallelizable - Local Ops and Focal Ops Support multiple – Threads – Cores – CPU’s – Machines Considered – Hadoop – Amazon Map Reduce – Beowolf
    • Success!! Reduced from 10-60 seconds to <500 milliseconds
    • Optimizing one process sub-optimizes others Complex to configure and maintain Limited to one operation No interpolation No mixing – cell sizes – extents – projections etc.
    •  Broader set of functionality Both raster and vector Scala + Akka Open source
    • Faster is Different
    • Regional/State: 84 msNational: 84 msLarge Country 115 msContinental 271 msPlanet 1.2 – 2.0 s
    • Ongoing R&D
    • GPUs
    • GPU Results  Re-wrote a few Map Algebra operations:  Local  Neighborhood  Zonal  Viewshed  etc.  15 – 120x  Large grids  Large kernels
    • New Spatial Operations Vector Neighborhood/Focal Spatial Statistics Integration
    • Urban Forest Ecosystem Modeling
    • Crime Analysis, Early Warning and Forecasting
    • Open Source Geoprocessing  GDAL  GeoServer  PostGIS R  GeoDa
    • Many Thanks!© Photo used with permission from Alphafish, via Flickr.com
    • Big (Geo) Data Science [We are hiring]Robert Cheethamcheetham@azavea.com @rcheetham