Criminal Incident Data Association Using OLAP Technology Donald E. Brown & Song Lin Department of Systems & Information En...
Summary <ul><li>In this paper, we combine OLAP (Online Analytical Processing) and data mining to associate criminal incide...
Objectives of Spatial Knowledge Mining <ul><li>Leverage DBMS (records management), OLAP, & GIS </li></ul><ul><li>Find spat...
Related Applications - UVa <ul><li>ReCAP </li></ul><ul><ul><li>Regional Crime Analysis Program </li></ul></ul><ul><ul><li>...
Outline <ul><li>Introduction </li></ul><ul><li>Existing studies on OLAP & data mining </li></ul><ul><li>Combined approach ...
Introduction (crime association)   <ul><li>80-20 rule: 20% of the criminals commit 80% of the crimes </li></ul><ul><li>How...
Theories of criminal behavior (criminology) <ul><li>Rational choice (Clarke and Cornish) </li></ul><ul><ul><li>Criminals e...
Theories of criminal behavior (template) <ul><li>“Template” (Brantingham & Brantingham) </li></ul><ul><ul><li>Environment ...
An operational approach to the theories (template) <ul><li>Criminal incidents committed by the same person </li></ul><ul><...
Existing Association Methods & Systems <ul><li>AREST (Badiru et al.) </li></ul><ul><ul><li>Suspect matching </li></ul></ul...
Existing Association Methods & Systems <ul><li>TSM (Brown) </li></ul><ul><ul><li>Total similarity measures </li></ul></ul>...
Comments on existing methods <ul><li>Computer technologies are central to criminal incident association </li></ul><ul><li>...
Comments on existing methods <ul><li>Two additional techniques that enable incident association </li></ul><ul><ul><li>Data...
Related Work on OLAP and data mining <ul><li>OLAP </li></ul><ul><ul><li>Ancestor: OLTP (transactional data) </li></ul></ul...
OLAP and Data Mining <ul><li>Both of them are powerful tools to support decision making process, but </li></ul><ul><ul><li...
Existing studies on combining OLAP and Data mining <ul><li>Cubegrade Problem (Imielinski) </li></ul><ul><ul><li>Generalize...
Existing studies on combining OLAP and Data mining <ul><li>Constrained Gradient Analysis </li></ul><ul><ul><li>Retrieve pa...
Existing studies on combining OLAP and Data mining <ul><li>Data driven exploration (Sarawagi) </li></ul><ul><ul><li>Find “...
Associating records by finding distinctive values or outliers <ul><li>Basic idea </li></ul><ul><ul><li>If a group of recor...
OLAP-outlier-based method to associate records <ul><li>Rationale for distinctive values or outliers </li></ul><ul><ul><li>...
Definitions <ul><li>Cell, Parent, Neighbor </li></ul><ul><ul><li>Cell: a vector of values for some attributes. </li></ul><...
Illustration -- Cell
Illustration --parent a 1 a 2 a 4 a 5 a 3 b 4 b 3 b 2 b 1 Cell (a 5 ,b 3 ) has two parents: (a 5 , *) and (*,b 3 )
Illustration -- Neighbor Neighbor is a collection of cells sharing the same parent
Outlier Score Function <ul><li>We start building this function from one dimension, and then we generalize to higher dimens...
Observation I P=0.1 Outlier For attribute “color”, value “blond” covers 10% of the records. Hence, it should get a higher ...
Observation II Although both of them have frequency=0.2, the left one is more “unusual”, because the uncertainty level is ...
Observation III <ul><li>“more evidence” </li></ul><ul><ul><li>More evidence is better than less    higher outlier score <...
OSF for One Dimension <ul><li>-log(p) comes from information theory, where p is the probability of a value </li></ul><ul><...
OSF for Higher Dimensions <ul><li>For any cell, calculate the sum of the OSF of its parent cell and the OSF conditional on...
Association  (using this OLAP-outlier method) <ul><li>For a pair of incidents (A,B) </li></ul><ul><ul><li>If there is a ce...
Application (dataset) <ul><li>Applied to a robbery dataset (Richmond, VA, 1998) </li></ul><ul><ul><li>Why robbery? </li></...
Attributes <ul><li>Three attributes </li></ul><ul><ul><li>Modus Operandi -- categorical </li></ul></ul><ul><ul><li>Census ...
Feature Selection <ul><li>Redundant features    feature selection </li></ul><ul><ul><li>Cluster features (similar feature...
Feature Selection Result
Final Selected Features <ul><li>Medoids </li></ul><ul><ul><li>HUNT (housing unit density) </li></ul></ul><ul><ul><li>ENRL3...
Discretize <ul><li>Discretize these numeric features into bins </li></ul><ul><ul><li>Similar to histogram </li></ul></ul><...
Evaluation <ul><li>For incidents with known suspects (170) </li></ul><ul><ul><li>Generate all incident pairs </li></ul></u...
Evaluation Criteria <ul><li>Two measures </li></ul><ul><ul><li>Detected true associations </li></ul></ul><ul><ul><ul><li>L...
Evaluation Criteria (cont.) <ul><li>From information retrieval </li></ul><ul><ul><li>Recall: ability to provide relevant i...
Result (OLAP-outlier based)
Result of binary association method  (calculating similarity score)
Comparison Outlier vs. Binary
Comparison (cont.) <ul><li>Generally, the curve of our method lies above the other one </li></ul><ul><ul><li>Given the sam...
Comparison (Outlier vs. Simple Combination)
WebCAT Implementation <ul><li>A secure web environment that can read several data formats, translate them into a uniform s...
Conclusions <ul><li>Developed a new data association method for linking criminal incidents that combines </li></ul><ul><ul...
Questions?
Upcoming SlideShare
Loading in …5
×

Criminal Incident Data Association Using OLAP Technology

752 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
752
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Criminal Incident Data Association Using OLAP Technology

  1. 1. Criminal Incident Data Association Using OLAP Technology Donald E. Brown & Song Lin Department of Systems & Information Engineering University of Virginia
  2. 2. Summary <ul><li>In this paper, we combine OLAP (Online Analytical Processing) and data mining to associate criminal incidents. </li></ul><ul><li>This method is tested with a robbery dataset from Richmond, Virginia </li></ul>
  3. 3. Objectives of Spatial Knowledge Mining <ul><li>Leverage DBMS (records management), OLAP, & GIS </li></ul><ul><li>Find spatial-temporal patterns and relationships in data </li></ul><ul><li>Support crime analysis & information sharing </li></ul>
  4. 4. Related Applications - UVa <ul><li>ReCAP </li></ul><ul><ul><li>Regional Crime Analysis Program </li></ul></ul><ul><ul><li>Provides support for regional analysis using RDBMS </li></ul></ul><ul><ul><li>Requires implementation on each client computer </li></ul></ul><ul><li>CARV </li></ul><ul><ul><li>Crime Analysis and Reporting in Virginia </li></ul></ul><ul><ul><li>Runs on Citrix Metaframe, so the number of concurrent users is limited </li></ul></ul><ul><li>GRASP </li></ul><ul><ul><li>Geospatial Repository for Analysis and Safety Planning </li></ul></ul><ul><ul><li>Web interface for a central repository of criminal incident data and geospatial files </li></ul></ul>
  5. 5. Outline <ul><li>Introduction </li></ul><ul><li>Existing studies on OLAP & data mining </li></ul><ul><li>Combined approach </li></ul><ul><li>Application </li></ul><ul><li>Conclusions </li></ul>
  6. 6. Introduction (crime association) <ul><li>80-20 rule: 20% of the criminals commit 80% of the crimes </li></ul><ul><li>How can we link criminal incidents committed by the same criminal? </li></ul><ul><li>Start by looking at the same crime types </li></ul>
  7. 7. Theories of criminal behavior (criminology) <ul><li>Rational choice (Clarke and Cornish) </li></ul><ul><ul><li>Criminals evaluate “benefit” and “risk”, make rational decisions to maximize “profit”. </li></ul></ul><ul><li>Routine activity (Felson) </li></ul><ul><ul><li>A ready criminal </li></ul></ul><ul><ul><li>Suitable target </li></ul></ul><ul><ul><li>Lack of effective guardian </li></ul></ul>
  8. 8. Theories of criminal behavior (template) <ul><li>“Template” (Brantingham & Brantingham) </li></ul><ul><ul><li>Environment sends out cues about its characteristics </li></ul></ul><ul><ul><li>Criminals use cues to evaluate </li></ul></ul><ul><ul><li>Template is built to associate certain cues with suitable targets </li></ul></ul><ul><ul><li>Template is self-reinforcing and enduring </li></ul></ul><ul><ul><li>A criminal does not have many templates </li></ul></ul>
  9. 9. An operational approach to the theories (template) <ul><li>Criminal incidents committed by the same person </li></ul><ul><ul><li>Similar patterns in time </li></ul></ul><ul><ul><li>Similar patterns in space </li></ul></ul><ul><ul><li>Similar patterns in MO </li></ul></ul><ul><li>It is possible to associate incidents from the same person by discovering these patterns </li></ul>
  10. 10. Existing Association Methods & Systems <ul><li>AREST (Badiru et al.) </li></ul><ul><ul><li>Suspect matching </li></ul></ul><ul><li>ViCAP (FBI) </li></ul><ul><ul><li>Incident matching </li></ul></ul><ul><li>COPLINK (U. Arizona) </li></ul><ul><ul><li>Link search terms with cases (concept space) </li></ul></ul>
  11. 11. Existing Association Methods & Systems <ul><li>TSM (Brown) </li></ul><ul><ul><li>Total similarity measures </li></ul></ul><ul><ul><li>Could be used for both incidents and suspects matching </li></ul></ul><ul><li>SQL </li></ul><ul><ul><li>Used by analysts in practice </li></ul></ul>
  12. 12. Comments on existing methods <ul><li>Computer technologies are central to criminal incident association </li></ul><ul><li>For example </li></ul><ul><ul><li>MIS </li></ul></ul><ul><ul><li>Databases </li></ul></ul><ul><ul><li>Information Retrieval </li></ul></ul><ul><ul><li>GIS </li></ul></ul>
  13. 13. Comments on existing methods <ul><li>Two additional techniques that enable incident association </li></ul><ul><ul><li>Data Warehousing / OLAP </li></ul></ul><ul><ul><li>Data Mining </li></ul></ul><ul><li>We develop a method that </li></ul><ul><ul><li>seamlessly integrates OLAP and data mining. </li></ul></ul>
  14. 14. Related Work on OLAP and data mining <ul><li>OLAP </li></ul><ul><ul><li>Ancestor: OLTP (transactional data) </li></ul></ul><ul><ul><li>OLAP: (summary data for analysis) </li></ul></ul><ul><ul><li>Dimension: </li></ul></ul><ul><ul><ul><li>OLAP data is multidimensional </li></ul></ul></ul><ul><ul><ul><li>Dimension: numeric or categorical attributes </li></ul></ul></ul><ul><ul><ul><li>Hierarchical structures exist in dimensions </li></ul></ul></ul><ul><ul><li>Aggregates: </li></ul></ul><ul><ul><ul><li>Sum, count, average, max, min, … </li></ul></ul></ul>
  15. 15. OLAP and Data Mining <ul><li>Both of them are powerful tools to support decision making process, but </li></ul><ul><ul><li>OLAP focus on efficiency, few quantitative analysis methods are used </li></ul></ul><ul><ul><li>Data mining is typically for 2-D dataset (spreadsheets), not for multidimensional OLAP data structures </li></ul></ul><ul><li>Idea: combine them </li></ul>
  16. 16. Existing studies on combining OLAP and Data mining <ul><li>Cubegrade Problem (Imielinski) </li></ul><ul><ul><li>Generalized version of association rule </li></ul></ul><ul><ul><li>Association rule: change of “count” aggregate imposing another constraint, or perform a “drill-down” operation </li></ul></ul><ul><ul><li>Other aggregates could also be considered </li></ul></ul>
  17. 17. Existing studies on combining OLAP and Data mining <ul><li>Constrained Gradient Analysis </li></ul><ul><ul><li>Retrieve pairs of OLAP cells </li></ul></ul><ul><ul><ul><li>Quite different in aggregates </li></ul></ul></ul><ul><ul><ul><li>Similar in dimension (parents, children, siblings) </li></ul></ul></ul><ul><ul><ul><li>More than one aggregate could be considered simultaneously (e.g., sum and mean). </li></ul></ul></ul>
  18. 18. Existing studies on combining OLAP and Data mining <ul><li>Data driven exploration (Sarawagi) </li></ul><ul><ul><li>Find “exceptions” </li></ul></ul><ul><ul><li>Mean and STD are calculated for a cell </li></ul></ul><ul><ul><li>If the aggregate of the cell is outside the (  -2.5  ,  +2.5  )  exception </li></ul></ul><ul><ul><li>OLAP version of “3  ” rule </li></ul></ul>
  19. 19. Associating records by finding distinctive values or outliers <ul><li>Basic idea </li></ul><ul><ul><li>If a group of records have common characteristics, and these “common” characteristics are unusual or “outliers”, we are more confident in asserting that these records come from the same causal mechanism. </li></ul></ul><ul><li>Look for distinctive characteristics – the best would be DNA </li></ul>
  20. 20. OLAP-outlier-based method to associate records <ul><li>Rationale for distinctive values or outliers </li></ul><ul><ul><li>Weapon used in robberies </li></ul></ul><ul><ul><li>“ gun” – very common, hard to associate </li></ul></ul><ul><ul><li>“ Japanese sword” – distinctive, come from the same person </li></ul></ul><ul><li>We build an outlier score function to measure this “distinctiveness”, </li></ul><ul><ul><li>Higher score  more distinctive  more confident to associate </li></ul></ul><ul><ul><li>It is for categorical attributes (MO is important in linking criminal incidents) </li></ul></ul>
  21. 21. Definitions <ul><li>Cell, Parent, Neighbor </li></ul><ul><ul><li>Cell: a vector of values for some attributes. </li></ul></ul><ul><ul><li>Parent: replace one attribute of the cell with wildcard element “*”. </li></ul></ul><ul><ul><li>Neighbor: A group of cells having the same Parent. </li></ul></ul><ul><li>Derive from OLAP field </li></ul>
  22. 22. Illustration -- Cell
  23. 23. Illustration --parent a 1 a 2 a 4 a 5 a 3 b 4 b 3 b 2 b 1 Cell (a 5 ,b 3 ) has two parents: (a 5 , *) and (*,b 3 )
  24. 24. Illustration -- Neighbor Neighbor is a collection of cells sharing the same parent
  25. 25. Outlier Score Function <ul><li>We start building this function from one dimension, and then we generalize to higher dimensions. </li></ul><ul><li>For one dimension, we have the following two observations. </li></ul><ul><ul><li>Values with small probability (frequency) are more “unusual” </li></ul></ul><ul><ul><li>Outlier score is high when the uncertainty level is low. </li></ul></ul>
  26. 26. Observation I P=0.1 Outlier For attribute “color”, value “blond” covers 10% of the records. Hence, it should get a higher outlier score.
  27. 27. Observation II Although both of them have frequency=0.2, the left one is more “unusual”, because the uncertainty level is low.
  28. 28. Observation III <ul><li>“more evidence” </li></ul><ul><ul><li>More evidence is better than less  higher outlier score </li></ul></ul>
  29. 29. OSF for One Dimension <ul><li>-log(p) comes from information theory, where p is the probability of a value </li></ul><ul><li>Entropy measures the information in a message (in this case, in a data record) </li></ul>
  30. 30. OSF for Higher Dimensions <ul><li>For any cell, calculate the sum of the OSF of its parent cell and the OSF conditional on the neighbor of this cell. (one-dimension OSF) </li></ul><ul><li>Do this calculation for all parent cells. </li></ul><ul><li>Take the maximum as the outlier score for this cell. </li></ul>
  31. 31. Association (using this OLAP-outlier method) <ul><li>For a pair of incidents (A,B) </li></ul><ul><ul><li>If there is a cell that contains both A and B </li></ul></ul><ul><ul><li>And the outlier score of this cell is large enough (threshold test) </li></ul></ul><ul><ul><li>Associate them </li></ul></ul>
  32. 32. Application (dataset) <ul><li>Applied to a robbery dataset (Richmond, VA, 1998) </li></ul><ul><ul><li>Why robbery? </li></ul></ul><ul><ul><ul><li>For evaluation purpose </li></ul></ul></ul><ul><ul><ul><li># of multiple offenses > murder </li></ul></ul></ul><ul><ul><ul><li># of known suspects > B & E </li></ul></ul></ul>
  33. 33. Attributes <ul><li>Three attributes </li></ul><ul><ul><li>Modus Operandi -- categorical </li></ul></ul><ul><ul><li>Census Features -- numeric </li></ul></ul><ul><ul><li>Distance Features – numeric </li></ul></ul>
  34. 34. Feature Selection <ul><li>Redundant features  feature selection </li></ul><ul><ul><li>Cluster features (similar features in the same group) </li></ul></ul><ul><ul><li>Pick a representative feature for each group </li></ul></ul><ul><ul><li>Method: k-medoid clustering </li></ul></ul><ul><ul><ul><li>Applicable to distance matrix </li></ul></ul></ul><ul><ul><ul><li>Return “medoids” </li></ul></ul></ul>
  35. 35. Feature Selection Result
  36. 36. Final Selected Features <ul><li>Medoids </li></ul><ul><ul><li>HUNT (housing unit density) </li></ul></ul><ul><ul><li>ENRL3 (public school enrollment)  POP3 (population:12-17) </li></ul></ul><ul><ul><ul><li>more meaningful (attacker and victims) </li></ul></ul></ul><ul><ul><li>TRAN_PC (transportation expense per capita)  MHINC (median income) </li></ul></ul>
  37. 37. Discretize <ul><li>Discretize these numeric features into bins </li></ul><ul><ul><li>Similar to histogram </li></ul></ul><ul><ul><li>Sturges’ number of bins rule </li></ul></ul>
  38. 38. Evaluation <ul><li>For incidents with known suspects (170) </li></ul><ul><ul><li>Generate all incident pairs </li></ul></ul><ul><ul><li>If a pair of incidents have the same criminal suspect, then “true association” </li></ul></ul><ul><ul><li>Compare results given by the algorithm with the “true result” </li></ul></ul>
  39. 39. Evaluation Criteria <ul><li>Two measures </li></ul><ul><ul><li>Detected true associations </li></ul></ul><ul><ul><ul><li>Larger is better </li></ul></ul></ul><ul><ul><li>Average number of relevant records </li></ul></ul><ul><ul><ul><li>Similar to search engines like “google” </li></ul></ul></ul><ul><ul><ul><li>Given one record, system return a list </li></ul></ul></ul><ul><ul><ul><li>Take the average of the length of all lists </li></ul></ul></ul><ul><ul><ul><li>Shorter is better. </li></ul></ul></ul>
  40. 40. Evaluation Criteria (cont.) <ul><li>From information retrieval </li></ul><ul><ul><li>Recall: ability to provide relevant items </li></ul></ul><ul><ul><li>Precision: ability to provide only relevant items </li></ul></ul><ul><li>1 st measure is “recall”; 2 nd is equivalent to “precision” </li></ul><ul><li>2 nd also measures the user effort (in further investigation) </li></ul>
  41. 41. Result (OLAP-outlier based)
  42. 42. Result of binary association method (calculating similarity score)
  43. 43. Comparison Outlier vs. Binary
  44. 44. Comparison (cont.) <ul><li>Generally, the curve of our method lies above the other one </li></ul><ul><ul><li>Given the same accuracy level, this method returns less records </li></ul></ul><ul><ul><li>Keep the same “length” of the list, this method is more accurate </li></ul></ul><ul><li>The other method is better at the tail </li></ul><ul><ul><li>However, that means the average number of relevant records is > 100 </li></ul></ul><ul><ul><li>Given the size is 170, no analyst would investigate 100 incidents. </li></ul></ul><ul><li>Generally, the new method is effective. </li></ul>
  45. 45. Comparison (Outlier vs. Simple Combination)
  46. 46. WebCAT Implementation <ul><li>A secure web environment that can read several data formats, translate them into a uniform standard (XML) </li></ul><ul><li>Uses free, open-source technology </li></ul><ul><ul><li>ASP, XML, MapServer, SVG, etc. </li></ul></ul><ul><li>Provides tools to meet spatial and statistical analysis needs, to include association </li></ul><ul><li>Provides utilities for querying and reporting </li></ul>
  47. 47. Conclusions <ul><li>Developed a new data association method for linking criminal incidents that combines </li></ul><ul><ul><li>Concepts in OLAP (multidimensional) </li></ul></ul><ul><ul><li>Ideas in data mining (outlier detection) </li></ul></ul><ul><li>Testing with a robbery dataset shows promise </li></ul><ul><li>Deployment through WebCAT provides open source (XML-based) capability for data access and analysis over the web </li></ul>
  48. 48. Questions?

×