Your SlideShare is downloading. ×
0
Highly Relevant Search ResultRanking for Law Enforcement       Ronald Mayer, Forensic Logic, Inc    ramayer@forensiclogic....
What I Will Cover Highly Relevant Search Result Ranking for Large Law  Enforcement Information Sharing Systems Who I am ...
My Background Ron Mayer CTO of Forensic Logic, Inc  • We power crime analysis and cross-agency search tools for the    L...
The Challenge Problem I set out to solve  • We had a good but complex database-based crime analysis package    for invest...
Project background
Project background Started 8 years ago with a desktop Crime Analysis  Application; ported to web application Big structu...
Project background Prototyped with Project Blacklight  • Wonderful F/OSS community  • Just added to their facet list in a...
Project background Eventually rewrote with many law-enforcement-  centric features.
Search Relevance for Law   Enforcement Users
Search Relevance for Law         Enforcement Users Searches often contain multiple clauses  • red baseball cap black leat...
Search Relevance for Law              Enforcement Users Geospatial factors  • Officers are often interested in things nea...
Search Relevance for Law           Enforcement Users Advanced geospatial searches  • Not having a lot of luck with Solr/L...
Search Relevance for Law              Enforcement Temporal factors  • Absolute time: Recent documents are often more inte...
Search Relevance for Law                 Enforcement Some parts of a document are more important than other parts  • A se...
Search Relevance for Law            Enforcement Some documents are more important than others.  • An active warrant on a ...
Search Relevance for Law              Enforcement Exact matches with text from the source document is weighted  more than...
Search Relevance for Law            Enforcement Keyword density matters  • The Lucene SweetSpotSimilarity feature seems t...
Disparate data
Disparate data from many source                            City                           CountyLaw Enforcement
Mixed structured/semi-    structured/un-structured data                               City                              Co...
Mixed structured/semi-     structured/un-structured data                                City                              ...
Arent there standards to deal             with that? XML, etc?
Arent there standards to deal               with that? Or course! And the best part is there are many to  choose from :)...
Arent there standards to deal              with that? But many of our data        Small cities whos record  sources aren...
Arent there standards to deal              with that? But many of our data  sources arent that  ready to adopt federal  s...
Mix of structured/semi-              structured/un-structured data Typical data we get  Typical searches from our<SomeXM...
De-structuring structured data Typical data we get  Typical searches done by<?xml version="1.0" encoding="UTF-8"?>      ...
De-structuring structured data Typical searches done by users  • tall blue eyed teen male with dragon tattoo  • ”Johnnie ...
De-structuring structured data Weve developed a pretty nice NIEM(*) to Human-  friendly English Text tool that enables us...
De-structuring structured data Another example – Vehicle VIN numbers  • Translate     “1N19G9J100001”  • To       “The VI...
De-structuring structured data Another example – GPS coordinates  • Translate       “37.799,-122.161”  • To        “Near ...
De-structuring structured data And (coming soon)  also translate     “37.799,-122.161” To “Room number  XXX in Building ...
Improving phrase searches                            33
Improving phrase searches Dismaxs “pf” (Phrase Fields) and “ps” (Phrase  Slop) are very useful.  • pf = the "pf" param ca...
Improving phrase searches Dismaxs “pf” (Phrase Fields) and “ps” (Phrase Slop)  are very useful.  • A high-boost “pf” with...
Improving phrase searches Edismaxs pf2 and pf3 are even more powerful.  • A modest “pf2” with a relatively small “ps”    ...
SOLR-2058 – best of both So with some experimentation, for our docs:  • We want a high pf with a very small (0) ps  • We ...
SOLR-2058 – best of bothThis worked pretty well for us when we first implemented:         "pf"      => "source_doc~1^500 t...
Alternatives that may work even                better This whole project started trying to boost adjectives  connected to...
Wrap Up Law Enforcement has some pretty interesting  challenges for finding the most relevant  document. Solrs a very ni...
Thanks to the Community Extremely helpful community! Thanks to many in the Lucene communitys help!!!  • Jayendra Patil-2...
Sources Resource  • http://leap.nctcog.org Links  •   https://issues.apache.org/jira/browse/SOLR-2058  •   https://githu...
Contact Ron Mayer  • ramayer@forensiclogic.com                                43
Upcoming SlideShare
Loading in...5
×

Highly Relevant Search Result Ranking for Large Law Enforcement Information Sharing Systems - By Ronald Mayer

343

Published on

See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011

Published in: Technology, Business, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
343
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Highly Relevant Search Result Ranking for Large Law Enforcement Information Sharing Systems - By Ronald Mayer"

  1. 1. Highly Relevant Search ResultRanking for Law Enforcement Ronald Mayer, Forensic Logic, Inc ramayer@forensiclogic.com, 2011-05-26 Police car photo by davidsonscott15 (Scott Davidson) on Flickr under (CC BY 2.0) license
  2. 2. What I Will Cover Highly Relevant Search Result Ranking for Large Law Enforcement Information Sharing Systems Who I am – Ron Mayer, CTO at Forensic Logic. The challenge / problem • Ranking law enforcement documents has interesting challenges. 3 interesting challenges: • Many factors affect relevance for a law-enforcement user • A mix of structured, unstructured, semi-structured data • Improving edismax sub-phrase boosting Conclusion • Solrs flexibility & community are both great. 2
  3. 3. My Background Ron Mayer CTO of Forensic Logic, Inc • We power crime analysis and cross-agency search tools for the LEAP (law enforcement analysis portal) project. • About 150 State, Local, and Federal law enforcement agencies use our SAAS software to analyze and share data My background • 8 years of delivering software technologies to law enforcement as SAAS solutions. • Use some F/OSS, quite a bit of proprietary. • Play well with F/OSS projects  (contributed back code to PostgreSQL, PostGIS, a memcached client, and earlier contributions from school that found their way into various projects) 3
  4. 4. The Challenge Problem I set out to solve • We had a good but complex database-based crime analysis package for investigators with good computer skills. • Needed an easy “google-like” interface that any officer could use. Considerations • Most officers dont want to sit around on desks filling out search forms. • Want something like Google – type a guess, and get the most relevant documents on the first page. Key hurdles or obstacles to success you had to overcome. • What factors even define “the most relevant” document. • Extremely Disparate data (some almost totally structured; some totally unstructured; most a mix) • How do we implement ranking. 4
  5. 5. Project background
  6. 6. Project background Started 8 years ago with a desktop Crime Analysis Application; ported to web application Big structured search forms worked well for crime analysts and detectives who can invest time at a desk Some users wanted quicker/easier simple search
  7. 7. Project background Prototyped with Project Blacklight • Wonderful F/OSS community • Just added to their facet list in a config file. • Constructuve feedback from customers in couple weeks.
  8. 8. Project background Eventually rewrote with many law-enforcement- centric features.
  9. 9. Search Relevance for Law Enforcement Users
  10. 10. Search Relevance for Law Enforcement Users Searches often contain multiple clauses • red baseball cap black leather jacket tall male suspect short asian victim • These search clauses are often noun clauses with a few adjectives preceding a noun; but are often independent from each other. Fuzzy searches are common • Victims give incomplete descriptions • Suspects lie • Close counts.
  11. 11. Search Relevance for Law Enforcement Users Geospatial factors • Officers are often interested in things near their own city or beat  Solr does this one well for 1 location of interest in a document: – bf=... recip(dist(2,primary_latlon,vector(#{lat},#{lon})),1,1,1)^0.5  I havent yet found a great solution for documents with many locations of interest (say, a document regarding a gang importing drugs from Ciudad Juárez Mexico to Denver, which should be highly relevant to every city touching the southern half of I25. • Often law enforcement officers want to search for documents near a certain type of landmark  “near any elementary school in the school district”  “near a particular school”  “in a predominantly Hispanic neighborhood”  “near a freeway” • Sometimes more convenient to interact with a map and use Solrs geospatial features. Sometimes more convenient to tag the documents with the relevant phrases.
  12. 12. Search Relevance for Law Enforcement Users Advanced geospatial searches • Not having a lot of luck with Solr/Lucene here yet • Often intersecting polygons.  Just off a I5  Walking distance from a Jr High School • We do it in a more complex app w/ Postgis.  Would love to be able to click a school or road on a map, and use that to filter or sort Solr results
  13. 13. Search Relevance for Law Enforcement Temporal factors • Absolute time: Recent documents are often more interesting than very old documents.  Solr handles this well with – Dismaxs bf=”recip(ms(NOW,primary_date),3.16e-11,1,1)^2 ...” – Edismaxs boost=recip(ms(NOW,primary_date),3.16e-11,1,1)&boost= – (unless you have expressions that can hit 0, edismaxs multiplicative boost seem easier to balance against other boosting factors) • Relative time: Gang retaliations often happen near each other in time.  Can replace “NOW” in the above with some other date of interest. • Time of day: Certain robbers and burglars like to work at certain times of the day (payday after work; dusk; at Raiders games).  Can handle as a range facet, and/or by tagging documents with phrases for text search
  14. 14. Search Relevance for Law Enforcement Some parts of a document are more important than other parts • A search for “John Doe” should rank documents where hes the Arrestee (or subject, etc) over those where hes an innocent bystander (or witness or victim, etc). • Handled nicely by Solrs Dismax and edismax “qf=important_text^2 less_important_text” feature Important parts of a document can depend a lot on the content of a document itself. • For a sexual assault, characteristics of a victim like the victims age and gender can be very "important", while the make/model of her car will be unimportant. For a vehicle theft, the age and gender of the victim will be more unimportant while make/model of the car will be more important. • Handled reasonably by having logic in the indexer to place some data into different text fields; and by having the app server tweak the boosts in the qf= expression as needed
  15. 15. Search Relevance for Law Enforcement Some documents are more important than others. • An active warrant on a person is more important than an inactive one. • An unsolved homicide is more important than a complaint about noise that was decided to be unfounded. • A document with complete descriptions is more important (well, or at least more actionable) than a very incomplete form that was abandoned Handled with the dismax: bf=sqrt(importance) parameter and similar edismax boost= paramters
  16. 16. Search Relevance for Law Enforcement Exact matches with text from the source document is weighted more than speculative guesses from our algorithms. • We tag documents with additional terms that werent necessarily in the source document.  Some of this is done by Solr – Stemming – Synonyms  Some approximations and guesses are done by our indexers – 64” -> tall – “lat = 37.799, lon = -122.161” -> “Near Skyline High School” – 8:00pm → dusk( at certain times of the year); night (at others) • But these additional tags carry less weight in ranking than the source document. Handled well by solrs • “qf=source_document^10 stemmed_text^1 speculative_guesses^0.1”
  17. 17. Search Relevance for Law Enforcement Keyword density matters • The Lucene SweetSpotSimilarity feature seems to be give nicer results than the old default. • Were experimenting with our own that may work better with our mixed-structured-unstructured content.
  18. 18. Disparate data
  19. 19. Disparate data from many source City CountyLaw Enforcement
  20. 20. Mixed structured/semi- structured/un-structured data City CountyCourtsLaw Enforcement
  21. 21. Mixed structured/semi- structured/un-structured data City County Federal JailsCourtsLaw Enforcement
  22. 22. Arent there standards to deal with that? XML, etc?
  23. 23. Arent there standards to deal with that? Or course! And the best part is there are many to choose from :) Many federal efforts • GJXDM (“Global Justice XML Data Model”) 1.0, 2.0, 3.0.3 (2005) • NIEM (outgrowth of GJXDM + DHS(FBI) + ODNI)  NIEM 1.0 (2006) NIEM2.0 (2007) 2.1 (2009) • LEXS – extends subsets of NIEM • EDXL (DHS, EIC) “Emergency Data Exchange Language”  Not really designed for law enforcement, but with data relevant to police, and less US-centric in person names and addresses. And many States define their own XML standards. (which are often Extensions to NIEM Subsets like the Texas Path to NIEM)
  24. 24. Arent there standards to deal with that? But many of our data  Small cities whos record sources arent that management system is a folder of word documents. ready to adopt federal  Old mainframe computers where standards. every developer has retired  Even when agencies using standardized XML, the most interesting contents not in the structured part.“The first suspect is described as a tall, heavyset, lightskinned black male, possibly half Italian, with 2 inch knots ordreads in his hair with a light brown mustache. He was inpossession of a small caliber handgun.”
  25. 25. Arent there standards to deal with that? But many of our data sources arent that ready to adopt federal standards. And some never will.
  26. 26. Mix of structured/semi- structured/un-structured data Typical data we get  Typical searches from our<SomeXMLContainer> users<?xml version="1.0" encoding="UTF-8"?> [... hundreds more lines...] <Incident> <nc:ActivityDate> <nc:DateTime>2007-01-01T10:00:00</nc:DateTime> </nc:ActivityDate> </Incident> [... hundreds more lines...] • tall red haired blue eyed teen male with dragon <tx:SubjectPerson s:id="Subject_id"> <nc:PersonBirthDate> <nc:Date>1970-01-01</nc:Date> </nc:PersonBirthDate> tattoo <nc:PersonEthnicityCode>N</nc:PersonEthnicityCode> <nc:PersonEyeColorCode>BLU</nc:PersonEyeColorCode> <nc:PersonHeightMeasure> <nc:MeasurePointValue>604</nc:MeasurePointValue> </nc:PersonHeightMeasure> <nc:PersonName> <nc:PersonGivenName>Jonathan</nc:PersonGivenName> <nc:PersonMiddleName>William</nc:PersonMiddleName> <nc:PersonSurName>Doe</nc:PersonSurName> • ”Johnnie Doe” dallas <nc:PersonNameSuffixText>III</nc:PersonNameSuffixText> </nc:PersonName> <nc:PersonPhysicalFeature> <nc:PhysicalFeatureDescriptionText>Green Dragon Tattoo</nc:PhysicalFeatureDescriptionText> • Burglar broke rear <nc:PhysicalFeatureLocationText>Arm</nc:PhysicalFeatureLocationText> </nc:PersonPhysicalFeature> <nc:PersonRaceCode>W</nc:PersonRaceCode> <nc:PersonSexCode>M</nc:PersonSexCode> <nc:PersonSkinToneCode>RUD</nc:PersonSkinToneCode> bedroom window, stole <nc:PersonHairColorCode>RED</nc:PersonHairColorCode> <nc:PersonWeightMeasure> <nc:MeasurePointValue>150</nc:MeasurePointValue> </nc:PersonWeightMeasure> jewelry [... dozens more lines of xml about the person ...] </tx:SubjectPerson> [... hundreds more lines of xml...] <tx:Location s:id="Subjects_Home_id"> <nc:LocationAddress> <nc:AddressFullText>1 Main St</nc:AddressFullText> <nc:StructuredAddress> <nc:LocationCityName>Dallas</nc:LocationCityName> <nc:LocationStateName>Texas</nc:LocationStateName> <nc:LocationCountryName>USA</nc:LocationCountryName> <nc:LocationPostalCode>54321</nc:LocationPostalCode> <...
  27. 27. De-structuring structured data Typical data we get  Typical searches done by<?xml version="1.0" encoding="UTF-8"?> users<SomeXMLContainer> [... hundreds more lines...] <Incident> <nc:ActivityDate> <nc:DateTime>2007-01-01T10:00:00</nc:DateTime> • tall blue eyed teen male with </nc:ActivityDate> </Incident> [... hundreds more lines...] <tx:SubjectPerson s:id="Subject_id"> dragon tattoo <nc:PersonBirthDate> <nc:Date>1990-01-01</nc:Date> </nc:PersonBirthDate> <nc:PersonEthnicityCode>N</nc:PersonEthnicityCode> • ”Johnnie Doe” “red hair” <nc:PersonEyeColorCode>BLU</nc:PersonEyeColorCode> <nc:PersonHeightMeasure> <nc:MeasurePointValue>604</nc:MeasurePointValue> </nc:PersonHeightMeasure> dallas <nc:PersonName> <nc:PersonGivenName>Jonathan</nc:PersonGivenName> <nc:PersonMiddleName>William</nc:PersonMiddleName> <nc:PersonSurName>Doe</nc:PersonSurName> <nc:PersonNameSuffixText>III</nc:PersonNameSuffixText> </nc:PersonName> <nc:PersonPhysicalFeature> <nc:PhysicalFeatureDescriptionText>Green Dragon Tattoo</nc:PhysicalFeatureDescriptionText> <nc:PhysicalFeatureLocationText>Arm</nc:PhysicalFeatureLocationText>  One nice trick for solr: </nc:PersonPhysicalFeature> <nc:PersonRaceCode>W</nc:PersonRaceCode> <nc:PersonSexCode>M</nc:PersonSexCode> <nc:PersonSkinToneCode>RUD</nc:PersonSkinToneCode> • Convert XML to English. <nc:PersonHairColorCode>RED</nc:PersonHairColorCode> <nc:PersonWeightMeasure> <nc:MeasurePointValue>150</nc:MeasurePointValue> </nc:PersonWeightMeasure>  Jonathan Doe, a tall (64”) red haired blue eyed teen (17 year [... dozens more lines of xml about the person ...] </tx:SubjectPerson> [... hundreds more lines of xml...] old) white male of Dallas TX was <tx:Location s:id="Subjects_Home_id"> <nc:LocationAddress> <nc:AddressFullText>1 Main St</nc:AddressFullText> <nc:StructuredAddress> <nc:LocationCityName>Dallas</nc:LocationCityName> <nc:LocationStateName>Texas</nc:LocationStateName> arrested at 1 Main St on Jan 1. <nc:LocationCountryName>USA</nc:LocationCountryName> <nc:LocationPostalCode>54321</nc:LocationPostalCode> </nc:StructuredAddress> Possible nicknames, johnny, </nc:LocationAddress> ... william, bill, billy ...”
  28. 28. De-structuring structured data Typical searches done by users • tall blue eyed teen male with dragon tattoo • ”Johnnie Doe” “red hair” Dallas Solution: • Convert XML to English.  “Jonathan Doe, a tall (64”) red haired blue eyed teen (17 year old) white male of Dallas TX was arrested at 1 Main St at 0456 Jan 1, 1999 (1999-01-01 04:56.) Possible nicknames, johnny, william, bill, billy ...” • A little more subtle than that  Terms generated by our speculative algorithms (possible nicknames, tall, etc) are put in a separate lower-weighted text field that the users can exclude when doing “exact match” searches.
  29. 29. De-structuring structured data Weve developed a pretty nice NIEM(*) to Human- friendly English Text tool that enables users uncomfortable with databases to search their agencys structured data much as they would google something. Side benefit – easier to fit one text field on a mobile phone than search forms with many dozen fields. * NIEM is a large government XML standard often used for law enforcement information exchange. Much of our data is sent to us in this format or closely related ones; and for other data sources we map it to NIEM as as early part of our import pipeline.
  30. 30. De-structuring structured data Another example – Vehicle VIN numbers • Translate “1N19G9J100001” • To “The VIN number suggests the vehicle a 1979 4- door Chevrolet (Chevy) Caprice” in one of our speculative-content fields. • (but only if the document didnt already have this information)
  31. 31. De-structuring structured data Another example – GPS coordinates • Translate “37.799,-122.161” • To “Near Skyline HighSchool” in one of our speculative-content fields.
  32. 32. De-structuring structured data And (coming soon) also translate “37.799,-122.161” To “Room number XXX in Building YYY at Skyline High”.
  33. 33. Improving phrase searches 33
  34. 34. Improving phrase searches Dismaxs “pf” (Phrase Fields) and “ps” (Phrase Slop) are very useful. • pf = the "pf" param can be used to "boost" the score of documents in cases where all of the terms in the "q" param appear in close proximity • ps = Amount of slop on phrase queries built for "pf" fields (affects boosting) 34
  35. 35. Improving phrase searches Dismaxs “pf” (Phrase Fields) and “ps” (Phrase Slop) are very useful. • A high-boost “pf” with 0 “ps” is great for ensuring that our very most relevant documents show up on the very top in search results. • A modest-boost “pf” with a largeish “ps” (paragraph sized) is great for ensuring that quite relevant documents appear in the first page of results. Examples: • If an exact phrase matches, its probably the document hes looking for. • If a single paragraph contains all the words of a users search, its probably relevant too. 35
  36. 36. Improving phrase searches Edismaxs pf2 and pf3 are even more powerful. • A modest “pf2” with a relatively small “ps” (about noun-clause sized) is excellent for searching for adjective/noun clauses. Examples: • Document text: “The suspect was a tall thin teen male wearing a red baseball cap and black leather jacket” • Quite relevant for searches for “black jacket”, “tall male”, “leather jacket”, etc. 36
  37. 37. SOLR-2058 – best of both So with some experimentation, for our docs: • We want a high pf with a very small (0) ps • We want a low pf with large ps • We want a moderate pf2 with moderate ps Solution • SOLR-2058 • ...&pf2=text^10~10&pf=text^100&pf=text~100 • your constants may change depending how much you weigh other boosting factors like document age or distance 37
  38. 38. SOLR-2058 – best of bothThis worked pretty well for us when we first implemented: "pf" => "source_doc~1^500 text_stem~1^100 source_doc~50^50 text_stem~20^50", "pf3" => "text_unstem~1^250", "pf2" => "text_stem^50 text_stem~10^10 text_unstem~10^10", "ps" => 1,Scary Parsed Query: [... many dozen lines... ]DisjunctionMaxQuery((text_stem:"black leather"~1^50.0)~0.01)DisjunctionMaxQuery((text_stem:"leather jacket"~1^50.0)~0.01)) (DisjunctionMaxQuery((text_stem:"red basebal"~10^10.0)~0.01)DisjunctionMaxQuery((text_stem:"basebal cap"~10^10.0)~0.01) [... many dozens more lines...]But its fast enough in the end: org.apache.solr.handler.component.QueryComponent: time: 658.0 38
  39. 39. Alternatives that may work even better This whole project started trying to boost adjectives connected to nouns • With document text like “Tall white heavyset male suspect with eyes that looked blue or gray and red hair wearing a black and yellow jacket a hat that looked purple and a green dragon tattoo on his right arm using a knife with an orange handle”. • And a search clause like white male, orange knife, black jacket boosting this document appropriately. Had an interesting conversation with one of this conferences sponsors about looking at the grammar to see which color goes with which noun. 39
  40. 40. Wrap Up Law Enforcement has some pretty interesting challenges for finding the most relevant document. Solrs a very nice tool for companies to get started with text search and tuning it for domain specific needs; thanks to nice projects already using it, and a very helpful community. Solrs flexibility makes it easy to configure to even quite demanding requirements. 40
  41. 41. Thanks to the Community Extremely helpful community! Thanks to many in the Lucene communitys help!!! • Jayendra Patil-2  Who experienced a similar issue and pointed me to exactly where in the code they applied a similar patch. • Yonik Seeley  Proposed a good syntax for the parameters, and politely critiqued my really ugly first implementation. • Chris Hostetter  Voicing support for the syntax and gave encouraging comments • Erik Hatcher  For Blacklight which introduced us to solr and powered our initial prototypes. • Swapnonil Mukherjee, Nick Hall  Expressing interest in and trying the patches. “Sor-2058 allows for a dramatic increase in search relevance” - Nick • Andy Jenkins and team at Ejustice  Another Lucene user were working with whos giving me great advice how to further improve ranking • Lucid Imagination  Thanks much for your free advice during early sales calls.  Thanks even more for your free support on mailing lists, IRC, etc. 41
  42. 42. Sources Resource • http://leap.nctcog.org Links • https://issues.apache.org/jira/browse/SOLR-2058 • https://github.com/ramayer/lucene- solr/tree/solr_2058_edismax_pf2_phrase_slop White paper 42
  43. 43. Contact Ron Mayer • ramayer@forensiclogic.com 43
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×