Your SlideShare is downloading. ×
0
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Searching The United States Code with Solr/Lucene

546

Published on

What are the challenges in searching an 85 year old document? The United States Code was published by the United States Congress in 1926 as a single bound volume containing all of the general and …

What are the challenges in searching an 85 year old document? The United States Code was published by the United States Congress in 1926 as a single bound volume containing all of the general and permanent laws of the United States Government. It has been updated every year since and has grown into a 30 volume set of some 40,000 pages divided into 50 titles.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
546
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Searching The United States Code with Solr/Lucene Paul Nelson / Ronald Matamoros, Search Technologies pnelson@searchtechnologies.com, 5/25/2011 rmatamoros@searchtechnologies.com
  • 2. Searching the United States Code§  Who are we: •  Paul Nelson, Chief Architect •  Ronald Matamoros, Lead Engineer§  Our Mission: Replace Personal Librarian Search •  A 20-Year-Old Search Engine!§  Key Challenges •  How to index this massive, complex, 85-year-old document? •  How to replicate 20-Year-Old search features?§  Government Documents are Fun! 3
  • 3. Search Technologies§  The largest independent provider of enterprise search expertise and services§  80 full-time dedicated search engine experts§  200+ customers§  Technology Neutral •  (yeah, we know Sphinx too)§  Offices All Over •  DC, NY, CA, MD, OH, UK, CR… 4
  • 4. A Quick Civics Lesson…§  The United States Code •  The general & permanent laws of the U.S. Government – All in one place •  51 titles §  Agriculture, Armed Forces, Conservation, The President, Food and Drugs, Postal Service, Public Health… •  First Version: 1926§  The Office of the Law Revision Council (OLRC) •  20 lawyers who author the U.S. Code •  They report to the Speaker of the House of Representatives§  Bonus Question: Which Title is the largest? 5
  • 5. Major Challenges1.  Document Parsing •  A 50 Volume Table Of Contents!2.  Query Parsing •  Custom Features (exact case, exact suffix, proximity, query templates, lemmatization, lots of fields…)3.  Searching & Highlighting Fields •  Some fields are embedded in the document •  These fields must be highlighted in context 6
  • 6. screenshot 7
  • 7. screenshot 8
  • 8. screenshot 9
  • 9. 10
  • 10. Part The First:Document Processing 11
  • 11. Document Processing / IndexingUSC Parse & Embed Construct Xform & Granularize Refs XHTML Store Index SolrTitle Repository 12
  • 12. Field Type 1: Extracted to Index Page Numbers<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108documentPDFPage:3 --><!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --><!-- itemsortkey:140AAAD --><!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --><!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3><!-- field-end:head --><!-- field-start:statute --> Heading<p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … Title<!-- field-end:statute --><!-- field-start:sourcecredit --><p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),…<!-- field-end:sourcecredit --><!-- field-start:notes --><!-- field-start:historicalandrevision-note --><h4 class="note-head">Historical and Revision Notes</h4> Source Credit<p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1…<!-- field-end:historicalandrevision-note --><!-- field-start:amendment-note --><h4 class="note-head">Amendments</h4><p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of …<!-- field-end:amendment-note --><!-- field-start:effectivedate-amendment-note --><h4 class="note-head">Effective Date of 2002 Amendment</h4><p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of … 13
  • 13. Document Processing / IndexingUSC Parse & Embed Construct Xform & Granularize Refs XHTML Store Index SolrTitle Repository Title 14 ch. 1 ch. 2 ch. 3 … pt. A pt. B pt. C … sec. 1 sec. 2 sec. 3 … 14
  • 14. Field Type 2: Embedded Refs<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108documentPDFPage:3 --><!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --><!-- itemsortkey:140AAAD --><!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --><!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3><!-- field-end:head --><!-- field-start:statute --> Statute at Large<p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military …<!-- field-end:statute --><!-- field-start:sourcecredit --><p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),…<!-- field-end:sourcecredit --> Public Law<!-- field-start:notes --> USC Refs Other<!-- field-start:historicalandrevision-note --><h4 class="note-head">Historical and Revision Notes</h4><p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1…<!-- field-end:historicalandrevision-note --><!-- field-start:amendment-note --><h4 class="note-head">Amendments</h4><p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of …<!-- field-end:amendment-note --> Public Law<!-- field-start:effectivedate-amendment-note --><h4 class="note-head">Effective Date of 2002 Amendment</h4> Public Law<p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of … 15
  • 15. Document Processing / IndexingUSC Parse & Embed Construct Xform & Granularize Refs XHTML Store Index SolrTitle Repository 16
  • 16. Document Processing / IndexingUSC Parse & Embed Construct Xform & Granularize Refs XHTML Store Index SolrTitle Repository §  /US-Code §  /2010 §  /title2 §  /USC-title2-section1532.htm §  /USC-title2-node3-rule5.htm 17
  • 17. Part The Second:Token Processing 18
  • 18. Token Processing 1 xhtml tag tokenizer <!-- field-start:amendment-note --> <h4 class="note-head"><!-- field-start:amendment-note --> Amendments<h4 class="note-head">Amendments</h4> </h4><p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <p class="note-body"><!-- field-end:amendment-note --> 2002 Pub L 107 296 Substituted Department of <!-- field-end:amendment-note --> 19
  • 19. Field Type 3: Marked Within Doc<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108documentPDFPage:3 --><!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --><!-- itemsortkey:140AAAD --><!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --><!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3><!-- field-end:head --><!-- field-start:statute --><p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military …<!-- field-end:statute --><!-- field-start:sourcecredit --><p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),…<!-- field-end:sourcecredit --><!-- field-start:notes --><!-- field-start:historicalandrevision-note --><h4 class="note-head">Historical and Revision Notes</h4><p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1…<!-- field-end:historicalandrevision-note --><!-- field-start:amendment-note --><h4 class="note-head">Amendments</h4><p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of …<!-- field-end:amendment-note --><!-- field-start:effectivedate-amendment-note --><h4 class="note-head">Effective Date of 2002 Amendment</h4><p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of … 20
  • 20. Token Processing 2Mark Start and End Tags<!-- field-start:amendment-note --> S/amendment<h4 class="note-head"> <h4 class="note-head">Amendments Amendments</h4> </h4><p class="note-body"> <p class="note-body">2002 2002Pub PubL L107 107296 296Substituted SubstitutedDepartment Departmentof of<!-- field-end:amendment-note --> E/amendment 21
  • 21. Token Processing 3Remove XHTML TagsS/amendment S/amendment<h4 class="note-head">Amendments Amendments</h4><p class="note-body">2002 2002Pub PubL L107 107296 296Substituted SubstitutedDepartment Departmentof ofE/amendment E/amendment 22
  • 22. Token Processing 4Tag Original Case & Lower CaseS/amendment S/amendmentAmendments O/Amendments L/amendments2002 O/2002 L/2002Pub O/Pub L/pubL O/L L/l107 O/107 L/107296 O/296 L/296Substituted O/Substituted L/substitutedDepartment O/Department L/departmentof O/of L/ofE/amendment E/amendment 23
  • 23. Token Processing 5 Lemmatize Uses dictionary-based lemmatizer based on GCIDE and WordNetS/amendment S/amendmentO/Amendments L/amendments O/Amendments L/amendments amendmentO/2002 L/2002 O/2002 L/2002 2002O/Pub L/pub O/Pub L/Pub pubO/L L/l O/L L/l; lO/107 L/107 O/107 L/107 107O/296 L/296 O/296 L/296 296O/Substituted L/substituted O/Substituted L/Substituted substituteO/Department L/department O/Department L/Department departmentO/of L/of O/of L/of ofE/amendment E/amendment 24
  • 24. Part The Third:Query Processing 25
  • 25. Query Processing (not all stages shown) buildQuery mark mark query parse lemmatize lucene searchString exact: phrases template query §  Communicates via generic QNode Class •  Simpler to manipulate than Lucene operators §  Can produce FAST FQL as well •  (cue the derisive catcalls) §  But most importantly: •  It is a Query Processing Pipeline §  Mix and match query processing modules 26
  • 26. Query Processing exact:FOIA top secret amendment:RECORDS buildQuery mark mark query parse lemmatize lucene searchString original lowercase template query and exact: phrase amendment: |FOIA| |top| |secret| |RECORDS| 27
  • 27. Query Processing exact:FOIA top secret amendment:RECORDS buildQuery mark mark query parse lemmatize lucene searchString original lowercase template query and O/FOIA phrase amendment: |top| |secret| |RECORDS| 28
  • 28. Query Processing exact:FOIA top secret amendment:RECORDS buildQuery mark mark query parse lemmatize lucene searchString original lowercase template query and O/FOIA phrase amendment: |L/top| |L/secret| |records| 29
  • 29. Query Processing exact:FOIA top secret amendment:RECORDS buildQuery mark mark query parse lemmatize lucene searchString original lowercase template query and O/FOIA phrase amendment: |L/top| |L/secret| |record| 30
  • 30. Query Processing exact:FOIA top secret amendment:RECORDS buildQuery mark mark query parse lemmatize lucene searchString original lowercase template query and O/FOIA phrase between S/amendment |L/top| |L/secret| |record| E/amendment 31
  • 31. The between() Operator§  between(start-tag, end-tag, pos-clause, neg-clause)§  start-tag à Starting tag, e.g. S/amendment§  end-tag à Ending tag, e.g. E/amendment§  pos-clause à words which must occur between start and end •  Note: Requires a nested ScanAnd() operator§  neg-clause à words which must not occur between start and end 32
  • 32. Part the Fourth:Hierarchical Navigation 33
  • 33. screenshot 34
  • 34. Hierarchies: Requirements§  Any number of levels §  Title, Sub-Title, Chapter, Sub-Chapter, Part, Sub-Part, Section§  Levels vary across titles §  Title 1: 3 levels §  Title 26: 8 levels§  Multiple views: §  Children §  Ancestors §  Ancestor s Siblings§  Multiple search scopes: §  Only children, all descendents, everything 35
  • 35. Hierarchies: Ancestor-Siblings§  US-Code •  Title 1 •  Title 2 §  Chapter 1 §  Chapter 2 –  Part 1 –  Part 2 •  Section 2.1 •  Section 2.2 –  Part 3 –  Part 4 §  Chapter 3 §  Chapter 4 •  Title 3 36
  • 36. Hierarchies: Fields§  ancestors •  Searching §  USC USC-title2 USC-title2-chapter25 USC-title2-chapter25- subchapter2§  encodedAncestors – for display only •  Where the node exists within the hierarchy §  id;heading;subjectTitle//id;heading;subjectTitle//... §  USC-title2-chapter25;Chapter 25;Unfunded Mandates Reform// USC-title2-chapter25-subchapter2;Subchapter II;Regulatory Accountabilty and Reform§  parentId – ID of the parent node §  USC-title2-chapter25-subchapter2§  treesort – Hierarchical sort field, e.g. 13/000/0/00882 37
  • 37. Hierarchies: Tree Sort§  Sorting In Print Order •  Front Matter à Titles à Tables à etc. •  Everything padded to fixed-length 01/011/1/0203201 = USC Title Sequence # in file 011 = Title 11 1 = An Appendix 38
  • 38. Hierarchies: Sample Searches§  Assuming Node = USC-title2-chapter25§  Search Children •  parentId:USC-title2-chapter25§  Search All Descendents •  ancestors:USC-title2-chapter25§  Ancestor Siblings •  (parentId:USC OR parentId:USC-title2 OR parentId:USC-title2-chapter25) 39
  • 39. Contact§  Paul Nelson •  pnelson@searchtechnologies.com§  Ronald Matamoros •  rmatamoros@searchtechnologies.com§  Search Technologies •  http://searchtechnologies.com 40

×