Searching The United States Code with Solr/Lucene - By Ronald Matamoros

746 views

Published on

See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011

Published in: Technology, Sports
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
746
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Searching The United States Code with Solr/Lucene - By Ronald Matamoros

  1. 1. Searching The United States Code with Solr/Lucene Paul Nelson / Ronald Matamoros, Search Technologies pnelson@searchtechnologies.com, 5/25/2011 [email_address]
  2. 2. Searching the United States Code <ul><li>Who are we: </li></ul><ul><ul><li>Paul Nelson, Chief Architect </li></ul></ul><ul><ul><li>Ronald Matamoros, Lead Engineer </li></ul></ul><ul><li>Our Mission: Replace Personal Librarian Search </li></ul><ul><ul><li>A 20-Year-Old Search Engine! </li></ul></ul><ul><li>Key Challenges </li></ul><ul><ul><li>How to index this massive, complex, 85-year-old document? </li></ul></ul><ul><ul><li>How to replicate 20-Year-Old search features? </li></ul></ul><ul><li>Government Documents are Fun! </li></ul>
  3. 3. Search Technologies <ul><li>The largest independent provider of enterprise search expertise and services </li></ul><ul><li>80 full-time dedicated search engine experts </li></ul><ul><li>200+ customers </li></ul><ul><li>Technology Neutral </li></ul><ul><ul><li>(yeah, we know Sphinx too) </li></ul></ul><ul><li>Offices All Over </li></ul><ul><ul><li>DC, NY, CA, MD, OH, UK, CR… </li></ul></ul>
  4. 4. A Quick Civics Lesson… <ul><li>The United States Code </li></ul><ul><ul><li>The general & permanent laws of the U.S. Government – All in one place </li></ul></ul><ul><ul><li>51 titles </li></ul></ul><ul><ul><ul><li>Agriculture, Armed Forces, Conservation, The President, Food and Drugs, Postal Service, Public Health… </li></ul></ul></ul><ul><ul><li>First Version: 1926 </li></ul></ul><ul><li>The Office of the Law Revision Council (OLRC) </li></ul><ul><ul><li>20 lawyers who author the U.S. Code </li></ul></ul><ul><ul><li>They report to the Speaker of the House of Representatives </li></ul></ul><ul><li>Bonus Question: Which Title is the largest? </li></ul>
  5. 5. Major Challenges <ul><li>Document Parsing </li></ul><ul><ul><li>A 50 Volume Table Of Contents! </li></ul></ul><ul><li>Query Parsing </li></ul><ul><ul><li>Custom Features (exact case, exact suffix, proximity, query templates, lemmatization, lots of fields…) </li></ul></ul><ul><li>Searching & Highlighting Fields </li></ul><ul><ul><li>Some fields are embedded in the document </li></ul></ul><ul><ul><li>These fields must be highlighted in context </li></ul></ul>
  6. 6. screenshot
  7. 7. screenshot
  8. 8. screenshot
  9. 9.
  10. 10. Part The First: Document Processing
  11. 11. Document Processing / Indexing USC Title Parse & Granularize Repository Construct XHTML Store Xform & Index Solr Embed Refs
  12. 12. Field Type 1: Extracted to Index <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class=&quot;section-head&quot;>&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class=&quot;statutory-body&quot;>The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class=&quot;source-credit&quot;>(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class=&quot;note-head&quot;>Historical and Revision Notes</h4> <p class=&quot;note-body&quot;>Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class=&quot;note-head&quot;>Amendments</h4> <p class=&quot;note-body&quot;>2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class=&quot;note-head&quot;>Effective Date of 2002 Amendment</h4> <p class=&quot;note-body&quot;>Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of … Page Numbers Title Heading Source Credit
  13. 13. Document Processing / Indexing Title 14 ch. 1 ch. 2 ch. 3 pt. A pt. B pt. C sec. 1 sec. 2 sec. 3 … … … USC Title Parse & Granularize Repository Construct XHTML Store Xform & Index Solr Embed Refs
  14. 14. Field Type 2: Embedded Refs <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class=&quot;section-head&quot;>&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class=&quot;statutory-body&quot;>The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class=&quot;source-credit&quot;>(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class=&quot;note-head&quot;>Historical and Revision Notes</h4> <p class=&quot;note-body&quot;>Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class=&quot;note-head&quot;>Amendments</h4> <p class=&quot;note-body&quot;>2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class=&quot;note-head&quot;>Effective Date of 2002 Amendment</h4> <p class=&quot;note-body&quot;>Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of … Public Law Other USC Refs Statute at Large Public Law Public Law
  15. 15. Document Processing / Indexing USC Title Parse & Granularize Repository Construct XHTML Store Xform & Index Solr Embed Refs
  16. 16. Document Processing / Indexing USC Title Parse & Granularize Repository Construct XHTML Store Xform & Index Solr Embed Refs <ul><li>/US-Code </li></ul><ul><ul><li>/2010 </li></ul></ul><ul><ul><ul><li>/title2 </li></ul></ul></ul><ul><ul><ul><ul><li>/USC-title2-section1532.htm </li></ul></ul></ul></ul><ul><ul><ul><ul><li>/USC-title2-node3-rule5.htm </li></ul></ul></ul></ul>
  17. 17. Part The Second: Token Processing
  18. 18. Token Processing 1 <ul><li>xhtml tag tokenizer </li></ul><!-- field-start:amendment-note --> <h4 class=&quot;note-head&quot;>Amendments</h4> <p class=&quot;note-body&quot;>2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:amendment-note --> <h4 class=&quot;note-head&quot;> Amendments </h4> <p class=&quot;note-body&quot;> 2002 Pub L 107 296 Substituted Department of <!-- field-end:amendment-note -->
  19. 19. Field Type 3: Marked Within Doc <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class=&quot;section-head&quot;>&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class=&quot;statutory-body&quot;>The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class=&quot;source-credit&quot;>(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class=&quot;note-head&quot;>Historical and Revision Notes</h4> <p class=&quot;note-body&quot;>Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class=&quot;note-head&quot;>Amendments</h4> <p class=&quot;note-body&quot;>2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class=&quot;note-head&quot;>Effective Date of 2002 Amendment</h4> <p class=&quot;note-body&quot;>Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …
  20. 20. Token Processing 2 <ul><li>Mark Start and End Tags </li></ul>S/amendment <h4 class=&quot;note-head&quot;> Amendments </h4> <p class=&quot;note-body&quot;> 2002 Pub L 107 296 Substituted Department of E/amendment <!-- field-start:amendment-note --> <h4 class=&quot;note-head&quot;> Amendments </h4> <p class=&quot;note-body&quot;> 2002 Pub L 107 296 Substituted Department of <!-- field-end:amendment-note -->
  21. 21. Token Processing 3 <ul><li>Remove XHTML Tags </li></ul>S/amendment Amendments 2002 Pub L 107 296 Substituted Department of E/amendment S/amendment <h4 class=&quot;note-head&quot;> Amendments </h4> <p class=&quot;note-body&quot;> 2002 Pub L 107 296 Substituted Department of E/amendment
  22. 22. Token Processing 4 <ul><li>Tag Original Case & Lower Case </li></ul>S/amendment O/Amendments L/amendments O/2002 L/2002 O/Pub L/pub O/L L/l O/107 L/107 O/296 L/296 O/Substituted L/substituted O/Department L/department O/of L/of E/amendment S/amendment Amendments 2002 Pub L 107 296 Substituted Department of E/amendment
  23. 23. Token Processing 5 <ul><li>Lemmatize </li></ul><ul><li>Uses dictionary-based lemmatizer based on GCIDE and WordNet </li></ul>S/amendment O/Amendments L/amendments amendment O/2002 L/2002 2002 O/Pub L/Pub pub O/L L/l; l O/107 L/107 107 O/296 L/296 296 O/Substituted L/Substituted substitute O/Department L/Department department O/of L/of of E/amendment S/amendment O/Amendments L/amendments O/2002 L/2002 O/Pub L/pub O/L L/l O/107 L/107 O/296 L/296 O/Substituted L/substituted O/Department L/department O/of L/of E/amendment
  24. 24. Part The Third: Query Processing
  25. 25. Query Processing parse mark phrases lemmatize query template build lucene query mark exact: Query String search <ul><li>Communicates via generic QNode Class </li></ul><ul><ul><li>Simpler to manipulate than Lucene operators </li></ul></ul><ul><li>Can produce FAST FQL as well </li></ul><ul><ul><li>(cue the derisive catcalls) </li></ul></ul><ul><li>But most importantly: </li></ul><ul><ul><li>It is a Query Processing Pipeline </li></ul></ul><ul><ul><ul><li>Mix and match query processing modules </li></ul></ul></ul>(not all stages shown)
  26. 26. Query Processing parse mark lowercase lemmatize query template build lucene query mark original Query String search and exact: |FOIA| phrase |top| |secret| amendment: |RECORDS| exact:FOIA “top secret” amendment:RECORDS
  27. 27. Query Processing parse mark lowercase lemmatize query template build lucene query mark original Query String search and O/FOIA phrase |top| |secret| amendment: exact:FOIA “top secret” amendment:RECORDS |RECORDS|
  28. 28. Query Processing parse mark lowercase lemmatize query template build lucene query mark original Query String search and O/FOIA phrase |L/top| |L/secret| amendment: exact:FOIA “top secret” amendment:RECORDS |records|
  29. 29. Query Processing parse mark lowercase lemmatize query template build lucene query mark original Query String search and O/FOIA phrase |L/top| |L/secret| amendment: exact:FOIA “top secret” amendment:RECORDS |record|
  30. 30. Query Processing parse mark lowercase lemmatize query template build lucene query mark original Query String search and O/FOIA phrase |L/top| |L/secret| between exact:FOIA “top secret” amendment:RECORDS E/amendment S/amendment |record|
  31. 31. The between() Operator <ul><li>between(start-tag, end-tag, pos-clause, neg-clause) </li></ul><ul><li>start-tag  Starting tag, e.g. “S/amendment” </li></ul><ul><li>end-tag  Ending tag, e.g. “E/amendment” </li></ul><ul><li>pos-clause  words which must occur between start and end </li></ul><ul><ul><li>Note: Requires a nested ScanAnd() operator </li></ul></ul><ul><li>neg-clause  words which must not occur between start and end </li></ul>
  32. 32. Part the Fourth: Hierarchical Navigation
  33. 33. screenshot
  34. 34. Hierarchies: Requirements <ul><li>Any number of levels </li></ul><ul><ul><ul><li>Title, Sub-Title, Chapter, Sub-Chapter, Part, Sub-Part, Section </li></ul></ul></ul><ul><li>Levels vary across titles </li></ul><ul><ul><ul><li>Title 1: 3 levels </li></ul></ul></ul><ul><ul><ul><li>Title 26: 8 levels </li></ul></ul></ul><ul><li>Multiple views: </li></ul><ul><ul><ul><li>Children </li></ul></ul></ul><ul><ul><ul><li>Ancestors </li></ul></ul></ul><ul><ul><ul><li>Ancestor’s Siblings </li></ul></ul></ul><ul><li>Multiple search scopes: </li></ul><ul><ul><ul><li>Only children, all descendents, everything </li></ul></ul></ul>
  35. 35. Hierarchies: Ancestor-Siblings <ul><li>US-Code </li></ul><ul><ul><li>Title 1 </li></ul></ul><ul><ul><li>Title 2 </li></ul></ul><ul><ul><ul><li>Chapter 1 </li></ul></ul></ul><ul><ul><ul><li>Chapter 2 </li></ul></ul></ul><ul><ul><ul><ul><li>Part 1 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Part 2 </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Section 2.1 </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Section 2.2 </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><li>Part 3 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Part 4 </li></ul></ul></ul></ul><ul><ul><ul><li>Chapter 3 </li></ul></ul></ul><ul><ul><ul><li>Chapter 4 </li></ul></ul></ul><ul><ul><li>Title 3 </li></ul></ul>
  36. 36. Hierarchies: Fields <ul><li>ancestors </li></ul><ul><ul><li>Searching </li></ul></ul><ul><ul><ul><li>USC USC-title2 USC-title2-chapter25 USC-title2-chapter25-subchapter2 </li></ul></ul></ul><ul><li>encodedAncestors – for display only </li></ul><ul><ul><li>Where the node exists within the hierarchy </li></ul></ul><ul><ul><ul><li>id;heading;subjectTitle//id;heading;subjectTitle//... </li></ul></ul></ul><ul><ul><ul><li>USC-title2-chapter25;Chapter 25;Unfunded Mandates Reform// USC-title2-chapter25-subchapter2;Subchapter II;Regulatory Accountabilty and Reform </li></ul></ul></ul><ul><li>parentId – ID of the parent node </li></ul><ul><ul><ul><li>USC-title2-chapter25-subchapter2 </li></ul></ul></ul><ul><li>treesort – Hierarchical sort field, e.g. “ 13/000/0/00882” </li></ul>
  37. 37. Hierarchies: Tree Sort <ul><li>Sorting In Print Order </li></ul><ul><ul><li>Front Matter  Titles  Tables  etc. </li></ul></ul><ul><ul><li>Everything padded to fixed-length </li></ul></ul>01/011/1/02032 01 = USC Title 011 = Title 11 1 = An Appendix Sequence # in file
  38. 38. Hierarchies: Sample Searches <ul><li>Assuming Node = “USC-title2-chapter25” </li></ul><ul><li>Search Children </li></ul><ul><ul><li>parentId:USC-title2-chapter25 </li></ul></ul><ul><li>Search All Descendents </li></ul><ul><ul><li>ancestors:USC-title2-chapter25 </li></ul></ul><ul><li>Ancestor Siblings </li></ul><ul><ul><li>(parentId:USC OR parentId:USC-title2 OR parentId:USC-title2-chapter25) </li></ul></ul>
  39. 39. Contact <ul><li>Paul Nelson </li></ul><ul><ul><li>[email_address] </li></ul></ul><ul><li>Ronald Matamoros </li></ul><ul><ul><li>[email_address] </li></ul></ul><ul><li>Search Technologies </li></ul><ul><ul><li>http://searchtechnologies.com </li></ul></ul>

×