SlideShare a Scribd company logo
1 of 39
Download to read offline
Searching The United States Code
        with Solr/Lucene
  Paul Nelson / Ronald Matamoros, Search Technologies
       pnelson@searchtechnologies.com, 5/25/2011
          rmatamoros@searchtechnologies.com
Searching the
           United States Code
§  Who are we:
  •  Paul Nelson, Chief Architect
  •  Ronald Matamoros, Lead Engineer
§  Our Mission: Replace Personal Librarian Search
  •  A 20-Year-Old Search Engine!
§  Key Challenges
  •  How to index this massive, complex, 85-year-old
     document?
  •  How to replicate 20-Year-Old search features?
§  Government Documents are Fun!

                                                       3
Search Technologies
§  The largest independent provider of enterprise
    search expertise and services
§  80 full-time dedicated search engine experts
§  200+ customers
§  Technology Neutral
   •  (yeah, we know
      Sphinx too)
§  Offices All Over
   •  DC, NY, CA, MD,
      OH, UK, CR…


                                                     4
A Quick Civics Lesson…
§  The United States Code
  •  The general & permanent laws of the U.S.
     Government – All in one place
  •  51 titles
     §  Agriculture, Armed Forces, Conservation, The President,
         Food and Drugs, Postal Service, Public Health…
  •  First Version: 1926
§  The Office of the Law Revision Council (OLRC)
  •  20 lawyers who author the U.S. Code
  •  They report to the Speaker of the House of
     Representatives
§  Bonus Question: Which Title is the largest?
                                                                   5
Major Challenges
1.  Document Parsing
  •  A 50 Volume Table Of Contents!


2.  Query Parsing
  •  Custom Features (exact case, exact suffix,
     proximity, query templates, lemmatization, lots
     of fields…)


3.  Searching & Highlighting Fields
  •  Some fields are embedded in the document
  •  These fields must be highlighted in context

                                                       6
screenshot




             7
screenshot




             8
screenshot




             9
10
Part The First:
Document Processing



                      11
Document Processing / Indexing

USC      Parse &      Embed   Construct                Xform &
        Granularize    Refs    XHTML
                                            Store
                                                        Index
                                                                 Solr
Title


                                          Repository




                                                                        12
Field Type 1: Extracted to Index
                                      Page Numbers
<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108
documentPDFPage:3 -->
<!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 -->
<!-- itemsortkey:140AAAD -->
<!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-
ESTABLISHMENT AND DUTIES!@!Sec. 1 -->
<!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3>
<!-- field-end:head -->
<!-- field-start:statute -->
                                        Heading
<p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military …
                                                                         Title
<!-- field-end:statute -->
<!-- field-start:sourcecredit -->
<p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),…
<!-- field-end:sourcecredit -->
<!-- field-start:notes -->
<!-- field-start:historicalandrevision-note -->
<h4 class="note-head">Historical and Revision Notes</h4>           Source Credit
<p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1…
<!-- field-end:historicalandrevision-note -->
<!-- field-start:amendment-note -->
<h4 class="note-head">Amendments</h4>
<p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of …
<!-- field-end:amendment-note -->
<!-- field-start:effectivedate-amendment-note -->
<h4 class="note-head">Effective Date of 2002 Amendment</h4>
<p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …       13
Document Processing / Indexing

USC        Parse &            Embed     Construct                Xform &
          Granularize          Refs      XHTML
                                                      Store
                                                                  Index
                                                                           Solr
Title


                                                    Repository

                   Title 14


          ch. 1     ch. 2      ch. 3    …
  pt. A   pt. B     pt. C      …
          sec. 1   sec. 2      sec. 3   …

                                                                                  14
Field Type 2: Embedded Refs
<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108
documentPDFPage:3 -->
<!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 -->
<!-- itemsortkey:140AAAD -->
<!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-
ESTABLISHMENT AND DUTIES!@!Sec. 1 -->
<!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3>
<!-- field-end:head -->
<!-- field-start:statute -->
                                                    Statute at Large
<p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military …
<!-- field-end:statute -->
<!-- field-start:sourcecredit -->
<p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),…
<!-- field-end:sourcecredit -->                                                            Public Law
<!-- field-start:notes --> USC Refs
                     Other
<!-- field-start:historicalandrevision-note -->
<h4 class="note-head">Historical and Revision Notes</h4>
<p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1…
<!-- field-end:historicalandrevision-note -->
<!-- field-start:amendment-note -->
<h4 class="note-head">Amendments</h4>
<p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of …
<!-- field-end:amendment-note -->
                                                                   Public Law
<!-- field-start:effectivedate-amendment-note -->
<h4 class="note-head">Effective Date of 2002 Amendment</h4>           Public Law
<p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …          15
Document Processing / Indexing

USC      Parse &      Embed   Construct                Xform &
        Granularize    Refs    XHTML
                                            Store
                                                        Index
                                                                 Solr
Title


                                          Repository




                                                                        16
Document Processing / Indexing

USC      Parse &      Embed       Construct                 Xform &
        Granularize    Refs        XHTML
                                                 Store
                                                             Index
                                                                      Solr
Title


                                               Repository




             §  /US-Code
                  §  /2010
                       §  /title2
                             §  /USC-title2-section1532.htm
                             §  /USC-title2-node3-rule5.htm


                                                                             17
Part The Second:
Token Processing



                   18
Token Processing 1
     xhtml tag tokenizer                             <!-- field-start:amendment-note -->
                                                     <h4 class="note-head">
<!-- field-start:amendment-note -->                  Amendments
<h4 class="note-head">Amendments</h4>
                                                     </h4>
<p class="note-body">2002&mdash;Pub. L. 107&ndash;
296 substituted &ldquo;Department of …               <p class="note-body">
<!-- field-end:amendment-note -->
                                                     2002
                                                     Pub
                                                     L
                                                     107
                                                     296
                                                     Substituted
                                                     Department
                                                     of
                                                     <!-- field-end:amendment-note -->



                                                                                           19
Field Type 3: Marked Within Doc
<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108
documentPDFPage:3 -->
<!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 -->
<!-- itemsortkey:140AAAD -->
<!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-
ESTABLISHMENT AND DUTIES!@!Sec. 1 -->
<!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3>
<!-- field-end:head -->
<!-- field-start:statute -->
<p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military …
<!-- field-end:statute -->
<!-- field-start:sourcecredit -->
<p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),…
<!-- field-end:sourcecredit -->
<!-- field-start:notes -->
<!-- field-start:historicalandrevision-note -->
<h4 class="note-head">Historical and Revision Notes</h4>
<p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1…
<!-- field-end:historicalandrevision-note -->
<!-- field-start:amendment-note -->
<h4 class="note-head">Amendments</h4>
<p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of …
<!-- field-end:amendment-note -->
<!-- field-start:effectivedate-amendment-note -->
<h4 class="note-head">Effective Date of 2002 Amendment</h4>
<p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …       20
Token Processing 2
Mark Start and End Tags
<!-- field-start:amendment-note -->   S/amendment
<h4 class="note-head">                <h4 class="note-head">
Amendments                            Amendments
</h4>                                 </h4>
<p class="note-body">                 <p class="note-body">
2002                                  2002
Pub                                   Pub
L                                     L
107                                   107
296                                   296
Substituted                           Substituted
Department                            Department
of                                    of
<!-- field-end:amendment-note -->     E/amendment


                                                               21
Token Processing 3
Remove XHTML Tags
S/amendment                  S/amendment
<h4 class="note-head">
Amendments                   Amendments
</h4>
<p class="note-body">
2002                         2002
Pub                          Pub
L                            L
107                          107
296                          296
Substituted                  Substituted
Department                   Department
of                           of
E/amendment                  E/amendment


                                           22
Token Processing 4
Tag Original Case & Lower Case

S/amendment                S/amendment
Amendments                 O/Amendments    L/amendments
2002                       O/2002          L/2002
Pub                        O/Pub           L/pub
L                          O/L             L/l
107                        O/107           L/107
296                        O/296           L/296
Substituted                O/Substituted   L/substituted
Department                 O/Department    L/department
of                         O/of            L/of
E/amendment                E/amendment




                                                           23
Token Processing 5
 Lemmatize
         Uses dictionary-based lemmatizer based on GCIDE and WordNet

S/amendment                         S/amendment
O/Amendments    L/amendments        O/Amendments    L/amendments    amendment
O/2002          L/2002              O/2002          L/2002          2002
O/Pub           L/pub               O/Pub           L/Pub           pub
O/L             L/l                 O/L             L/l;            l
O/107           L/107               O/107           L/107           107
O/296           L/296               O/296           L/296           296
O/Substituted   L/substituted       O/Substituted   L/Substituted   substitute
O/Department    L/department        O/Department    L/Department    department
O/of            L/of                O/of            L/of            of
E/amendment                         E/amendment




                                                                                 24
Part The Third:
Query Processing



                   25
Query Processing
                           (not all stages shown)

                                                                build
Query             mark       mark                     query
          parse                         lemmatize              lucene   search
String            exact:    phrases                 template
                                                                query


   §  Communicates via generic QNode Class
         •  Simpler to manipulate than Lucene operators
   §  Can produce FAST FQL as well
         •  (cue the derisive catcalls)
   §  But most importantly:
         •  It is a Query Processing Pipeline
            §  Mix and match query processing modules


                                                                           26
Query Processing
                 exact:FOIA top secret amendment:RECORDS

                                                                     build
Query              mark         mark                      query
         parse                             lemmatize                lucene   search
String            original   lowercase                  template
                                                                     query




                              and

                 exact:               phrase           amendment:


                 |FOIA|       |top|        |secret|    |RECORDS|




                                                                                27
Query Processing
                 exact:FOIA top secret amendment:RECORDS

                                                                     build
Query              mark         mark                      query
         parse                             lemmatize                lucene   search
String            original   lowercase                  template
                                                                     query




                              and

                 O/FOIA               phrase           amendment:


                              |top|        |secret|    |RECORDS|




                                                                                28
Query Processing
                 exact:FOIA top secret amendment:RECORDS

                                                                    build
Query              mark         mark                     query
         parse                            lemmatize                lucene   search
String            original   lowercase                 template
                                                                    query




                              and

                 O/FOIA             phrase            amendment:


                             |L/top|     |L/secret|    |records|




                                                                               29
Query Processing
                 exact:FOIA top secret amendment:RECORDS

                                                                    build
Query              mark         mark                     query
         parse                            lemmatize                lucene   search
String            original   lowercase                 template
                                                                    query




                              and

                 O/FOIA             phrase            amendment:


                             |L/top|     |L/secret|    |record|




                                                                               30
Query Processing
                 exact:FOIA top secret amendment:RECORDS

                                                                   build
Query              mark         mark                    query
         parse                            lemmatize               lucene    search
String            original   lowercase                template
                                                                   query




                              and

                 O/FOIA             phrase            between

                                                                 S/amendment
                             |L/top|     |L/secret|
                                                                 |record|

                                                                 E/amendment

                                                                               31
The between() Operator
§  between(start-tag, end-tag, pos-clause, neg-clause)

§  start-tag à Starting tag, e.g. S/amendment
§  end-tag à Ending tag, e.g. E/amendment

§  pos-clause à words which must occur between
    start and end
   •  Note: Requires a nested ScanAnd() operator
§  neg-clause à words which must not occur between
    start and end

                                                      32
Part the Fourth:
Hierarchical Navigation



                          33
screenshot




             34
Hierarchies: Requirements
§  Any number of levels
      §  Title, Sub-Title, Chapter, Sub-Chapter, Part, Sub-Part,
          Section
§  Levels vary across titles
      §  Title 1: 3 levels
      §  Title 26: 8 levels
§  Multiple views:
      §  Children
      §  Ancestors
      §  Ancestor s Siblings
§  Multiple search scopes:
      §  Only children, all descendents, everything

                                                                    35
Hierarchies: Ancestor-Siblings
§  US-Code
  •  Title 1
  •  Title 2
     §  Chapter 1
     §  Chapter 2
         –  Part 1
         –  Part 2
              •  Section 2.1
              •  Section 2.2
         –  Part 3
         –  Part 4
     §  Chapter 3
     §  Chapter 4
  •  Title 3

                                  36
Hierarchies: Fields
§  ancestors
   •  Searching
      §  USC USC-title2 USC-title2-chapter25 USC-title2-chapter25-
          subchapter2
§  encodedAncestors – for display only
   •  Where the node exists within the hierarchy
      §  id;heading;subjectTitle//id;heading;subjectTitle//...
      §  USC-title2-chapter25;Chapter 25;Unfunded Mandates Reform//
          USC-title2-chapter25-subchapter2;Subchapter II;Regulatory
          Accountabilty and Reform
§  parentId – ID of the parent node
      §  USC-title2-chapter25-subchapter2
§  treesort – Hierarchical sort field, e.g. 13/000/0/00882

                                                                       37
Hierarchies: Tree Sort
§  Sorting In Print Order
   •  Front Matter à Titles à Tables à etc.
   •  Everything padded to fixed-length

                    01/011/1/02032

01 = USC Title                            Sequence # in file

                 011 = Title 11   1 = An Appendix




                                                               38
Hierarchies: Sample Searches
§  Assuming Node = USC-title2-chapter25
§  Search Children
   •  parentId:USC-title2-chapter25
§  Search All Descendents
   •  ancestors:USC-title2-chapter25
§  Ancestor Siblings
   •  (parentId:USC OR parentId:USC-title2 OR
      parentId:USC-title2-chapter25)




                                                39
Contact
§  Paul Nelson
   •  pnelson@searchtechnologies.com
§  Ronald Matamoros
   •  rmatamoros@searchtechnologies.com
§  Search Technologies
   •  http://searchtechnologies.com




                                          40

More Related Content

Viewers also liked

Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Lucidworks (Archived)
 
Open Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationOpen Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationLucidworks (Archived)
 
HTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコルHTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコル彰 村地
 
Hellosong
HellosongHellosong
Hellosongtanica
 
Mujer, pajaro y estrella
Mujer, pajaro y estrellaMujer, pajaro y estrella
Mujer, pajaro y estrellaguest986e5ae
 
Discover the new techniques about search application
Discover the new techniques about search applicationDiscover the new techniques about search application
Discover the new techniques about search applicationLucidworks (Archived)
 
Using Solr in Online Travel Shopping to Improve User Experience
Using Solr in Online Travel Shopping to Improve User ExperienceUsing Solr in Online Travel Shopping to Improve User Experience
Using Solr in Online Travel Shopping to Improve User ExperienceLucidworks (Archived)
 
Jazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search ProblemJazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search ProblemLucidworks (Archived)
 
Zombie
ZombieZombie
Zombietanica
 
Civil War
Civil WarCivil War
Civil Wartanica
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
 
Tv ролики
Tv роликиTv ролики
Tv роликиtarodnova
 

Viewers also liked (18)

Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"
 
All Data Big and Small
All Data Big and SmallAll Data Big and Small
All Data Big and Small
 
Open Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationOpen Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to Information
 
What’s New in Apache Lucene 3.0
What’s New in Apache Lucene 3.0What’s New in Apache Lucene 3.0
What’s New in Apache Lucene 3.0
 
HTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコルHTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコル
 
Hellosong
HellosongHellosong
Hellosong
 
Mujer, pajaro y estrella
Mujer, pajaro y estrellaMujer, pajaro y estrella
Mujer, pajaro y estrella
 
Discover the new techniques about search application
Discover the new techniques about search applicationDiscover the new techniques about search application
Discover the new techniques about search application
 
Using Solr in Online Travel Shopping to Improve User Experience
Using Solr in Online Travel Shopping to Improve User ExperienceUsing Solr in Online Travel Shopping to Improve User Experience
Using Solr in Online Travel Shopping to Improve User Experience
 
Jazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search ProblemJazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search Problem
 
What’s New in Apache Lucene 2.9
What’s New in Apache Lucene 2.9What’s New in Apache Lucene 2.9
What’s New in Apache Lucene 2.9
 
Zombie
ZombieZombie
Zombie
 
Civil War
Civil WarCivil War
Civil War
 
Portades
PortadesPortades
Portades
 
Linked In Introduction
Linked In IntroductionLinked In Introduction
Linked In Introduction
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Tv ролики
Tv роликиTv ролики
Tv ролики
 

Similar to Searching the United States Code with Solr/Lucene

Searching The United States Code with Solr/Lucene - By Ronald Matamoros
Searching The United States Code with Solr/Lucene - By Ronald MatamorosSearching The United States Code with Solr/Lucene - By Ronald Matamoros
Searching The United States Code with Solr/Lucene - By Ronald Matamoroslucenerevolution
 
Trouble-shooting Tips for Primo (2013)
Trouble-shooting Tips for Primo (2013)Trouble-shooting Tips for Primo (2013)
Trouble-shooting Tips for Primo (2013)Alison Hitchens
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs HadoopFujio Turner
 
Paper_Scalable database logging for multicores
Paper_Scalable database logging for multicoresPaper_Scalable database logging for multicores
Paper_Scalable database logging for multicoresHyo jeong Lee
 
W1.1 i os in database
W1.1   i os in databaseW1.1   i os in database
W1.1 i os in databasegafurov_x
 
Apache solr tech doc
Apache solr tech docApache solr tech doc
Apache solr tech docBarot Sagar
 
Ugif 10 2012 beauty ofifmxdiskstructs ugif
Ugif 10 2012 beauty ofifmxdiskstructs ugifUgif 10 2012 beauty ofifmxdiskstructs ugif
Ugif 10 2012 beauty ofifmxdiskstructs ugifUGIF
 
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAPOpen Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAPPieter De Leenheer
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Kai Chan
 
SPARQL queries on CIDOC-CRM data of BritishMuseum
SPARQL queries on CIDOC-CRM data of BritishMuseumSPARQL queries on CIDOC-CRM data of BritishMuseum
SPARQL queries on CIDOC-CRM data of BritishMuseumThomas Francart
 
Oracle10g New Features I
Oracle10g New Features IOracle10g New Features I
Oracle10g New Features IDenish Patel
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with ElasticsearchSamantha Quiñones
 

Similar to Searching the United States Code with Solr/Lucene (13)

Searching The United States Code with Solr/Lucene - By Ronald Matamoros
Searching The United States Code with Solr/Lucene - By Ronald MatamorosSearching The United States Code with Solr/Lucene - By Ronald Matamoros
Searching The United States Code with Solr/Lucene - By Ronald Matamoros
 
Trouble-shooting Tips for Primo (2013)
Trouble-shooting Tips for Primo (2013)Trouble-shooting Tips for Primo (2013)
Trouble-shooting Tips for Primo (2013)
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs Hadoop
 
Paper_Scalable database logging for multicores
Paper_Scalable database logging for multicoresPaper_Scalable database logging for multicores
Paper_Scalable database logging for multicores
 
W1.1 i os in database
W1.1   i os in databaseW1.1   i os in database
W1.1 i os in database
 
Solr Presentation
Solr PresentationSolr Presentation
Solr Presentation
 
Apache solr tech doc
Apache solr tech docApache solr tech doc
Apache solr tech doc
 
Ugif 10 2012 beauty ofifmxdiskstructs ugif
Ugif 10 2012 beauty ofifmxdiskstructs ugifUgif 10 2012 beauty ofifmxdiskstructs ugif
Ugif 10 2012 beauty ofifmxdiskstructs ugif
 
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAPOpen Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 
SPARQL queries on CIDOC-CRM data of BritishMuseum
SPARQL queries on CIDOC-CRM data of BritishMuseumSPARQL queries on CIDOC-CRM data of BritishMuseum
SPARQL queries on CIDOC-CRM data of BritishMuseum
 
Oracle10g New Features I
Oracle10g New Features IOracle10g New Features I
Oracle10g New Features I
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with Elasticsearch
 

More from Lucidworks (Archived)

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Lucidworks (Archived)
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCLucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKLucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarLucidworks (Archived)
 

More from Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 

Recently uploaded

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Recently uploaded (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Searching the United States Code with Solr/Lucene

  • 1. Searching The United States Code with Solr/Lucene Paul Nelson / Ronald Matamoros, Search Technologies pnelson@searchtechnologies.com, 5/25/2011 rmatamoros@searchtechnologies.com
  • 2. Searching the United States Code §  Who are we: •  Paul Nelson, Chief Architect •  Ronald Matamoros, Lead Engineer §  Our Mission: Replace Personal Librarian Search •  A 20-Year-Old Search Engine! §  Key Challenges •  How to index this massive, complex, 85-year-old document? •  How to replicate 20-Year-Old search features? §  Government Documents are Fun! 3
  • 3. Search Technologies §  The largest independent provider of enterprise search expertise and services §  80 full-time dedicated search engine experts §  200+ customers §  Technology Neutral •  (yeah, we know Sphinx too) §  Offices All Over •  DC, NY, CA, MD, OH, UK, CR… 4
  • 4. A Quick Civics Lesson… §  The United States Code •  The general & permanent laws of the U.S. Government – All in one place •  51 titles §  Agriculture, Armed Forces, Conservation, The President, Food and Drugs, Postal Service, Public Health… •  First Version: 1926 §  The Office of the Law Revision Council (OLRC) •  20 lawyers who author the U.S. Code •  They report to the Speaker of the House of Representatives §  Bonus Question: Which Title is the largest? 5
  • 5. Major Challenges 1.  Document Parsing •  A 50 Volume Table Of Contents! 2.  Query Parsing •  Custom Features (exact case, exact suffix, proximity, query templates, lemmatization, lots of fields…) 3.  Searching & Highlighting Fields •  Some fields are embedded in the document •  These fields must be highlighted in context 6
  • 9. 10
  • 10. Part The First: Document Processing 11
  • 11. Document Processing / Indexing USC Parse & Embed Construct Xform & Granularize Refs XHTML Store Index Solr Title Repository 12
  • 12. Field Type 1: Extracted to Index Page Numbers <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1- ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> Heading <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … Title <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> Source Credit <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of … 13
  • 13. Document Processing / Indexing USC Parse & Embed Construct Xform & Granularize Refs XHTML Store Index Solr Title Repository Title 14 ch. 1 ch. 2 ch. 3 … pt. A pt. B pt. C … sec. 1 sec. 2 sec. 3 … 14
  • 14. Field Type 2: Embedded Refs <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1- ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> Statute at Large <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> Public Law <!-- field-start:notes --> USC Refs Other <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> Public Law <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> Public Law <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of … 15
  • 15. Document Processing / Indexing USC Parse & Embed Construct Xform & Granularize Refs XHTML Store Index Solr Title Repository 16
  • 16. Document Processing / Indexing USC Parse & Embed Construct Xform & Granularize Refs XHTML Store Index Solr Title Repository §  /US-Code §  /2010 §  /title2 §  /USC-title2-section1532.htm §  /USC-title2-node3-rule5.htm 17
  • 17. Part The Second: Token Processing 18
  • 18. Token Processing 1 xhtml tag tokenizer <!-- field-start:amendment-note --> <h4 class="note-head"> <!-- field-start:amendment-note --> Amendments <h4 class="note-head">Amendments</h4> </h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash; 296 substituted &ldquo;Department of … <p class="note-body"> <!-- field-end:amendment-note --> 2002 Pub L 107 296 Substituted Department of <!-- field-end:amendment-note --> 19
  • 19. Field Type 3: Marked Within Doc <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1- ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of … 20
  • 20. Token Processing 2 Mark Start and End Tags <!-- field-start:amendment-note --> S/amendment <h4 class="note-head"> <h4 class="note-head"> Amendments Amendments </h4> </h4> <p class="note-body"> <p class="note-body"> 2002 2002 Pub Pub L L 107 107 296 296 Substituted Substituted Department Department of of <!-- field-end:amendment-note --> E/amendment 21
  • 21. Token Processing 3 Remove XHTML Tags S/amendment S/amendment <h4 class="note-head"> Amendments Amendments </h4> <p class="note-body"> 2002 2002 Pub Pub L L 107 107 296 296 Substituted Substituted Department Department of of E/amendment E/amendment 22
  • 22. Token Processing 4 Tag Original Case & Lower Case S/amendment S/amendment Amendments O/Amendments L/amendments 2002 O/2002 L/2002 Pub O/Pub L/pub L O/L L/l 107 O/107 L/107 296 O/296 L/296 Substituted O/Substituted L/substituted Department O/Department L/department of O/of L/of E/amendment E/amendment 23
  • 23. Token Processing 5 Lemmatize Uses dictionary-based lemmatizer based on GCIDE and WordNet S/amendment S/amendment O/Amendments L/amendments O/Amendments L/amendments amendment O/2002 L/2002 O/2002 L/2002 2002 O/Pub L/pub O/Pub L/Pub pub O/L L/l O/L L/l; l O/107 L/107 O/107 L/107 107 O/296 L/296 O/296 L/296 296 O/Substituted L/substituted O/Substituted L/Substituted substitute O/Department L/department O/Department L/Department department O/of L/of O/of L/of of E/amendment E/amendment 24
  • 24. Part The Third: Query Processing 25
  • 25. Query Processing (not all stages shown) build Query mark mark query parse lemmatize lucene search String exact: phrases template query §  Communicates via generic QNode Class •  Simpler to manipulate than Lucene operators §  Can produce FAST FQL as well •  (cue the derisive catcalls) §  But most importantly: •  It is a Query Processing Pipeline §  Mix and match query processing modules 26
  • 26. Query Processing exact:FOIA top secret amendment:RECORDS build Query mark mark query parse lemmatize lucene search String original lowercase template query and exact: phrase amendment: |FOIA| |top| |secret| |RECORDS| 27
  • 27. Query Processing exact:FOIA top secret amendment:RECORDS build Query mark mark query parse lemmatize lucene search String original lowercase template query and O/FOIA phrase amendment: |top| |secret| |RECORDS| 28
  • 28. Query Processing exact:FOIA top secret amendment:RECORDS build Query mark mark query parse lemmatize lucene search String original lowercase template query and O/FOIA phrase amendment: |L/top| |L/secret| |records| 29
  • 29. Query Processing exact:FOIA top secret amendment:RECORDS build Query mark mark query parse lemmatize lucene search String original lowercase template query and O/FOIA phrase amendment: |L/top| |L/secret| |record| 30
  • 30. Query Processing exact:FOIA top secret amendment:RECORDS build Query mark mark query parse lemmatize lucene search String original lowercase template query and O/FOIA phrase between S/amendment |L/top| |L/secret| |record| E/amendment 31
  • 31. The between() Operator §  between(start-tag, end-tag, pos-clause, neg-clause) §  start-tag à Starting tag, e.g. S/amendment §  end-tag à Ending tag, e.g. E/amendment §  pos-clause à words which must occur between start and end •  Note: Requires a nested ScanAnd() operator §  neg-clause à words which must not occur between start and end 32
  • 34. Hierarchies: Requirements §  Any number of levels §  Title, Sub-Title, Chapter, Sub-Chapter, Part, Sub-Part, Section §  Levels vary across titles §  Title 1: 3 levels §  Title 26: 8 levels §  Multiple views: §  Children §  Ancestors §  Ancestor s Siblings §  Multiple search scopes: §  Only children, all descendents, everything 35
  • 35. Hierarchies: Ancestor-Siblings §  US-Code •  Title 1 •  Title 2 §  Chapter 1 §  Chapter 2 –  Part 1 –  Part 2 •  Section 2.1 •  Section 2.2 –  Part 3 –  Part 4 §  Chapter 3 §  Chapter 4 •  Title 3 36
  • 36. Hierarchies: Fields §  ancestors •  Searching §  USC USC-title2 USC-title2-chapter25 USC-title2-chapter25- subchapter2 §  encodedAncestors – for display only •  Where the node exists within the hierarchy §  id;heading;subjectTitle//id;heading;subjectTitle//... §  USC-title2-chapter25;Chapter 25;Unfunded Mandates Reform// USC-title2-chapter25-subchapter2;Subchapter II;Regulatory Accountabilty and Reform §  parentId – ID of the parent node §  USC-title2-chapter25-subchapter2 §  treesort – Hierarchical sort field, e.g. 13/000/0/00882 37
  • 37. Hierarchies: Tree Sort §  Sorting In Print Order •  Front Matter à Titles à Tables à etc. •  Everything padded to fixed-length 01/011/1/02032 01 = USC Title Sequence # in file 011 = Title 11 1 = An Appendix 38
  • 38. Hierarchies: Sample Searches §  Assuming Node = USC-title2-chapter25 §  Search Children •  parentId:USC-title2-chapter25 §  Search All Descendents •  ancestors:USC-title2-chapter25 §  Ancestor Siblings •  (parentId:USC OR parentId:USC-title2 OR parentId:USC-title2-chapter25) 39
  • 39. Contact §  Paul Nelson •  pnelson@searchtechnologies.com §  Ronald Matamoros •  rmatamoros@searchtechnologies.com §  Search Technologies •  http://searchtechnologies.com 40