Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

MTAS Henny Brugman

472 views

Published on

CLARIAH-dag 2016

Published in: Science
  • Be the first to comment

  • Be the first to like this

MTAS Henny Brugman

  1. 1. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Multi Tier Annotation Search MTAS Matthijs Brouwer Meertens Institute December 8, 2015 Matthijs Brouwer Multi Tier Annotation Search
  2. 2. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results 1 Introduction 2 Lucene 3 MTAS 4 Tokenizer FoLiA 5 Search using CQL 6 Results Matthijs Brouwer Multi Tier Annotation Search
  3. 3. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Text and Metadata Annotated Text Requirements Provide Search on Combination of Text and Metadata Example data Author Eduard Douwes Dekker Place of birth Amsterdam Date of birth 1820, March 2 Pseudonym Max Havelaar Title Multatuli Published 1860 Text Ik ben makelaar in ko e en woon op de Lauriergracht no 37 . . . Matthijs Brouwer Multi Tier Annotation Search
  4. 4. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Text and Metadata Annotated Text Requirements Solution based on Apache Solr Reverse Index Apache Solr (based on Apache Lucene) Index on both Text and Metadata Advantages Search Facets Scalable Custom plugin (join) Actively developed Matthijs Brouwer Multi Tier Annotation Search
  5. 5. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Text and Metadata Annotated Text Requirements Search Text ’Ik ben makelaar in ko e, en woon op de Lauriergracht no 37.’ We can search for ”Makelaar” ”Makelaar in ko e” ”Makel.* in ko e” Matthijs Brouwer Multi Tier Annotation Search
  6. 6. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Text and Metadata Annotated Text Requirements Annotations ’Ik ben makelaar in ko e, en woon op de Lauriergracht no 37.’ text lemma pos/features Ik ik VNW(pers,pron,nomin,vol,1,ev) ben zijn WW(pv,tgw,ev) makelaar makelaar N(soort,ev,basis,zijd,stan) in in VZ(init) ko e ko e N(soort,ev,basis,zijd,stan) , , LET() . . . . . . . . . Matthijs Brouwer Multi Tier Annotation Search
  7. 7. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Text and Metadata Annotated Text Requirements FoLiA <text xml:id=”untitled.text”> <p xml:id=”untitled.p.1”> <s xml:id=”untitled.p.1.s.1”> <w xml:id=”untitled.p.1.s.1.w.1” class=”WORD”> <t>Ik</t> <pos class=”VNW(pers,pron,nomin,vol,1,ev)” confidence=”0.999791” head=”VNW”> <feat class=”pers” subset=”vwtype”/> <feat class=”pron” subset=”pdtype”/> <feat class=”nomin” subset=”naamval”/> <feat class=”vol” subset=”status”/> <feat class=”1” subset=”persoon”/> <feat class=”ev” subset=”getal”/> </pos> <morphology> <morpheme> <t o↵set=”0”>ik</t> </morpheme> </morphology> <lemma class=”ik”/> </w> . . . Matthijs Brouwer Multi Tier Annotation Search
  8. 8. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Text and Metadata Annotated Text Requirements Required functionality Extend current Solr solution Search on annotations like pos, lemma, features, . . . Search on sentences, paragraphs, chapters, . . . Search on entities and chunks Search on dependencies Statistics, grouping, facets, . . . Important Maintaining functionality and scalability Upgradeable to new releases Solr/Lucene Matthijs Brouwer Multi Tier Annotation Search
  9. 9. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Tokenization Reverse Index Limitations Alternatives Tokenization Something about Lucene internals Focus on text Tokenization Text is split up into tokens value, e.g. ”ko e” position, e.g. 4 o↵set, e.g. 19 24 payload, e.g. 1.000 ’Ik ben makelaar in ko e, en woon op de Lauriergracht no 37.’ Matthijs Brouwer Multi Tier Annotation Search
  10. 10. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Tokenization Reverse Index Limitations Alternatives Reverse Index Tokenstream used to construct Reverse Index text document position o↵set payload ben 0 1 3 5 0.500 de 0 9 38 39 0.200 en 0 6 27 28 0.250 in 0 3 16 17 0.350 ko e 0 4 19 24 0.900 makelaar 0 2 7 14 0.800 . . . . . . . . . . . . . . . This enables fast search, since the locations of matching terms can be found very quickly. Matthijs Brouwer Multi Tier Annotation Search
  11. 11. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Tokenization Reverse Index Limitations Alternatives Limitations Limitations of this approach Heavily based on grouping by document Collecting statistics Grouping results Not possible to include Structural information: sentences, paragraphs, . . . Annotations: pos, lemma’s, . . . Relations: dependencies, chunking, . . . No real forward index Finding all tokens for a given position Matthijs Brouwer Multi Tier Annotation Search
  12. 12. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Tokenization Reverse Index Limitations Alternatives Alternatives Alternative solutions Graph Database Experiments Neo4j: problems scalability and performance Too general, doesn’t use sequential nature of textual data BlackLab Based on Lucene, no integration with Solr Di↵erent fields for each annotation layer Matthijs Brouwer Multi Tier Annotation Search
  13. 13. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results General Prefixes Payload Forward Indexes Additional requirements Extension provided by MTAS Store multiple tokens on the same position, and use prefixes to distinguish between di↵erent layers of annotations Use the payload to encode additional information on each token Construct forward indexes by extending the Lucene Codec Implementation Extension based on the Lucene Library Provide query handlers for extended data structures Provide Solr Plugin using the MTAS extension Matthijs Brouwer Multi Tier Annotation Search
  14. 14. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results General Prefixes Payload Forward Indexes Additional requirements Prefixes Store multiple tokens on the same position, and use prefixes to distinguish between di↵erent layers of annotations text document position lemma:de 0 9 lemma:zijn 0 1 . . . . . . . . . pos:LID 0 9 pos:WW 0 1 . . . . . . . . . t:ben 0 1 t:de 0 9 . . . . . . . . . Matthijs Brouwer Multi Tier Annotation Search
  15. 15. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results General Prefixes Payload Forward Indexes Additional requirements Payload Use the payload to encode additional information on each token mtas id integer identifying token within a document position type of position: single, range or set additional information for range or set o↵set start and end o↵set real o↵set start and end real o↵set parent reference to another token by its mtas id payload original payload Matthijs Brouwer Multi Tier Annotation Search
  16. 16. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results General Prefixes Payload Forward Indexes Additional requirements Forward Indexes Construct forward indexes by extending the Lucene Codec Position Given the position within the document, return references to all objects on that position. Parent Id Given the mtas id, return references to all objects referring to this mtas id as parent Object Id Given the id, return a reference to the object Prefix/Position Given prefix and position, return the value Matthijs Brouwer Multi Tier Annotation Search
  17. 17. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results General Prefixes Payload Forward Indexes Additional requirements Usage new structure The additions make it possible to quickly retrieve the required information for queries and results based on the annotated text. To take advantage of these additions to the Lucene structure, we need Tokenizer mapping the original annotated data (FoLiA) on the new structure Query handlers, and query language: CQL Matthijs Brouwer Multi Tier Annotation Search
  18. 18. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results FoLiA <text xml:id=”untitled.text”> <p xml:id=”untitled.p.1”> <s xml:id=”untitled.p.1.s.1”> <w xml:id=”untitled.p.1.s.1.w.1” class=”WORD”> <t>Ik</t> <pos class=”VNW(pers,pron,nomin,vol,1,ev)” confidence=”0.999791” head=”VNW”> <feat class=”pers” subset=”vwtype”/> <feat class=”pron” subset=”pdtype”/> <feat class=”nomin” subset=”naamval”/> <feat class=”vol” subset=”status”/> <feat class=”1” subset=”persoon”/> <feat class=”ev” subset=”getal”/> </pos> <morphology> <morpheme> <t o↵set=”0”>ik</t> </morpheme> </morphology> <lemma class=”ik”/> </w> . . . Matthijs Brouwer Multi Tier Annotation Search
  19. 19. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Tokenizer FoLiA Several elements can be distinguished: Words : <w/> Annotations on Words : <pos/>, <t/>, <lemma/> Groups of Words : <p/>, <s/>, <div/> Annotations on Groups : <lang/> References : <wref/> Relations : <entity/> The configurable FoLiA tokenizer enables to define these items and map them onto the new index structure. Matthijs Brouwer Multi Tier Annotation Search
  20. 20. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Search using CQL For new MTAS data structure Query handlers provided Support Corpus Query Language (CQL) Enables to define conditions on annotations Confusion about the exact interpretation and implementation Matthijs Brouwer Multi Tier Annotation Search
  21. 21. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Search using CQL the big green shiny apple LID ADJ ADJ ADJ N Ambiguities illustrated by examples [pos = ”LID”|word = ”the”] (1) [word = ”b. ⇤ ”|word = ”. ⇤ g”] (2) [pos = ”ADJ”]{2} (3) [pos = ”ADJ”]? [pos = ”N”] (4) Matthijs Brouwer Multi Tier Annotation Search
  22. 22. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Search using CQL Within MTAS Results should be considered as equal if and only if the positions of both results exactly match. Di↵ers from the default query interpretation of Lucene and the CQL interpretation as used in other applications No options to refer to parts of the matched pattern to e.g. sort, group or collect statistics Matthijs Brouwer Multi Tier Annotation Search
  23. 23. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Size indexes Performance TODO Size indexes Collection # FoLiA Zipped Size Index Positions DBNL T 9, 465 29GB 198GB 677,476,310 DBNL DT 131, 177 95GB 395,530,191 SONAR 2, 063, 880 22GB 127GB 504,393,711 Search on combined indexes using Solr sharding # FoLiA 2, 204, 522 # Positions 1, 577, 400, 212 # Sentences 92, 584, 655 There are approximately 10 tokens on each position. Matthijs Brouwer Multi Tier Annotation Search
  24. 24. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Size indexes Performance TODO Performance Virtual Machine, Ubuntu, 8 cores, 48GB (40GB Solr) Computing stats (sum, mean, median, standarddeviation, etc.) on full set of 2, 204, 522 documents and 1, 577, 400, 212 positions. CQL Time Hits Docs [t = ”de”] 3, 023 ms 57, 531, 353 1, 801, 583 [t = ”de” & pos = ”LID”] 7, 877 ms 56, 704, 921 1, 799, 499 [t = ”de” & !pos = ”LID”] 3, 105 ms 826, 432 132, 722 < s > [t = ”De”] 11, 568 ms 6, 085, 643 1, 090, 127 [pos = ”N”] 6, 200 ms 259, 942, 340 2, 189, 750 [pos = ”ADJ”] [pos = ”N”] 42, 977 ms 45, 366, 603 1, 821, 716 [pos = ”ADJ”]? [pos = ”N”] 207, 795 ms 305, 308, 943 2, 189, 750 Matthijs Brouwer Multi Tier Annotation Search
  25. 25. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Size indexes Performance TODO TODO Group results Facets Performance . . . Matthijs Brouwer Multi Tier Annotation Search
  26. 26. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Size indexes Performance TODO The end Matthijs Brouwer Multi Tier Annotation Search

×