MTAS Henny Brugman

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Multi Tier Annotation Search
MTAS
Matthijs Brouwer
Meertens Institute
December 8, 2015
Matthijs Brouwer Multi Tier Annotation Search

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
1 Introduction
2 Lucene
3 MTAS
4 Tokenizer FoLiA
5 Search using CQL
6 Results

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Text and Metadata
Annotated Text
Requirements
Provide Search on Combination of Text and Metadata
Example data
Author Eduard Douwes Dekker
Place of birth Amsterdam
Date of birth 1820, March 2
Pseudonym Max Havelaar
Title Multatuli
Published 1860
Text Ik ben makelaar in ko e
en woon op de Lauriergracht
no 37 . . .

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Text and Metadata
Annotated Text
Requirements
Solution based on Apache Solr
Reverse Index
Apache Solr (based on Apache Lucene)
Index on both Text and Metadata
Advantages
Search
Facets
Scalable
Custom plugin (join)
Actively developed

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Text and Metadata
Annotated Text
Requirements
Search Text
’Ik ben makelaar in ko e, en woon op de Lauriergracht no 37.’
We can search for
”Makelaar”
”Makelaar in ko e”
”Makel.* in ko e”

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Text and Metadata
Annotated Text
Requirements
Annotations
text lemma pos/features
Ik ik VNW(pers,pron,nomin,vol,1,ev)
ben zijn WW(pv,tgw,ev)
makelaar makelaar N(soort,ev,basis,zijd,stan)
in in VZ(init)
ko e ko e N(soort,ev,basis,zijd,stan)
, , LET()
. . . . . . . . .

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Text and Metadata
Annotated Text
Requirements
FoLiA
<text xml:id=”untitled.text”>
<p xml:id=”untitled.p.1”>
<s xml:id=”untitled.p.1.s.1”>
<w xml:id=”untitled.p.1.s.1.w.1” class=”WORD”>
<t>Ik</t>
<pos class=”VNW(pers,pron,nomin,vol,1,ev)” conﬁdence=”0.999791” head=”VNW”>
<feat class=”pers” subset=”vwtype”/>
<feat class=”pron” subset=”pdtype”/>
<feat class=”nomin” subset=”naamval”/>
<feat class=”vol” subset=”status”/>
<feat class=”1” subset=”persoon”/>
<feat class=”ev” subset=”getal”/>
</pos>
<morphology>
<morpheme>
<t o↵set=”0”>ik</t>
</morpheme>
</morphology>
<lemma class=”ik”/>
</w>
. . .

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Text and Metadata
Annotated Text
Requirements
Required functionality
Extend current Solr solution
Search on annotations like pos, lemma, features, . . .
Search on sentences, paragraphs, chapters, . . .
Search on entities and chunks
Search on dependencies
Statistics, grouping, facets, . . .
Important
Maintaining functionality and scalability
Upgradeable to new releases Solr/Lucene

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Tokenization
Reverse Index
Limitations
Alternatives
Tokenization
Something about Lucene internals
Focus on text
Tokenization
Text is split up into tokens
value, e.g. ”ko e”
position, e.g. 4
o↵set, e.g. 19 24
payload, e.g. 1.000

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Tokenization
Reverse Index
Limitations
Alternatives
Reverse Index
Tokenstream used to construct Reverse Index
text document position o↵set payload
ben 0 1 3 5 0.500
de 0 9 38 39 0.200
en 0 6 27 28 0.250
in 0 3 16 17 0.350
ko e 0 4 19 24 0.900
makelaar 0 2 7 14 0.800
. . . . . . . . . . . . . . .
This enables fast search, since the locations of matching terms can
be found very quickly.

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Tokenization
Reverse Index
Limitations
Alternatives
Limitations
Limitations of this approach
Heavily based on grouping by document
Collecting statistics
Grouping results
Not possible to include
Structural information: sentences, paragraphs, . . .
Annotations: pos, lemma’s, . . .
Relations: dependencies, chunking, . . .
No real forward index
Finding all tokens for a given position

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Tokenization
Reverse Index
Limitations
Alternatives
Alternatives
Alternative solutions
Graph Database
Experiments Neo4j: problems scalability and performance
Too general, doesn’t use sequential nature of textual data
BlackLab
Based on Lucene, no integration with Solr
Di↵erent ﬁelds for each annotation layer

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
General
Preﬁxes
Payload
Forward Indexes
Additional requirements
Extension provided by MTAS
Store multiple tokens on the same position, and use preﬁxes
to distinguish between di↵erent layers of annotations
Use the payload to encode additional information on each
token
Construct forward indexes by extending the Lucene Codec
Implementation
Extension based on the Lucene Library
Provide query handlers for extended data structures
Provide Solr Plugin using the MTAS extension

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
General
Prefixes
Payload
Forward Indexes
Prefixes
Store multiple tokens on the same position, and use prefixes to
distinguish between di↵erent layers of annotations
text document position
lemma:de 0 9
lemma:zijn 0 1
. . . . . . . . .
pos:LID 0 9
pos:WW 0 1
. . . . . . . . .
t:ben 0 1
t:de 0 9
. . . . . . . . .

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
General
Preﬁxes
Payload
Forward Indexes
Payload
Use the payload to encode additional information on each token
mtas id integer identifying token within a document
position type of position: single, range or set
additional information for range or set
o↵set start and end o↵set
real o↵set start and end real o↵set
parent reference to another token by its mtas id
payload original payload

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
General
Prefixes
Payload
Forward Indexes
Forward Indexes
Construct forward indexes by extending the Lucene Codec
Position Given the position within the document,
return references to all objects on that position.
Parent Id Given the mtas id, return references
to all objects referring to this mtas id as parent
Object Id Given the id, return a reference to the object
Prefix/Position Given prefix and position, return the value

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
General
Preﬁxes
Payload
Forward Indexes
Usage new structure
The additions make it possible to quickly retrieve the required
information for queries and results based on the annotated text.
To take advantage of these additions to the Lucene structure, we
need
Tokenizer mapping the original annotated data (FoLiA) on the
new structure
Query handlers, and query language: CQL

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
FoLiA
<text xml:id=”untitled.text”>
<p xml:id=”untitled.p.1”>
<s xml:id=”untitled.p.1.s.1”>
<w xml:id=”untitled.p.1.s.1.w.1” class=”WORD”>
<t>Ik</t>
<pos class=”VNW(pers,pron,nomin,vol,1,ev)” conﬁdence=”0.999791” head=”VNW”>
<feat class=”pers” subset=”vwtype”/>
<feat class=”pron” subset=”pdtype”/>
<feat class=”nomin” subset=”naamval”/>
<feat class=”vol” subset=”status”/>
<feat class=”1” subset=”persoon”/>
<feat class=”ev” subset=”getal”/>
</pos>
<morphology>
<morpheme>
<t o↵set=”0”>ik</t>
</morpheme>
</morphology>
<lemma class=”ik”/>
</w>
. . .

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Tokenizer FoLiA
Several elements can be distinguished:
Words : <w/>
Annotations on Words : <pos/>, <t/>, <lemma/>
Groups of Words : <p/>, <s/>, <div/>
Annotations on Groups : <lang/>
References : <wref/>
Relations : <entity/>
The conﬁgurable FoLiA tokenizer enables to deﬁne these items and
map them onto the new index structure.

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Search using CQL
For new MTAS data structure
Query handlers provided
Support Corpus Query Language (CQL)
Enables to deﬁne conditions on annotations
Confusion about the exact interpretation and implementation

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Search using CQL
the big green shiny apple
LID ADJ ADJ ADJ N
Ambiguities illustrated by examples
[pos = ”LID”|word = ”the”] (1)
[word = ”b. ⇤ ”|word = ”. ⇤ g”] (2)
[pos = ”ADJ”]{2} (3)
[pos = ”ADJ”]? [pos = ”N”] (4)

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Search using CQL
Within MTAS
Results should be considered as equal if and only if the
positions of both results exactly match.
Di↵ers from the default query interpretation of Lucene and
the CQL interpretation as used in other applications
No options to refer to parts of the matched pattern to e.g.
sort, group or collect statistics

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Size indexes
Performance
TODO
Size indexes
Collection # FoLiA Zipped Size Index Positions
DBNL T 9, 465 29GB 198GB 677,476,310
DBNL DT 131, 177 95GB 395,530,191
SONAR 2, 063, 880 22GB 127GB 504,393,711
Search on combined indexes using Solr sharding
# FoLiA 2, 204, 522
# Positions 1, 577, 400, 212
# Sentences 92, 584, 655
There are approximately 10 tokens on each position.

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Size indexes
Performance
TODO
Performance
Virtual Machine, Ubuntu, 8 cores, 48GB (40GB Solr)
Computing stats (sum, mean, median, standarddeviation, etc.) on
full set of 2, 204, 522 documents and 1, 577, 400, 212 positions.
CQL Time Hits Docs
[t = ”de”] 3, 023 ms 57, 531, 353 1, 801, 583
[t = ”de” & pos = ”LID”] 7, 877 ms 56, 704, 921 1, 799, 499
[t = ”de” & !pos = ”LID”] 3, 105 ms 826, 432 132, 722
< s > [t = ”De”] 11, 568 ms 6, 085, 643 1, 090, 127
[pos = ”N”] 6, 200 ms 259, 942, 340 2, 189, 750
[pos = ”ADJ”] [pos = ”N”] 42, 977 ms 45, 366, 603 1, 821, 716
[pos = ”ADJ”]? [pos = ”N”] 207, 795 ms 305, 308, 943 2, 189, 750

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Size indexes
Performance
TODO
TODO
Group results
Facets
Performance
. . .

Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Size indexes
Performance
TODO
The end

MTAS Henny Brugman

More Related Content

Viewers also liked

Similar to MTAS Henny Brugman

More from CLARIAH

Recently uploaded

MTAS Henny Brugman