Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Anserini SIGIR 2017 Poster
1. Anserini: enabling the use of Lucene for
information retrieval research
Peilin Yang and Hui Fang
Electrical and Computer Engineering Department,
University of Delaware
This research was supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada and
the U.S. National Science Foundation under IIS-1423002 and CNS-1405688.
Information Retrieval Research relies on the open-source
toolkits to evaluate retrieval models
Efforts are generally directed toward better ranking and less
attention is usually given to scalability and other operational
considerations
Lucene has become the de facto platform in industry for
building search applications BUT lacks systematic support
for evaluation over standard test collections
Motivation
http://anserini.io/
Lucene is Slow (because of Java) [FALSE]
See evaluation results below
Lucene doesn’t offer modern ranking functions. [FALSE]
BM25, Dirichlet LM, Axiomatic Models,
Divergence from Randomness Models
Lucene is difficult to use. [PARTIALLY TRUE]
Poor Documentation
Not Intuitive APIs
Dispelling the Myth of Lucene
Multi-threaded indexing
Streamlined TREC Adhoc/Web tracks Evaluation
Relevance Feedback
Many more to come…
Main Components
Modules
Evaluation
(The lower the better)
Latency
ms/query
Throughput
query/sec
Indri 2403 0.42
Anserini 382 2.61
Models Disk12 ROBUST04 WT2G WT10G GOV2
BM25(I) 0.2040 0.2478 0.3152 0.1955 0.2970
BM25(A) 0.2267 0.2500 0.3015 0.1981 0.3030
LM(I) 0.2269 0.2516 0.3116 0.1915 0.2995
LM(A) 0.2232 0.2465 0.2922 0.2015 0.2951
Retrieval Efficiency
Terabyte 06 efficiency queries
Retrieval Effectiveness
(Mean Average Precision)
Lucene
Core
Word
Embedding
Relevance
Feedback
Learning 2
Rank
Indexing
Doc
Parser
Tokenizer
Stemmer
ClueWeb
TREC
Space lower Porter
Index
Options
Counts
Positions
Searching
Query
Reader
Webxml
TREC
Ranking
Model
Evaluation
BM25
LM
trec_eval AIR
Lucene Module Static Components
Pluggable Module Anserini Extensions
Jimmy Lin
David R. Cheriton School of Computer Science,
University of Waterloo
Disk12
Disk45
AQUAINT
WT2g
Indri
A(doc)
A(pos)
A(cnt)
GOV2
CW09b
CW09
CW12b13
CW12
A(pos)
A(cnt)
Indexing Time
Dual AMD Opteron 6128 processors (2.0GHz, 8 cores) with
40GB RAM running CentOS 6.8
Dual Intel Xeon E5-2699 v4 processors (2.2GHz, 22 cores) and 1
TB RAM running Ubuntu 16.04
Index Size
Collection A(cnt) A(pos)
CW09b 28GB 75GB
CW09 254GB 649GB
CW12b13 29GB 76GB
CW12 376GB 1.1TB
6
Times
Faster
Similar and
Reliable
performance
Editor's Notes
Title wrong
Metric and conclusion of retrieval effectiveness
Keep Motivation consistent with the paper