SlideShare a Scribd company logo
WHAT IS IN A LUCENE INDEX
Adrien Grand
@jpountz

Software engineer at Elasticsearch
About me
•
•

Lucene/Solr committer
Software engineer at Elasticsearch

•

I like changing the index file formats!
– stored fields
– term vectors
– doc values
– ...
Why should I
learn about
Lucene internals?
Why should I learn about Lucene internals?
•

Know the cost of the APIs
– to build blazing fast search applications
– don’t commit all the time
– when to use stored fields vs. doc values
– maybe Lucene is not the right tool

•

Understand index size
– oh, term vectors are 1/2 of the index size!
– I removed 20% of my documents and index size hasn’t changed

•

This is a lot of fun!
Indexing
•

Make data fast to search
– duplicate data if it helps
– decide on how to index based on the queries

•

Trade update speed for search speed
– Grep vs full-text indexing
– Prefix queries vs edge n-grams
– Phrase queries vs shingles

•

Indexing is fast
– 220 GB/hour for 4K docs!
– http://people.apache.org/~mikemccand/lucenebench/indexing.html
Let’s create an index
•

Tree structure
– sorted for range queries
– O(log(n)) search

sql
index

data

term

Lucene

Lucene in action
Databases
Lucene doesn’t
work this way
Another index
•

Store terms and documents in arrays
– binary search

0

data

0,1

1

index

0,1

2

Lucene

0

3

term

0

4

sql

1

0

Lucene in action

1

Databases
Another index
•

Store terms and documents in arrays
– binary search

0

0,1

1

Segment

data
index

0,1

2

Lucene

0

3

term

0

4

sql

1

term
ordinal

terms
dict

postings
list

0

Lucene in action

1

Databases

doc id

document
Insertions?
•
•

Insertion = write a new segment
Merge segments when there are too many of them
– concatenate docs, merge terms dicts and postings lists (merge sort!)
0

data

0

1

index

0

2

Lucene

0

term

0

0

data

0

1

index

0

2

sql

0

0

Databases

1

index

0,1

Lucene

0

term

0

4

Lucene in action

0,1

2

0

data

3

3

0

sql

1

0

Lucene in action

1

Databases
Insertions?
•
•

Insertion = write a new segment
Merge segments when there are too many of them
– concatenate docs, merge terms dicts and postings lists (merge sort!)
0

data

0

1

index

0

2

Lucene

0

term

0

0

data

1

1

index

1

2

sql

1

1

Databases

1

index

0,1

Lucene

0

term

0

4

Lucene in action

0,1

2

0

data

3

3

0

sql

1

0

Lucene in action

1

Databases
Deletions?
•
•
•

Deletion = turn a bit off
Ignore deleted documents when searching and merging (reclaims space)
Merge policies favor segments with many deletions

0

data

0,1

1

index

0,1

2

Lucene

0

3

term

0

4

sql

1

0

Lucene in action

1

1

Databases

0

live docs: 1 = live, 0 = deleted
Pros/cons
•

•

•
•

•

Updates require writing a new segment
– single-doc updates are costly, bulk updates preferred
– writes are sequential
Segments are never modified in place
– filesystem-cache-friendly
– lock-free!
Terms are deduplicated
– saves space for high-freq terms
Docs are uniquely identified by an ord
– useful for cross-API communication
– Lucene can use several indexes in a single query
Terms are uniquely identified by an ord
– important for sorting: compare longs, not strings
– important for faceting (more on this later)
Lucene can use
several indexes
Many databases can’t
Index intersection
1

red
shoe

2

4

6

7

9

1, 2, 10, 11, 20, 30, 50, 100
2, 20, 21, 22, 30, 40, 100
3

5

8

Lucene’s postings lists support skipping that
can be use to “leap-frog”
Many databases just pick the most selective
index and ignore the other ones
What else?
•
•

We just covered search
Lucene does more
– term vectors
– norms
– numeric doc values
– binary doc values
– sorted doc values
– sorted set doc values
Term vectors
•
•
•

Per-document inverted index
Useful for more-like-this
Sometimes used for highlighting
0

Lucene in action

0

data

0

0

data

0,1

1

index

0

1

index

0,1

2

Lucene

0

2

Lucene

0

3

term

0

3

term

0

0

data

0

4

sql

1

1

index

0

2

sql

0

1

Databases
Numeric/binary doc values
•
•
•

Per doc and per field single numeric values, stored in a column-stride fashion
Useful for sorting and custom scoring
Norms are numeric doc values
field_a field_b
0

Lucene in action

42

afc

1

Databases

1

gce

2

Solr in action

3

ppy

3

Java

10

ccn
Sorted (set) doc values
•

Ordinal-enabled per-doc and per-field values
– sorted: single-valued, useful for sorting
– sorted set: multi-valued, useful for faceting

0

Lucene in action

1,2

0

distributed

1

Databases

0

1

Java

2

Solr in action

0,1,2

2

search

3

Java

1

Ordinals

Terms dictionary for
this dv field
Faceting
•

Compute value counts for docs that match a query
– eg. category counts on an ecommerce website

•

Naive solution
– hash table: value to count
– O(#docs) ordinal lookups
– O(#doc) value lookups

•

2nd solution
– hash table: ord to count
– resolve values in the end
– O(#docs) ordinal lookups
– O(#values) value lookups

Since ordinals are dense,
this can be a simple array
How can I use these APIs?
•

These are the low-level Lucene APIs, everything is built on top of these APIs:
searching, faceting, scoring, highlighting, etc.
API

Useful for

Method

Inverted index

Term -> doc ids, positions,
offsets

AtomicReader.fields

Stored fields

Summaries of search results

IndexReader.document

Live docs

Ignoring deleted docs

AtomicReader.liveDocs

Term vectors

More like this

IndexReader.termVectors

Doc values / Norms

Sorting/faceting/scoring

AtomicReader.get*Values
Wrap up
•

•

Data duplicated up to 4 times
– not a waste of space!
– easy to manage thanks to immutability
Stored fields vs doc values
– Optimized for different access patterns
– get many field values for a few docs: stored fields
– get a few field values for many docs: doc values

Stored fields

0,A

0,B

0,C

Doc values

0,A

1,A

2,A

0,B

1,B

2,B

0,B

1,B

2,B

1,A

1,B

1,C

2,A

2,B

2,C

At most 1 seek per doc
At most 1 seek per doc per field
BUT more disk / file-system cache-friendly
File formats
Important rules
•

Save file handles
– don’t use one file per field or per doc

•

Avoid disk seeks whenever possible
– disk seek on spinning disk is ~10 ms

•

BUT don’t ignore the filesystem cache
– random access in small files is fine

•

Light compression helps
– less I/O
– smaller indexes
– filesystem-cache-friendly
Codecs
•

File formats are codec-dependent

•

Default codec tries to get the best speed for little memory
– To trade memory for speed, don’t use RAMDirectory:
– MemoryPostingsFormat, MemoryDocValuesFormat, etc.

•

Detailed file formats available in javadocs
– http://lucene.apache.org/core/4_5_1/core/org/apache/lucene/codecs/packagesummary.html
–
Compression techniques
•

Bit packing / vInt encoding
– postings lists
– numeric doc values

•

LZ4
– code.google.com/p/lz4
– lightweight compression algorithm
– stored fields, term vectors

•

FSTs
– conceptually a Map<String, ?>
– keys share prefixes and suffixes
– terms index
What happens
when I run a
TermQuery?
1. Terms index
•

Lookup the term in the terms index
– In-memory FST storing terms prefixes
– Gives the offset to look at in the terms dictionary
– Can fast-fail if no terms have this prefix

r

b/2
l/4

a/1

c

u
y/3

r

br = 2
brac = 3
luc = 4
lyr = 7
2. Terms dictionary
•

•

Jump to the given offset in the terms dictionary
– compressed based on shared prefixes, similarly to a burst trie
– called the “BlockTree terms dict”
read sequentially until the term is found
–

Jump here
Not found
Not found
Found

[prefix=luc]
a, freq=1, offset=101
as, freq=1, offset=149
ene, freq=9, offset=205
ky, frea=7, offset=260
rative, freq=5, offset=323
3. Postings lists
•
•

Jump to the given offset in the postings lists
Encoded using modified FOR (Frame of Reference) delta
– 1. delta-encode
– 2. split into block of N=128 values
– 3. bit packing per block
– 4. if remaining docs, encode with vInt

Example with N=4

1,3,4,6,8,20,22,26,30,31
1,2,1,2,2,12,2,4,4,1
[1,2,1,2] [2,12,2,4] 4, 1

2 bits per value

vInt-encoded

4 bits per value
4. Stored fields
•

•

In-memory index for a subset of the doc ids
– memory-efficient thanks to monotonic compression
– searched using binary search
Stored fields
– stored sequentially
– compressed (LZ4) in 16+KB blocks
docId=3
offset=127

docId=0
offset=42

0

1
16KB

2

docId=4
offset=199

3
16KB

4

5
16KB

6
Query execution
•
•

2 disk seeks per field for search
1 disk seek per doc for stored fields

•

It is common that the terms dict / postings lists fits into the file-system cache

•

“Pulse” optimization
– For unique terms (freq=1), postings are inlined in the terms dict
– Only 1 disk seek
– Will always be used for your primary keys
Quizz
What is happening here?
qps

1
2

#docs in the index
What is happening here?
qps

1

Index grows larger than the filesystem
cache: stored fields not fully in the cache
anymore

2

#docs in the index
What is happening here?
qps

1

Index grows larger than the filesystem
cache: stored fields not fully in the cache
anymore

2 Terms dict/Postings lists not fully in the
cache

#docs in the index
Thank you!

More Related Content

What's hot

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
Slim Baltagi
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Andrew Lamb
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
Walaa Eldin Moustafa
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Arnab Mitra
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Julien Le Dem
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
Knoldus Inc.
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
 
Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene Dawid Weiss- Finite state automata in lucene
Dawid Weiss- Finite state automata in lucene
Lucidworks (Archived)
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
Databricks
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
Adrien Grand
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
 

What's hot (20)

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene Dawid Weiss- Finite state automata in lucene
Dawid Weiss- Finite state automata in lucene
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 

Viewers also liked

Lucene basics
Lucene basicsLucene basics
Lucene basics
Nitin Pande
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
Josiane Gamgo
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
otisg
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
dnaber
 
Elasticsearch From the Bottom Up
Elasticsearch From the Bottom UpElasticsearch From the Bottom Up
Elasticsearch From the Bottom Up
foundsearch
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Provectus
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
Ruslan Zavacky
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
ABC Talks
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1
Maruf Hassan
 
SlideShare 101
SlideShare 101SlideShare 101
SlideShare 101
Amit Ranjan
 

Viewers also liked (11)

Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Elasticsearch From the Bottom Up
Elasticsearch From the Bottom UpElasticsearch From the Bottom Up
Elasticsearch From the Bottom Up
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1
 
SlideShare 101
SlideShare 101SlideShare 101
SlideShare 101
 

Similar to What is in a Lucene index?

Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
GokulD
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
otisg
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
GokulD
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
Clifford James
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.Net
Dean Thrasher
 
Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
lucenerevolution
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Ecommerce Solution Provider SysIQ
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
pmanvi
 
Musings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBaseMusings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBase
Jesse Yates
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
rcmuir
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
Robert Viseur
 
Fun with flexible indexing
Fun with flexible indexingFun with flexible indexing
Fun with flexible indexing
Lucidworks (Archived)
 
Exploring Direct Concept Search
Exploring Direct Concept SearchExploring Direct Concept Search
Exploring Direct Concept Search
Steve Rowe
 
Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?
lucenerevolution
 
SFDC Introduction to Apex
SFDC Introduction to ApexSFDC Introduction to Apex
SFDC Introduction to Apex
Sujit Kumar
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
Audible, Inc.
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
GokulD
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Sease
 
Exploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, LucidworksExploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, Lucidworks
Lucidworks
 
MarcEdit Shelter-In-Place Webinar 4: Merging, Clustering, and Integrations…oh...
MarcEdit Shelter-In-Place Webinar 4: Merging, Clustering, and Integrations…oh...MarcEdit Shelter-In-Place Webinar 4: Merging, Clustering, and Integrations…oh...
MarcEdit Shelter-In-Place Webinar 4: Merging, Clustering, and Integrations…oh...
Terry Reese
 

Similar to What is in a Lucene index? (20)

Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.Net
 
Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Musings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBaseMusings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBase
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 
Fun with flexible indexing
Fun with flexible indexingFun with flexible indexing
Fun with flexible indexing
 
Exploring Direct Concept Search
Exploring Direct Concept SearchExploring Direct Concept Search
Exploring Direct Concept Search
 
Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?
 
SFDC Introduction to Apex
SFDC Introduction to ApexSFDC Introduction to Apex
SFDC Introduction to Apex
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Exploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, LucidworksExploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, Lucidworks
 
MarcEdit Shelter-In-Place Webinar 4: Merging, Clustering, and Integrations…oh...
MarcEdit Shelter-In-Place Webinar 4: Merging, Clustering, and Integrations…oh...MarcEdit Shelter-In-Place Webinar 4: Merging, Clustering, and Integrations…oh...
MarcEdit Shelter-In-Place Webinar 4: Merging, Clustering, and Integrations…oh...
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
lucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
lucenerevolution
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
lucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
lucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
lucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
lucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
lucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
lucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
lucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
lucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
lucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
lucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
lucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
lucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
lucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Recently uploaded

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 

Recently uploaded (20)

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 

What is in a Lucene index?

  • 1.
  • 2. WHAT IS IN A LUCENE INDEX Adrien Grand @jpountz Software engineer at Elasticsearch
  • 3. About me • • Lucene/Solr committer Software engineer at Elasticsearch • I like changing the index file formats! – stored fields – term vectors – doc values – ...
  • 4. Why should I learn about Lucene internals?
  • 5. Why should I learn about Lucene internals? • Know the cost of the APIs – to build blazing fast search applications – don’t commit all the time – when to use stored fields vs. doc values – maybe Lucene is not the right tool • Understand index size – oh, term vectors are 1/2 of the index size! – I removed 20% of my documents and index size hasn’t changed • This is a lot of fun!
  • 6. Indexing • Make data fast to search – duplicate data if it helps – decide on how to index based on the queries • Trade update speed for search speed – Grep vs full-text indexing – Prefix queries vs edge n-grams – Phrase queries vs shingles • Indexing is fast – 220 GB/hour for 4K docs! – http://people.apache.org/~mikemccand/lucenebench/indexing.html
  • 7. Let’s create an index • Tree structure – sorted for range queries – O(log(n)) search sql index data term Lucene Lucene in action Databases
  • 9. Another index • Store terms and documents in arrays – binary search 0 data 0,1 1 index 0,1 2 Lucene 0 3 term 0 4 sql 1 0 Lucene in action 1 Databases
  • 10. Another index • Store terms and documents in arrays – binary search 0 0,1 1 Segment data index 0,1 2 Lucene 0 3 term 0 4 sql 1 term ordinal terms dict postings list 0 Lucene in action 1 Databases doc id document
  • 11. Insertions? • • Insertion = write a new segment Merge segments when there are too many of them – concatenate docs, merge terms dicts and postings lists (merge sort!) 0 data 0 1 index 0 2 Lucene 0 term 0 0 data 0 1 index 0 2 sql 0 0 Databases 1 index 0,1 Lucene 0 term 0 4 Lucene in action 0,1 2 0 data 3 3 0 sql 1 0 Lucene in action 1 Databases
  • 12. Insertions? • • Insertion = write a new segment Merge segments when there are too many of them – concatenate docs, merge terms dicts and postings lists (merge sort!) 0 data 0 1 index 0 2 Lucene 0 term 0 0 data 1 1 index 1 2 sql 1 1 Databases 1 index 0,1 Lucene 0 term 0 4 Lucene in action 0,1 2 0 data 3 3 0 sql 1 0 Lucene in action 1 Databases
  • 13. Deletions? • • • Deletion = turn a bit off Ignore deleted documents when searching and merging (reclaims space) Merge policies favor segments with many deletions 0 data 0,1 1 index 0,1 2 Lucene 0 3 term 0 4 sql 1 0 Lucene in action 1 1 Databases 0 live docs: 1 = live, 0 = deleted
  • 14. Pros/cons • • • • • Updates require writing a new segment – single-doc updates are costly, bulk updates preferred – writes are sequential Segments are never modified in place – filesystem-cache-friendly – lock-free! Terms are deduplicated – saves space for high-freq terms Docs are uniquely identified by an ord – useful for cross-API communication – Lucene can use several indexes in a single query Terms are uniquely identified by an ord – important for sorting: compare longs, not strings – important for faceting (more on this later)
  • 15. Lucene can use several indexes Many databases can’t
  • 16. Index intersection 1 red shoe 2 4 6 7 9 1, 2, 10, 11, 20, 30, 50, 100 2, 20, 21, 22, 30, 40, 100 3 5 8 Lucene’s postings lists support skipping that can be use to “leap-frog” Many databases just pick the most selective index and ignore the other ones
  • 17. What else? • • We just covered search Lucene does more – term vectors – norms – numeric doc values – binary doc values – sorted doc values – sorted set doc values
  • 18. Term vectors • • • Per-document inverted index Useful for more-like-this Sometimes used for highlighting 0 Lucene in action 0 data 0 0 data 0,1 1 index 0 1 index 0,1 2 Lucene 0 2 Lucene 0 3 term 0 3 term 0 0 data 0 4 sql 1 1 index 0 2 sql 0 1 Databases
  • 19. Numeric/binary doc values • • • Per doc and per field single numeric values, stored in a column-stride fashion Useful for sorting and custom scoring Norms are numeric doc values field_a field_b 0 Lucene in action 42 afc 1 Databases 1 gce 2 Solr in action 3 ppy 3 Java 10 ccn
  • 20. Sorted (set) doc values • Ordinal-enabled per-doc and per-field values – sorted: single-valued, useful for sorting – sorted set: multi-valued, useful for faceting 0 Lucene in action 1,2 0 distributed 1 Databases 0 1 Java 2 Solr in action 0,1,2 2 search 3 Java 1 Ordinals Terms dictionary for this dv field
  • 21. Faceting • Compute value counts for docs that match a query – eg. category counts on an ecommerce website • Naive solution – hash table: value to count – O(#docs) ordinal lookups – O(#doc) value lookups • 2nd solution – hash table: ord to count – resolve values in the end – O(#docs) ordinal lookups – O(#values) value lookups Since ordinals are dense, this can be a simple array
  • 22. How can I use these APIs? • These are the low-level Lucene APIs, everything is built on top of these APIs: searching, faceting, scoring, highlighting, etc. API Useful for Method Inverted index Term -> doc ids, positions, offsets AtomicReader.fields Stored fields Summaries of search results IndexReader.document Live docs Ignoring deleted docs AtomicReader.liveDocs Term vectors More like this IndexReader.termVectors Doc values / Norms Sorting/faceting/scoring AtomicReader.get*Values
  • 23. Wrap up • • Data duplicated up to 4 times – not a waste of space! – easy to manage thanks to immutability Stored fields vs doc values – Optimized for different access patterns – get many field values for a few docs: stored fields – get a few field values for many docs: doc values Stored fields 0,A 0,B 0,C Doc values 0,A 1,A 2,A 0,B 1,B 2,B 0,B 1,B 2,B 1,A 1,B 1,C 2,A 2,B 2,C At most 1 seek per doc At most 1 seek per doc per field BUT more disk / file-system cache-friendly
  • 25. Important rules • Save file handles – don’t use one file per field or per doc • Avoid disk seeks whenever possible – disk seek on spinning disk is ~10 ms • BUT don’t ignore the filesystem cache – random access in small files is fine • Light compression helps – less I/O – smaller indexes – filesystem-cache-friendly
  • 26. Codecs • File formats are codec-dependent • Default codec tries to get the best speed for little memory – To trade memory for speed, don’t use RAMDirectory: – MemoryPostingsFormat, MemoryDocValuesFormat, etc. • Detailed file formats available in javadocs – http://lucene.apache.org/core/4_5_1/core/org/apache/lucene/codecs/packagesummary.html –
  • 27. Compression techniques • Bit packing / vInt encoding – postings lists – numeric doc values • LZ4 – code.google.com/p/lz4 – lightweight compression algorithm – stored fields, term vectors • FSTs – conceptually a Map<String, ?> – keys share prefixes and suffixes – terms index
  • 28. What happens when I run a TermQuery?
  • 29. 1. Terms index • Lookup the term in the terms index – In-memory FST storing terms prefixes – Gives the offset to look at in the terms dictionary – Can fast-fail if no terms have this prefix r b/2 l/4 a/1 c u y/3 r br = 2 brac = 3 luc = 4 lyr = 7
  • 30. 2. Terms dictionary • • Jump to the given offset in the terms dictionary – compressed based on shared prefixes, similarly to a burst trie – called the “BlockTree terms dict” read sequentially until the term is found – Jump here Not found Not found Found [prefix=luc] a, freq=1, offset=101 as, freq=1, offset=149 ene, freq=9, offset=205 ky, frea=7, offset=260 rative, freq=5, offset=323
  • 31. 3. Postings lists • • Jump to the given offset in the postings lists Encoded using modified FOR (Frame of Reference) delta – 1. delta-encode – 2. split into block of N=128 values – 3. bit packing per block – 4. if remaining docs, encode with vInt Example with N=4 1,3,4,6,8,20,22,26,30,31 1,2,1,2,2,12,2,4,4,1 [1,2,1,2] [2,12,2,4] 4, 1 2 bits per value vInt-encoded 4 bits per value
  • 32. 4. Stored fields • • In-memory index for a subset of the doc ids – memory-efficient thanks to monotonic compression – searched using binary search Stored fields – stored sequentially – compressed (LZ4) in 16+KB blocks docId=3 offset=127 docId=0 offset=42 0 1 16KB 2 docId=4 offset=199 3 16KB 4 5 16KB 6
  • 33. Query execution • • 2 disk seeks per field for search 1 disk seek per doc for stored fields • It is common that the terms dict / postings lists fits into the file-system cache • “Pulse” optimization – For unique terms (freq=1), postings are inlined in the terms dict – Only 1 disk seek – Will always be used for your primary keys
  • 34. Quizz
  • 35. What is happening here? qps 1 2 #docs in the index
  • 36. What is happening here? qps 1 Index grows larger than the filesystem cache: stored fields not fully in the cache anymore 2 #docs in the index
  • 37. What is happening here? qps 1 Index grows larger than the filesystem cache: stored fields not fully in the cache anymore 2 Terms dict/Postings lists not fully in the cache #docs in the index