Mikhail khludnev: approaching-join index for lucene

Approaching Join Index
yet another one join algorithm
Mikhail Khludnev
principal engineer

PRIVILEGED AND CONFIDENTIAL
• Grid Dynamics is a Silicon Valley-based leading provider of scalable, next-generation
commerce technology solutions
• Record of outperformance with Tier 1 retail clients
• Fortune 1000 client relationships

About Me
● principal engineer at Grid Dynamics
● spoke at few last LuceneRevolutions
● contributed BlockJoin query parser for Solr - {!parent}
● blogged about it at http://blog.griddynamics.com/
● tried to fix threads at DataImportHandler
http://google.com/+MikhailKhludnev

You are expected to know
● how Lucene searches/filters
● how it counts facets
● that there are segments
● what is DocValues
● why to join
● RDBMS joins: nested loop join, sort-merge join and hash join.

I’m expected to know
● query-time join
● index-time join
● yet another one join

Lucene/Solr/Elastic Is Strong
● searching
○ filtering
● analytics
○ facets
○ pivots
○ stats

SKU_ID: 13
PROD_ID: 1
TYPE: SKU
SIZE: 8
COLOR:Blue
SKU_ID: 12
PROD_ID: 1
TYPE: SKU
SIZE: 8
COLOR:Green
There is a weakness
● robust joins
○ multiple entities
○ relations
PROD_ID: 1
TYPE: PROD
BRAND: Nike
NAME: Shoes
PRICE: $50
SKU_ID: 11
PROD_ID: 1
TYPE: SKU
SIZE: 7
COLOR: Black

children
Joins in General
1:M
parents

Join in General
parents ∩ join-relation ∩ children

JoinUtil
q
“25”
“17”
“17”
“25”
“25”
“56”
“56”
“56”
“25”
“4”
“61”
FK[doc#]

JoinUtil
q
FK[doc#]
“17”
“17”
“25”
“25”
“56”
“56”
“56”
“25”
“4”
“61”
“25”
“25”
“17”
...

JoinUtil
q
PK
“1” →△
…
“17”→△
“25”→△
...
“25”
“17”
...
“25”
“17”
“17”
“25”
“25”
“56”
“56”
“56”
“25”
“4”
“61”
FK[doc#]

JoinUtil
q
“1” →△
…
“17”→△
“25”→△
...
“25”
“17”
...
fq
“25”
“17”
“17”
“25”
“25”
“56”
“56”
“56”
“25”
“4”
“61”
FK[doc#]PK

Block Join
doc#1
0
0
1
0
0
1
0
0
1
0

Block Join
doc#1
0
0
1
0
0
1
0
0
1
0
q

Block Join
doc#1
0
0
1
0
0
1
0
0
1
0
q
fq

Comparison
JoinUtil BlockJoin
searching slow fast
reindexing

Comparison
JoinUtil BlockJoin
searching slow fast
reindexing fast slow

Comparison
doc#
JoinUtil BlockJoin
searching slow fast
reindexing fast slow

Comparison
JoinUtil BlockJoin
searching slow < ? < fast
reindexing fast > ? > slow

Join Index
q
doc#[doc#]
3
6
0
3
10
0
6
3

Join Index
q
doc#[doc#]
fq 3
6
0
3
10
0
6
3

Join Index
q
fq
doc#[doc#]
2
4
1
6
5 10
9
8

Join Index
q
doc#[doc#]
fq
2
4
1
6
5 10
9
8

GlobalOrdinalsQuery
“25”
“17”
“17”
“25”
“25”
“56”
“56”
“56”
“25”
“4”
“61”
SortedDocValues

GlobalOrdinalsQuery
“25”
“17”
“17”
“25”
“25”
“56”
“56”
“56”
“25”
“4”
“61”
SortedDocValues
0 “17”
1 “25”
2 “4”
3 “56”
4 “61”
Ordinals
0
0
1
1
3
3
3
1
2
4
1

GlobalOrdinalsQuery
q
0
0
1
1
3
3
3
1
2
4
1
1
1
0
0
0

fq
GlobalOrdinalsQuery
q
0
0
1
1
3
3
3
1
2
4
1
1
1
0
0
0

Benchmarking 2.9 M docs
https://github.com/m-khl/lucene-solr/tree/dvjoin-benchmark-5-1
Latency, ms
the bigger the worse
7
28
14
BlockJoin
(i-time)
Join Index
JoinUtil
(q-time)
GloblOrdinals

●
JoinUtil JoinIndex
Global
Ordinals
Block
Join
searching slow fast fast
faster
anyway
reindexing fast uber slow fast slow
<

doc#[doc#]
Indexing is still a problem
2
4
3
6
1
0
3
10
0
6
3
5 10
9
8

Further Plans 2.0
● incremental join-index update
● perhaps just calculate and cache it
● or put to dedicated index
● join in both directions
● calculate optimal execution plan of segments enumeration
● edge case for benchmark

Summary
● Joins in General
● JoinUtil vs Block-join vs GlobalOrdinals
● updatable DocValues
● opportunities for improving query-time joins:
○ eliminate term enum
○ choose lower cardinality side for enumeration
○ GlobalOrdinalsJoin

References
● Searching relational content with Lucene's BlockJoinQuery
http://blog.mikemccandless.com/
● Solr Experience: search parent-child relations. Part I
Solr block-join support
http://blog.griddynamics.com/
● https://wiki.apache.org/solr/Join
● http://www.slideshare.net/martijnvg/document-relations
● SOLR-6234 - {!scorejoin }
● LUCENE-6352
● Updatable DocValues Under the Hood
http://shaierera.blogspot.com/
● Subject: How to openIfChanged the most recent merge?
at: java-dev@lucene.apache.org

Thanks for Joining us!
https://goo.gl/hjsYZW

Joins’ Zoo in Lucene
True Joins
● query-time join
○ JoinUtil
○ {!join }
○ {!scorejoin } - SOLR-6234
● index-time join aka block-join {!
parent}

Joins’ Zoo in Lucene
Workarounds
● term positions/SpanQueries
● FieldCollapsing/Grouping
● term decoration
○ spatial
● multivalue fields
True Joins
● query-time join
○ JoinUtil
○ {!join },
○ {!scorejoin } - SOLR-6234
● index-time join aka block-join {!
parent}

Two phase update problem
Subject: How to openIfChanged the most recent merge?
at: java-dev@lucene.apache.org

JoinUtil
● query-time
● indexing is fast
● searching is slow, why?
○ expensive term enum
○ single enumeration order
BlockJoin
● index-time
● reindexing whole block is as expensive as mandatory
● searching is darn fast, however
○ can’t reorder child docs

store ref = segment#, doc#
put ref to previous and current segment in DV
when add new segment, join IDs with previous segments
- for parent, just ref to all children docnums
- for children, add plain field refsToSeg:seg#
-
when score parents on some segment
- buffer them with the link refs, then
- intersect buffered link refs with children query on previous segments
- search all segments for refsToSeg:seg#, intersect with children query, obtain
perent ref from DV intersect with buffered

Mikhail khludnev: approaching-join index for lucene

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mikhail khludnev: approaching-join index for lucene

Similar to Mikhail khludnev: approaching-join index for lucene (20)

More from Grid Dynamics

More from Grid Dynamics (20)

Recently uploaded

Recently uploaded (20)

Mikhail khludnev: approaching-join index for lucene