My slides on Heliosearch/Solr, covering native code performance optimizations, off-heap data structures to prevent garbage collection issues, and the new JSON Facet API.
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
1. Native Code, Off-Heap Data &
JSON Facet API for Solr
Yonik Seeley
Apachecon EU 2014
Budapest, Hungary
2. My Background
• Creator of Solr
• Heliosearch Founder
• LucidWorks Co-Founder
• Lucene/Solr committer, PMC member
• Apache Software Foundation member
• M.S. in Computer Science, Stanford
3. Heliosearch Project
• The Next Evolution of Solr
• Forked from Solr, Developing at github
– Started Jan 2014
– Well aligned community
– Open Source, Apache licensed
• Bring back to Apache in the future?
• Currently drop-in replacement for Solr at the HTTP-API level
– A super-set… we continually merge in upstream changes
– Latest version of Heliosearch includes latest Solr
• Current Features: Off-heap filters, Off-heap fieldcache, facet-by-
function, sub-facets, native code performance
enhancements
5. Garbage Collection Basics
Eden Space
Survivor Space 1
Survivor Space 2
Tenured Space
Permanent Space
New objects allocated in Eden
Find live objects by tracing from GC
“roots” (threads, stack locals, etc)
Make a copy of live objects, leaving
“garbage” behind
Eden + Survivor Space copied
together to other Survivor space
Tenured from Survivor when old
enough
“stop-the-world” needed when GC
can’t keep up
Out of memory when too much time
spent in GC
Thread
6. Java Memory Waste
- Need to size for worst case scenario
- OS needs free memory to cache index files
- JVMs aren’t good at “sharing” with rest of the system
- mmap allocations managed by OS, can be immediately reused on free
OS Real Memory
max heap
Unused Heap
Heap in use
JVM
max heap
Unused Heap
Heap in use
JVM
mmap alloced mmap alloced
Unused Heap
C Heap in use
C Process
Unused Heap
C Heap in use
C Process
“Free” Memory
includes buffer
cache, important
to cache index files
7. GC Impact
GC Reduces Throughput
Time to copy all that memory around could be spent
better!
Stop-the-world pauses
Seconds to Minutes long
Pause time proportional to heap size
Still exists in all Hotspot GCs… CMS, G1GC, etc
Breaks Application SLAs (request timeouts, etc)
Can cause SolrCloud Zookeeper session timeouts
Reducing max pause size normally means reduced
throughput
Non-graceful degradation
if you don't size your heap big enough… BOOM!
9. GC Reduction
Reuse objects – cause less garbage
Move certain things off-heap (invisible to GC)
Option1: Direct ByteBuffers
Limited to “int” (2GB)
No way to directly “free” – still relies on GC
Option2: sun.misc.Unsafe
malloc() + free() + direct memory access
Supported on all major JVMs
Widely used: Java (nio, concurrent),JSR166, Google
Guava, objenesis (which is used in Kyro, which is used
in Twitter Storm), Apache DirectMemory,Lightning,
Hazelcast, snappy, gson, …
Being considered for Java 9
11. Off-Heap title
Filters Test
Observed max process sizes
Solr : 3.8GB – 4.3GB
Heliosearch: 3.6GB – 3.7GB
12. Off-Heap FieldCache
Normal (on-heap) FieldCache
Typically the largest data structures kept on the heap
Used for sorting, function query values, single-valued faceting,
grouping
Uses weak references
Heliosearch nCache (n is for “native”)
Allocated off-heap
First-class managed Solr cache
Configure size, warming policies
View statistics
Per-segment (NRT friendly)
No weak references
13.
14. nCache admin stats
item_id:{ "field":"id", "uses":8, "class":"StrTopValues",
"refcount":2, "numSegments":7, "carriedOver":6, "size":612}
item_popularity:{ "field":"popularity", "uses":5,
"class":"IntTopValues", "refcount":2, "numSegments":7,
"carriedOver":6, "size":106}
item_price:{
"field":"price”,
"uses":0, -- the number of top-level uses for searcher
"class":"FloatTopValues",
"refcount":2,
"numSegments":5, -- number of segments populated
"carriedOver":5, -- number of segments carried over from last searcher
"size":272 -- size in bytes for all populated segments
}
15. Off-Heap Integer Field
50M document index
Sorting on 6 different integer fields (10,100,1000,10000,1M unique values)
4 request threads
Results
42% faster sorting
73% faster functions
16. String Field Sorting
10M document index
10 different string fields, each field 80% populated
Median latency
17. String Field Sorting Throughput
Concurrent throughput sorting on random fields in random order (asc/desc)
~50% performance gain
19. Native Code
The Idea: create native accelerators for CPU hotspots
Faceting anyone?
But…. JNI Sucks! (and it’s GC’s fault again)
jint *buf= (*env)->GetIntArrayElements(env, arr, 0);
for (i=0; i<len; i++) {
sum += buf[i];
GetArrayElements() – makes a *copy* of the array!
GetPrimitiveArrayCritical() – blocks garbage collection!
Tons of other restrictions… it’s a “critical section”
Defeats the purpose of going to native code in the first place
But… our data is already off-heap, we’re good!
}
20. Native Single Valued String Faceting
Top-Level off-heap String cache
Improves Sorting and Faceting speed
Eliminates FieldCache “insanity”
Native Code
Written in C++, compiled with GCC 4.7, 4.8
Currently supports 64 bit Windows, OS-X, Linux (x86)
static compilation avoids JVM hotspot warmup period,
mis-compilation bugs, and variations between runs
25. Facet Module Goals
Replace the aging “SimpleFacets”
First class JSON support
Easier programmatic construction of complex nested facet
commands
Canonical response format that is easier for clients to
parse
First class analytics support
Cleaner distributed search support
Fully pluggable
Better base for integration of other search features
Heliosearch is a Solr super-set, so you can still chose to
use the old faceting or mix-n-match.
26. API Comparison
Old Style New JSON API
&facet=true
&facet.range={!key=age_ranges}age
&f.age_ranges.facet.range.start=0
&f.age_ranges.facet.range.end=100
&f.age_ranges.facet.range.gap=10
&facet.range={!key=price_ranges}price
&f.price_ranges.facet.range.start=0
&f.price_ranges.facet.range.end=1000
&f.price_ranges.facet.range.gap=50
{
age_ranges: { // facet name
range: { // facet type
field : age, // facet params
start : 0,
end : 100,
gap : 10
}
},
price_ranges: {
range: {
field : price,
start : 0,
end : 1000,
gap : 50
}
}
}
27. Facet Functions
Sort/Report by things other than “count”
Aggregation Functions / Stats:
count
sum(function)
avg(function)
sumsq(function)
min(function)
max(function)
unique(string_field)
any “function query” that yields a
numeric value!
Example:
sum(mul(num_units, unit_price))
Stats are calculated “per bucket”
Buckets created by Query, Range, or Terms (field) facets
28. Simple Request + Response
$ curl http://localhost:8983/solr/query -d 'q=widgets&
json.facet=
{ // Comments can help with clarity
/* traditional C-style comments are also supported */
x : "avg(price)" , // Simple strings can occur unquoted
y : 'unique(brand)' // Strings can also use single quotes
}
'
[…]
"facets" : {
"count" : 314,
"x" : 102.5,
"y" : 28
}
Number of documents in
the facet bucket
30. Sub-Facets
Any facet that produces buckets can have sub-facets
(terms/field, range, query)
Sub-facets can have facet functions (stats) or their
own sub-facets (no limit to nesting).
A subfacet can be any type (field, range, query)
Multiple subfacets can be added to any given facet
Subfacets are first-class facets - can be configured
independently like any other facet.
Different offsets, limits, stats, sorts, etc
31. Sub-Facet Example
json.facet={
shoes:{
terms:{
field: shoe_style,
sort: {x : desc},
facet:{
x : "avg(price)",
y : "unique(brand)",
colors :{terms:color}
}
}
}
}
"facets": {
"count" : 472,
"shoes": {
"buckets" : [
{
"val" : "Hiking",
"count" : 34,
"x" : 135.25,
"y" : 17,
"colors" : {
"buckets" : [
{ "val" : "brown",
"count" : 12 },
{ "val" : "black",
"count" : 10
}, […]
]
} // end of colors sub-facet
}, // end of Hiking bucket
{
"val" : "Running",
"count" : 45,
"x" : 110.75,
"y" : 24,
"colors" : {
"buckets" : […]
Short-form for terms facet simply
specifies the field. Sorts buckets
by count descending.
32. Terms Facet
Terms facet creates buckets of docs with the same value in a field
- field – The field name to facet over.
- offset – Used for paging, this skips the first N buckets. Defaults to 0.
- limit – Limits the number of buckets returned. Defaults to 10.
- mincount – Only return buckets with a count of at least this number. Defaults to 1.
- sort – Specifies how to sort the buckets produced. “count” specifies document count,
“index” sorts by the index (natural) order of the bucket value. One can also sort by any
facet function / statistic that occurs in the bucket. The default is “count desc”. This
parameter may also be specified in JSON like sort:{count:desc}. The sort order may
either be “asc” or “desc”
- missing – A boolean that specifies if a special “missing” bucket should be returned that is
defined by documents without a value in the field. Defaults to false.
- numBuckets – A boolean. If true, adds “numBuckets” to the response, an integer
representing the number of buckets for the facet (as opposed to the number of buckets
returned). Defaults to false.
- allBuckets – A boolean. If true, adds an “allBuckets” bucket to the response, representing
the union of all of the buckets. For multi-valued fields, this is different than a bucket for all
of the documents in the domain since a single document can belong to multiple buckets.
Defaults to false.
- prefix – Only produce buckets for terms starting with the specified prefix.
33. Query Facet
Query facet creates a single bucket of documents matching the
query.
{ // simple example
highpop:{ query:{ q:"inStock:true AND popularity[8 TO 10]" } }
}
{ // example with multiple sub-facets
highpop:{ query:{
q : "inStock:true AND popularity[8 TO 10]",
facet : {
average_price : "agv(price)",
available_colors : { terms : color },
price_ranges : { range : {
field:price, start:0, end:200, gap:10
}}
}}
}
34. Range Facet
Creates buckets over ranges on a numeric or date field
Parameter names/values "in sync" with Solr range parameters:
field – The numeric field or date field to produce range buckets from
start – Lower bound of the ranges
end – Upper bound of the ranges
gap – Size of each range bucket produced
hardend – A boolean, which if true means that the last bucket will end at “end” even if it is less than “gap” wide. If false,
the last bucket will be “gap” wide, which may extend past “end”.
other – This param indicates that in addition to the counts for each range constraint between facet.range.start and
facet.range.end, counts should also be computed for…
– "before" all records with field values lower then lower bound of the first range
– "after" all records with field values greater then the upper bound of the last range
– "between" all records with field values between the start and end bounds of all ranges
– "none" compute none of this information
– "all" shortcut for before, between, and after
include – By default, the ranges used to compute range faceting between facet.range.start and facet.range.end are
inclusive of their lower bounds and exclusive of the upper bounds. The “before” range is exclusive and the “after” range is
inclusive. This default, equivalent to lower below, will not result in double counting at the boundaries. This behavior can
be modified by the facet.range.include param, which can be any combination of the following options…
– "lower" all gap based ranges include their lower bound
– "upper" all gap based ranges include their upper bound
– "edge" the first and last gap ranges include their edge bounds (ie: lower for the first one, upper for the last one)
even if the corresponding upper/lower option is not specified
– "outer" the “before” and “after” ranges will be inclusive of their bounds, even if the first or last ranges already
include those boundaries.
– "all" shorthand for lower, upper, edge, outer
36. Fantasy ($1045)
Top Authors
$423 George R.R. Martin
$347 Brandon Sanderson
$155 JK Rowling
Top Books
$252 A Game of Thrones
$113 Emperor of Thorns
$101 Nine Princes in Amber
$82 Steel Heart
Sci-Fi ($898)
Top Authors
$321 Iain M Banks
$218 Neal Asher
$155 Neal Stephenson
Top Books
$113 Gridlinked
$101 Use of Weapons
$93 Snow Crash
$82 The Skinner
Mystery ($645)
Top Authors
$191 James Patterson
$145 Patricia Cornwell
$126 John Grisham
Top Books
$85 One for the Money
$77 Angels & Daemons
$64 Shutter Island
$35 The Firm
Filter By
State
$852 NJ (14 stores)
$658 NY (11 stores)
$421 CT (8 stores)
Chain
$984 Amazoon (14 stores)
$734 Houses&Royalty (9 stores)
$387 Books-r-us (7 stores)
Store
$108 Amazoon Branchburg
$93 Books-r-us Bridgewater
$87 H&R NYC
Number of Books
Chain
201K Houses&Royalty
183K Amazoon
98K Books-r-us
Store
193K H&R NYC
77K Books-r-us Bridgewater
68K Amazoon Branchburg
37. date_breakout : { range: {
field: sale_date,
start : ...,
end : ...,
gap : "+1MONTH”,
facet : {
top_genre : { terms : {
field : genre,
sort : "revenue desc",
limit : 4,
facet : {
revenue : "sum(sales)"
}
}},
by_chain: { terms : {
field : chain,
facet : {
revenue : "sum(sales)"
}
}}
[…]
Implementation
Creates series of facet
buckets based on date
For each date bucket, facet by genre, taking
the top 4 by revenue
For each genre bucket, report revenue
38. Fantasy ($1045)
Top Authors
$423 George R.R. Martin
$347 Brandon Sanderson
$155 JK Rowling
Top Books
$252 A Game of Thrones
$113 Emperor of Thorns
$101 Nine Princes in Amber
$82 Steel Heart
Sci-Fi ($898)
Top Authors
$321 Iain M Banks
$218 Neal Asher
$155 Neal Stephenson
Top Books
$113 Gridlinked
$101 Use of Weapons
$93 Snow Crash
$82 The Skinner
Mystery ($645)
Top Authors
$191 James Patterson
$145 Patricia Cornwell
$126 John Grisham
Top Books
$85 One for the Money
$77 Angels & Daemons
$64 Shutter Island
$35 The Firm
top_genres:{ terms:{
field: genre,
facet : {
rev : "sum(sales)",
top_authors:{ terms:{
field : author,
sort :"rev desc",
limit : 3,
facet : {
rev : "sum(sales)"
}
}},
top_books:{ terms:{
field : title,
sort : "rev desc",
limit : 4,
facet : {
rev : "sum(sales)"
}
}}
[…]
41. Parameter Substitution
Parameters / macros substituted across whole request
Happens before any parsing, so usable in any context
q=price:[ ${low} TO ${high} ]
&low=100
&high=200
Default values
q=price:[ ${low:0} TO ${high:100} ]
Nested
q=${price_query}
&price_query=${price_field}:[ ${low} TO ${high} ] AND inStock:true
&price_field=specialPrice
&low=50
&high=100
42. New Query Parser Features
Filters in queries - just like “fq” parameters, but may appear
anywhere in a query
q=(text:elephant –(filter(*:* -price:[ 0 TO 100 ]) OR
filter(date[0 TO 2013]) )
Constant Score Queries
q=color:(blue OR green)^=1 text:shoes
Comments in Queries (can nest)
q=+text:elephant /* the main query */ /* boosting part – WIP
{!func}mul(pop,rank)^10 */
43. Thank You
Help Develop the Next Generation of Solr!
Resources:
http://heliosearch.org
https://github.com/Heliosearch/heliosearch
https://groups.google.com/forum/#!forum/heliosearch
https://groups.google.com/forum/#!forum/heliosearch-dev
twitter.com/lucene_solr