Solr 6 Feature Preview

1© Cloudera, Inc. All rights reserved.
Solr 6 Feature Preview
Yonik Seeley
3/09/2016

My Background
• Creator of Solr
• Cloudera Engineer
• LucidWorks Co-Founder
• Lucene/Solr committer, PMC member
• Apache Software Foundation member
• M.S. in Computer Science, Stanford

Solr 6
• Happy Birthday Solr!
• 10 Years at the Apache Software Foundation as of 1/2016
• Release branch as been cut
• ETA before April
• Java 8+ only

Streaming Expressions

Solr Streaming Expressions
• Generic platform for distributed computation
• The basis for implementing distributed SQL
• Works across entire result sets (or subsets)
• normal search operations are designed for fast top-N operations
• Map-reduce like "shuffle" partitions result sets for greater scalability
• Worker nodes can be allocated from a collection for parallelism

Tuple Streams
• A streaming expression compiles/parses to a tuple stream
• direct mapping from a streaming expression function->tuple_stream
• Stream Sources – produce a tuple stream
• Stream Decorators – operate on tuple streams
• Designed to include streams from non-Solr systems

search() expression
$ curl http://localhost:8983/solr/techproducts/stream -d
'expr=search(techproducts, q="*:*", fl="id,price,score", sort="id asc")'
{"result-set":{"docs":[
{"score":1.0,"id":"0579B002","price":179.99},
{"score":1.0,"id":"100-435805","price":649.99},
{"score":1.0,"id":"3007WFP","price":2199.0},
{"score":1.0,"id":"VDBDB1A16"},
{"score":1.0,"id":"VS1GB400C3","price":74.99},
{"EOF":true,"RESPONSE_TIME":6}]}}
resulting tuple stream

Search Tuple Stream
Shard 1
Replica 2
Shard 1
Replica 1
Shard 1
Replica 2
Shard 2
Replica 1
Shard 1
Replica 2
Shard 3
Replica 1
Worker
Tuple Stream
Tuple Stream
/stream worker
executing the "search"
expression
• search() is a stream source
• SolrCloud aware (CloudSolrStream java class)
• Fully streaming (no big buffers)
• Worker node doesn't need to be a Solr node

search expression args
search( // parses to CloudSolrStream java class
techproducts, // name of the collection to search
zkHost="localhost:9983", // (opt) zookeeper address of collection to search
qt="/select", // (opt) the request handler to use (/export is also
available)
rows=1000000, // (opt) number of rows to retrieve
q=*:*, // query to match returned documents
fl="id,price,score", // which fields to return
sort="id asc, price desc", // how to sort the results
aliases="id=myid,price=myprice" // (opt) renames output fields
)

reduce() streaming expression
• Groups tuples by common field values
• Emits one group-head per group
• Each group-head contains list of tuples
• "by" parameter must match up with
"sort" parameter
• Any partitioning should be done on
same group field.
reduce(
search(collection1, qt="/export"
q="*:*",
fl="id,manu,price",
sort="manu asc, price desc"),
by="manu"),
group(sort="price desc",n=100)
)
stream operation

rollup() expression
• Groups tuples by common field values
• Emits rollup value along with metrics
• Closest equivalent to faceting
rollup(
search(collection1, qt="/export"
q="*:*",
fl="id,manu,price",
sort="manu asc"),
over="manu"),
count(*),
max(price)
)
metrics
{"manu":"apple","count(*)":1.0},
{"manu":"asus","count(*)":1.0},
{"manu":"ati","count(*)":1.0},
{"manu":"belkin","count(*)":2.0},
{"manu":"canon","count(*)":2.0},
{"manu":"corsair","count(*)":3.0},
[...]

facet() expression
• Like search+rollup, but pushes down
computation to JSON Facet API
facet(
techproducts,
q="*:*",
buckets="manu",
bucketSorts="count(*) desc",
bucketSizeLimit=1000,
count(*),
sum(price),
max(popularity)
)
{"avg(price)":129.99, "max(popularity)":7.0,"manu":"corsair","count(*)":3},
{"avg(price)":15.72,"max(popularity)":1.0,"manu":"belkin","count(*)":2},
{"avg(price)":254.97,"max(popularity)":7.0,"manu":"canon","count(*)":2},
{"avg(price)":399.0,"max(popularity)":10.0,"manu":"apple","count(*)":1},
{"avg(price)":479.95,"max(popularity)":7.0,"manu":"asus","count(*)":1},
{"avg(price)":649.98,"max(popularity)":7.0,"manu":"ati","count(*)":1},
{"avg(price)":0.0,"max(popularity)":"NaN","manu":"boa","count(*)":1},
[...]

Parallel Tuple Stream
Shard 1
Replica 2
Shard 1
Replica 1
Shard 1
Replica 2
Shard 2
Replica 1
Shard 1
Replica 2
Shard 3
Replica 1
Worker
Partition 1
Worker
Partition 2
Worker
Tuple Stream

Streaming Expressions – parallel
• Wraps a stream and sends to N worker
nodes
• The first parameter is the collection to
use for the intermediate worker nodes
• partitionKeys must be provided to
underlying workers
• usually makes sense to partition by
what you are grouping on
• inner and outer sorts should match
parallel(collection1,
rollup(
search(techproducts,
q="*:*",
fl="id,manu,price",
sort="manu asc",
partitionKeys="manu"),
over="manu asc"),
workers=2,
zkHost="localhost:9983",
sort="manu asc")

Joins!
innerJoin(
search(people, q=*:*, fl="personId,name", sort="personId asc"),
search(pets, q=type:cat, fl="personId,petName", sort="personId asc"),
on="personId"
)
leftOuterJoin, hashJoin, outerHashJoin,

More decorators
• complement – emits tuples from A which do not exist in B
• intersect – emits tuples from A whish do exist in B
• merge
• top – reorders the stream and returns the top N tuples
• unique – emits only the first tuple for each value
• select – select, rename, or give default values to fields in a tuple

Interesting streams
• update stream – indexes input into another SolrCloud collection!
• daemon stream – blocks until more data is available from underlying stream
• topic stream – a publish/subscribe messaging service
• checkpoints are persisted in a Solr collection
• resubmit to get new stuff
• combine with daemon stream to automatically get continuous updates over time
• further combine with update stream to push all matches to another collection
topic(checkpointCollection, dataCollection, id="topicA",
q="solr rocks" checkpointEvery="1000")

jdbc() expression stream
join with other data sources!
innerJoin( // example from JDBCStreamTest
select( search(collection1, fl="personId_i,rating_f", q="rating_f:*",
sort="personId_i asc"),
personId_i as personId, rating_f as rating ),
select( jdbc(connection="jdbc:hsqldb:mem:.", sql="select PEOPLE.ID as
PERSONID, PEOPLE.NAME, COUNTRIES.COUNTRY_NAME from PEOPLE inner join
COUNTRIES on PEOPLE.COUNTRY_CODE = COUNTRIES.CODE order by PEOPLE.ID",
sort="ID asc", get_column_name=true),
ID as personId, NAME as personName, COUNTRY_NAME as country ),
on="personId"
)

Parallel SQL

/sql Handler
• /sql handler is there by default on all solr nodes
• Translates SQL -> parallel streaming expressions
• SQL tables map to SolrCloud collections
• Query planner / optimizer
• Currently uses Presto parser
• May switch to Apache Calcite?

Simplest SQL Example
$ curl http://localhost:8983/solr/techproducts/sql -d "stmt=select id from techproducts"
{"id":"EN7800GTX/2DHTV/256M"},
{"id":"100-435805"},
{"id":"UTF8TEST"},
{"id":"SOLR1000"},
{"id":"9885A004"},
[...]
tables map to
collections

SQL handler HTTP parameters
curl http://localhost:8983/solr/techproducts/sql -d '
&stmt=<sql_statement>
&numWorkers=4 // currently used by GROUP BY and DISTINCT (via parallel stream)
&workerCollection=collection1 // where to create intermediate workers
&workerZkhost=localhost:9983 // cluster (zookeeper ensemble) address
&aggregationMode=map_reduce | facet

The WHERE clause
• WHERE clauses are all pushed down to the search layer
select id
where popularity=10 // simple match on numeric field "popularity"
where popularity='[5 TO 10]' // solr range query (note the quotes)
where name='hard drive' // phrase query on the "name" field
where name='((memory retail) AND popularity:[5 TO 10])' // arbitrary solr query
where name='(memory retail)' AND popularity='[5 TO 10]' // boolean logic

Ordering and Limiting
select id,score from techproducts
where text='(memory hard drive)'
ORDER BY popularity desc // default order is score desc for limited queries
LIMIT 100
• Limited queries use /select handler
• Unlimited queries use /export handler
• fields selected need to be docValues
• fields in "order by" need to be docValues
• no "score" field allowed

More SQL examples
select distinct fieldA as fa, fieldB as fb from tableA order by fa desc, fb desc
// simple stats
select count(fieldA) as count, sum(fieldB) as sum from tableA where fieldC = 'Hello'
select fieldA, fieldB, count(*), sum(fieldC), avg(fieldY) from tableA
where fieldC = 'term1 term2'
group by fieldA, fieldB
having ((sum(fieldC) > 1000) AND (avg(fieldY) <= 10))
order by sum(fieldC) asc

Solr JDBC Driver

Solr JDBC driver works with Zeppelin

More Solr6 Features

Graph Query
• Basic (non-distributed) graph traversal query
• Follows nodes to edges, optionally filtering during traversal
• Currently only a "filter" query (produces a set of documents)
• Parameters: from, to, traversalFilter, returnRoot, returnOnlyLeaf, maxDepth
• This example query matches “Philip J. Fry” and all of his ancestors:
fq={!graph from=parent_id to=id}id:"Philip J. Fry"

Scoring changes
• For docCount (i.e. idf) in scoring, use the number of documents with that field
rather than the number of documents in the whole index (maxDoc).
• can add documents of a different type and not disturb/skew scoring
• BM25 scoring by default
• tweakable on a per-fieldType basis ("k1" and "b" factors)
• classic tf-idf still available

Cross DC Replication

Thank you
yonik@cloudera.com

Solr 6 Feature Preview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Solr 6 Feature Preview

Similar to Solr 6 Feature Preview (20)

Recently uploaded

Recently uploaded (20)

Solr 6 Feature Preview