Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick Erickson

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

Streaming Aggregation, New Horizons for Search
Erick Erickson
Workplace Partners, LLC.

Who am I?
• Erick Erickson
• Lucene/Solr committer
• PMC member
• Independent Consultant (Workplace
Partners, LLC)
• Not the Red State Guy
• XKCD fan

My favorite XKCD cartoon
http://xkcd.com/722/

Agenda
• High-level introduction to why you should care
about Streaming Aggregation (SA hereafter)
• High-level view of Parallel SQL processing built
on SA
• High-level view of Streaming Expressions
• Samples from a mortgage database
• Joel Bernstein will do a deep-dive right after this
presentation
• Assuming you are familiar with Solr concepts

Why SA?
• Solr has always had “issues” when
dealing with very large result sets
• Data returned had to be read from disk
an decompressed
• “Deep paging” paid this price too
• Entire result set returned at once == lots
of memory

Quick Overview of SA
• Built on the “export” capabilities introduced in
Solr 4.10
• Exports “tuples” which must be populated from
docValues ﬁelds
• Only exports primitive types, e.g. numeric,
string etc.
• Work can be distributed in parallel to worker
nodes
• Can scale to limits of hardware, 10s of millions of
rows a second with ParallelStreams (we think)

DocValues
• DocValues are basic to SA, they are the only fields
that can be specified in the “fl” list of an
Streaming Aggregation query
• Only Solr “primitive” types (int/tint, long/tlong,
string) are allowed in DocValues fields
• Defined per-field in schema.xml
• Specifically, cannot be Solr.TextField-derived
• The Solr doc may contain any field types at all, the
DocValues restriction is only on the fields that
may be exported in “tuples” for SA

We can do SQL in Solr!
select
agency_code, count(*), sum(loan_amount),
avg(loan_amount), min(loan_amount),
max(loan_amount), avg(applicant_income)
from hmda
where phonetic_name='(eric)’
having (avg(applicant_income) > 50)
group by agency_code
order by agency_code asc

And that’s not all!
• We can program arbitrary operations on complete
result sets
• We can parallelize processing across Solr nodes
• We can process very large result sets in limited
memory
• Design processing rate is 400K rows/node/
second

Streaming Aggregation == glue
• Solr is built for returning the top N documents
•  Top N is usually small, e.g. 20 docs
•  Decompress to return ﬁelds (ﬂ list)
•  Solr commonly deals with billions of documents
• Analytics:
•  Often memory intensive, especially in distributed
mode. If they can be done at all
•  Are becoming more important to this thing we call
“search”
•  Increasingly important in the era of “big data”

Use the Right Tool
• Three “modes”
• Streaming Aggregation to do arbitrary
operations on large result sets – SolrJ
• Streaming Expressions for non Java way to
access Streaming Aggregations – HTTP and SolrJ
• Parallel SQL to do selected SQL operations on
large result sets - SolrJ
• SA’s sweet spot: batch operations
• Complements Solr’s capabilities, applies to
different problems

Why not use an RDBMS?
•  Well, if it’s the best tool, you should
•  RDBMSs are not good search engines though
•  Find the average mortgage value for all
users with a name that sounds like “erick”
•  erik, erich, eric, aerick, erick, arik
•  Critical point: The “tuples” processed can be
those that satisfy any arbitrary Solr query

Why not use Spark?
•  Well, if it’s the best tool, you should
•  I’m still trying to understand when one is
preferable to the other
•  SA only needs Solr, no other infrastructure

Why not just use Solr?
• Well, if it’s the best tool, you should
• What I’d do: exhaust Solr’s capabilities then apply
SA to those kinds of problems that OOB Solr isn’t
satisfactory for, especially those that require
processing very large result sets

How does SA work?
• Simple example of how to get a bunch of rows
back and “do something” with them from a Solr
collection
• You can process multiple streams from entirely
different collections if you choose!
• It’s usually a good idea to sort return sets
• Process all of one kind of thing then move on
• Could write the results to ﬁle, connector, etc.

Sample Data
• Data set of approx 200M mortgages. Selected
ﬁelds:
• Year
• Loan amount (thousands)
• Agency (FDIC, FRS, HUD)
• Reason for loan
• Reason for denial
• No personal data, I added randomly generated
names to illustrate search

Use SA through SolrJ
•  The basic pattern is:
•  Create a Solr query
•  Feed it to the appropriate stream
•  Process the “tuples”
•  Right, what’s a “tuple”? A wrapper for a map:
•  keys are the Solr field names
•  values the contents of those fields: must be docValues
•  Why this restriction? Because getting stored fields is
expensive

Code example
• Here’s a bit of code that
• Accesses a 2-shard SolrCloud collection
• Computes the average mortgage by “agency”,
e.g. HUD, OTS, OCC, OFS, FDIC, NCUA
• For a 217M dataset, 335K results (untuned) took
2.1 seconds

Code example
String zkHost = "169.254.80.84:2181";
Map params = new HashMap();
params.put("q", "phonetic_name:eric");
params.put("ﬂ", "loan_amount,agency_code");
params.put("sort", "agency_code asc");
params.put("qt", "/export");
….
CloudSolrStream stream = new
CloudSolrStream(zkHost, "hmda", params);
stream.open();

More code
while (true) {
Tuple tuple = stream.read();
if (tuple.EOF) {
break;
}
// next slide in here
}

Last Code
String newAgency =
tuple.getString("agency_code");
long loant = tuple.getLong("loan_amount");
if (agency.equals(thisAgency) == true) {
add_to_current_counters
} else {
log(average for this agency);
reset_for_next_agency
}

More interestingly
•  Using SA, you can:
•  Join across completely different collections
•  Manipulate data in arbitrary ways to suit your use-case
•  Distribute this load across the solr nodes in a
collection
•  Unlike standard search, SA can use cycles on all the
replicas of a shard
•  Process zillions of buckets without blowing up
memory

Parallel SQL
• Use from SolrJ
• The work can be distributed across multiple
“worker” nodes
• Operations can be combined into complex
statements
• Let’s do our previous example with ParallelSQL
• Currently trunk/6.0 only due to Java 8
requirement for SQL parser. No plan to put in 5x

Parallel SQL
•  SQL “select” is mapped to Solr Search
•  Order by, Group by and Having are all supported
•  Certain aggregations are supported
•  count, sum, avg, min max
•  You can get crazy here:
•  having ((sum(ﬁeldC) > 1000) AND (avg(ﬁeldY) <= 10))
•  Following query with numWorkers=2, 612K rows
•  383ms

Sample SQL
select
max(loan_amount)
from hmda
where phonetic_name='(erich)’

Sample SQL
select
max(loan_amount)
from hmda <- collection name

Sample SQL
select
max(loan_amount)
from hmda
where phonetic_name='(eric)’ <- Solr search

Sample SQL
select
max(loan_amount)
from hmda
group by agency_code <- Solr ﬁeld
order by agency_code asc <- Solr ﬁeld

Parallel Sql in SolrJ
params.put(CommonParams.QT, "/sql");
params.put("numWorkers", "2");
params.put("sql", "select agency_code, count(*),
sum(loan_amount), avg(loan_amount), " +
"min(loan_amount), max(loan_amount),
avg(applicant_income) from hmda where
phonetic_name='eric' " +
"group by agency_code " +
"having (avg(applicant_income) > 50) " +
"order by agency_code asc");
SolrStream stream = new SolrStream("http://ericks-mac-pro:
8981/solr/hmda", params);

params.put(CommonParams.QT, "/sql");
params.put("numWorkers", "2");
params.put("sql", "select agency_code, count(*),
sum(loan_amount), avg(loan_amount), " +
"min(loan_amount), max(loan_amount),
avg(applicant_income) from hmda where
phonetic_name='eric' " +
"group by agency_code " +
"having (avg(applicant_income) > 50) " +
"order by agency_code asc");

SolrStream stream = new SolrStream("http://ericks-mac-pro:
8981/solr/hmda", params);
try {
stream.open();
while (true) {
dumpTuple(tuple);
log("");
if (tuple.EOF) {
break;
}
}
} ﬁnally {
if (stream != null) stream.close();
}

SolrStream stream = new SolrStream("http://ericks-mac-
pro:8981/solr/hmda", params);
try {
stream.open();
while (true) {
if (tuple.EOF) {
break;
}
dumpTuple(tuple);
}
} ﬁnally {
if (stream != null) stream.close();
}

Sample tuples returned
agency_code=FDIC
max(loan_amount)=972.0
sum(loan_amount)=53307.0
count(*)=224.0
avg(loan_amount)=237.97767857142858
min(loan_amount)=5.0

Sample tuples returned
agency_code=FRS
max(loan_amount)=3000.0
sum(loan_amount)=179702.0
count(*)=834.0
avg(loan_amount)=215.47002398081534
min(loan_amount)=1.0

Current Gotcha’s
• All ﬁelds must be lower case (possibly with
underscores)
• Trunk (6.0) only although will be in 5.x (5.4?) Not
planned. (Calcite)
• Requires solrconﬁg entries
• Only nodes hosting collections can act as worker
nodes (But not necessarily the queried collection)
• Be prepared to dig, documentation is also
evolving

Streaming expressions
• Provide a simple query language for SolrCloud
that merges search with parallel computing
without Java programming
• Operations can be nested

Streaming Expressions
• Can access at least two ways:
• HTTP
• SolrJ

Streaming Expressions
• Operations:
• search
• merge – can be used with separate collections
• group
• unique
• top
• parallel

Example Code
curl --data-urlencode
'stream=group(
search(hmda,q="*:*",
ﬂ="id,agency_code",
sort="agency_code asc"),
by="agency_code asc")'
http://169.254.80.84:8981/solr/hmda/stream

Response
{"result-set":{"docs":[
{"agency_code":"FDIC","_MAPS_":[
{"agency_code":"FDIC","id":"2004_CD1.CSV_3955”}
…]
{"agency_code":"NCUA","_MAPS_":[
{"agency_code":"NCUA","id":"2004_CD1.CSV_2816”}
…]
{"EOF":true,"RESPONSE_TIME":4}]}}

Response
{"result-set":{"docs":[
{"agency_code":"FDIC","_MAPS_":[
{"agency_code":"FDIC","id":"2004_CD1.CSV_3955”
}…]
{"agency_code":"NCUA","_MAPS_":[
{"agency_code":"NCUA","id":"2004_CD1.CSV_2816
”}…]
{"EOF":true,"RESPONSE_TIME":4}]}}

Future Enhancements
• This capability is quite new, Solr 5.2 with
signiﬁcant enhancements every release
• Some is still “baking” in trunk/6.0
• A JDBC Driver so any Java application can treat
Solr like a SQL database, e.g. for visualization
• More user-friendly interfaces (widgets?)
• More docs, how to’s, etc.
• “Select Into”

No time for (some)
•  Oh My. Subclasses of TupleStream:
•  MetricStream
•  RollupStream (for high cardinality faceting)
•  UniqueStream
•  FilterStream (Set operations)
•  MergeStream
•  ReducerStream
•  SolrStream for non-SolrCloud

No time for (cont)
• Parallel execution details
• Distributing SA across “Worker nodes”
• All of the Parallel SQL composition
possibilities
• All of the Streaming Expression
operations

Resources
• Ref guide for streaming expressions: https://
cwiki.apache.org/conﬂuence/display/solr/
Streaming+Expressions
• Solr user’s list: http://lucene.apache.org/solr/
resources.html
• Joel Bernstein’s blogs: http://
joelsolr.blogspot.com/2015/04/in-line-
streaming-aggregation.html
• Parallel SQL Solr JIRA: https://issues.apache.org/
jira/browse/SOLR-7560

Resources (cont)
• Streaming expressions JIRA: https://
issues.apache.org/jira/browse/SOLR-7377
• Background for SA. http://heliosearch.org/
streaming-aggregation-for-solrcloud/
• Background for Parallel SQL. http://
heliosearch.org/heliosearch-sql-sub-project/
• Getting the code, compiling, etc. https://
wiki.apache.org/solr/HowToContribute

Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick Erickson

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick Erickson

Similar to Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick Erickson (20)

More from Lucidworks

More from Lucidworks (20)

Recently uploaded

Recently uploaded (20)

Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick Erickson