Parallel SQL for SolrCloud

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

Parallel SQL
Joel Bernstein
Search Engineer, Alfresco
jbernste@apache.org

3
03
Introduction
• Joel Bernstein
• Lucene/Solr Committer
• Search Engineer at Alfresco
• Live and work in NYC

4
03
Alfresco
• Open source ECM (Enterprise Content Management)
• Alfresco is a system of record for documents
• Uses Solr for search
• 1800+ customers
• 11 million active user accounts
• Alfresco Solr: Document level access control,
eventually consistent, transactional,
multi-master, distributed search and faceting
(coming in Alfresco 5.1)

5
01
Agenda
1. SQL Unleashed (What can it do?)
2. SQL Under the Hood (How does it work?)

6
01
SQL Unleashed
(In Solr 6.0)

7
01
Why SQL?
• Solr has many awesome features.
• But all of these feature create complexity.
• Which faceting API to use? When to
Stream? Which parameters to use for
optimal performance?
• The complexity level increases
dramatically when distributed joins come
into play
• With SQL we can provide an optimizer to
choose the best query plan.

8
01
The SQL Interface at Glance
• SQL over Map/Reduce: supports high
cardinality aggregations and
distributed joins.
• SQL over Facets: high performance on
moderate cardinality aggregations.
• SQL with Solr Search Predicates
• SQL is fully integrated with SolrCloud

9
01
SQL Syntax: Limited and Unlimited SELECT
• select colA, colB from tableB
• select colA, colB from tableB limit 100
• Unlimited selects return the entire result
set. Return fields must be DocValues.
• Limited selects can sort by score and
retrieve any stored field.

10
01
SQL Syntax: ORDER BY
• select a, b from tableB order by a desc, b
desc
• Unlimited selects sort the entire result
set

11
01
The Predicate: Phrase Searching
• select a, b from tableB where
c = ‘hello world’
• Searches for the phrase ‘hello world’ in
field c.

12
01
The Predicate: Boolean searching
• select a, b from tableB where c = ‘(hello world)’
• Adding parens searches for (hello OR world).
• Supports Solr query syntax inside the parens.

13
01
The Predicate: Range query
• select a, b from tableB where c = ‘[0 TO 100]’

14
01
The Predicate: Arbitrary Boolean clauses
• select a, b from tableB where (c = ‘hello
world’ AND d = ‘[0 TO 100]’)

15
01
SQL Syntax: Select Distinct
• select distinct a, b from tableB
• Map/Reduce Implementation: Tuples
are shuffled to worker nodes where
the distinct operation is performed.
• JSON Facet Implementation: distinct
operation is pushed down into the
search engine
• Map/Reduce for high cardinality
• Facet for high QPS

16
01
Shuffle vs Push Down
• Shuffling: high cardinality and
parallel relational algebra
(Distributed Joins)
• Pushdown (Facet): blazing fast, high
QPS, moderate cardinality
• aggregationMode flag is available with
the JDBC driver and http interface
[map_reduce or facet]

17
01
Aggregations: Stats
• select count(*), sum(a) from tableA
• Uses the StatsComponent under the covers
• Initial release supports count, sum, avg, min,
max
• Aggregation logic is always
pushed down into the search engine.

18
01
Aggregations: GROUP BY
• select a, b count(*), sum(c) from tableB group by a
, b having count(*) > 50 order by sum(c) desc
• Supports complex having clause: having (count(*)
> 50 AND sum(b) < 1000)
• Has Map/Reduce implementation (shuffle)
• And JSON Facet implementation (push down)
• Map/Reduce can handle high cardinality multi-
dimension aggregations.

19
01
JDBC Driver
• Ships with Solrj
• Poolable Connection and Statement
• SolrCloud Aware Load Balancing
• Connection has aggregationMode
switch [map_reduce or facet]

21
01
SQL Parsing
• Presto SQL Parser handles the parsing
• SQL Statements are compiled to
TupleStream objects
• The TupleStream is the base interface of the
Streaming API
• The Streaming API is a general purpose
parallel computing API for SolrCloud

22
01
Parallel Computing Framework
• Shuffling
• Worker Collections
• Streaming API
• Streaming Expressions
• Parallel SQL

23
01
Shuffling (sorting & partitioning)
• Shuffling is pushed down into the search engine
• Sorting: /export handler “stream sorts”
entire result sets.
• Partitioning: HashQParserPlugin, hash
partitioning filter. Partitions results on
arbitrary fields.
• Tuples (search results) begin streaming
instantly to worker nodes. Shuffling
never requires a spill to disk.
• All replicas shuffle in parallel for the same
query. Allows for massive throughput.

24
01
Shuffling (sorting & partitioning)
Worker 2Worker 1
Shard 1
Replica 1
Shard 2
Replica 1
Shard 1
Replica 2
Shard 2
Replica 2
Client
Each worker is
shuffled ½
the result set
Tuples are
sorted and
partitioned on
keys

25
01
Worker Collections
• Are Generic SolrCloud Collections
• Can hold data, or just perform work
• Search results are shuffled to the
workers
• Configured with the /stream handler

26
01
Streaming API
• Java Programming API for the parallel
computing framework
• Real-time Map/Reduce and Parallel
Relational Algebra
• Abstracts search results as Streams of
tuples (TupleStream)
• Streams are transformed in parallel by
pluggable Decorator streams.
• Parallel transformations include: group by,
rollup, union, intersect, complement and join

27
01
Streaming Expressions
• Contributed by Dennis Gove (Bloomberg)
• String Query Language and Serialization
format for the Streaming API
• Streaming Expressions compile to
TupleStreams
• TupleStreams serialize to
Streaming Expressions

28
01
Parallel SQL
• Compiles SQL to a TupleStream
• The TupleStream is serialized to a
Streaming Expression and sent to
worker nodes.
• Worker nodes translate the Streaming
Expression back into TupleStream
• Worker nodes open() and read() the
TupleStream in parallel. Tuples are
returned from each worker

29
01
From SQL to Streaming Expression
select str_s, count(*), sum(field_i), min(field_i), max(field_i),
avg(field_i) from collection1 where text='XXXX' group by str_s
rollup(
search(collection1,
q="(text:XXXX)",
qt="/export",
fl="str_s, field_i",
partitionKeys=str_s,
sort="str_s asc",
zkHost="127.0.0.1:64149/solr"),
over=str_s,
count(*),
sum(field_i),
min(field_i),
max(field_i),
avg(field_i))

30
01
Parallel SQL Shuffle (5 workers, 5 shards, aggregationMode=map_reduce)
Client
Worker 2
Shard 3
Replica 2
Worker 3Worker 1 Worker 4 Worker 5
Shard 1
Replica 2
Shard 1
Replica 3
Shard 2
Replica 3
Shard 2
Replica 2
Shard 2
Replica 1
Shard 1
Replica 1 Shard 3
Replica 1
Shard 3
Replica 3
Shard 4
Replica 3
Shard 4
Replica 2
Shard 4
Replica 1
Shard 5
Replica 3
Shard 5
Replica 2
Shard 5
Replica 1
/SQL
handler

31
01
Jira Tickets
• SOLR-7560: Parallel SQL Support
• SOLR-7377: Solr Streaming Expressions
• SOLR-7082: Streaming Aggregation for SolrCloud
• SOLR-7441: Improve overall robustness of the
Streaming stack: Streaming API,
Streaming Expressions, Parallel SQL

32
01
Getting Involved
• SQL is in Trunk
• Releasing with Solr 6
• Streaming API and Streaming Expressions
are located in the Solrj libraries
(solrj.io)
• Patches welcome
• Testers and feedback needed

Parallel SQL for SolrCloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Parallel SQL for SolrCloud

Similar to Parallel SQL for SolrCloud (20)

Recently uploaded

Recently uploaded (20)

Parallel SQL for SolrCloud

Editor's Notes