SlideShare a Scribd company logo
O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
Parallel SQL
Joel Bernstein
Search Engineer, Alfresco
jbernste@apache.org
3
03
Introduction
• Joel Bernstein
• Lucene/Solr Committer
• Search Engineer at Alfresco
• Live and work in NYC
4
03
Alfresco
• Open source ECM (Enterprise Content Management)
• Alfresco is a system of record for documents
• Uses Solr for search
• 1800+ customers
• 11 million active user accounts
• Alfresco Solr: Document level access control,
eventually consistent, transactional,
multi-master, distributed search and faceting
(coming in Alfresco 5.1)
5
01
Agenda
1. SQL Unleashed (What can it do?)
2. SQL Under the Hood (How does it work?)
6
01
SQL Unleashed
(In Solr 6.0)
7
01
Why SQL?
• Solr has many awesome features.
• But all of these feature create complexity.
• Which faceting API to use? When to
Stream? Which parameters to use for
optimal performance?
• The complexity level increases
dramatically when distributed joins come
into play
• With SQL we can provide an optimizer to
choose the best query plan.
8
01
The SQL Interface at Glance
• SQL over Map/Reduce: supports high
cardinality aggregations and
distributed joins.
• SQL over Facets: high performance on
moderate cardinality aggregations.
• SQL with Solr Search Predicates
• SQL is fully integrated with SolrCloud
9
01
SQL Syntax: Limited and Unlimited SELECT
• select colA, colB from tableB
• select colA, colB from tableB limit 100
• Unlimited selects return the entire result
set. Return fields must be DocValues.
• Limited selects can sort by score and
retrieve any stored field.
10
01
SQL Syntax: ORDER BY
• select a, b from tableB order by a desc, b
desc
• Unlimited selects sort the entire result
set
11
01
The Predicate: Phrase Searching
• select a, b from tableB where
c = ‘hello world’
• Searches for the phrase ‘hello world’ in
field c.
12
01
The Predicate: Boolean searching
• select a, b from tableB where c = ‘(hello world)’
• Adding parens searches for (hello OR world).
• Supports Solr query syntax inside the parens.
13
01
The Predicate: Range query
• select a, b from tableB where c = ‘[0 TO 100]’
14
01
The Predicate: Arbitrary Boolean clauses
• select a, b from tableB where (c = ‘hello
world’ AND d = ‘[0 TO 100]’)
15
01
SQL Syntax: Select Distinct
• select distinct a, b from tableB
• Map/Reduce Implementation: Tuples
are shuffled to worker nodes where
the distinct operation is performed.
• JSON Facet Implementation: distinct
operation is pushed down into the
search engine
• Map/Reduce for high cardinality
• Facet for high QPS
16
01
Shuffle vs Push Down
• Shuffling: high cardinality and
parallel relational algebra
(Distributed Joins)
• Pushdown (Facet): blazing fast, high
QPS, moderate cardinality
• aggregationMode flag is available with
the JDBC driver and http interface
[map_reduce or facet]
17
01
Aggregations: Stats
• select count(*), sum(a) from tableA
• Uses the StatsComponent under the covers
• Initial release supports count, sum, avg, min,
max
• Aggregation logic is always
pushed down into the search engine.
18
01
Aggregations: GROUP BY
• select a, b count(*), sum(c) from tableB group by a
, b having count(*) > 50 order by sum(c) desc
• Supports complex having clause: having (count(*)
> 50 AND sum(b) < 1000)
• Has Map/Reduce implementation (shuffle)
• And JSON Facet implementation (push down)
• Map/Reduce can handle high cardinality multi-
dimension aggregations.
19
01
JDBC Driver
• Ships with Solrj
• Poolable Connection and Statement
• SolrCloud Aware Load Balancing
• Connection has aggregationMode
switch [map_reduce or facet]
20
01
SQL Under the Hood
21
01
SQL Parsing
• Presto SQL Parser handles the parsing
• SQL Statements are compiled to
TupleStream objects
• The TupleStream is the base interface of the
Streaming API
• The Streaming API is a general purpose
parallel computing API for SolrCloud
22
01
Parallel Computing Framework
• Shuffling
• Worker Collections
• Streaming API
• Streaming Expressions
• Parallel SQL
23
01
Shuffling (sorting & partitioning)
• Shuffling is pushed down into the search engine
• Sorting: /export handler “stream sorts”
entire result sets.
• Partitioning: HashQParserPlugin, hash
partitioning filter. Partitions results on
arbitrary fields.
• Tuples (search results) begin streaming
instantly to worker nodes. Shuffling
never requires a spill to disk.
• All replicas shuffle in parallel for the same
query. Allows for massive throughput.
24
01
Shuffling (sorting & partitioning)
Worker 2Worker 1
Shard 1
Replica 1
Shard 2
Replica 1
Shard 1
Replica 2
Shard 2
Replica 2
Client
Each worker is
shuffled ½
the result set
Tuples are
sorted and
partitioned on
keys
25
01
Worker Collections
• Are Generic SolrCloud Collections
• Can hold data, or just perform work
• Search results are shuffled to the
workers
• Configured with the /stream handler
26
01
Streaming API
• Java Programming API for the parallel
computing framework
• Real-time Map/Reduce and Parallel
Relational Algebra
• Abstracts search results as Streams of
tuples (TupleStream)
• Streams are transformed in parallel by
pluggable Decorator streams.
• Parallel transformations include: group by,
rollup, union, intersect, complement and join
27
01
Streaming Expressions
• Contributed by Dennis Gove (Bloomberg)
• String Query Language and Serialization
format for the Streaming API
• Streaming Expressions compile to
TupleStreams
• TupleStreams serialize to
Streaming Expressions
28
01
Parallel SQL
• Compiles SQL to a TupleStream
• The TupleStream is serialized to a
Streaming Expression and sent to
worker nodes.
• Worker nodes translate the Streaming
Expression back into TupleStream
• Worker nodes open() and read() the
TupleStream in parallel. Tuples are
returned from each worker
29
01
From SQL to Streaming Expression
select str_s, count(*), sum(field_i), min(field_i), max(field_i),
avg(field_i) from collection1 where text='XXXX' group by str_s
rollup(
search(collection1,
q="(text:XXXX)",
qt="/export",
fl="str_s, field_i",
partitionKeys=str_s,
sort="str_s asc",
zkHost="127.0.0.1:64149/solr"),
over=str_s,
count(*),
sum(field_i),
min(field_i),
max(field_i),
avg(field_i))
30
01
Parallel SQL Shuffle (5 workers, 5 shards, aggregationMode=map_reduce)
Client
Worker 2
Shard 3
Replica 2
Worker 3Worker 1 Worker 4 Worker 5
Shard 1
Replica 2
Shard 1
Replica 3
Shard 2
Replica 3
Shard 2
Replica 2
Shard 2
Replica 1
Shard 1
Replica 1 Shard 3
Replica 1
Shard 3
Replica 3
Shard 4
Replica 3
Shard 4
Replica 2
Shard 4
Replica 1
Shard 5
Replica 3
Shard 5
Replica 2
Shard 5
Replica 1
/SQL
handler
31
01
Jira Tickets
• SOLR-7560: Parallel SQL Support
• SOLR-7377: Solr Streaming Expressions
• SOLR-7082: Streaming Aggregation for SolrCloud
• SOLR-7441: Improve overall robustness of the
Streaming stack: Streaming API,
Streaming Expressions, Parallel SQL
32
01
Getting Involved
• SQL is in Trunk
• Releasing with Solr 6
• Streaming API and Streaming Expressions
are located in the Solrj libraries
(solrj.io)
• Patches welcome
• Testers and feedback needed
33
01
Questions
Thanks!

More Related Content

What's hot

Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward
 
Akka Streams
Akka StreamsAkka Streams
Akka Streams
Diego Pacheco
 
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, AlibabaWhat's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
Flink Forward
 
University program - writing an apache apex application
University program  - writing an apache apex applicationUniversity program  - writing an apache apex application
University program - writing an apache apex application
Akshay Gore
 
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatestFlink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
Flink Forward
 
Exploring Streams and Lambdas in Java8
Exploring Streams and Lambdas in Java8Exploring Streams and Lambdas in Java8
Exploring Streams and Lambdas in Java8
Isuru Samaraweera
 
Максим Харченко. Erlang lincx
Максим Харченко. Erlang lincxМаксим Харченко. Erlang lincx
Максим Харченко. Erlang lincx
Alina Dolgikh
 
Connect S3 with Kafka using Akka Streams
Connect S3 with Kafka using Akka Streams Connect S3 with Kafka using Akka Streams
Connect S3 with Kafka using Akka Streams
Seiya Mizuno
 
Using akka streams to access s3 objects
Using akka streams to access s3 objectsUsing akka streams to access s3 objects
Using akka streams to access s3 objects
Mikhail Girkin
 
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Apache Flink's Table & SQL API - unified APIs for batch and stream processingApache Flink's Table & SQL API - unified APIs for batch and stream processing
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Timo Walther
 
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink:  Fast and reliable large-scale data processingJanuary 2015 HUG: Apache Flink:  Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
Yahoo Developer Network
 
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward
 
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward
 
Akka Streams and HTTP
Akka Streams and HTTPAkka Streams and HTTP
Akka Streams and HTTP
Roland Kuhn
 
Java8lambda
Java8lambda Java8lambda
Java8lambda
Isuru Samaraweera
 
Akka streams scala italy2015
Akka streams scala italy2015Akka streams scala italy2015
Akka streams scala italy2015
mircodotta
 
Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...
Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...
Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...
Flink Forward
 
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward
 

What's hot (20)

Presto overview
Presto overviewPresto overview
Presto overview
 
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
 
Akka Streams
Akka StreamsAkka Streams
Akka Streams
 
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, AlibabaWhat's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
 
Whats New in Java 8
Whats New in Java 8Whats New in Java 8
Whats New in Java 8
 
University program - writing an apache apex application
University program  - writing an apache apex applicationUniversity program  - writing an apache apex application
University program - writing an apache apex application
 
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatestFlink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
 
Exploring Streams and Lambdas in Java8
Exploring Streams and Lambdas in Java8Exploring Streams and Lambdas in Java8
Exploring Streams and Lambdas in Java8
 
Максим Харченко. Erlang lincx
Максим Харченко. Erlang lincxМаксим Харченко. Erlang lincx
Максим Харченко. Erlang lincx
 
Connect S3 with Kafka using Akka Streams
Connect S3 with Kafka using Akka Streams Connect S3 with Kafka using Akka Streams
Connect S3 with Kafka using Akka Streams
 
Using akka streams to access s3 objects
Using akka streams to access s3 objectsUsing akka streams to access s3 objects
Using akka streams to access s3 objects
 
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Apache Flink's Table & SQL API - unified APIs for batch and stream processingApache Flink's Table & SQL API - unified APIs for batch and stream processing
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
 
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink:  Fast and reliable large-scale data processingJanuary 2015 HUG: Apache Flink:  Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
 
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
 
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
 
Akka Streams and HTTP
Akka Streams and HTTPAkka Streams and HTTP
Akka Streams and HTTP
 
Java8lambda
Java8lambda Java8lambda
Java8lambda
 
Akka streams scala italy2015
Akka streams scala italy2015Akka streams scala italy2015
Akka streams scala italy2015
 
Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...
Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...
Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...
 
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
 

Similar to Parallel SQL for SolrCloud

Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, AlfrescoParallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Lucidworks
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6
Shalin Shekhar Mangar
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkFabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018
Aman Sinha
 
Xia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLXia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATL
MLconf
 
3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql
Łukasz Grala
 
Reactive Spring 5
Reactive Spring 5Reactive Spring 5
Reactive Spring 5
Corneil du Plessis
 
Rdbms chapter 1 function
Rdbms chapter 1 functionRdbms chapter 1 function
Rdbms chapter 1 function
dipumaliy
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Xu Jiang
 
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
Lucidworks
 
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIsFabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Flink Forward
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
Vyacheslav Lapin
 
Foundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theoryFoundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theory
DataWorks Summit
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
Mao Geng
 
sqltuning101-170419021007-2.pdf
sqltuning101-170419021007-2.pdfsqltuning101-170419021007-2.pdf
sqltuning101-170419021007-2.pdf
TricantinoLopezPerez
 
SQL Tuning 101
SQL Tuning 101SQL Tuning 101
SQL Tuning 101
Carlos Sierra
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesStreaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+Tables
C4Media
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User Group
Remus Rusanu
 

Similar to Parallel SQL for SolrCloud (20)

Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, AlfrescoParallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkFabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
 
Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018
 
Xia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLXia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATL
 
3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql
 
Reactive Spring 5
Reactive Spring 5Reactive Spring 5
Reactive Spring 5
 
Rdbms chapter 1 function
Rdbms chapter 1 functionRdbms chapter 1 function
Rdbms chapter 1 function
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIsFabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
 
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
 
Foundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theoryFoundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theory
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
sqltuning101-170419021007-2.pdf
sqltuning101-170419021007-2.pdfsqltuning101-170419021007-2.pdf
sqltuning101-170419021007-2.pdf
 
SQL Tuning 101
SQL Tuning 101SQL Tuning 101
SQL Tuning 101
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesStreaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+Tables
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User Group
 

Recently uploaded

Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
mz5nrf0n
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 

Recently uploaded (20)

Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 

Parallel SQL for SolrCloud

  • 1. O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
  • 2. Parallel SQL Joel Bernstein Search Engineer, Alfresco jbernste@apache.org
  • 3. 3 03 Introduction • Joel Bernstein • Lucene/Solr Committer • Search Engineer at Alfresco • Live and work in NYC
  • 4. 4 03 Alfresco • Open source ECM (Enterprise Content Management) • Alfresco is a system of record for documents • Uses Solr for search • 1800+ customers • 11 million active user accounts • Alfresco Solr: Document level access control, eventually consistent, transactional, multi-master, distributed search and faceting (coming in Alfresco 5.1)
  • 5. 5 01 Agenda 1. SQL Unleashed (What can it do?) 2. SQL Under the Hood (How does it work?)
  • 7. 7 01 Why SQL? • Solr has many awesome features. • But all of these feature create complexity. • Which faceting API to use? When to Stream? Which parameters to use for optimal performance? • The complexity level increases dramatically when distributed joins come into play • With SQL we can provide an optimizer to choose the best query plan.
  • 8. 8 01 The SQL Interface at Glance • SQL over Map/Reduce: supports high cardinality aggregations and distributed joins. • SQL over Facets: high performance on moderate cardinality aggregations. • SQL with Solr Search Predicates • SQL is fully integrated with SolrCloud
  • 9. 9 01 SQL Syntax: Limited and Unlimited SELECT • select colA, colB from tableB • select colA, colB from tableB limit 100 • Unlimited selects return the entire result set. Return fields must be DocValues. • Limited selects can sort by score and retrieve any stored field.
  • 10. 10 01 SQL Syntax: ORDER BY • select a, b from tableB order by a desc, b desc • Unlimited selects sort the entire result set
  • 11. 11 01 The Predicate: Phrase Searching • select a, b from tableB where c = ‘hello world’ • Searches for the phrase ‘hello world’ in field c.
  • 12. 12 01 The Predicate: Boolean searching • select a, b from tableB where c = ‘(hello world)’ • Adding parens searches for (hello OR world). • Supports Solr query syntax inside the parens.
  • 13. 13 01 The Predicate: Range query • select a, b from tableB where c = ‘[0 TO 100]’
  • 14. 14 01 The Predicate: Arbitrary Boolean clauses • select a, b from tableB where (c = ‘hello world’ AND d = ‘[0 TO 100]’)
  • 15. 15 01 SQL Syntax: Select Distinct • select distinct a, b from tableB • Map/Reduce Implementation: Tuples are shuffled to worker nodes where the distinct operation is performed. • JSON Facet Implementation: distinct operation is pushed down into the search engine • Map/Reduce for high cardinality • Facet for high QPS
  • 16. 16 01 Shuffle vs Push Down • Shuffling: high cardinality and parallel relational algebra (Distributed Joins) • Pushdown (Facet): blazing fast, high QPS, moderate cardinality • aggregationMode flag is available with the JDBC driver and http interface [map_reduce or facet]
  • 17. 17 01 Aggregations: Stats • select count(*), sum(a) from tableA • Uses the StatsComponent under the covers • Initial release supports count, sum, avg, min, max • Aggregation logic is always pushed down into the search engine.
  • 18. 18 01 Aggregations: GROUP BY • select a, b count(*), sum(c) from tableB group by a , b having count(*) > 50 order by sum(c) desc • Supports complex having clause: having (count(*) > 50 AND sum(b) < 1000) • Has Map/Reduce implementation (shuffle) • And JSON Facet implementation (push down) • Map/Reduce can handle high cardinality multi- dimension aggregations.
  • 19. 19 01 JDBC Driver • Ships with Solrj • Poolable Connection and Statement • SolrCloud Aware Load Balancing • Connection has aggregationMode switch [map_reduce or facet]
  • 21. 21 01 SQL Parsing • Presto SQL Parser handles the parsing • SQL Statements are compiled to TupleStream objects • The TupleStream is the base interface of the Streaming API • The Streaming API is a general purpose parallel computing API for SolrCloud
  • 22. 22 01 Parallel Computing Framework • Shuffling • Worker Collections • Streaming API • Streaming Expressions • Parallel SQL
  • 23. 23 01 Shuffling (sorting & partitioning) • Shuffling is pushed down into the search engine • Sorting: /export handler “stream sorts” entire result sets. • Partitioning: HashQParserPlugin, hash partitioning filter. Partitions results on arbitrary fields. • Tuples (search results) begin streaming instantly to worker nodes. Shuffling never requires a spill to disk. • All replicas shuffle in parallel for the same query. Allows for massive throughput.
  • 24. 24 01 Shuffling (sorting & partitioning) Worker 2Worker 1 Shard 1 Replica 1 Shard 2 Replica 1 Shard 1 Replica 2 Shard 2 Replica 2 Client Each worker is shuffled ½ the result set Tuples are sorted and partitioned on keys
  • 25. 25 01 Worker Collections • Are Generic SolrCloud Collections • Can hold data, or just perform work • Search results are shuffled to the workers • Configured with the /stream handler
  • 26. 26 01 Streaming API • Java Programming API for the parallel computing framework • Real-time Map/Reduce and Parallel Relational Algebra • Abstracts search results as Streams of tuples (TupleStream) • Streams are transformed in parallel by pluggable Decorator streams. • Parallel transformations include: group by, rollup, union, intersect, complement and join
  • 27. 27 01 Streaming Expressions • Contributed by Dennis Gove (Bloomberg) • String Query Language and Serialization format for the Streaming API • Streaming Expressions compile to TupleStreams • TupleStreams serialize to Streaming Expressions
  • 28. 28 01 Parallel SQL • Compiles SQL to a TupleStream • The TupleStream is serialized to a Streaming Expression and sent to worker nodes. • Worker nodes translate the Streaming Expression back into TupleStream • Worker nodes open() and read() the TupleStream in parallel. Tuples are returned from each worker
  • 29. 29 01 From SQL to Streaming Expression select str_s, count(*), sum(field_i), min(field_i), max(field_i), avg(field_i) from collection1 where text='XXXX' group by str_s rollup( search(collection1, q="(text:XXXX)", qt="/export", fl="str_s, field_i", partitionKeys=str_s, sort="str_s asc", zkHost="127.0.0.1:64149/solr"), over=str_s, count(*), sum(field_i), min(field_i), max(field_i), avg(field_i))
  • 30. 30 01 Parallel SQL Shuffle (5 workers, 5 shards, aggregationMode=map_reduce) Client Worker 2 Shard 3 Replica 2 Worker 3Worker 1 Worker 4 Worker 5 Shard 1 Replica 2 Shard 1 Replica 3 Shard 2 Replica 3 Shard 2 Replica 2 Shard 2 Replica 1 Shard 1 Replica 1 Shard 3 Replica 1 Shard 3 Replica 3 Shard 4 Replica 3 Shard 4 Replica 2 Shard 4 Replica 1 Shard 5 Replica 3 Shard 5 Replica 2 Shard 5 Replica 1 /SQL handler
  • 31. 31 01 Jira Tickets • SOLR-7560: Parallel SQL Support • SOLR-7377: Solr Streaming Expressions • SOLR-7082: Streaming Aggregation for SolrCloud • SOLR-7441: Improve overall robustness of the Streaming stack: Streaming API, Streaming Expressions, Parallel SQL
  • 32. 32 01 Getting Involved • SQL is in Trunk • Releasing with Solr 6 • Streaming API and Streaming Expressions are located in the Solrj libraries (solrj.io) • Patches welcome • Testers and feedback needed

Editor's Notes

  1. This slide shows a network diagram of a parallel SQL shuffle. Below are the steps: A client sends a SQL query to the SQL handler running on a Solr node. The SQL handler compiles the SQL query to a Streaming Expression (See slide 27). The Streaming Expression is sent across the wire to worker nodes Each worker node compiles the Streaming Expression to a Streaming API TupleStream. The workers open the TupleStream and iterate the Tuples. The requests from the worker nodes to the shards occur when then TupleStream is opened and iterated. Each worker node contacts each shard and is shuffled 1/5 of the search results. Requests to the shards are spread evenly across all replicas. All replicas shuffle in parallel to satisfy the query.