SlideShare a Scribd company logo
1 of 33
O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
Parallel SQL
Joel Bernstein
Search Engineer, Alfresco
jbernste@apache.org
3
03
Introduction
• Joel Bernstein
• Lucene/Solr Committer
• Search Engineer at Alfresco
• Live and work in NYC
4
03
Alfresco
• Open source ECM (Enterprise Content Management)
• Alfresco is a system of record for documents
• Uses Solr for search
• 1800+ customers
• 11 million active user accounts
• Alfresco Solr: Document level access control,
eventually consistent, transactional,
multi-master, distributed search and faceting
(coming in Alfresco 5.1)
5
01
Agenda
1. SQL Unleashed (What can it do?)
2. SQL Under the Hood (How does it work?)
6
01
SQL Unleashed
(In Solr 6.0)
7
01
Why SQL?
• Solr has many awesome features.
• But all of these feature create complexity.
• Which faceting API to use? When to
Stream? Which parameters to use for
optimal performance?
• The complexity level increases
dramatically when distributed joins come
into play
• With SQL we can provide an optimizer to
choose the best query plan.
8
01
The SQL Interface at Glance
• SQL over Map/Reduce: supports high
cardinality aggregations and
distributed joins.
• SQL over Facets: high performance on
moderate cardinality aggregations.
• SQL with Solr Search Predicates
• SQL is fully integrated with SolrCloud
9
01
SQL Syntax: Limited and Unlimited SELECT
• select colA, colB from tableB
• select colA, colB from tableB limit 100
• Unlimited selects return the entire result
set. Return fields must be DocValues.
• Limited selects can sort by score and
retrieve any stored field.
10
01
SQL Syntax: ORDER BY
• select a, b from tableB order by a desc, b
desc
• Unlimited selects sort the entire result
set
11
01
The Predicate: Phrase Searching
• select a, b from tableB where
c = ‘hello world’
• Searches for the phrase ‘hello world’ in
field c.
12
01
The Predicate: Boolean searching
• select a, b from tableB where c = ‘(hello world)’
• Adding parens searches for (hello OR world).
• Supports Solr query syntax inside the parens.
13
01
The Predicate: Range query
• select a, b from tableB where c = ‘[0 TO 100]’
14
01
The Predicate: Arbitrary Boolean clauses
• select a, b from tableB where (c = ‘hello
world’ AND d = ‘[0 TO 100]’)
15
01
SQL Syntax: Select Distinct
• select distinct a, b from tableB
• Map/Reduce Implementation: Tuples
are shuffled to worker nodes where
the distinct operation is performed.
• JSON Facet Implementation: distinct
operation is pushed down into the
search engine
• Map/Reduce for high cardinality
• Facet for high QPS
16
01
Shuffle vs Push Down
• Shuffling: high cardinality and
parallel relational algebra
(Distributed Joins)
• Pushdown (Facet): blazing fast, high
QPS, moderate cardinality
• aggregationMode flag is available with
the JDBC driver and http interface
[map_reduce or facet]
17
01
Aggregations: Stats
• select count(*), sum(a) from tableA
• Uses the StatsComponent under the covers
• Initial release supports count, sum, avg, min,
max
• Aggregation logic is always
pushed down into the search engine.
18
01
Aggregations: GROUP BY
• select a, b count(*), sum(c) from tableB group by a
, b having count(*) > 50 order by sum(c) desc
• Supports complex having clause: having (count(*)
> 50 AND sum(b) < 1000)
• Has Map/Reduce implementation (shuffle)
• And JSON Facet implementation (push down)
• Map/Reduce can handle high cardinality multi-
dimension aggregations.
19
01
JDBC Driver
• Ships with Solrj
• Poolable Connection and Statement
• SolrCloud Aware Load Balancing
• Connection has aggregationMode
switch [map_reduce or facet]
20
01
SQL Under the Hood
21
01
SQL Parsing
• Presto SQL Parser handles the parsing
• SQL Statements are compiled to
TupleStream objects
• The TupleStream is the base interface of the
Streaming API
• The Streaming API is a general purpose
parallel computing API for SolrCloud
22
01
Parallel Computing Framework
• Shuffling
• Worker Collections
• Streaming API
• Streaming Expressions
• Parallel SQL
23
01
Shuffling (sorting & partitioning)
• Shuffling is pushed down into the search engine
• Sorting: /export handler “stream sorts”
entire result sets.
• Partitioning: HashQParserPlugin, hash
partitioning filter. Partitions results on
arbitrary fields.
• Tuples (search results) begin streaming
instantly to worker nodes. Shuffling
never requires a spill to disk.
• All replicas shuffle in parallel for the same
query. Allows for massive throughput.
24
01
Shuffling (sorting & partitioning)
Worker 2Worker 1
Shard 1
Replica 1
Shard 2
Replica 1
Shard 1
Replica 2
Shard 2
Replica 2
Client
Each worker is
shuffled ½
the result set
Tuples are
sorted and
partitioned on
keys
25
01
Worker Collections
• Are Generic SolrCloud Collections
• Can hold data, or just perform work
• Search results are shuffled to the
workers
• Configured with the /stream handler
26
01
Streaming API
• Java Programming API for the parallel
computing framework
• Real-time Map/Reduce and Parallel
Relational Algebra
• Abstracts search results as Streams of
tuples (TupleStream)
• Streams are transformed in parallel by
pluggable Decorator streams.
• Parallel transformations include: group by,
rollup, union, intersect, complement and join
27
01
Streaming Expressions
• Contributed by Dennis Gove (Bloomberg)
• String Query Language and Serialization
format for the Streaming API
• Streaming Expressions compile to
TupleStreams
• TupleStreams serialize to
Streaming Expressions
28
01
Parallel SQL
• Compiles SQL to a TupleStream
• The TupleStream is serialized to a
Streaming Expression and sent to
worker nodes.
• Worker nodes translate the Streaming
Expression back into TupleStream
• Worker nodes open() and read() the
TupleStream in parallel. Tuples are
returned from each worker
29
01
From SQL to Streaming Expression
select str_s, count(*), sum(field_i), min(field_i), max(field_i),
avg(field_i) from collection1 where text='XXXX' group by str_s
rollup(
search(collection1,
q="(text:XXXX)",
qt="/export",
fl="str_s, field_i",
partitionKeys=str_s,
sort="str_s asc",
zkHost="127.0.0.1:64149/solr"),
over=str_s,
count(*),
sum(field_i),
min(field_i),
max(field_i),
avg(field_i))
30
01
Parallel SQL Shuffle (5 workers, 5 shards, aggregationMode=map_reduce)
Client
Worker 2
Shard 3
Replica 2
Worker 3Worker 1 Worker 4 Worker 5
Shard 1
Replica 2
Shard 1
Replica 3
Shard 2
Replica 3
Shard 2
Replica 2
Shard 2
Replica 1
Shard 1
Replica 1 Shard 3
Replica 1
Shard 3
Replica 3
Shard 4
Replica 3
Shard 4
Replica 2
Shard 4
Replica 1
Shard 5
Replica 3
Shard 5
Replica 2
Shard 5
Replica 1
/SQL
handler
31
01
Jira Tickets
• SOLR-7560: Parallel SQL Support
• SOLR-7377: Solr Streaming Expressions
• SOLR-7082: Streaming Aggregation for SolrCloud
• SOLR-7441: Improve overall robustness of the
Streaming stack: Streaming API,
Streaming Expressions, Parallel SQL
32
01
Getting Involved
• SQL is in Trunk
• Releasing with Solr 6
• Streaming API and Streaming Expressions
are located in the Solrj libraries
(solrj.io)
• Patches welcome
• Testers and feedback needed
33
01
Questions
Thanks!

More Related Content

What's hot

Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...Flink Forward
 
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, AlibabaWhat's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, AlibabaFlink Forward
 
University program - writing an apache apex application
University program  - writing an apache apex applicationUniversity program  - writing an apache apex application
University program - writing an apache apex applicationAkshay Gore
 
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatestFlink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatestFlink Forward
 
Exploring Streams and Lambdas in Java8
Exploring Streams and Lambdas in Java8Exploring Streams and Lambdas in Java8
Exploring Streams and Lambdas in Java8Isuru Samaraweera
 
Максим Харченко. Erlang lincx
Максим Харченко. Erlang lincxМаксим Харченко. Erlang lincx
Максим Харченко. Erlang lincxAlina Dolgikh
 
Connect S3 with Kafka using Akka Streams
Connect S3 with Kafka using Akka Streams Connect S3 with Kafka using Akka Streams
Connect S3 with Kafka using Akka Streams Seiya Mizuno
 
Using akka streams to access s3 objects
Using akka streams to access s3 objectsUsing akka streams to access s3 objects
Using akka streams to access s3 objectsMikhail Girkin
 
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Apache Flink's Table & SQL API - unified APIs for batch and stream processingApache Flink's Table & SQL API - unified APIs for batch and stream processing
Apache Flink's Table & SQL API - unified APIs for batch and stream processingTimo Walther
 
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink:  Fast and reliable large-scale data processingJanuary 2015 HUG: Apache Flink:  Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processingYahoo Developer Network
 
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...Flink Forward
 
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...Flink Forward
 
Akka Streams and HTTP
Akka Streams and HTTPAkka Streams and HTTP
Akka Streams and HTTPRoland Kuhn
 
Akka streams scala italy2015
Akka streams scala italy2015Akka streams scala italy2015
Akka streams scala italy2015mircodotta
 
Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...
Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...
Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...Flink Forward
 
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...Flink Forward
 

What's hot (20)

Presto overview
Presto overviewPresto overview
Presto overview
 
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
 
Akka Streams
Akka StreamsAkka Streams
Akka Streams
 
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, AlibabaWhat's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
 
Whats New in Java 8
Whats New in Java 8Whats New in Java 8
Whats New in Java 8
 
University program - writing an apache apex application
University program  - writing an apache apex applicationUniversity program  - writing an apache apex application
University program - writing an apache apex application
 
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatestFlink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
 
Exploring Streams and Lambdas in Java8
Exploring Streams and Lambdas in Java8Exploring Streams and Lambdas in Java8
Exploring Streams and Lambdas in Java8
 
Максим Харченко. Erlang lincx
Максим Харченко. Erlang lincxМаксим Харченко. Erlang lincx
Максим Харченко. Erlang lincx
 
Connect S3 with Kafka using Akka Streams
Connect S3 with Kafka using Akka Streams Connect S3 with Kafka using Akka Streams
Connect S3 with Kafka using Akka Streams
 
Using akka streams to access s3 objects
Using akka streams to access s3 objectsUsing akka streams to access s3 objects
Using akka streams to access s3 objects
 
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Apache Flink's Table & SQL API - unified APIs for batch and stream processingApache Flink's Table & SQL API - unified APIs for batch and stream processing
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
 
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink:  Fast and reliable large-scale data processingJanuary 2015 HUG: Apache Flink:  Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
 
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
 
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
 
Akka Streams and HTTP
Akka Streams and HTTPAkka Streams and HTTP
Akka Streams and HTTP
 
Java8lambda
Java8lambda Java8lambda
Java8lambda
 
Akka streams scala italy2015
Akka streams scala italy2015Akka streams scala italy2015
Akka streams scala italy2015
 
Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...
Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...
Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...
 
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
 

Similar to Parallel SQL for SolrCloud

Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, AlfrescoParallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, AlfrescoLucidworks
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Shalin Shekhar Mangar
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkFabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkVerverica
 
Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Aman Sinha
 
Xia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLXia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLMLconf
 
3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sqlŁukasz Grala
 
Rdbms chapter 1 function
Rdbms chapter 1 functionRdbms chapter 1 function
Rdbms chapter 1 functiondipumaliy
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...Flink Forward
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and SparkLucidworks
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Fabian Hueske
 
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIsFabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIsFlink Forward
 
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...Vyacheslav Lapin
 
Foundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theoryFoundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theoryDataWorks Summit
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesStreaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesC4Media
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupRemus Rusanu
 

Similar to Parallel SQL for SolrCloud (20)

Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, AlfrescoParallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkFabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
 
Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018
 
Xia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLXia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATL
 
3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql
 
Reactive Spring 5
Reactive Spring 5Reactive Spring 5
Reactive Spring 5
 
Rdbms chapter 1 function
Rdbms chapter 1 functionRdbms chapter 1 function
Rdbms chapter 1 function
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
 
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIsFabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
 
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
 
Foundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theoryFoundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theory
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
SQL Tuning 101
SQL Tuning 101SQL Tuning 101
SQL Tuning 101
 
sqltuning101-170419021007-2.pdf
sqltuning101-170419021007-2.pdfsqltuning101-170419021007-2.pdf
sqltuning101-170419021007-2.pdf
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesStreaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+Tables
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User Group
 

Recently uploaded

How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 

Recently uploaded (20)

How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 

Parallel SQL for SolrCloud

  • 1. O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
  • 2. Parallel SQL Joel Bernstein Search Engineer, Alfresco jbernste@apache.org
  • 3. 3 03 Introduction • Joel Bernstein • Lucene/Solr Committer • Search Engineer at Alfresco • Live and work in NYC
  • 4. 4 03 Alfresco • Open source ECM (Enterprise Content Management) • Alfresco is a system of record for documents • Uses Solr for search • 1800+ customers • 11 million active user accounts • Alfresco Solr: Document level access control, eventually consistent, transactional, multi-master, distributed search and faceting (coming in Alfresco 5.1)
  • 5. 5 01 Agenda 1. SQL Unleashed (What can it do?) 2. SQL Under the Hood (How does it work?)
  • 7. 7 01 Why SQL? • Solr has many awesome features. • But all of these feature create complexity. • Which faceting API to use? When to Stream? Which parameters to use for optimal performance? • The complexity level increases dramatically when distributed joins come into play • With SQL we can provide an optimizer to choose the best query plan.
  • 8. 8 01 The SQL Interface at Glance • SQL over Map/Reduce: supports high cardinality aggregations and distributed joins. • SQL over Facets: high performance on moderate cardinality aggregations. • SQL with Solr Search Predicates • SQL is fully integrated with SolrCloud
  • 9. 9 01 SQL Syntax: Limited and Unlimited SELECT • select colA, colB from tableB • select colA, colB from tableB limit 100 • Unlimited selects return the entire result set. Return fields must be DocValues. • Limited selects can sort by score and retrieve any stored field.
  • 10. 10 01 SQL Syntax: ORDER BY • select a, b from tableB order by a desc, b desc • Unlimited selects sort the entire result set
  • 11. 11 01 The Predicate: Phrase Searching • select a, b from tableB where c = ‘hello world’ • Searches for the phrase ‘hello world’ in field c.
  • 12. 12 01 The Predicate: Boolean searching • select a, b from tableB where c = ‘(hello world)’ • Adding parens searches for (hello OR world). • Supports Solr query syntax inside the parens.
  • 13. 13 01 The Predicate: Range query • select a, b from tableB where c = ‘[0 TO 100]’
  • 14. 14 01 The Predicate: Arbitrary Boolean clauses • select a, b from tableB where (c = ‘hello world’ AND d = ‘[0 TO 100]’)
  • 15. 15 01 SQL Syntax: Select Distinct • select distinct a, b from tableB • Map/Reduce Implementation: Tuples are shuffled to worker nodes where the distinct operation is performed. • JSON Facet Implementation: distinct operation is pushed down into the search engine • Map/Reduce for high cardinality • Facet for high QPS
  • 16. 16 01 Shuffle vs Push Down • Shuffling: high cardinality and parallel relational algebra (Distributed Joins) • Pushdown (Facet): blazing fast, high QPS, moderate cardinality • aggregationMode flag is available with the JDBC driver and http interface [map_reduce or facet]
  • 17. 17 01 Aggregations: Stats • select count(*), sum(a) from tableA • Uses the StatsComponent under the covers • Initial release supports count, sum, avg, min, max • Aggregation logic is always pushed down into the search engine.
  • 18. 18 01 Aggregations: GROUP BY • select a, b count(*), sum(c) from tableB group by a , b having count(*) > 50 order by sum(c) desc • Supports complex having clause: having (count(*) > 50 AND sum(b) < 1000) • Has Map/Reduce implementation (shuffle) • And JSON Facet implementation (push down) • Map/Reduce can handle high cardinality multi- dimension aggregations.
  • 19. 19 01 JDBC Driver • Ships with Solrj • Poolable Connection and Statement • SolrCloud Aware Load Balancing • Connection has aggregationMode switch [map_reduce or facet]
  • 21. 21 01 SQL Parsing • Presto SQL Parser handles the parsing • SQL Statements are compiled to TupleStream objects • The TupleStream is the base interface of the Streaming API • The Streaming API is a general purpose parallel computing API for SolrCloud
  • 22. 22 01 Parallel Computing Framework • Shuffling • Worker Collections • Streaming API • Streaming Expressions • Parallel SQL
  • 23. 23 01 Shuffling (sorting & partitioning) • Shuffling is pushed down into the search engine • Sorting: /export handler “stream sorts” entire result sets. • Partitioning: HashQParserPlugin, hash partitioning filter. Partitions results on arbitrary fields. • Tuples (search results) begin streaming instantly to worker nodes. Shuffling never requires a spill to disk. • All replicas shuffle in parallel for the same query. Allows for massive throughput.
  • 24. 24 01 Shuffling (sorting & partitioning) Worker 2Worker 1 Shard 1 Replica 1 Shard 2 Replica 1 Shard 1 Replica 2 Shard 2 Replica 2 Client Each worker is shuffled ½ the result set Tuples are sorted and partitioned on keys
  • 25. 25 01 Worker Collections • Are Generic SolrCloud Collections • Can hold data, or just perform work • Search results are shuffled to the workers • Configured with the /stream handler
  • 26. 26 01 Streaming API • Java Programming API for the parallel computing framework • Real-time Map/Reduce and Parallel Relational Algebra • Abstracts search results as Streams of tuples (TupleStream) • Streams are transformed in parallel by pluggable Decorator streams. • Parallel transformations include: group by, rollup, union, intersect, complement and join
  • 27. 27 01 Streaming Expressions • Contributed by Dennis Gove (Bloomberg) • String Query Language and Serialization format for the Streaming API • Streaming Expressions compile to TupleStreams • TupleStreams serialize to Streaming Expressions
  • 28. 28 01 Parallel SQL • Compiles SQL to a TupleStream • The TupleStream is serialized to a Streaming Expression and sent to worker nodes. • Worker nodes translate the Streaming Expression back into TupleStream • Worker nodes open() and read() the TupleStream in parallel. Tuples are returned from each worker
  • 29. 29 01 From SQL to Streaming Expression select str_s, count(*), sum(field_i), min(field_i), max(field_i), avg(field_i) from collection1 where text='XXXX' group by str_s rollup( search(collection1, q="(text:XXXX)", qt="/export", fl="str_s, field_i", partitionKeys=str_s, sort="str_s asc", zkHost="127.0.0.1:64149/solr"), over=str_s, count(*), sum(field_i), min(field_i), max(field_i), avg(field_i))
  • 30. 30 01 Parallel SQL Shuffle (5 workers, 5 shards, aggregationMode=map_reduce) Client Worker 2 Shard 3 Replica 2 Worker 3Worker 1 Worker 4 Worker 5 Shard 1 Replica 2 Shard 1 Replica 3 Shard 2 Replica 3 Shard 2 Replica 2 Shard 2 Replica 1 Shard 1 Replica 1 Shard 3 Replica 1 Shard 3 Replica 3 Shard 4 Replica 3 Shard 4 Replica 2 Shard 4 Replica 1 Shard 5 Replica 3 Shard 5 Replica 2 Shard 5 Replica 1 /SQL handler
  • 31. 31 01 Jira Tickets • SOLR-7560: Parallel SQL Support • SOLR-7377: Solr Streaming Expressions • SOLR-7082: Streaming Aggregation for SolrCloud • SOLR-7441: Improve overall robustness of the Streaming stack: Streaming API, Streaming Expressions, Parallel SQL
  • 32. 32 01 Getting Involved • SQL is in Trunk • Releasing with Solr 6 • Streaming API and Streaming Expressions are located in the Solrj libraries (solrj.io) • Patches welcome • Testers and feedback needed

Editor's Notes

  1. This slide shows a network diagram of a parallel SQL shuffle. Below are the steps: A client sends a SQL query to the SQL handler running on a Solr node. The SQL handler compiles the SQL query to a Streaming Expression (See slide 27). The Streaming Expression is sent across the wire to worker nodes Each worker node compiles the Streaming Expression to a Streaming API TupleStream. The workers open the TupleStream and iterate the Tuples. The requests from the worker nodes to the shards occur when then TupleStream is opened and iterated. Each worker node contacts each shard and is shuffled 1/5 of the search results. Requests to the shards are spread evenly across all replicas. All replicas shuffle in parallel to satisfy the query.