SlideShare a Scribd company logo
1 of 33
1© Cloudera, Inc. All rights reserved.
Solr 6 Feature Preview
Yonik Seeley
3/09/2016
2© Cloudera, Inc. All rights reserved.
My Background
• Creator of Solr
• Cloudera Engineer
• LucidWorks Co-Founder
• Lucene/Solr committer, PMC member
• Apache Software Foundation member
• M.S. in Computer Science, Stanford
3© Cloudera, Inc. All rights reserved.
Solr 6
• Happy Birthday Solr!
• 10 Years at the Apache Software Foundation as of 1/2016
• Release branch as been cut
• ETA before April
• Java 8+ only
4© Cloudera, Inc. All rights reserved.
Streaming Expressions
5© Cloudera, Inc. All rights reserved.
Solr Streaming Expressions
• Generic platform for distributed computation
• The basis for implementing distributed SQL
• Works across entire result sets (or subsets)
• normal search operations are designed for fast top-N operations
• Map-reduce like "shuffle" partitions result sets for greater scalability
• Worker nodes can be allocated from a collection for parallelism
6© Cloudera, Inc. All rights reserved.
Tuple Streams
• A streaming expression compiles/parses to a tuple stream
• direct mapping from a streaming expression function->tuple_stream
• Stream Sources – produce a tuple stream
• Stream Decorators – operate on tuple streams
• Designed to include streams from non-Solr systems
7© Cloudera, Inc. All rights reserved.
search() expression
$ curl http://localhost:8983/solr/techproducts/stream -d
'expr=search(techproducts, q="*:*", fl="id,price,score", sort="id asc")'
{"result-set":{"docs":[
{"score":1.0,"id":"0579B002","price":179.99},
{"score":1.0,"id":"100-435805","price":649.99},
{"score":1.0,"id":"3007WFP","price":2199.0},
{"score":1.0,"id":"VDBDB1A16"},
{"score":1.0,"id":"VS1GB400C3","price":74.99},
{"EOF":true,"RESPONSE_TIME":6}]}}
resulting tuple stream
8© Cloudera, Inc. All rights reserved.
Search Tuple Stream
Shard 1
Replica 2
Shard 1
Replica 1
Shard 1
Replica 2
Shard 2
Replica 1
Shard 1
Replica 2
Shard 3
Replica 1
Worker
Tuple Stream
Tuple Stream
/stream worker
executing the "search"
expression
• search() is a stream source
• SolrCloud aware (CloudSolrStream java class)
• Fully streaming (no big buffers)
• Worker node doesn't need to be a Solr node
9© Cloudera, Inc. All rights reserved.
search expression args
search( // parses to CloudSolrStream java class
techproducts, // name of the collection to search
zkHost="localhost:9983", // (opt) zookeeper address of collection to search
qt="/select", // (opt) the request handler to use (/export is also
available)
rows=1000000, // (opt) number of rows to retrieve
q=*:*, // query to match returned documents
fl="id,price,score", // which fields to return
sort="id asc, price desc", // how to sort the results
aliases="id=myid,price=myprice" // (opt) renames output fields
)
10© Cloudera, Inc. All rights reserved.
reduce() streaming expression
• Groups tuples by common field values
• Emits one group-head per group
• Each group-head contains list of tuples
• "by" parameter must match up with
"sort" parameter
• Any partitioning should be done on
same group field.
reduce(
search(collection1, qt="/export"
q="*:*",
fl="id,manu,price",
sort="manu asc, price desc"),
by="manu"),
group(sort="price desc",n=100)
)
stream operation
11© Cloudera, Inc. All rights reserved.
rollup() expression
• Groups tuples by common field values
• Emits rollup value along with metrics
• Closest equivalent to faceting
rollup(
search(collection1, qt="/export"
q="*:*",
fl="id,manu,price",
sort="manu asc"),
over="manu"),
count(*),
max(price)
)
metrics
{"result-set":{"docs":[
{"manu":"apple","count(*)":1.0},
{"manu":"asus","count(*)":1.0},
{"manu":"ati","count(*)":1.0},
{"manu":"belkin","count(*)":2.0},
{"manu":"canon","count(*)":2.0},
{"manu":"corsair","count(*)":3.0},
[...]
12© Cloudera, Inc. All rights reserved.
facet() expression
• Like search+rollup, but pushes down
computation to JSON Facet API
facet(
techproducts,
q="*:*",
buckets="manu",
bucketSorts="count(*) desc",
bucketSizeLimit=1000,
count(*),
sum(price),
max(popularity)
)
{"result-set":{"docs":[
{"avg(price)":129.99, "max(popularity)":7.0,"manu":"corsair","count(*)":3},
{"avg(price)":15.72,"max(popularity)":1.0,"manu":"belkin","count(*)":2},
{"avg(price)":254.97,"max(popularity)":7.0,"manu":"canon","count(*)":2},
{"avg(price)":399.0,"max(popularity)":10.0,"manu":"apple","count(*)":1},
{"avg(price)":479.95,"max(popularity)":7.0,"manu":"asus","count(*)":1},
{"avg(price)":649.98,"max(popularity)":7.0,"manu":"ati","count(*)":1},
{"avg(price)":0.0,"max(popularity)":"NaN","manu":"boa","count(*)":1},
[...]
13© Cloudera, Inc. All rights reserved.
Parallel Tuple Stream
Shard 1
Replica 2
Shard 1
Replica 1
Shard 1
Replica 2
Shard 2
Replica 1
Shard 1
Replica 2
Shard 3
Replica 1
Worker
Partition 1
Worker
Partition 2
Worker
Tuple Stream
14© Cloudera, Inc. All rights reserved.
Streaming Expressions – parallel
• Wraps a stream and sends to N worker
nodes
• The first parameter is the collection to
use for the intermediate worker nodes
• partitionKeys must be provided to
underlying workers
• usually makes sense to partition by
what you are grouping on
• inner and outer sorts should match
parallel(collection1,
rollup(
search(techproducts,
q="*:*",
fl="id,manu,price",
sort="manu asc",
partitionKeys="manu"),
over="manu asc"),
workers=2,
zkHost="localhost:9983",
sort="manu asc")
15© Cloudera, Inc. All rights reserved.
Joins!
innerJoin(
search(people, q=*:*, fl="personId,name", sort="personId asc"),
search(pets, q=type:cat, fl="personId,petName", sort="personId asc"),
on="personId"
)
leftOuterJoin, hashJoin, outerHashJoin,
16© Cloudera, Inc. All rights reserved.
More decorators
• complement – emits tuples from A which do not exist in B
• intersect – emits tuples from A whish do exist in B
• merge
• top – reorders the stream and returns the top N tuples
• unique – emits only the first tuple for each value
• select – select, rename, or give default values to fields in a tuple
17© Cloudera, Inc. All rights reserved.
Interesting streams
• update stream – indexes input into another SolrCloud collection!
• daemon stream – blocks until more data is available from underlying stream
• topic stream – a publish/subscribe messaging service
• checkpoints are persisted in a Solr collection
• resubmit to get new stuff
• combine with daemon stream to automatically get continuous updates over time
• further combine with update stream to push all matches to another collection
topic(checkpointCollection, dataCollection, id="topicA",
q="solr rocks" checkpointEvery="1000")
18© Cloudera, Inc. All rights reserved.
jdbc() expression stream
join with other data sources!
innerJoin( // example from JDBCStreamTest
select( search(collection1, fl="personId_i,rating_f", q="rating_f:*",
sort="personId_i asc"),
personId_i as personId, rating_f as rating ),
select( jdbc(connection="jdbc:hsqldb:mem:.", sql="select PEOPLE.ID as
PERSONID, PEOPLE.NAME, COUNTRIES.COUNTRY_NAME from PEOPLE inner join
COUNTRIES on PEOPLE.COUNTRY_CODE = COUNTRIES.CODE order by PEOPLE.ID",
sort="ID asc", get_column_name=true),
ID as personId, NAME as personName, COUNTRY_NAME as country ),
on="personId"
)
19© Cloudera, Inc. All rights reserved.
Parallel SQL
20© Cloudera, Inc. All rights reserved.
/sql Handler
• /sql handler is there by default on all solr nodes
• Translates SQL -> parallel streaming expressions
• SQL tables map to SolrCloud collections
• Query planner / optimizer
• Currently uses Presto parser
• May switch to Apache Calcite?
21© Cloudera, Inc. All rights reserved.
22© Cloudera, Inc. All rights reserved.
Simplest SQL Example
$ curl http://localhost:8983/solr/techproducts/sql -d "stmt=select id from techproducts"
{"result-set":{"docs":[
{"id":"EN7800GTX/2DHTV/256M"},
{"id":"100-435805"},
{"id":"UTF8TEST"},
{"id":"SOLR1000"},
{"id":"9885A004"},
[...]
tables map to
collections
23© Cloudera, Inc. All rights reserved.
SQL handler HTTP parameters
curl http://localhost:8983/solr/techproducts/sql -d '
&stmt=<sql_statement>
&numWorkers=4 // currently used by GROUP BY and DISTINCT (via parallel stream)
&workerCollection=collection1 // where to create intermediate workers
&workerZkhost=localhost:9983 // cluster (zookeeper ensemble) address
&aggregationMode=map_reduce | facet
24© Cloudera, Inc. All rights reserved.
The WHERE clause
• WHERE clauses are all pushed down to the search layer
select id
where popularity=10 // simple match on numeric field "popularity"
where popularity='[5 TO 10]' // solr range query (note the quotes)
where name='hard drive' // phrase query on the "name" field
where name='((memory retail) AND popularity:[5 TO 10])' // arbitrary solr query
where name='(memory retail)' AND popularity='[5 TO 10]' // boolean logic
25© Cloudera, Inc. All rights reserved.
Ordering and Limiting
select id,score from techproducts
where text='(memory hard drive)'
ORDER BY popularity desc // default order is score desc for limited queries
LIMIT 100
• Limited queries use /select handler
• Unlimited queries use /export handler
• fields selected need to be docValues
• fields in "order by" need to be docValues
• no "score" field allowed
26© Cloudera, Inc. All rights reserved.
More SQL examples
select distinct fieldA as fa, fieldB as fb from tableA order by fa desc, fb desc
// simple stats
select count(fieldA) as count, sum(fieldB) as sum from tableA where fieldC = 'Hello'
select fieldA, fieldB, count(*), sum(fieldC), avg(fieldY) from tableA
where fieldC = 'term1 term2'
group by fieldA, fieldB
having ((sum(fieldC) > 1000) AND (avg(fieldY) <= 10))
order by sum(fieldC) asc
27© Cloudera, Inc. All rights reserved.
Solr JDBC Driver
28© Cloudera, Inc. All rights reserved.
Solr JDBC driver works with Zeppelin
29© Cloudera, Inc. All rights reserved.
More Solr6 Features
30© Cloudera, Inc. All rights reserved.
Graph Query
• Basic (non-distributed) graph traversal query
• Follows nodes to edges, optionally filtering during traversal
• Currently only a "filter" query (produces a set of documents)
• Parameters: from, to, traversalFilter, returnRoot, returnOnlyLeaf, maxDepth
• This example query matches “Philip J. Fry” and all of his ancestors:
fq={!graph from=parent_id to=id}id:"Philip J. Fry"
31© Cloudera, Inc. All rights reserved.
Scoring changes
• For docCount (i.e. idf) in scoring, use the number of documents with that field
rather than the number of documents in the whole index (maxDoc).
• can add documents of a different type and not disturb/skew scoring
• BM25 scoring by default
• tweakable on a per-fieldType basis ("k1" and "b" factors)
• classic tf-idf still available
32© Cloudera, Inc. All rights reserved.
Cross DC Replication
33© Cloudera, Inc. All rights reserved.
Thank you
yonik@cloudera.com

More Related Content

What's hot

Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
Erik Hatcher
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
Chris Huang
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 

What's hot (20)

Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UNSolr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
An Introduction to Basics of Search and Relevancy with Apache Solr
An Introduction to Basics of Search and Relevancy with Apache SolrAn Introduction to Basics of Search and Relevancy with Apache Solr
An Introduction to Basics of Search and Relevancy with Apache Solr
 
Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHP
 
Solr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseSolr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by Case
 
Ingesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptIngesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScript
 

Viewers also liked

What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0
Erik Hatcher
 
Сергей Моренец: "Gradle. Write once, build everywhere"
Сергей Моренец: "Gradle. Write once, build everywhere"Сергей Моренец: "Gradle. Write once, build everywhere"
Сергей Моренец: "Gradle. Write once, build everywhere"
Provectus
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Дима Гадомский (Юскутум) “Можно ли позаимствовать дизайн и функционал так, чт...
Дима Гадомский (Юскутум) “Можно ли позаимствовать дизайн и функционал так, чт...Дима Гадомский (Юскутум) “Можно ли позаимствовать дизайн и функционал так, чт...
Дима Гадомский (Юскутум) “Можно ли позаимствовать дизайн и функционал так, чт...
Provectus
 

Viewers also liked (20)

Hackathon
HackathonHackathon
Hackathon
 
Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networks
 
Why I want to Kazan
Why I want to KazanWhy I want to Kazan
Why I want to Kazan
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Solr Powered Libraries
Solr Powered LibrariesSolr Powered Libraries
Solr Powered Libraries
 
Solr 4
Solr 4Solr 4
Solr 4
 
Open source applied: Real-world uses
Open source applied: Real-world usesOpen source applied: Real-world uses
Open source applied: Real-world uses
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logging
 
Faceted Search – the 120 Million Documents Story
Faceted Search – the 120 Million Documents StoryFaceted Search – the 120 Million Documents Story
Faceted Search – the 120 Million Documents Story
 
What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0
 
Solr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksSolr Indexing and Analysis Tricks
Solr Indexing and Analysis Tricks
 
Сергей Моренец: "Gradle. Write once, build everywhere"
Сергей Моренец: "Gradle. Write once, build everywhere"Сергей Моренец: "Gradle. Write once, build everywhere"
Сергей Моренец: "Gradle. Write once, build everywhere"
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst Again
 
Gimme shelter: Tips on protecting proprietary and open source code
Gimme shelter: Tips on protecting proprietary and open source codeGimme shelter: Tips on protecting proprietary and open source code
Gimme shelter: Tips on protecting proprietary and open source code
 
Apache Solr Changes the Way You Build Sites
Apache Solr Changes the Way You Build SitesApache Solr Changes the Way You Build Sites
Apache Solr Changes the Way You Build Sites
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
"Solr Update" at code4lib '13 - Chicago
"Solr Update" at code4lib '13 - Chicago"Solr Update" at code4lib '13 - Chicago
"Solr Update" at code4lib '13 - Chicago
 
Дима Гадомский (Юскутум) “Можно ли позаимствовать дизайн и функционал так, чт...
Дима Гадомский (Юскутум) “Можно ли позаимствовать дизайн и функционал так, чт...Дима Гадомский (Юскутум) “Можно ли позаимствовать дизайн и функционал так, чт...
Дима Гадомский (Юскутум) “Можно ли позаимствовать дизайн и функционал так, чт...
 
Top Node.js Metrics to Watch
Top Node.js Metrics to WatchTop Node.js Metrics to Watch
Top Node.js Metrics to Watch
 
Solr introduction
Solr introductionSolr introduction
Solr introduction
 

Similar to Solr 6 Feature Preview

Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentation
Murat Çakal
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
JSGB
 
Cassandra
CassandraCassandra
Cassandra
exsuns
 

Similar to Solr 6 Feature Preview (20)

Webinar: What's New in Solr 6
Webinar: What's New in Solr 6Webinar: What's New in Solr 6
Webinar: What's New in Solr 6
 
Spark etl
Spark etlSpark etl
Spark etl
 
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, ClouderaReal-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
 
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, ClouderaParallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
 
Solr As A SparkSQL DataSource
Solr As A SparkSQL DataSourceSolr As A SparkSQL DataSource
Solr As A SparkSQL DataSource
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentation
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Scaling Web Applications with Cassandra Presentation.ppt
Scaling Web Applications with Cassandra Presentation.pptScaling Web Applications with Cassandra Presentation.ppt
Scaling Web Applications with Cassandra Presentation.ppt
 
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than EverApache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
 
Solr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceSolr as a Spark SQL Datasource
Solr as a Spark SQL Datasource
 
Webinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphWebinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and Graph
 
OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...
OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...
OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 
Cassandra
CassandraCassandra
Cassandra
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Solr 6 Feature Preview

  • 1. 1© Cloudera, Inc. All rights reserved. Solr 6 Feature Preview Yonik Seeley 3/09/2016
  • 2. 2© Cloudera, Inc. All rights reserved. My Background • Creator of Solr • Cloudera Engineer • LucidWorks Co-Founder • Lucene/Solr committer, PMC member • Apache Software Foundation member • M.S. in Computer Science, Stanford
  • 3. 3© Cloudera, Inc. All rights reserved. Solr 6 • Happy Birthday Solr! • 10 Years at the Apache Software Foundation as of 1/2016 • Release branch as been cut • ETA before April • Java 8+ only
  • 4. 4© Cloudera, Inc. All rights reserved. Streaming Expressions
  • 5. 5© Cloudera, Inc. All rights reserved. Solr Streaming Expressions • Generic platform for distributed computation • The basis for implementing distributed SQL • Works across entire result sets (or subsets) • normal search operations are designed for fast top-N operations • Map-reduce like "shuffle" partitions result sets for greater scalability • Worker nodes can be allocated from a collection for parallelism
  • 6. 6© Cloudera, Inc. All rights reserved. Tuple Streams • A streaming expression compiles/parses to a tuple stream • direct mapping from a streaming expression function->tuple_stream • Stream Sources – produce a tuple stream • Stream Decorators – operate on tuple streams • Designed to include streams from non-Solr systems
  • 7. 7© Cloudera, Inc. All rights reserved. search() expression $ curl http://localhost:8983/solr/techproducts/stream -d 'expr=search(techproducts, q="*:*", fl="id,price,score", sort="id asc")' {"result-set":{"docs":[ {"score":1.0,"id":"0579B002","price":179.99}, {"score":1.0,"id":"100-435805","price":649.99}, {"score":1.0,"id":"3007WFP","price":2199.0}, {"score":1.0,"id":"VDBDB1A16"}, {"score":1.0,"id":"VS1GB400C3","price":74.99}, {"EOF":true,"RESPONSE_TIME":6}]}} resulting tuple stream
  • 8. 8© Cloudera, Inc. All rights reserved. Search Tuple Stream Shard 1 Replica 2 Shard 1 Replica 1 Shard 1 Replica 2 Shard 2 Replica 1 Shard 1 Replica 2 Shard 3 Replica 1 Worker Tuple Stream Tuple Stream /stream worker executing the "search" expression • search() is a stream source • SolrCloud aware (CloudSolrStream java class) • Fully streaming (no big buffers) • Worker node doesn't need to be a Solr node
  • 9. 9© Cloudera, Inc. All rights reserved. search expression args search( // parses to CloudSolrStream java class techproducts, // name of the collection to search zkHost="localhost:9983", // (opt) zookeeper address of collection to search qt="/select", // (opt) the request handler to use (/export is also available) rows=1000000, // (opt) number of rows to retrieve q=*:*, // query to match returned documents fl="id,price,score", // which fields to return sort="id asc, price desc", // how to sort the results aliases="id=myid,price=myprice" // (opt) renames output fields )
  • 10. 10© Cloudera, Inc. All rights reserved. reduce() streaming expression • Groups tuples by common field values • Emits one group-head per group • Each group-head contains list of tuples • "by" parameter must match up with "sort" parameter • Any partitioning should be done on same group field. reduce( search(collection1, qt="/export" q="*:*", fl="id,manu,price", sort="manu asc, price desc"), by="manu"), group(sort="price desc",n=100) ) stream operation
  • 11. 11© Cloudera, Inc. All rights reserved. rollup() expression • Groups tuples by common field values • Emits rollup value along with metrics • Closest equivalent to faceting rollup( search(collection1, qt="/export" q="*:*", fl="id,manu,price", sort="manu asc"), over="manu"), count(*), max(price) ) metrics {"result-set":{"docs":[ {"manu":"apple","count(*)":1.0}, {"manu":"asus","count(*)":1.0}, {"manu":"ati","count(*)":1.0}, {"manu":"belkin","count(*)":2.0}, {"manu":"canon","count(*)":2.0}, {"manu":"corsair","count(*)":3.0}, [...]
  • 12. 12© Cloudera, Inc. All rights reserved. facet() expression • Like search+rollup, but pushes down computation to JSON Facet API facet( techproducts, q="*:*", buckets="manu", bucketSorts="count(*) desc", bucketSizeLimit=1000, count(*), sum(price), max(popularity) ) {"result-set":{"docs":[ {"avg(price)":129.99, "max(popularity)":7.0,"manu":"corsair","count(*)":3}, {"avg(price)":15.72,"max(popularity)":1.0,"manu":"belkin","count(*)":2}, {"avg(price)":254.97,"max(popularity)":7.0,"manu":"canon","count(*)":2}, {"avg(price)":399.0,"max(popularity)":10.0,"manu":"apple","count(*)":1}, {"avg(price)":479.95,"max(popularity)":7.0,"manu":"asus","count(*)":1}, {"avg(price)":649.98,"max(popularity)":7.0,"manu":"ati","count(*)":1}, {"avg(price)":0.0,"max(popularity)":"NaN","manu":"boa","count(*)":1}, [...]
  • 13. 13© Cloudera, Inc. All rights reserved. Parallel Tuple Stream Shard 1 Replica 2 Shard 1 Replica 1 Shard 1 Replica 2 Shard 2 Replica 1 Shard 1 Replica 2 Shard 3 Replica 1 Worker Partition 1 Worker Partition 2 Worker Tuple Stream
  • 14. 14© Cloudera, Inc. All rights reserved. Streaming Expressions – parallel • Wraps a stream and sends to N worker nodes • The first parameter is the collection to use for the intermediate worker nodes • partitionKeys must be provided to underlying workers • usually makes sense to partition by what you are grouping on • inner and outer sorts should match parallel(collection1, rollup( search(techproducts, q="*:*", fl="id,manu,price", sort="manu asc", partitionKeys="manu"), over="manu asc"), workers=2, zkHost="localhost:9983", sort="manu asc")
  • 15. 15© Cloudera, Inc. All rights reserved. Joins! innerJoin( search(people, q=*:*, fl="personId,name", sort="personId asc"), search(pets, q=type:cat, fl="personId,petName", sort="personId asc"), on="personId" ) leftOuterJoin, hashJoin, outerHashJoin,
  • 16. 16© Cloudera, Inc. All rights reserved. More decorators • complement – emits tuples from A which do not exist in B • intersect – emits tuples from A whish do exist in B • merge • top – reorders the stream and returns the top N tuples • unique – emits only the first tuple for each value • select – select, rename, or give default values to fields in a tuple
  • 17. 17© Cloudera, Inc. All rights reserved. Interesting streams • update stream – indexes input into another SolrCloud collection! • daemon stream – blocks until more data is available from underlying stream • topic stream – a publish/subscribe messaging service • checkpoints are persisted in a Solr collection • resubmit to get new stuff • combine with daemon stream to automatically get continuous updates over time • further combine with update stream to push all matches to another collection topic(checkpointCollection, dataCollection, id="topicA", q="solr rocks" checkpointEvery="1000")
  • 18. 18© Cloudera, Inc. All rights reserved. jdbc() expression stream join with other data sources! innerJoin( // example from JDBCStreamTest select( search(collection1, fl="personId_i,rating_f", q="rating_f:*", sort="personId_i asc"), personId_i as personId, rating_f as rating ), select( jdbc(connection="jdbc:hsqldb:mem:.", sql="select PEOPLE.ID as PERSONID, PEOPLE.NAME, COUNTRIES.COUNTRY_NAME from PEOPLE inner join COUNTRIES on PEOPLE.COUNTRY_CODE = COUNTRIES.CODE order by PEOPLE.ID", sort="ID asc", get_column_name=true), ID as personId, NAME as personName, COUNTRY_NAME as country ), on="personId" )
  • 19. 19© Cloudera, Inc. All rights reserved. Parallel SQL
  • 20. 20© Cloudera, Inc. All rights reserved. /sql Handler • /sql handler is there by default on all solr nodes • Translates SQL -> parallel streaming expressions • SQL tables map to SolrCloud collections • Query planner / optimizer • Currently uses Presto parser • May switch to Apache Calcite?
  • 21. 21© Cloudera, Inc. All rights reserved.
  • 22. 22© Cloudera, Inc. All rights reserved. Simplest SQL Example $ curl http://localhost:8983/solr/techproducts/sql -d "stmt=select id from techproducts" {"result-set":{"docs":[ {"id":"EN7800GTX/2DHTV/256M"}, {"id":"100-435805"}, {"id":"UTF8TEST"}, {"id":"SOLR1000"}, {"id":"9885A004"}, [...] tables map to collections
  • 23. 23© Cloudera, Inc. All rights reserved. SQL handler HTTP parameters curl http://localhost:8983/solr/techproducts/sql -d ' &stmt=<sql_statement> &numWorkers=4 // currently used by GROUP BY and DISTINCT (via parallel stream) &workerCollection=collection1 // where to create intermediate workers &workerZkhost=localhost:9983 // cluster (zookeeper ensemble) address &aggregationMode=map_reduce | facet
  • 24. 24© Cloudera, Inc. All rights reserved. The WHERE clause • WHERE clauses are all pushed down to the search layer select id where popularity=10 // simple match on numeric field "popularity" where popularity='[5 TO 10]' // solr range query (note the quotes) where name='hard drive' // phrase query on the "name" field where name='((memory retail) AND popularity:[5 TO 10])' // arbitrary solr query where name='(memory retail)' AND popularity='[5 TO 10]' // boolean logic
  • 25. 25© Cloudera, Inc. All rights reserved. Ordering and Limiting select id,score from techproducts where text='(memory hard drive)' ORDER BY popularity desc // default order is score desc for limited queries LIMIT 100 • Limited queries use /select handler • Unlimited queries use /export handler • fields selected need to be docValues • fields in "order by" need to be docValues • no "score" field allowed
  • 26. 26© Cloudera, Inc. All rights reserved. More SQL examples select distinct fieldA as fa, fieldB as fb from tableA order by fa desc, fb desc // simple stats select count(fieldA) as count, sum(fieldB) as sum from tableA where fieldC = 'Hello' select fieldA, fieldB, count(*), sum(fieldC), avg(fieldY) from tableA where fieldC = 'term1 term2' group by fieldA, fieldB having ((sum(fieldC) > 1000) AND (avg(fieldY) <= 10)) order by sum(fieldC) asc
  • 27. 27© Cloudera, Inc. All rights reserved. Solr JDBC Driver
  • 28. 28© Cloudera, Inc. All rights reserved. Solr JDBC driver works with Zeppelin
  • 29. 29© Cloudera, Inc. All rights reserved. More Solr6 Features
  • 30. 30© Cloudera, Inc. All rights reserved. Graph Query • Basic (non-distributed) graph traversal query • Follows nodes to edges, optionally filtering during traversal • Currently only a "filter" query (produces a set of documents) • Parameters: from, to, traversalFilter, returnRoot, returnOnlyLeaf, maxDepth • This example query matches “Philip J. Fry” and all of his ancestors: fq={!graph from=parent_id to=id}id:"Philip J. Fry"
  • 31. 31© Cloudera, Inc. All rights reserved. Scoring changes • For docCount (i.e. idf) in scoring, use the number of documents with that field rather than the number of documents in the whole index (maxDoc). • can add documents of a different type and not disturb/skew scoring • BM25 scoring by default • tweakable on a per-fieldType basis ("k1" and "b" factors) • classic tf-idf still available
  • 32. 32© Cloudera, Inc. All rights reserved. Cross DC Replication
  • 33. 33© Cloudera, Inc. All rights reserved. Thank you yonik@cloudera.com