6. • Big data systems have grown too complex trying to satisfy a variety of access patterns
• Fast primary key lookups / atomic updates (Solr, HBase, Cassandra, …)
• Low-latency ranked retrieval (Solr, Elastic, DataStax, …)
• Large, distributed table scans (Spark, M/R, Pig, Cassandra, Hive, Impala, …)
• Graph traversal (Graphx, Giraph, Neo4j, …)
• De-normalization can be inconvenient as related data sets can change at different velocities
(movies vs. movie ratings)
• Leverage progress made by the Solr community to support big data in Solr using horizontal
scalability (shards & replicas)
• Don’t forget about speed ~ Search engines in general and Solr in particular are extremely fast!
Why Solr needs Parallel Computation
7. Lucidworks Fusion Is Search-Driven Everything
•Drive next generation relevance
via Content, Collaboration and
Context
•Harness best in class Open
Source: Apache Solr + Spark
•Simplify application
development and reduce
ongoing maintenance
CATALOG
DYNAMIC NAVIGATION
AND LANDING PAGES
INSTANT INSIGHTS AND
ANALYTICS
PERSONALIZED
SHOPPING EXPERIENCE
PROMOTIONS USER HISTORY
Data Acquisition
Indexing & Streaming
Smart Access API
Recommendations &
Alerts
Analytics & InsightsExtreme Relevancy
Access data from
anywhere to build
intelligent, data-
driven applications.
10. • SQL is ubiquitous language for analytics
• People: Less training and easier to understand
• Tools! Solr as JDBC data source (DbVisualizer,
Apache Zeppelin, and SQuirreL SQL)
• Query planning / optimization can evolve
iteratively
SQL is natural extension for Solr’s parallel computing engine
11. Give me the top 5 action movies with rating of 4 or better
Mental Warm-up
/select?q=*:*
&fq=genre_ss:action
&fq=rating_i:[4 TO *]
&facet=true
&facet.limit=5
&facet.mincount=1
&facet.field=title_s
SELECT title_s, COUNT(*) as cnt
FROM movielens
WHERE genre_ss='action'
AND rating_i='[4 TO *]’
GROUP BY title_s
ORDER BY cnt desc
LIMIT 5
{
...
"facet_counts":{
"facet_fields":{
"title_s":[
"Star
Wars
(1977)",501,
"Return
of
the
Jedi
(1983)",379,
"Godfather,
The
(1972)",351,
"Raiders
of
the
Lost
Ark
(1981)",348,
"Empire
Strikes
Back,
The
(1980)",293]},
...}}
{"result-‐set":{"docs":[
{"title_s":"Star
Wars
(1977)”,"cnt":501},
{"title_s":"Return
of
the
Jedi
(1983)","cnt":379},
{"title_s":"Godfather,
The
(1972)","cnt":351},
{"title_s":"Raiders
of
the
Lost
Ark
(1981)","cnt":348},
{"title_s":"Empire
Strikes
Back,
The
(1980)","cnt":293},
{"EOF":true,"RESPONSE_TIME":42}]}}
12.
SELECT
gender_s,
COUNT(*)
as
num_ratings,
avg(rating_i)
as
avg_rating
FROM
movielens
WHERE
genre_ss='romance'
AND
age_i='[30
TO
*]'
GROUP
BY
gender_s
ORDER
BY
num_ratings
desc
SQL Examples
SELECT
title_s,
genre_s,
COUNT(*)
as
num_ratings,
avg(rating_i)
as
avg_rating
FROM
movielens
GROUP
BY
title_s,
genre_s
HAVING
num_ratings
>=
100
ORDER
BY
avg_rating
desc
LIMIT
5
SELECT
DISTINCT(user_id_i)
as
user_id
FROM
movielens
WHERE
genre_ss='documentary'
ORDER
BY
user_id
desc
Give me the avg rating for men
and women over 30 for
romance movies
Give me the top 5 rated movies
with at least 100 ratings
Give me the set of unique users
that have rated documentaries
14. • Relies on docValues (column-oriented data
structure) and /export handler
• Extreme read performance (8-10x faster than
queries using cursorMark)
• Facet or map/reduce style aggregation modes
• Tiered architecture
• SQL interface tier
• Worker tier (scale a pool of worker “nodes”
independently of the data collection)
• Data tier (Solr collection)
Streaming API: Nuts and Bolts
15. parallel(workers,
hashJoin(
search(movielens,
q=*:*,
fl="user_id_i,movie_id_i,rating_i",
sort="movie_id_i
asc",
partitionKeys="movie_id_i"),
hashed=search(movielens_movies,
q=*:*,
fl="movie_id_i,title_s,genre_s",
sort="movie_id_i
asc",
partitionKeys="movie_id_i"),
on="movie_id_i"
),
workers="4",
sort="movie_id_i
asc"
)
Streaming Expression Example: hashJoin
The small “right” side of the join
gets loaded into memory on
each worker node
Each shard queried by N
workers, so 4 workers x 4 shards
means 16 queries (usually all
replicas per shard are hit)
Workers collection isolates parallel
computation nodes from data nodes
16. Aggregation Modes
• Map/Reduce aggregationMode — for high cardinality aggregations and distributed joins
(requires a shuffle phase to move keys to correct worker)
curl
-‐-‐data-‐urlencode
"stmt=SELECT
user_id_i,
avg(rating_i)
as
avg_rating
FROM
movielens
GROUP
BY
user_id_i"
“http://host:port/solr/movielens/sql?aggregationMode=map_reduce”
• Facet aggregationMode — Uses JSON facet engine for high performance on low-to-moderate
cardinality fields (e.g. movies)
curl
-‐-‐data-‐urlencode
"stmt=SELECT
movie_id_i,
avg(rating_i)
as
avg_rating
FROM
movielens
GROUP
BY
movie_id_i"
“http://host:port/solr/movielens/sql?aggregationMode=facet”
17. • spark-solr project uses streaming API to pull data
from Solr into Spark jobs if docValues enabled,
see: https://github.com/lucidworks/spark-solr
• Perform aggregations of “signals”, e.g clicks, to
compute boosts and recommendations using
Spark
• Custom Scala script jobs to perform complex
analysis on data in Solr, e.g. sessionize request
logs
• Power rich data visualizations using Spark SQL
over Solr streaming aggregations
How we use Solr streaming API in Fusion
19. • Anomaly detection and fraud detection
• Recommenders
• Social network analysis
• Graph Search
• Access Control
• Examples:
• Find all tweets mentioning “Solr” by me or people I follow
• Find all draft blog posts about “Parallel SQL” written by a developer
• Find 3-star hotels in NYC my friends stayed in last year
Graph Use Cases
20. • Some data is much more naturally represented as a graph structure
• Traditionally hard to deal with in search’s inverted index
• Solr 6.0 introduces the Graph Query Parser
• Solr 6.1 brings Graph Streaming expressions
Graph Basics
21. • Query-time, cyclic aware graph traversal is able to rank
documents based on relationships
• Provides controls for depth, filtering of results and
inclusion of root and/or leaves
• Limitations: single node/shard only
• Examples:
• http://localhost:8983/solr/graph/query?
fl=id,score&q={!graph+from=in_edge+to=out_edge}
id:A
• http://localhost:8983/solr/my_graph/query?fl=id&q={!
graph+from=in_edge+to=out_edge
+traversalFilter='foo:[*+TO+15]'}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&q={!
graph+from=in_edge+to=out_edge+maxDepth=1}foo:
[*+TO+10]
Graph Query Parser
22. •Part of Solr’s broader Streaming Expressions capability
•Implements a powerful, breadth-first traversal
•Works across shards AND collections
•Supports aggregations
•Cycle aware
Graph Streaming Expressions (Solr 6.1)
curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d ‘expr=…’ "http://localhost:
18984/solr/movielens/stream"
23. All movies that user 389 watched
expr:gatherNodes(movielens,walk="389->user_id_i",gather="movie_id_i")
24. All the Movies that viewers of Movie 161 watched
expr:gatherNodes(movielens,
gatherNodes(movielens,walk="161->movie_id_i",gather="user_id_i"),
walk="node->user_id_i",gather="movie_id_i", trackTraversal="true")
Movie 161: “The Air Up There”
27. Comparing Graph Choices
Solr Elastic Graph Neo4J
Spark
GraphX
Best Use Case
QParser: predef.
relationships as filters
Expressions: fast,
query-based, dist.
graph ops
Term relationship
exploration
Graph ops and
querying that fit on a
single node
Large-scale, iterative
graph ops
Common Graph
Algorithms (e.g.
Pregel, Traversal)
Partial No Yes Yes
Scaling
QParser: no
Expressions: yes
Yes Master/Replica Yes
Commercial
License Required
No Yes GPLv3 No
Visualizations
GraphML support
(Gephi)
Kibana Neo4j browser 3rd party
28. Comparing Big Data SQL Choices
Solr Hive Drill SparkSQL
Secret Sauce
Push complex query
constructs into engine
(full text, spatial,
functions, etc)
Mature SQL solution
for Hadoop stack
Execute SQL over
NoSQL data sources
Spark core (optimized
shuffle, in-memory,
etc), integration of
other APIs: ML,
Streaming, GraphX
SQL Features Evolving Mature Maturing Maturing
Scaling
Linear (shards and
replicas) backed by
inverted index
Limited by Hadoop
infrastructure (table
scans)
Good, but need to
benchmark
Memory intensive;
Scale out using Spark
cluster, backed by
RDDs
Integration w/ external
systems
JDBC stream source
external tables /
plugin API
many drivers
available
DataSource API,
many systems
supported
30. • Alternate graph traversal approaches, e.g. depth-first
• Possible support for Gremlin (Graph Traversal Language from Tinker Pop)
• Additional graph algorithms (e.g. strongly conn. components, page rank)
Future Work
31. • No support for pushing >, >=, <, <= operators in
WHERE clause down into Solr as range queries;
use range syntax [4 TO *] for now
• Using Solr function queries in WHERE clause, e.g.
WHERE
location_p='{!geofilt
d=90
pt=37.773972,-‐122.431297
sfield=location_p}’
• SQL Joins (SOLR-8593)
• Port SQL layer to use Apache Calcite vs. Presto
SQL: Current Limitations and Future Plans