Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera

1
©
Cloudera,
Inc.
All
rights
reserved.

Parallel
SQL
and
Analy.cs
with
Solr

Yonik
Seeley

Cloudera

2
©
Cloudera,
Inc.
All
rights
reserved.

My
Background

• Creator
of
Solr

• Cloudera
Engineer

• LucidWorks
Co-‐Founder

• Lucene/Solr
commiFer,
PMC
member

• Apache
SoJware
FoundaLon
member

• M.S.
in
Computer
Science,
Stanford

3
©
Cloudera,
Inc.
All
rights
reserved.

What
is
Apache
Solr

•  Search
server

• like
a
database,
but
diﬀerent
indexing
technology
(Apache
Lucene)

• opLmized
for
interacLve
results

•  Columns
(aka
docValues)
for
fast
scans

•  HighlighLng

•  FaceLng
(category
counts)

•  SpaLal
search

•  Powers
search
for
the
leading
Hadoop
Big
Data
vendors

4
©
Cloudera,
Inc.
All
rights
reserved.

ParLal
Solr
Architecture

Lucene

Streaming
Expressions

Parallel
SQL

Distributed
Search

Facets
&

StaLsLcs

Solr
Request

Framework

JSON
Facet

API

Green
blocks

are
newer

addiLons

5
©
Cloudera,
Inc.
All
rights
reserved.

Diﬀerent
ways
to
calculate
things
in
Solr

•  Faceted
search
v1
/
stats
module

facet=true&facet.field=color&facet.limit=5

•  JSON
Facet
API
(faceted
search
v2)

{colors:{type:terms,
field:color,
limit:5}}

•  Streaming
expressions

rollup(search(techproducts,q="*:*",fl="id,color",

sort="color
asc"),
over="color",
count(*))

•  Parallel
SQL

select
count(*)
from
techproducts
where
_text_='(*:*)'
group

by
color"

6
©
Cloudera,
Inc.
All
rights
reserved.

JSON
Facet
API

7
©
Cloudera,
Inc.
All
rights
reserved.

Faceted
Search

•  Breaks
search
results
into

buckets

•  Generally
provides
bucket

counts

•  Allows
user
to
ﬁlter
/
"drill

into"
results

8
©
Cloudera,
Inc.
All
rights
reserved.

FaceLng

Search

StaLsLcs

Facet
Module
Goals

Search

Joins

Grouping

Field

Collapsing

New
Facet
Module

JSON
Facet
API

•  IntegraLon

•  Performance

•  Ease
of
use

HighlighLng

Nested

Documents

Geosearch

9
©
Cloudera,
Inc.
All
rights
reserved.

Simple
JSON
Facet
request
and
response

curl
http://localhost:8983/solr/query
-‐d
'

q=widgets&

json.facet=

{

x
:
"avg(price)"
,

y
:
"unique(brand)"

}

'

[…]

"facets"
:
{

"count"
:
314,

"x"
:
102.5,

"y"
:
28

}

root
domain
deﬁned
by
docs

matching
the
query
count
of
docs
in
the
bucket

10
©
Cloudera,
Inc.
All
rights
reserved.

Terms
facet
example

json.facet={

shoes
:
{

type
:
terms,

field
:
shoe_style,

sort
:
{x
:
desc},

facet
:
{

x
:
"avg(price)",

y
:
"unique(brand)"

}

}

}

"facets":
{

"count"
:
472,

"shoes":
{

"buckets"
:
[

{

"val"
:
"Hiking",

"count"
:
34,

"x"
:
135.25,

"y"
:
17,

},

{

"val"
:
"Running",

"count"
:
45,

"x"
:
110.75,

"y"
:
24,

},

Calculated
per-‐bucket

Sort
by
any
stat!

11
©
Cloudera,
Inc.
All
rights
reserved.

Sub-‐facet
example

json.facet={

shoes:{

type
:
terms,

field
:
shoe_style,

sort
:
{x
:
desc},

facet
:
{

x
:
"avg(price)",

y
:
"unique(brand)",

colors
:
{

type
:
terms,

field
:
color

}

}

}

}

"facets":
{

"count"
:
472,

"shoes":
{

"buckets"
:
[

{

"val"
:
"Hiking",

"count"
:
34,

"x"
:
135.25,

"y"
:
17,

"colors"
:
{

"buckets"
:
[

{
"val"
:
"brown",

"count"
:
12
},

{
"val"
:
"black",

"count"
:
10

},
[…]

]

}
//
end
of
colors
sub-‐facet

},
//
end
of
Hiking
bucket

{

"val"
:
"Running",

"count"
:
45,

"x"
:
110.75,

"y"
:
24,

"colors"
:
{

"buckets"
:
[…]

12
©
Cloudera,
Inc.
All
rights
reserved.

Facet
Types

•  Terms
Facet

• Creates
new
domains
(facet
buckets)
based
on
values
in
a
ﬁeld

•  Range
Facet

• Creates
mulLple
buckets
based
on
date
ranges
or
numeric
ranges

•  Query
Facet

• Creates
a
single
bucket
of
documents
that
match
any
given
query

•  Unlimited
nesLng:
Any
facet
types
may
have
any
number
of
sub-‐facets

•  MulL-‐select
faceLng
(ﬁlter
exclusion)

•  Nested
documents
(block
join)

13
©
Cloudera,
Inc.
All
rights
reserved.

Streaming
Expressions

14
©
Cloudera,
Inc.
All
rights
reserved.

Solr
Streaming
Expressions

• Generic
plalorm
for
distributed
computaLon

• The
basis
for
implemenLng
distributed
parallel
SQL

• relaLonal
operaLons
on
streams

• Works
across
enLre
result
sets
(or
subsets)

• normal
search
operaLons
are
designed
for
fast
top-‐N
operaLons

• Map-‐reduce
like
"shuﬄe"
parLLons
result
sets
for
greater
scalability

• Worker
nodes
can
be
allocated
from
a
collecLon
for
parallelism

• Incorporates
streams
from
non-‐Solr
systems

15
©
Cloudera,
Inc.
All
rights
reserved.

search()
expression

$
curl
hFp://localhost:8983/solr/techproducts/stream
-‐d

'expr=search(techproducts,
q="*:*",
ﬂ="id,price,score",
sort="id
asc")'

{"result-‐set":{"docs":[

{"score":1.0,"id":"0579B002","price":179.99},

{"score":1.0,"id":"100-‐435805","price":649.99},

{"score":1.0,"id":"3007WFP","price":2199.0},

{"score":1.0,"id":"VDBDB1A16"},

{"score":1.0,"id":"VS1GB400C3","price":74.99},

{"EOF":true,"RESPONSE_TIME":6}]}}

resulLng
tuple
stream

16
©
Cloudera,
Inc.
All
rights
reserved.

Search
Tuple
Stream

Shard
1

Replica
2

Shard
1

Replica
1

Shard
1

Replica
2

Shard
2

Replica
1

Shard
1

Replica
2

Shard
3

Replica
1

Worker

Tuple
Stream

Tuple
Stream

/stream
worker

execuLng
the
"search"

expression

•  search()
is
a
stream
source

•  Fully
SolrCloud
aware
(knows
cluster
layout)

•  Fully
streaming
(no
big
buﬀers)

17
©
Cloudera,
Inc.
All
rights
reserved.

search
expression
args

search(

//
parses
to
CloudSolrStream
java
class

techproducts,

//
name
of
the
collecLon
to
search

zkHost="localhost:9983",
//
(opt)
zookeeper
address
of
collecLon
to
search

qt="/select",

//
(opt)
the
request
handler
to
use
(/export
is
also
available)

rows=1000000,

//
(opt)
number
of
rows
to
retrieve

q=*:*,

//
query
to
match
returned
documents

fl="id,price,score",

//
which
fields
to
return

sort="id
asc,
price
desc",
//
how
to
sort
the
results

aliases="id=myid,price=myprice"

//
(opt)
renames
output
fields

)

18
©
Cloudera,
Inc.
All
rights
reserved.

rollup()
expression

•  Groups
tuples
by
common
ﬁeld
values

•  Emits
rollup
value
along
with
metrics

•  Closest
equivalent
to
face.ng

rollup(

search(collecLon1,
qt="/export"

q="*:*",

ﬂ="id,manu,price",

sort="manu
asc"),

over="manu"),

count(*),

max(price)

)

metrics


{"manu":"apple","count(*)":1.0},

{"manu":"asus","count(*)":1.0},

{"manu":"aL","count(*)":1.0},

{"manu":"belkin","count(*)":2.0},

{"manu":"canon","count(*)":2.0},

{"manu":"corsair","count(*)":3.0},

[...]

19
©
Cloudera,
Inc.
All
rights
reserved.

Parallel
Tuple
Stream

Shard
1

Replica
2

Shard
1

Replica
1

Shard
1

Replica
2

Shard
2

Replica
1

Shard
1

Replica
2

Shard
3

Replica
1

Worker

ParLLon
1

Worker

ParLLon
2

Worker

Tuple
Stream

20
©
Cloudera,
Inc.
All
rights
reserved.

Streaming
Expressions
–
parallel

•  Wraps
a
stream
and
sends
to
N
worker

nodes

•  The
ﬁrst
parameter
is
the
collec.on
to

use
for
the
intermediate
worker
nodes

•  par..onKeys
must
be
provided
to

underlying
workers

• usually
makes
sense
to
par..on
by

what
you
are
grouping
on

•  inner
and
outer
sorts
should
match

parallel(collecLon1,

rollup(

search(techproducts,

q="*:*",

ﬂ="id,manu,price",

sort="manu
asc",

parLLonKeys="manu"),

over="manu
asc"),

workers=2,

zkHost="localhost:9983",

sort="manu
asc")

21
©
Cloudera,
Inc.
All
rights
reserved.

Distributed
Joins!

innerJoin(

search(people,
q=*:*,
ﬂ="personId,name",
sort="personId
asc"),

search(pets,
q=type:cat,
ﬂ="personId,petName",
sort="personId
asc"),

on="personId"

)

Also:
leJOuterJoin,
hashJoin,
outerHashJoin,

22
©
Cloudera,
Inc.
All
rights
reserved.

More
stream
decorators

•  complement
–
emits
tuples
from
A
which
do
not
exist
in
B

•  intersect
–
emits
tuples
from
A
whish
do
exist
in
B

•  merge

•  reduce

•  sort

•  top
–
reorders
the
stream
and
returns
the
top
N
tuples

•  unique
–
emits
only
the
first
tuple
for
each
value

•  select
–
select,
rename,
or
give
default
values
to
fields
in
a
tuple

hFps://cwiki.apache.org/confluence/display/solr/Streaming+Expressions

23
©
Cloudera,
Inc.
All
rights
reserved.

jdbc()
expression
stream

join
with
other
data
sources!

innerJoin(

select(

search(collecLon1,
[...]),

personId_i
as
personId,

raLng_f
as
raLng

),

select(

jdbc(connecLon="jdbc:hsqldb:mem:.",
sql="select
PEOPLE.ID
as

PERSONID,
PEOPLE.NAME,
COUNTRIES.COUNTRY_NAME
from
PEOPLE
inner
join

COUNTRIES
on
PEOPLE.COUNTRY_CODE
=
COUNTRIES.CODE
order
by
PEOPLE.ID",

sort="ID
asc",
get_column_name=true),

ID
as
personId,

NAME
as
personName,

COUNTRY_NAME
as
country

),

on="personId"

)

24
©
Cloudera,
Inc.
All
rights
reserved.

Parallel
SQL

25
©
Cloudera,
Inc.
All
rights
reserved.

/sql
Handler

Why
SQL?

• External
integraLons

• Higher
level
language
–
says
what
we
want,
not
how
to
get
it

• SQL
has
made
a
comeback
along
with
big
data,
more
ubiquitous
than
ever

•  /sql
REST
endpoint
by
default
on
all
solr
nodes

•  Translates
SQL
-‐>
parallel
streaming
expressions

•  SQL
tables
map
to
SolrCloud
collecLons

•  Currently
uses
Presto
SQL
parser

• Switch
to
Apache
Calcite
parser
in
the
works

26
©
Cloudera,
Inc.
All
rights
reserved.

27
©
Cloudera,
Inc.
All
rights
reserved.

Simplest
SQL
Example

$
curl
hFp://localhost:8983/solr/techproducts/sql
-‐d
"stmt=select
id
from
techproducts"


{"id":"EN7800GTX/2DHTV/256M"},

{"id":"100-‐435805"},

{"id":"UTF8TEST"},

{"id":"SOLR1000"},

{"id":"9885A004"},

[...]

tables
map
to

collecLons

28
©
Cloudera,
Inc.
All
rights
reserved.

SQL
handler
HTTP
parameters

curl
hFp://localhost:8983/solr/techproducts/sql
-‐d
'

&stmt=<sql_statement>

&numWorkers=4

//
currently
used
by
GROUP
BY
and
DISTINCT
(via
parallel
stream)

&workerCollecLon=collecLon1

//
where
to
create
intermediate
workers

&workerZkhost=localhost:9983

//
cluster
(zookeeper
ensemble)
address

&aggregaLonMode=map_reduce
|
facet

29
©
Cloudera,
Inc.
All
rights
reserved.

The
WHERE
clause

•  WHERE
clauses
are
all
pushed
down
to
the
search
layer

select
id

where
popularity=10

//
simple
match
on
numeric
ﬁeld
"popularity"

where
popularity='[5
TO
10]'

//
solr
range
query
(note
the
quotes)

where
name='hard
drive'

//
phrase
query
on
the
"name"
ﬁeld

where
name='((memory
retail)
AND
popularity:[5
TO
10])'

//
arbitrary
solr
query

where
name='(memory
retail)'
AND
popularity='[5
TO
10]'
//
boolean
logic

30
©
Cloudera,
Inc.
All
rights
reserved.

Ordering
and
LimiLng

select
id,score
from
techproducts

where
text='(memory
hard
drive)'

ORDER
BY
popularity
desc

//
default
order
is
score
desc
for
limited
queries

LIMIT
100

•  Limited
queries
use
/select
handler

•  Unlimited
queries
use
/export
handler

• fields
selected
need
to
be
docValues

• fields
in
"order
by"
need
to
be
docValues

• no
"score"
field
allowed

31
©
Cloudera,
Inc.
All
rights
reserved.

More
SQL
examples

select
disLnct
fieldA
as
fa,
fieldB
as
‚
from
tableA
order
by
fa
desc,
‚
desc

//
simple
stats

select
count(fieldA)
as
count,
sum(fieldB)
as
sum
from
tableA
where
fieldC
=
'Hello'

select
fieldA,
fieldB,
count(*),
sum(fieldC),
avg(fieldY)
from
tableA

where
fieldC
=
'term1
term2'

group
by
fieldA,
fieldB

having
((sum(fieldC)
>
1000)
AND
(avg(fieldY)
<=
10))

order
by
sum(fieldC)
asc

32
©
Cloudera,
Inc.
All
rights
reserved.

Solr
JDBC
Driver

33
©
Cloudera,
Inc.
All
rights
reserved.

Solr
JDBC
driver
works
with
Apache
Zeppelin

34
©
Cloudera,
Inc.
All
rights
reserved.

Graph
Traversal

35
©
Cloudera,
Inc.
All
rights
reserved.

Graph
Filter

•  Follows
ad
hoc
edges

•  Not
distributed!

• still
useable
on

partitioned
data

•  Can
filter
on
each
hop

•  Can
specify
max
depth

•  Cycle
detection

fq={!graph
from=parents
to=id}
id:"Philip
J.
Fry"

id
:
"Philip
J.
Fry"

parents:["Yancy
Fry,
Sr.","Mrs.
Fry"]

id
:
"Yancy
Fry"

parents:["Yancy
Fry,
Sr.","Mrs.
Fry"]

id
:
"Yancy
Fry,
Sr."

parents:["Mildred,
"Philip
J.
Fry"]

id
:
"Mrs.
Fry"

parents:["Mr.
Gleisner",

"Mrs.
Gleisner"]

id
:
"Mildred"

id
:
"Hubert
J.

Farnsworth"

id
:
"Philip
J.
Fry"

parents:["Yancy
Fry,
Sr.","Mrs.
Fry"]

Cycle!

36
©
Cloudera,
Inc.
All
rights
reserved.

Graph
streaming
expressions

•  Breadth-‐ﬁrst
graph
traversals

•  Fully
integrated
with
streaming,
fully
distributed

•  Traverse
across
collecLons
as
well
as
shards

•  Compute
aggregaLons

curl
http://localhost:8983/solr/emails/stream
–d
'

expr=gatherNodes(emails,

walk="johndoe@apache.org-‐>from",

gather="to")

'

37
©
Cloudera,
Inc.
All
rights
reserved.

Graph
streaming
expressions
example

•  Index
some
books
in
one
collecLon

curl
http://localhost:8983/solr/books/update
-‐H
'Content-‐type:text/csv'
-‐d
'

id,cat,pubyear_i,title,author,series_s,sequence_i

book1,fantasy,2000,A
Storm
of
Swords,George
R.R.
Martin,A
Song
of
Ice
and
Fire,3

Feast
for
Crows,George
R.R.
Martin,A
Song
of
Ice
and
Fire,4

Dance
with
Dragons,George
R.R.
Martin,A
Song
of
Ice
and
Fire,5

book4,sci-‐fi,1987,Consider
Phlebas,Iain
M.
Banks,The
Culture,1

book5,sci-‐fi,1988,The
Player
of
Games,Iain
M.
Banks,The
Culture,2

book6,sci-‐fi,1990,Use
of
Weapons,Iain
M.
Banks,The
Culture,3

book7,fantasy,1984,Shadows
Linger,Glen
Cook,The
Black
Company,2

book8,fantasy,1984,The
White
Rose,Glen
Cook,The
Black
Company,3

book9,fantasy,1989,Shadow
Games,Glen
Cook,The
Black
Company,4

book10,sci-‐fi,2001,Gridlinked,Neal
Asher,Ian
Cormac,1

book11,sci-‐fi,2003,The
Line
of
Polity,Neal
Asher,Ian
Cormac,2

book12,sci-‐fi,2005,Brass
Man,Neal
Asher,Ian
Cormac,3

'

38
©
Cloudera,
Inc.
All
rights
reserved.

Graph
streaming
expressions
example

•  Index
some
book
reviews
into
another
collecLon

curl
http://localhost:8983/solr/reviews/update-‐H
'Content-‐type:text/csv'
-‐d
'

id,book_s,user_s,rating_i,review_t

book1_r1,book1,Yonik,5,awesome
book!

book1_r2,book1,Aarav,2,too
bloody

book1_r3,book1,Haruka,5,awesome
world
building

book2_r1,book2,Yonik,5,another
great
one

book2_r2,book2,Maria,5,wow!

book4_r1,book4,Yonik,2,i
am
lying...
actually
liked
it

book4_r2,book4,Aarav,5,Loved
it

book7_r1,book7,Yonik,4,read
back
in
college
but
it
was
good

book10_r1,book10,Maria,5,I
want
a
gridlink!

book11_r1,book11,Maria,1,Blech

book11_r2,book11,Aarav,4,is
this
the
first
book?

book12_r1,book12,Yonik,5,Mr.
Crane
is
scary...

'

1.
Find
books
I
like

2.
Find
who
else
rated

those
books
highly

3.
Find
other
books

they
rated
highly

4.
Proﬁt!

39
©
Cloudera,
Inc.
All
rights
reserved.

1.
Search
expression
to
ﬁnd
my
high
raLngs

URL="http://localhost:8983/solr/reviews/stream"

#
Use
search
expression
to
find
reviews
that
I
have
the
book
a
"5"

curl
$URL
-‐d
'expr=search(reviews,
q="user_s:Yonik
AND
rating_i:5",

fl="id,book_s,user_s,rating_i",
sort="user_s
asc")'


{"raLng_i":5,"id":"book2_r1","user_s":"Yonik","book_s":"book2"},




40
©
Cloudera,
Inc.
All
rights
reserved.

2.
gatherNodes
expression
to
find
users

curl
$URL
-‐d
'expr=gatherNodes(reviews,

search(reviews,
q="user_s:Yonik
AND
rating_i:5",

fl="book_s,user_s,rating_i",sort="user_s
asc"),

walk="book_s-‐>book_s",

gather="user_s",

fq="rating_i:[4
TO
*]
-‐user_s:Yonik",

trackTraversal=true
)'


{"node":"Haruka","collecLon":"reviews","field":"user_s","ancestors":["book1"],"level":1},

{"node":"Maria","collecLon":"reviews","field":"user_s","ancestors":["book2"],"level":1},


"gather"
values

41
©
Cloudera,
Inc.
All
rights
reserved.

3.
gatherNodes
to
ﬁnd
high
raLngs
by
those
users

curl
$URL
-‐d

gatherNodes(reviews,
search(reviews,q="user_s:Yonik
AND
rating_i:
5",fl="id,book_s,user_s,rating_i",sort="user_s
asc"),
walk="book_s-‐>book_s",

gather="user_s",fq="rating_i:[4
TO
*]
-‐user_s:Yonik"),

walk="node-‐>user_s",
gather="book_s",
fq="rating_i:[4
TO
*]",

avg(rating_i),

trackTraversal=true)'


{"node":"book10","avg(raLng_i)":5.0,"ﬁeld":"book_s","level":
2,"collecLon":"reviews","ancestors":["Maria"]},


42
©
Cloudera,
Inc.
All
rights
reserved.

Retrieving
complete
traversal

curl
$URL
-‐d
[...],
scaFer="branches,leaves")'


{"node":"book12","collecLon":"reviews","field":"book_s","level":0},



{"node":"Haruka","collecLon":"reviews","field":"user_s","level":1},

{"node":"Maria","collecLon":"reviews","field":"user_s","level":1},

{"node":"book10","avg(raLng_i)":5.0,"field":"book_s","level":2,

"collecLon":"reviews","ancestors":["Maria"]},


44
©
Cloudera,
Inc.
All
rights
reserved.

More
graph
expressions

•  shortestPath

• Finds
the
shortest
path
between
"from"
and
"to"

•  scoreNodes
:
l-‐idf
inspired
scoring

• wraps
a
gatherNodes
expression
that
ﬁnds
the
co-‐occurrence
count

• l
factor
–
the
co-‐occurrence
count

• idf
factor
–
boosts
nodes
that
are
rarer
overall

45
©
Cloudera,
Inc.
All
rights
reserved.

Network
analysis
and
visualizaLon

curl
http://localhost:8983/solr/reviews/graph
-‐d
[...],

scaFer="branches,leaves")'

<?xml
version="1.0"
encoding="UTF-‐8"?>

<graphml
xmlns="hFp://graphml.graphdrawing.org/xmlns"

xmlns:xsi="hFp://www.w3.org/2001/XMLSchema-‐instance"

xsi:schemaLocaLon="hFp://graphml.graphdrawing.org/xmlns
hFp://graphml.graphdrawing.org/xmlns/1.0/
graphml.xsd">

<graph
id="G"
edgedefault="directed">

<node
id="book12">

<data
key="ﬁeld">book_s</data>

<data
key="level">0</data>

</node>

<node
id="book1">

<data
key="ﬁeld">book_s</data>

[...]

48
©
Cloudera,
Inc.
All
rights
reserved.

JSON
Facet
API

•  More
focused
on
web-‐scale
interacLve

responses

•  Tighter
integraLon

• ULlizes
exisLng
distributed
search

framework
/
just
another
search

component

• single
request-‐response
top-‐N,

grouping,
highlighLng,
faceLng,
etc.

• block
join
/
nested
document

support

•  More
expressive?

Streaming
Expressions

•  More
general
purpose,
larger
scope

• wrap
streams
within
streams
to
do

preFy
much
anything

• not
Led
to
documents
(analyLcs
across

joins
w/
external
DBs)

• update
streams,
machine
learning

streams,
etc.

•  Exact
results
(e.g.
cardinality)

•  distributed
joins,
graph

•  Increasingly
will
use
JSON
Facet
API
to

push
work
to
leaves

Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera

Similar to Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera (20)

More from Lucidworks

More from Lucidworks (20)

Recently uploaded

Recently uploaded (20)

Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera