SOLR SEARCH WITH
SPARK FOR BIG DATA
ANALYTICS IN ACTION
Romain Rigaux
GOALS

Build	
  a	
  Web	
  app	
  
Quickly	
  explore	
  data	
  
…	
  with	
  Solr
make	
  Solr	
  /	
  Hadoop	
  easier	
  to	
  use
+
ARCHITECTURE

“Just	
  a	
  view”	
  on	
  top	
  of	
  the	
  standard	
  Solr	
  API
REST
HISTORY

V1 USER
HISTORY

V1 ADMIN
ARCHITECTURE

NEXT!
Lot	
  of	
  learning,	
  UX	
  Boost	
  needed	
  
Simple,	
  don’t	
  know	
  it	
  is	
  Solr
HISTORY

V2 USER
HISTORY

V2 ADMIN
HISTORY

V2 BETTER UX
ARCHITECTURE
/select	
  
/admin/collections	
  
/get	
  
/luke...
/add_widget	
  
/zoom_in	
  
/select_facet	
  
/select_range...
REST AJAX
Templates	
  
+	
  
JS	
  Model
www….
ARCHITECTURE

UI FOR FACETS
Query
Collection
	
  Layout All	
  the	
  2D	
  positioning	
  (cell	
  ids),	
  visual,	
  drag&drop
Dashboard,	
  fields,	
  template,	
  widgets	
  (ids)
Search	
  terms,	
  selected	
  facets	
  (q,	
  fqs)
ADDING A WIDGET

LIFECYCLE
Load	
  the	
  initial	
  page	
  
Edit	
  mode	
  and	
  Drag&Drop
/solr/zookeeper/clusterstate.json	
  
/solr/admin/luke…
/get_collection
ADDING A WIDGET

LIFECYCLE
/solr/select?stats=true /new_facet
Select	
  the	
  field	
  
Guess	
  ranges	
  (number	
  or	
  dates)	
  
Rounding	
  (number	
  or	
  dates)
ADDING A WIDGET

LIFECYCLE
Query	
  part	
  1
Query	
  Part	
  2
Augment	
  Solr	
  response
facet.range={!ex=bytes}bytes&f.bytes.facet.range.start=0&f.bytes.facet.range.end=9000000&	
  
f.bytes.facet.range.gap=900000&f.bytes.facet.mincount=0&f.bytes.facet.limit=10
q=Chrome&fq={!tag=bytes}bytes:[900000+TO+1800000]
{
'facet_counts':{
'facet_ranges':{
'bytes':{
'start':10000,
'counts':[
'900000',
3423,
'1800000',
339,
...
]
}
}
}
{
...,
'normalized_facets':[
{
'extraSeries':[
],
'label':'bytes',
'field':'bytes',
'counts':[
{
'from’:'900000',
'to':'1800000',
'selected':True,
'value':3423,
'field’:'bytes',
'exclude':False
}
], ...
}
}
}
JSON TO WIDGET

{
"field":"rate_code",
"counts":[
{
"count":97797,
"exclude":true,
"selected":false,
"value":"1",
"cat":"rate_code"
} ...
{
"field":"medallion",
"counts":[
{
"count":159,
"exclude":true,
"selected":false,
"value":"6CA28FC49A4C49A9A96",
"cat":"medallion"
} ….
{
"extraSeries":[
],
"label":"trip_time_in_secs",
"field":"trip_time_in_secs",
"counts":[
{
"from":"0",
"to":"10",
"selected":false,
"value":527,
"field":"trip_time_in_secs",
"exclude":true
} ...
{
"field":"passenger_count",
"counts":[
{
"count":74766,
"exclude":true,
"selected":false,
"value":"1",
"cat":"passenger_count"
} ...
REPEAT

UNTIL…
GAME CHANGER!
Possibilihes
5.1	
  /	
  5.2
Analyhc	
  Facets
FACET

FUNCTIONS
Count	
  
Sum	
  
Avg	
  
Percentile	
  
Max	
  
...
Count(id)	
  
Sum(bytes)	
  
Avg(mul(price,	
  quantity))	
  
Percentile(salary,	
  50,	
  90)	
  
Max(temperature)	
  
...
FACET

FUNCTIONS
SUB “NESTED”

FACETS
top_os	
  {	
  
	
  	
  type:	
  term,	
  
	
  	
  field:	
  os,	
  
	
  	
  limit:	
  5	
  
}
top_os	
  {	
  
	
  	
  type:	
  term,	
  
	
  	
  field:	
  os,	
  
	
  	
  limit:	
  5,	
  
	
  	
  facet	
  :	
  {	
  
	
  	
  	
  	
  by_country:	
  {	
  
	
  	
  	
  	
  	
  	
  type:	
  term,	
  
	
  	
  	
  	
  	
  	
  field:	
  country	
  
	
  	
  	
  	
  }	
  
	
  	
  }	
  
}
FUNCTION + NESTED =

ANALYTICS states	
  {	
  
	
  	
  type:	
  term,	
  
	
  	
  field:	
  state,	
  
	
  	
  facet	
  :	
  {	
  
	
  	
  	
  by_month	
  :	
  {	
  
	
  	
  	
  	
  	
  	
  type:	
  range,	
  
	
  	
  	
  	
  	
  	
  field:	
  time,	
  
	
  	
  	
  	
  	
  	
  start:	
  “TODAY-­‐6MONTHS”,	
  
	
  	
  	
  	
  	
  	
  end:	
  “TODAY”,	
  
	
  	
  	
  	
  	
  	
  gap:	
  “MONTH”,	
  
	
  	
  	
  	
  	
  	
  facet	
  :	
  {	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  avg_sal:	
  “avg(salary)”	
  
	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  }	
  
	
  	
  }	
  
}
states	
  {	
  
	
  	
  type:	
  term,	
  
	
  	
  field:	
  state,	
  
	
  	
  facet	
  :	
  {	
  
	
  	
  	
  	
  avg_sal:	
  “avg(salary)”	
  
	
  	
  }	
  
}
OPERATIONS ON

BUCKETS OF DATA
Counts	
  →	
  Functions
OPERATIONS ON

BUCKETS OF DATA
Nested	
  →	
  nD	
  functions
SEARCH AS ONLY

APP IN HUE
gethue.com/solr-­‐search-­‐ui-­‐only/
• Spark	
  in	
  your	
  browser	
  
• Notebooks	
  
• New	
  REST	
  Server
SPARK

INDEXING
WHAT
• Open	
  source	
  REST	
  for	
  Spark	
  Shell	
  
• Runs	
  locally	
  or	
  inside	
  YARN	
  
• Spark	
  Scala,	
  PySpark	
  and	
  jar/py	
  
submission
SPARK

INDEXING
WHAT
hpps://github.com/cloudera/hue/tree/master/apps/spark/java
LIVY ARCH
YARN LOCAL
Livy	
  Server
Livy	
  REPL
Spark	
  Contexts
Spark	
  Worker
Livy	
  Server
YARN	
  Master
YARN	
  Node
Livy	
  REPL
Spark	
  Context	
  /	
  PySpark
YARN	
  Node
Spark	
  Worker
YARN	
  Node
Spark	
  Worker
1
2
3
4
SPARK STREAMING
Real	
  hme!	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Spark	
  Solr
• Python	
  
• Scala	
  
• Charts
NOTEBOOKS / SHELL
WHAT
DEMO
TIME

• Analyze	
  Bay	
  area	
  bike	
  share	
  
• Visualize	
  one	
  year	
  of	
  data	
  
• Know	
  your	
  users,	
  predict	
  behavior
MISSED

SOMETHING?
demo.gethue.com
• Full	
  Analyhcs	
  
• Easier	
  indexing	
  
• Geo	
  
• Export/Share	
  results	
  
• Solr	
  Joins,	
  Solr	
  SQL	
  
• Spark,	
  SQL...	
  integrahon,	
  Hue	
  4
WHAT’S NEXT
NEW FEATURES
TWITTER
@gethue
USER GROUP
hue-­‐user@
WEBSITE
hpp://gethue.com
LEARN
hpp://learn.gethue.com
THANKS!


20150627 bigdatala