SlideShare a Scribd company logo
1 of 27
Download to read offline
Building Analytics Applications
with Streaming Expressions in
Apache Solr
Amrit Sarkar
Search Engineer, Lucidworks Inc
@sarkaramrit2
#Activate18 #ActivateSearch
Agenda
• Parallel Computing Framework
Introduction to
• Streaming API
• Streaming Expressions
• Types of Expressions
• Shuffling
• Workers
• Real-Life Use Cases
• Demo Application
• Performance Analysis
Challenges building applications on real-time data
• Searching and filtering the data
before performing analytics.
• Executing complex operations &
co-relations on unstructured and
non-preprocessed data is time
consuming.
• Dependencies on multiple tools
leading to higher maintenance cost.
Data visualizer
Client Client
Database
Real-time
Updates
• Streaming API
• Streaming Expressions
• Shuffling
• Worker collections
• Parallel SQL
Parallel Computing
Framework
Available in SolrCloud mode
Streaming API
• Java API for parallel computation
• Real-time Mapreduce and Parallel Relational Algebra
• Results are streams of tuples (key/value) (TupleStream)
• org.apache.solr.client.solrj.io.*
ParallelStream pstream =
(ParallelStream) streamFactory.constructStream("parallel(collectionName, ……....)");
pstream.open();
Streaming Expressions
curl --data-urlencode ‘expr=
search(gettingstarted,
zkHost=”localhost:9983",
qt=”/export”,
q=”hatchbacks”,
fq=”year:2014”,
fl=”id, model_name”,
sort=”id asc”))’
http://localhost:8983/solr/
gettingstarted/stream
Use case: perform full index
search and retrieve specific
fields sorted
• String Query Language and Serialisation
format for the Streaming API
• Streaming expressions compile to
TupleStream; TupleStream serialise to
Streaming Expressions
• Can be used directly via HTTP to SolrJ
• Expressions can be executed against Solr
API: /solr/<collection-name>/stream
Streaming Expressions
curl --data-urlencode ‘expr=
search(gettingstarted,
zkHost=”localhost:9983",
qt=”/export”,
q=”hatchbacks”,
fq=”year:2014”,
fl=”id, model_name”,
sort=”id asc”))’
http://localhost:8983/solr/
gettingstarted/stream
Use case: perform full index
search and retrieve specific
fields sorted
Streaming Expressions
• Stream Sources
The origin of a TupleStream
search, facet, jdbc, stats, topic, timeseries, train and more..
• Stream Decorators
Wrap other stream functions and perform operations on the stream, row wise
complement, hashJoin, innerJoin, merge, intersect, top, unique and more..
• Stream evaluators
evaluate (calculate) new values based on other values in a tuple, column wise
add, eq, div, mul, sub, length, asin, acos, abs, if:then and more..
Streaming Expressions - Use cases
Use case: Destinations reachable with single stop from ‘New York’
(graphical traversal)
nodes(distances,
nodes(distances,
walk="New York->source_s",
gather="destination_s"),
walk="node->source_s",
gather="destination_s",
trackTraversal="true",
scatter="branches,leaves")
Solr indexes are stored in ‘token’ to ‘document-ids’ format, ‘nodes’ perform BFS on field tokens.
Streaming Expressions - Use cases
Use case: Determine most relevant terms on dynamic data set
significantTerms(
enron-emails,
q="To:Tim Belden",
field="content",
limit="2",
minDocFreq="10",
maxDocFreq=".20",
minTermLength="5"
)
Solr indexes are stored in ‘token’ to ‘document-ids’ format, ‘significantTerms’ aggregates over tokens.
Streaming Expressions - Use cases
Use case: Calculate useful metrics on data fetched from various sources.
• conversion ratio (conversions to clicks)
• CTR (clicks to impressions)
• cost ratio (conversions to currency cost)
campaign_id_s org_id_s conversions_i impressions_i clicks_i
cmp-01 org-01 4 134 48
cmp-02 org-02 2 174 26
cmp-03 org-01 6 152 49
cmp-01 org-01 5 154 27
cmp-02 org-01 9 176 38
cmp-03 org-01 5 137 83
cmp-01 org-01 3 154 36
cmp-02 org-02 1 178 35
cmp-03 org-01 7 124 49
……... ……... ……... ……... ……...
campaign_id_s currency_cost_i
cmp-01 6600
cmp-02 5840
cmp-03 8400
Events captured in solr collection
‘weekly_data’
Campaign costs stored in mysql table
‘cost’
Streaming Expressions - Use cases
facet(weekly_data, q="org_id_s:org-01", buckets="campaign_id_s",
bucketSorts="campaign_id_s asc", bucketSizeLimit=100,
sum(conversations_i), sum(impressions_i), sum(clicks_i)),
Use case: Join cost data with aggregated conversions, clicks and impressions per campaign
for organisation ‘org-01’
select(
campaign_id_s as campaign_id_s, sum(conversations_i) as aggr_conv,
sum(impressions_i) as aggr_impr, sum(clicks_i) as aggr_clicks),
innerJoin(
jdbc(connection="jdbc:mysql://localhost/cost_db?user=root&password=root",
sql="SELECT campaign_id_s,currency_cost_i FROM cost",
sort="campaign_id_s asc", driver="com.mysql.jdbc.Driver"),
on="campaign_id_s")
Streaming Expressions - Use cases
Use case: Join cost data with aggregated conversions, clicks and impressions per campaign for
organisation ‘org-01’
innerJoin(
select(
facet(weekly_data,
q="org_id_s:org-01",
buckets="campaign_id_s",
bucketSorts="campaign_id_s asc",
bucketSizeLimit=100,
sum(conversations_i),
sum(impressions_i),
sum(clicks_i)),
campaign_id_s as campaign_id_s,
sum(conversations_i) as aggr_conv,sum(impressions_i)
as aggr_impr, sum(clicks_i) as aggr_clicks),
jdbc(connection="jdbc:mysql://localhost/cost_db?
user=root&password=root",
sql="SELECT campaign_id_s,currency_cost_i FROM cost",
sort="campaign_id_s asc",
driver="com.mysql.jdbc.Driver"),
on="campaign_id_s")
Streaming Expressions - Use cases
Use case: Calculate useful metrics on data fetched from various sources for ‘org-01’:
● conversion ratio (conversions to clicks)
● CTR (clicks to impressions)
● cost ratio (conversions to currency cost)
select(
innerJoin(
……..
on="campaign_id_s"),
div( aggr_conv, aggr_clicks )
as conversion_ratio,
div( aggr_clicks , aggr_impr )
as ctr,
div( currency_cost_i, aggr_conv)
as campaign_cost_ratio)
Streaming Expressions - Use Cases
Use case: Calculate useful metrics on data fetched from different sources for organisations and campaigns:
● Ratios
○ conversion ratio (conversions to clicks)
○ CTR (clicks to impressions)
○ cost ratio (conversions to currency cost)
● Time-series ratios
● Rankings: multi-faceted
Apache Zeppelin Notebook
Plot analytics dashboards on Apache Zeppelin
using Solr Interpreter
(Kiran Chitturi, Lucidworks Inc)
Streaming Expressions - Demo Application
Analytics application powered by Streaming Expression available at:
www.streamsolr.tk
Streaming Expressions - Use cases
Use case: Create a view from result-set of previously discussed use-case:
calculate metrics (index data to new collection)
update(
report, batchSize=500,
select( ……..
campaign_id_s as campaign)
)
complexity - O(N)
N - total rows processed
Streaming Expressions - Shuffle
Client
/stream
handler
/stream
handler
/stream
handler
Worker 1 Worker 2 Worker 3 Worker 4 Worker 5
Shard 1
Replica 1
Shard 2
Replica 1
Shard 3
Replica 1
Shard 4
Replica 1
Shard 5
Replica 1
Shard 1
Replica 2
Shard 2
Replica 2
Shard 3
Replica 2
Shard 4
Replica 2
Shard 5
Replica 2
Streaming Expressions - Shuffle
Client
/stream
handler
/stream
handler
/stream
handler
Worker 1 Worker 2 Worker 3 Worker 4 Worker 5
Shard 1
Replica 1
Shard 2
Replica 1
Shard 3
Replica 1
Shard 4
Replica 1
Shard 5
Replica 1
Shard 1
Replica 2
Shard 2
Replica 2
Shard 3
Replica 2
Shard 4
Replica 2
Shard 5
Replica 2
“Controlled”[subset S1]
Streaming Expressions
Worker
Collections
● Regular SolrCloud collections
● Perform streaming aggregations using the
Streaming API
● Receive shuffled streams from replicas
● May be empty or created just-in-time or
have regular data
● The goal is to separate processing from
data if necessary
Streaming Expressions - Use cases
Use case: Indexing the result-set of discussed use-case (calculate metrics for
organisation) to new collection ‘report’ parallely utilising ‘n’ workers
parallel(worker,
update(report,batchSize=10,
select(
innerJoin(
select(
facet(weekly_data, q="org_id_s:org-01", buckets="campaign_id_s", bucketSorts="campaign_id_s asc", bucketSizeLimit=100,
sum(conversations_i), sum(impressions_i), sum(clicks_i), partitionKeys="campaign_id_s"),
campaign_id_s as campaign_id_s, sum(conversations_i) as aggr_conv, sum(impressions_i) as aggr_impr, sum(clicks_i) as aggr_clicks),
search(cost, zkHost="localhost:9983", qt="/export",q="*:*", fl="campaign_id_s,org_id_s,currency_cost_i",
partitionKeys="campaign_id_s", sort="campaign_id_s asc"),
on="campaign_id_s"),
div( aggr_conv, aggr_clicks ) as conversion_ratio, div( aggr_clicks , aggr_impr ) as ctr, div( currency_cost_i, aggr_conv)
as campaign_cost_ratio, campaign_id_s as campaign)),
workers=3,
zkHost="localhost:9983",
sort="campaign asc")
Streaming Expressions - Use cases
Use case: Indexing the result-set of discussed use-case (calculate metrics for
organisation) to new collection ‘report’ parallely utilising ‘n’ workers.
complexity - Z(W) + O(N x W)/W ~ O(N)/W
N - total rows processed Z - aggregation W - number of workers utilised N >>W
Streaming Expressions - Cautious with ‘Parallel’
• Rows from each shard are requested once in ‘non-parallel’ expression execution.
total rows processed: N
• Each worker filters tuples through hashing, and further process them
individually asynchronously.
• If N is relatively low OR not many nested decorators involved e.g.
parallel(search(.....)), ‘parallel’ introduces unnecessary performance complexity.
• Carefully and thoughtfully use ‘parallel’.
(N - tuples at encapsulated stream) (W - number of workers)
• While each worker specified under ‘parallel’ request all rows from each shard.
total rows processed: N x W
Statistical Programming
• Solr’s powerful data retrieval capabilities can be combined with in-depth
statistical analysis.
• SQL, timeseries aggregation, KNNs, Linear regressions and more..
• Syntax can be used to create arrays from the data so it can be manipulated,
transformed and analyzed
• Statistical function library:
• Correlation, Covariances, Percentiles, Euclidean distance and more..
• backed by Apache Common Maths Library
Streaming Expressions & DeepLearning4j
• Eclipse Deeplearning4j is first commercial-grade, open-source, distributed
deep-learning library written for Java and Scala.
• DataSetIterator handles traversing through a dataset and preparing data for a
neural network.
• TupleStreamDataSetIterator is introduced in 1.0.0-beta2 by Christine
Poerschke, Committer PMC Apache Solr.
• Fetches data via Streaming Expressions, sources like Solr Collections, JDBC etc.
References & Knowledge Base
• Use cases and examples available on Github: /sarkaramrit2/stream-solr
• Streaming expression official documentation in Apache Solr.
• Statistical Programming official documentation in Apache Solr.
• Joel Bernstein’s blog.
• Presentation links:
• The Evolution of Streaming Expressions
• Streaming Aggregation, New Horizons for Search
• Analytics and Graph Traversal with Solr
• Creating New Streaming Expressions
Thank you!
Amrit Sarkar
Search Engineer, Lucidworks Inc
@sarkaramrit2
#Activate18 #ActivateSearch

More Related Content

What's hot

ELK - What's new and showcases
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcasesAndrii Gakhov
 
MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation Amit Ghosh
 
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesAyudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesBig Data Colombia
 
CodiLime Tech Talk - Katarzyna Ziomek-Zdanowicz: RxJS main concepts and real ...
CodiLime Tech Talk - Katarzyna Ziomek-Zdanowicz: RxJS main concepts and real ...CodiLime Tech Talk - Katarzyna Ziomek-Zdanowicz: RxJS main concepts and real ...
CodiLime Tech Talk - Katarzyna Ziomek-Zdanowicz: RxJS main concepts and real ...CodiLime
 
PistonHead's use of MongoDB for Analytics
PistonHead's use of MongoDB for AnalyticsPistonHead's use of MongoDB for Analytics
PistonHead's use of MongoDB for AnalyticsAndrew Morgan
 
NoSQL meets Microservices - Michael Hackstein
NoSQL meets Microservices - Michael HacksteinNoSQL meets Microservices - Michael Hackstein
NoSQL meets Microservices - Michael Hacksteindistributed matters
 
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015NoSQLmatters
 
Nu program language on Shibuya.lisp#5 LT
Nu program language on  Shibuya.lisp#5 LTNu program language on  Shibuya.lisp#5 LT
Nu program language on Shibuya.lisp#5 LTYuumi Yoshida
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorHenrik Ingo
 
Aerospike User Group: Exploring Data Modeling
Aerospike User Group: Exploring Data ModelingAerospike User Group: Exploring Data Modeling
Aerospike User Group: Exploring Data ModelingBrillix
 
Living in eventually consistent reality
Living in eventually consistent realityLiving in eventually consistent reality
Living in eventually consistent realityBartosz Sypytkowski
 
MongoDB .local Chicago 2019: Practical Data Modeling for MongoDB: Tutorial
MongoDB .local Chicago 2019: Practical Data Modeling for MongoDB: TutorialMongoDB .local Chicago 2019: Practical Data Modeling for MongoDB: Tutorial
MongoDB .local Chicago 2019: Practical Data Modeling for MongoDB: TutorialMongoDB
 
Wakanda#4
Wakanda#4Wakanda#4
Wakanda#4kmiyako
 
Online | MongoDB Atlas on GCP Workshop
Online | MongoDB Atlas on GCP Workshop Online | MongoDB Atlas on GCP Workshop
Online | MongoDB Atlas on GCP Workshop Natasha Wilson
 
MongoDB Stich Overview
MongoDB Stich OverviewMongoDB Stich Overview
MongoDB Stich OverviewMongoDB
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovDatabricks
 

What's hot (20)

ELK - What's new and showcases
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcases
 
MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation
 
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesAyudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
 
CodiLime Tech Talk - Katarzyna Ziomek-Zdanowicz: RxJS main concepts and real ...
CodiLime Tech Talk - Katarzyna Ziomek-Zdanowicz: RxJS main concepts and real ...CodiLime Tech Talk - Katarzyna Ziomek-Zdanowicz: RxJS main concepts and real ...
CodiLime Tech Talk - Katarzyna Ziomek-Zdanowicz: RxJS main concepts and real ...
 
PistonHead's use of MongoDB for Analytics
PistonHead's use of MongoDB for AnalyticsPistonHead's use of MongoDB for Analytics
PistonHead's use of MongoDB for Analytics
 
Scala.io
Scala.ioScala.io
Scala.io
 
NoSQL meets Microservices - Michael Hackstein
NoSQL meets Microservices - Michael HacksteinNoSQL meets Microservices - Michael Hackstein
NoSQL meets Microservices - Michael Hackstein
 
CAVE Overview
CAVE OverviewCAVE Overview
CAVE Overview
 
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
 
Nu program language on Shibuya.lisp#5 LT
Nu program language on  Shibuya.lisp#5 LTNu program language on  Shibuya.lisp#5 LT
Nu program language on Shibuya.lisp#5 LT
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop Connector
 
Aerospike User Group: Exploring Data Modeling
Aerospike User Group: Exploring Data ModelingAerospike User Group: Exploring Data Modeling
Aerospike User Group: Exploring Data Modeling
 
Living in eventually consistent reality
Living in eventually consistent realityLiving in eventually consistent reality
Living in eventually consistent reality
 
MongoDB .local Chicago 2019: Practical Data Modeling for MongoDB: Tutorial
MongoDB .local Chicago 2019: Practical Data Modeling for MongoDB: TutorialMongoDB .local Chicago 2019: Practical Data Modeling for MongoDB: Tutorial
MongoDB .local Chicago 2019: Practical Data Modeling for MongoDB: Tutorial
 
Wakanda#4
Wakanda#4Wakanda#4
Wakanda#4
 
Rxjs kyivjs 2015
Rxjs kyivjs 2015Rxjs kyivjs 2015
Rxjs kyivjs 2015
 
Online | MongoDB Atlas on GCP Workshop
Online | MongoDB Atlas on GCP Workshop Online | MongoDB Atlas on GCP Workshop
Online | MongoDB Atlas on GCP Workshop
 
MongoDB Stich Overview
MongoDB Stich OverviewMongoDB Stich Overview
MongoDB Stich Overview
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Supstat nyc subway
Supstat nyc subwaySupstat nyc subway
Supstat nyc subway
 

Similar to Streaming Solr - Activate 2018 talk

Strtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP EngineStrtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP EngineMyung Ho Yun
 
WSO2 Analytics Platform: The one stop shop for all your data needs
WSO2 Analytics Platform: The one stop shop for all your data needsWSO2 Analytics Platform: The one stop shop for all your data needs
WSO2 Analytics Platform: The one stop shop for all your data needsSriskandarajah Suhothayan
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming AppsWSO2
 
Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming Stratio
 
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...Noriaki Tatsumi
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics PlatformSrinath Perera
 
TSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkAnirudh Todi
 
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...WSO2
 
PredictionIO – A Machine Learning Server in Scala – SF Scala
PredictionIO – A Machine Learning Server in Scala – SF ScalaPredictionIO – A Machine Learning Server in Scala – SF Scala
PredictionIO – A Machine Learning Server in Scala – SF Scalapredictionio
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADtab0ris_1
 
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital EnterpriseWSO2
 
A head start on cloud native event driven applications - bigdatadays
A head start on cloud native event driven applications - bigdatadaysA head start on cloud native event driven applications - bigdatadays
A head start on cloud native event driven applications - bigdatadaysSriskandarajah Suhothayan
 
Snowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your businessSnowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your businessGiuseppe Gaviani
 
Snowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your businessSnowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your businessyalisassoon
 
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, ClouderaReal-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, ClouderaLucidworks
 
Scaling Experimentation & Data Capture at Grab
Scaling Experimentation & Data Capture at GrabScaling Experimentation & Data Capture at Grab
Scaling Experimentation & Data Capture at GrabRoman
 
Serverless Streaming Data Processing using Amazon Kinesis Analytics
Serverless Streaming Data Processing using Amazon Kinesis AnalyticsServerless Streaming Data Processing using Amazon Kinesis Analytics
Serverless Streaming Data Processing using Amazon Kinesis AnalyticsAmazon Web Services
 
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...Amazon Web Services
 
Small pieces loosely joined
Small pieces loosely joinedSmall pieces loosely joined
Small pieces loosely joinedennui2342
 

Similar to Streaming Solr - Activate 2018 talk (20)

Strtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP EngineStrtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP Engine
 
WSO2 Analytics Platform: The one stop shop for all your data needs
WSO2 Analytics Platform: The one stop shop for all your data needsWSO2 Analytics Platform: The one stop shop for all your data needs
WSO2 Analytics Platform: The one stop shop for all your data needs
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
 
Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming
 
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics Platform
 
TSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech Talk
 
Tsar tech talk
Tsar tech talkTsar tech talk
Tsar tech talk
 
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...
 
PredictionIO – A Machine Learning Server in Scala – SF Scala
PredictionIO – A Machine Learning Server in Scala – SF ScalaPredictionIO – A Machine Learning Server in Scala – SF Scala
PredictionIO – A Machine Learning Server in Scala – SF Scala
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
 
A head start on cloud native event driven applications - bigdatadays
A head start on cloud native event driven applications - bigdatadaysA head start on cloud native event driven applications - bigdatadays
A head start on cloud native event driven applications - bigdatadays
 
Snowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your businessSnowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your business
 
Snowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your businessSnowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your business
 
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, ClouderaReal-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
 
Scaling Experimentation & Data Capture at Grab
Scaling Experimentation & Data Capture at GrabScaling Experimentation & Data Capture at Grab
Scaling Experimentation & Data Capture at Grab
 
Serverless Streaming Data Processing using Amazon Kinesis Analytics
Serverless Streaming Data Processing using Amazon Kinesis AnalyticsServerless Streaming Data Processing using Amazon Kinesis Analytics
Serverless Streaming Data Processing using Amazon Kinesis Analytics
 
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
 
Small pieces loosely joined
Small pieces loosely joinedSmall pieces loosely joined
Small pieces loosely joined
 

Recently uploaded

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Streaming Solr - Activate 2018 talk

  • 1. Building Analytics Applications with Streaming Expressions in Apache Solr Amrit Sarkar Search Engineer, Lucidworks Inc @sarkaramrit2 #Activate18 #ActivateSearch
  • 2. Agenda • Parallel Computing Framework Introduction to • Streaming API • Streaming Expressions • Types of Expressions • Shuffling • Workers • Real-Life Use Cases • Demo Application • Performance Analysis
  • 3. Challenges building applications on real-time data • Searching and filtering the data before performing analytics. • Executing complex operations & co-relations on unstructured and non-preprocessed data is time consuming. • Dependencies on multiple tools leading to higher maintenance cost. Data visualizer Client Client Database Real-time Updates
  • 4. • Streaming API • Streaming Expressions • Shuffling • Worker collections • Parallel SQL Parallel Computing Framework Available in SolrCloud mode
  • 5. Streaming API • Java API for parallel computation • Real-time Mapreduce and Parallel Relational Algebra • Results are streams of tuples (key/value) (TupleStream) • org.apache.solr.client.solrj.io.* ParallelStream pstream = (ParallelStream) streamFactory.constructStream("parallel(collectionName, ……....)"); pstream.open();
  • 6. Streaming Expressions curl --data-urlencode ‘expr= search(gettingstarted, zkHost=”localhost:9983", qt=”/export”, q=”hatchbacks”, fq=”year:2014”, fl=”id, model_name”, sort=”id asc”))’ http://localhost:8983/solr/ gettingstarted/stream Use case: perform full index search and retrieve specific fields sorted • String Query Language and Serialisation format for the Streaming API • Streaming expressions compile to TupleStream; TupleStream serialise to Streaming Expressions • Can be used directly via HTTP to SolrJ • Expressions can be executed against Solr API: /solr/<collection-name>/stream
  • 7. Streaming Expressions curl --data-urlencode ‘expr= search(gettingstarted, zkHost=”localhost:9983", qt=”/export”, q=”hatchbacks”, fq=”year:2014”, fl=”id, model_name”, sort=”id asc”))’ http://localhost:8983/solr/ gettingstarted/stream Use case: perform full index search and retrieve specific fields sorted
  • 8. Streaming Expressions • Stream Sources The origin of a TupleStream search, facet, jdbc, stats, topic, timeseries, train and more.. • Stream Decorators Wrap other stream functions and perform operations on the stream, row wise complement, hashJoin, innerJoin, merge, intersect, top, unique and more.. • Stream evaluators evaluate (calculate) new values based on other values in a tuple, column wise add, eq, div, mul, sub, length, asin, acos, abs, if:then and more..
  • 9. Streaming Expressions - Use cases Use case: Destinations reachable with single stop from ‘New York’ (graphical traversal) nodes(distances, nodes(distances, walk="New York->source_s", gather="destination_s"), walk="node->source_s", gather="destination_s", trackTraversal="true", scatter="branches,leaves") Solr indexes are stored in ‘token’ to ‘document-ids’ format, ‘nodes’ perform BFS on field tokens.
  • 10. Streaming Expressions - Use cases Use case: Determine most relevant terms on dynamic data set significantTerms( enron-emails, q="To:Tim Belden", field="content", limit="2", minDocFreq="10", maxDocFreq=".20", minTermLength="5" ) Solr indexes are stored in ‘token’ to ‘document-ids’ format, ‘significantTerms’ aggregates over tokens.
  • 11. Streaming Expressions - Use cases Use case: Calculate useful metrics on data fetched from various sources. • conversion ratio (conversions to clicks) • CTR (clicks to impressions) • cost ratio (conversions to currency cost) campaign_id_s org_id_s conversions_i impressions_i clicks_i cmp-01 org-01 4 134 48 cmp-02 org-02 2 174 26 cmp-03 org-01 6 152 49 cmp-01 org-01 5 154 27 cmp-02 org-01 9 176 38 cmp-03 org-01 5 137 83 cmp-01 org-01 3 154 36 cmp-02 org-02 1 178 35 cmp-03 org-01 7 124 49 ……... ……... ……... ……... ……... campaign_id_s currency_cost_i cmp-01 6600 cmp-02 5840 cmp-03 8400 Events captured in solr collection ‘weekly_data’ Campaign costs stored in mysql table ‘cost’
  • 12. Streaming Expressions - Use cases facet(weekly_data, q="org_id_s:org-01", buckets="campaign_id_s", bucketSorts="campaign_id_s asc", bucketSizeLimit=100, sum(conversations_i), sum(impressions_i), sum(clicks_i)), Use case: Join cost data with aggregated conversions, clicks and impressions per campaign for organisation ‘org-01’ select( campaign_id_s as campaign_id_s, sum(conversations_i) as aggr_conv, sum(impressions_i) as aggr_impr, sum(clicks_i) as aggr_clicks), innerJoin( jdbc(connection="jdbc:mysql://localhost/cost_db?user=root&password=root", sql="SELECT campaign_id_s,currency_cost_i FROM cost", sort="campaign_id_s asc", driver="com.mysql.jdbc.Driver"), on="campaign_id_s")
  • 13. Streaming Expressions - Use cases Use case: Join cost data with aggregated conversions, clicks and impressions per campaign for organisation ‘org-01’ innerJoin( select( facet(weekly_data, q="org_id_s:org-01", buckets="campaign_id_s", bucketSorts="campaign_id_s asc", bucketSizeLimit=100, sum(conversations_i), sum(impressions_i), sum(clicks_i)), campaign_id_s as campaign_id_s, sum(conversations_i) as aggr_conv,sum(impressions_i) as aggr_impr, sum(clicks_i) as aggr_clicks), jdbc(connection="jdbc:mysql://localhost/cost_db? user=root&password=root", sql="SELECT campaign_id_s,currency_cost_i FROM cost", sort="campaign_id_s asc", driver="com.mysql.jdbc.Driver"), on="campaign_id_s")
  • 14. Streaming Expressions - Use cases Use case: Calculate useful metrics on data fetched from various sources for ‘org-01’: ● conversion ratio (conversions to clicks) ● CTR (clicks to impressions) ● cost ratio (conversions to currency cost) select( innerJoin( …….. on="campaign_id_s"), div( aggr_conv, aggr_clicks ) as conversion_ratio, div( aggr_clicks , aggr_impr ) as ctr, div( currency_cost_i, aggr_conv) as campaign_cost_ratio)
  • 15. Streaming Expressions - Use Cases Use case: Calculate useful metrics on data fetched from different sources for organisations and campaigns: ● Ratios ○ conversion ratio (conversions to clicks) ○ CTR (clicks to impressions) ○ cost ratio (conversions to currency cost) ● Time-series ratios ● Rankings: multi-faceted Apache Zeppelin Notebook Plot analytics dashboards on Apache Zeppelin using Solr Interpreter (Kiran Chitturi, Lucidworks Inc)
  • 16. Streaming Expressions - Demo Application Analytics application powered by Streaming Expression available at: www.streamsolr.tk
  • 17. Streaming Expressions - Use cases Use case: Create a view from result-set of previously discussed use-case: calculate metrics (index data to new collection) update( report, batchSize=500, select( …….. campaign_id_s as campaign) ) complexity - O(N) N - total rows processed
  • 18. Streaming Expressions - Shuffle Client /stream handler /stream handler /stream handler Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 Shard 1 Replica 1 Shard 2 Replica 1 Shard 3 Replica 1 Shard 4 Replica 1 Shard 5 Replica 1 Shard 1 Replica 2 Shard 2 Replica 2 Shard 3 Replica 2 Shard 4 Replica 2 Shard 5 Replica 2
  • 19. Streaming Expressions - Shuffle Client /stream handler /stream handler /stream handler Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 Shard 1 Replica 1 Shard 2 Replica 1 Shard 3 Replica 1 Shard 4 Replica 1 Shard 5 Replica 1 Shard 1 Replica 2 Shard 2 Replica 2 Shard 3 Replica 2 Shard 4 Replica 2 Shard 5 Replica 2 “Controlled”[subset S1]
  • 20. Streaming Expressions Worker Collections ● Regular SolrCloud collections ● Perform streaming aggregations using the Streaming API ● Receive shuffled streams from replicas ● May be empty or created just-in-time or have regular data ● The goal is to separate processing from data if necessary
  • 21. Streaming Expressions - Use cases Use case: Indexing the result-set of discussed use-case (calculate metrics for organisation) to new collection ‘report’ parallely utilising ‘n’ workers parallel(worker, update(report,batchSize=10, select( innerJoin( select( facet(weekly_data, q="org_id_s:org-01", buckets="campaign_id_s", bucketSorts="campaign_id_s asc", bucketSizeLimit=100, sum(conversations_i), sum(impressions_i), sum(clicks_i), partitionKeys="campaign_id_s"), campaign_id_s as campaign_id_s, sum(conversations_i) as aggr_conv, sum(impressions_i) as aggr_impr, sum(clicks_i) as aggr_clicks), search(cost, zkHost="localhost:9983", qt="/export",q="*:*", fl="campaign_id_s,org_id_s,currency_cost_i", partitionKeys="campaign_id_s", sort="campaign_id_s asc"), on="campaign_id_s"), div( aggr_conv, aggr_clicks ) as conversion_ratio, div( aggr_clicks , aggr_impr ) as ctr, div( currency_cost_i, aggr_conv) as campaign_cost_ratio, campaign_id_s as campaign)), workers=3, zkHost="localhost:9983", sort="campaign asc")
  • 22. Streaming Expressions - Use cases Use case: Indexing the result-set of discussed use-case (calculate metrics for organisation) to new collection ‘report’ parallely utilising ‘n’ workers. complexity - Z(W) + O(N x W)/W ~ O(N)/W N - total rows processed Z - aggregation W - number of workers utilised N >>W
  • 23. Streaming Expressions - Cautious with ‘Parallel’ • Rows from each shard are requested once in ‘non-parallel’ expression execution. total rows processed: N • Each worker filters tuples through hashing, and further process them individually asynchronously. • If N is relatively low OR not many nested decorators involved e.g. parallel(search(.....)), ‘parallel’ introduces unnecessary performance complexity. • Carefully and thoughtfully use ‘parallel’. (N - tuples at encapsulated stream) (W - number of workers) • While each worker specified under ‘parallel’ request all rows from each shard. total rows processed: N x W
  • 24. Statistical Programming • Solr’s powerful data retrieval capabilities can be combined with in-depth statistical analysis. • SQL, timeseries aggregation, KNNs, Linear regressions and more.. • Syntax can be used to create arrays from the data so it can be manipulated, transformed and analyzed • Statistical function library: • Correlation, Covariances, Percentiles, Euclidean distance and more.. • backed by Apache Common Maths Library
  • 25. Streaming Expressions & DeepLearning4j • Eclipse Deeplearning4j is first commercial-grade, open-source, distributed deep-learning library written for Java and Scala. • DataSetIterator handles traversing through a dataset and preparing data for a neural network. • TupleStreamDataSetIterator is introduced in 1.0.0-beta2 by Christine Poerschke, Committer PMC Apache Solr. • Fetches data via Streaming Expressions, sources like Solr Collections, JDBC etc.
  • 26. References & Knowledge Base • Use cases and examples available on Github: /sarkaramrit2/stream-solr • Streaming expression official documentation in Apache Solr. • Statistical Programming official documentation in Apache Solr. • Joel Bernstein’s blog. • Presentation links: • The Evolution of Streaming Expressions • Streaming Aggregation, New Horizons for Search • Analytics and Graph Traversal with Solr • Creating New Streaming Expressions
  • 27. Thank you! Amrit Sarkar Search Engineer, Lucidworks Inc @sarkaramrit2 #Activate18 #ActivateSearch