SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers

SQL Analytics for Search Engineers
Timothy Potter
Manager of Smart Data @ Lucidworks / Apache Solr Committer
@thelabdude
#Activate18 #ActivateSearch

An ever-expanding list of needs from search engineers
• Better relevancy, less manual tuning
• Bigger scale, less downtime, fixed resources
• Higher QPS, more complex query pipelines
• More bespoke, search-driven applications,
faster!
• Trying out new ideas
• Making better decisions with self-service
analytics
• Random one-off jobs for this and that
• Use AI everywhere!

The ideal solution …
• Easy to explain to your boss how it works
• Tooling available
• Résumé friendly
• Extensible / customizable / flexible
• Scalable
• People want to feel productive
SQL in Fusion!

Data Ingest = Project Friction
• Bespoke, search-driven applications >
general purpose dashboard tools
• Getting data in continues to be a hassle
/ friction when getting started
• Need something nimble but also fast /
scalable
• For every connector, there’s probably
20 SQL / NoSQL data silos

Fusion’s Parallel Bulk Loader
• Get to the fun stuff faster!
• Complement Fusion’s connectors for those dirty
ETL jobs that cause friction in every project
• High performance parallel reads from structured
data sources, including Cassandra, Elastic, HBase,
JDBC, Hadoop, …
• Basic ETL tasks with SQL and/or custom Scala
• ML Model predictions as UDF
• Direct to Solr for optimal speed or send to index-
pipelines for optimal flexibility

A foundation built on SparkSQL
• Expose structured data as a DataFrame:
RDD + schema
• 100’s of data sources + formats
• spark-solr translates Solr query results
to a DataFrame
• Highly optimized parallel reads, with
predicate pushdown across a Spark
cluster
• Spark optimizes the SQL query plan
• 100’s of built-in functions

Demo: Parallel Bulk Loader
Parallel Bulk Loader

Read parquet
from S3
Write to a Fusion
Index Pipeline
Advanced transforms
with Scala
Transform with SQL
Add job dependencies
On-the-fly

User Feedback to Improve Relevancy
• MRR is sub-optimal for many queries?
• Want to boost some docs based on user
click behavior (per query)
• Older clicks should age out over time
• Some user actions are more important
than others: click < cart add < purchase
• Sometimes you need to join signals with
other tables, e.g. item metadata
• Hide complex business logic behind UDF
/ UDAF (pluggable)
• Designed for change!

SQL Aggregation

Join with other
tables
Custom UDAF
Final output to
Solr

Window Functions
WITH sessions AS (
SELECT *, sum(IF(diff_secs > 30, 1, 0))
OVER (PARTITION BY clientip ORDER BY ts) session_id
FROM (
SELECT *, unix_timestamp(ts) - lag(unix_timestamp(ts))
OVER (PARTITION BY clientip ORDER BY ts) as diff_secs
FROM ${inputCollection}
WHERE clientip IS NOT NULL AND ts IS NOT NULL AND bytes IS NOT NULL
AND verb IS NOT NULL AND response IS NOT NULL
)
) SELECT concat_ws('||', clientip,session_id) as id,
first(clientip) as clientip,
min(ts) as session_start,
max(ts) as session_end,
timediff(max(ts), min(ts), "MILLISECONDS") as session_len_ms_l,
sum(bytes) as total_bytes_l,
count(*) as total_requests_l
FROM sessions
GROUP BY clientip, session_id
Lag window
function

SQL Aggregations Scalability
• Aggregate 42M signals into 11M groups
(query / doc_id)
• ~18 mins on 3 node EC2 cluster (r3.xlarge)
• Mostly I/O from/to Solr

Why Self-service Analytics?
• Powerful connectors, relevance, speed,
and massive scalability = more mission-
critical datasets finding their way into
Fusion
• Don’t be another data silo!
• Let users ask questions of this data
using their tool of choice w/o adding
work for the IT group!
• Aggregations over full-text ranked
results
• But it has to be fast else you’re right
back to data warehousing problems

Self-service Analytics
• Fusion SQL is a JDBC service that
supports SQL
• Fusion SQL plugs into Apache
Spark’s query planner to translate
SQL into optimized Solr queries
(streaming expressions and JSON
facets)
• Integrate with popular BI tools like
Tableau, PowerBI, and Spotfire +
Notebooks like Apache Zeppelin

Self-Service Analytics

Self-service Analytics Performance
• Blog performed a comparison of their SQL engine against
common DBs using a count distinct query typical for
dashboards
• 14M logs, 1200 distinct dashboards, 1700 distinct
user_id/dashboard_id pairs
• Replicated the experiment with Fusion on Ec2 (m1.xlarge),
single instance of Solr
Fusion: ~900ms
28M rows: ~1.3secs
https://www.periscopedata.com/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html

Self-service Analytics Performance
SELECT m.title as title, agg.aggCount as aggCount
FROM movies m
INNER JOIN (
SELECT movie_id, COUNT(*) as aggCount
FROM ratings
WHERE rating >= 4 GROUP BY movie_id
ORDER BY aggCount desc LIMIT 10) as agg
ON agg.movie_id = m.id
ORDER BY aggCount DESC
20M rows
Fusion SQL : ~1.1 secs
MySQL: 17 secs (w/ index on movie_id)
Movielens data: Aggregate 20M ratings
https://lucidworks.com/2018/08/06/using-tableau-sql-and-search-for-fast-data-visualizations/

Experiments
• Run live experiments to try out
new ideas and compare
outcomes between variants
• Built-in metrics: MRR, avg|min|
max response time, CTR …
and you guessed it! SQL
• Bayesian Bandits to
explore/exploit the best
performing variant

Experiment Metrics

Recap
• How to build powerful SQL aggregations with
joins, custom UDF/ UDAF, and window functions to
power boosting and recommendations
• Ingesting data from data sources using SQL for
ETL, ML
• Self-service analytics from popular BI visualization
tools
• Measure outcomes between variants in an
experiment using SQL
https://github.com/lucidworks/fusion-spark-bootcamp

Top 10 Things you can do with SQL in Fusion
1. Aggregate signals by query / doc / user to compute boost
weights and generate recommendations
2. Ingest & ETL from 100’s of data sources using SparkSQL
3. Use ML models to generate predictions and Lucene text
analysis using UDF functions
4. Join data from multiple Solr collections and data sources
5. Self-service analytics with BI tools like Tableau and PowerBI
6. Hide complex business logic behind UDF / UDAF
7. Use window functions for tasks like sessionization
8. Grouping sets and cubes for advanced analytic reporting
9. Compute KPIs across variants in an experiment
10. Expose complex Solr streaming expressions as simple SQL
views

Thank you!
Timothy Potter
Manager Smart Data, Lucidworks
@thelabdude
#Activate18 #ActivateSearch

SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers

Similar to SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers (20)

More from Lucidworks

More from Lucidworks (20)

Recently uploaded

Recently uploaded (20)

SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers

Editor's Notes