Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
1. SQL Analytics for Search Engineers
Timothy Potter
Manager of Smart Data @ Lucidworks / Apache Solr Committer
@thelabdude
#Activate18 #ActivateSearch
2. An ever-expanding list of needs from search engineers
• Better relevancy, less manual tuning
• Bigger scale, less downtime, fixed resources
• Higher QPS, more complex query pipelines
• More bespoke, search-driven applications,
faster!
• Trying out new ideas
• Making better decisions with self-service
analytics
• Random one-off jobs for this and that
• Use AI everywhere!
3. The ideal solution …
• Easy to explain to your boss how it works
• Tooling available
• Résumé friendly
• Extensible / customizable / flexible
• Scalable
• People want to feel productive
SQL in Fusion!
4. Data Ingest = Project Friction
• Bespoke, search-driven applications >
general purpose dashboard tools
• Getting data in continues to be a hassle
/ friction when getting started
• Need something nimble but also fast /
scalable
• For every connector, there’s probably
20 SQL / NoSQL data silos
5. Fusion’s Parallel Bulk Loader
• Get to the fun stuff faster!
• Complement Fusion’s connectors for those dirty
ETL jobs that cause friction in every project
• High performance parallel reads from structured
data sources, including Cassandra, Elastic, HBase,
JDBC, Hadoop, …
• Basic ETL tasks with SQL and/or custom Scala
• ML Model predictions as UDF
• Direct to Solr for optimal speed or send to index-
pipelines for optimal flexibility
6. A foundation built on SparkSQL
• Expose structured data as a DataFrame:
RDD + schema
• 100’s of data sources + formats
• spark-solr translates Solr query results
to a DataFrame
• Highly optimized parallel reads, with
predicate pushdown across a Spark
cluster
• Spark optimizes the SQL query plan
• 100’s of built-in functions
8. Read parquet
from S3
Write to a Fusion
Index Pipeline
Advanced transforms
with Scala
Transform with SQL
Add job dependencies
On-the-fly
9. User Feedback to Improve Relevancy
• MRR is sub-optimal for many queries?
• Want to boost some docs based on user
click behavior (per query)
• Older clicks should age out over time
• Some user actions are more important
than others: click < cart add < purchase
• Sometimes you need to join signals with
other tables, e.g. item metadata
• Hide complex business logic behind UDF
/ UDAF (pluggable)
• Designed for change!
13. Window Functions
WITH sessions AS (
SELECT *, sum(IF(diff_secs > 30, 1, 0))
OVER (PARTITION BY clientip ORDER BY ts) session_id
FROM (
SELECT *, unix_timestamp(ts) - lag(unix_timestamp(ts))
OVER (PARTITION BY clientip ORDER BY ts) as diff_secs
FROM ${inputCollection}
WHERE clientip IS NOT NULL AND ts IS NOT NULL AND bytes IS NOT NULL
AND verb IS NOT NULL AND response IS NOT NULL
)
) SELECT concat_ws('||', clientip,session_id) as id,
first(clientip) as clientip,
min(ts) as session_start,
max(ts) as session_end,
timediff(max(ts), min(ts), "MILLISECONDS") as session_len_ms_l,
sum(bytes) as total_bytes_l,
count(*) as total_requests_l
FROM sessions
GROUP BY clientip, session_id
Lag window
function
14. SQL Aggregations Scalability
• Aggregate 42M signals into 11M groups
(query / doc_id)
• ~18 mins on 3 node EC2 cluster (r3.xlarge)
• Mostly I/O from/to Solr
15. Why Self-service Analytics?
• Powerful connectors, relevance, speed,
and massive scalability = more mission-
critical datasets finding their way into
Fusion
• Don’t be another data silo!
• Let users ask questions of this data
using their tool of choice w/o adding
work for the IT group!
• Aggregations over full-text ranked
results
• But it has to be fast else you’re right
back to data warehousing problems
16. Self-service Analytics
• Fusion SQL is a JDBC service that
supports SQL
• Fusion SQL plugs into Apache
Spark’s query planner to translate
SQL into optimized Solr queries
(streaming expressions and JSON
facets)
• Integrate with popular BI tools like
Tableau, PowerBI, and Spotfire +
Notebooks like Apache Zeppelin
20. Self-service Analytics Performance
• Blog performed a comparison of their SQL engine against
common DBs using a count distinct query typical for
dashboards
• 14M logs, 1200 distinct dashboards, 1700 distinct
user_id/dashboard_id pairs
• Replicated the experiment with Fusion on Ec2 (m1.xlarge),
single instance of Solr
Fusion: ~900ms
28M rows: ~1.3secs
https://www.periscopedata.com/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html
21. Self-service Analytics Performance
SELECT m.title as title, agg.aggCount as aggCount
FROM movies m
INNER JOIN (
SELECT movie_id, COUNT(*) as aggCount
FROM ratings
WHERE rating >= 4 GROUP BY movie_id
ORDER BY aggCount desc LIMIT 10) as agg
ON agg.movie_id = m.id
ORDER BY aggCount DESC
20M rows
Fusion SQL : ~1.1 secs
MySQL: 17 secs (w/ index on movie_id)
Movielens data: Aggregate 20M ratings
https://lucidworks.com/2018/08/06/using-tableau-sql-and-search-for-fast-data-visualizations/
22. Experiments
• Run live experiments to try out
new ideas and compare
outcomes between variants
• Built-in metrics: MRR, avg|min|
max response time, CTR …
and you guessed it! SQL
• Bayesian Bandits to
explore/exploit the best
performing variant
24. Recap
• How to build powerful SQL aggregations with
joins, custom UDF/ UDAF, and window functions to
power boosting and recommendations
• Ingesting data from data sources using SQL for
ETL, ML
• Self-service analytics from popular BI visualization
tools
• Measure outcomes between variants in an
experiment using SQL
https://github.com/lucidworks/fusion-spark-bootcamp
25. Top 10 Things you can do with SQL in Fusion
1. Aggregate signals by query / doc / user to compute boost
weights and generate recommendations
2. Ingest & ETL from 100’s of data sources using SparkSQL
3. Use ML models to generate predictions and Lucene text
analysis using UDF functions
4. Join data from multiple Solr collections and data sources
5. Self-service analytics with BI tools like Tableau and PowerBI
6. Hide complex business logic behind UDF / UDAF
7. Use window functions for tasks like sessionization
8. Grouping sets and cubes for advanced analytic reporting
9. Compute KPIs across variants in an experiment
10. Expose complex Solr streaming expressions as simple SQL
views
How are you going to get all this done?
In Fusion, we chose SQL as the foundational technology to solve many of these issues.
So I think we’re all pretty clear on the scope of the problem, but what might the ideal solution look like?
Audience poll:
- How many know SQL and have used it in some fashion in the last year
- How many have integrated with some sort of SQL database with search today
One of the amazing things about app studio is you can rapidly build bespoke search applications w/o creating another data silo!
Getting data indexed is not the end goal of a project, an impediment on most projects, adds friction and distracts us from the important stuff (queries / visualization)
Organizations are really good at provisioning data silos
To let people ask new questions from your data, they need access across many data sources
SQL and NoSQL databases are everywhere! Need something nimble to go grab data from multiple places and
Connectors are great for complex business apps like Sharepoint and Box but for every Sharepoint there’s a 100 SQL / NoSQL databases in a modern org
SQL lets Spark create an optimized query plan, which sometimes we know how to optimize further for Solr
Typically built by experts
NoSQL: Cassandra, HBase, Hive, Mongo
S3, HDFS, parquet
Search: Solr, Elastic
RDBMS: JDBC, Redshift, Hive
Azure, Google Analytics
Ingest data from S3
Invoke an ML model to do NLP stuff
Do some basic ETL with SQL
Just a placeholder slide for what is shown in the demo
Spark function reference: https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/index.html
Fusion’s built-in click signal SQL aggregation job
Time-decay function
Custom UDF (price bucketing)
Custom SQL job: sessionization of logs with window functions
Just a placeholder slide for what is shown in the demo
Just a placeholder to show another example of a SQL agg job, this time with a window function to find sessions.
The traditional problem with self-service analytics is speed, flexibility, scalability
A whole that’s greater than the sum of its parts
Pushdown the computation of an aggregated query into Solr for maximum performance
Or, pull rows into Spark from Solr to perform most any analytics task
At step 1, a Fusion data analyst is authenticated by the JDBC/ODBC client application (e.g. SpotFire or Tableau) using Kerberos. Once authenticated, the user’s SQL query is sent to the Fusion SQL Thriftserver over HTTP (step 2 in the diagram). The SQL Thriftserver uses the service principal keytab to validate the incoming user identity using Kerberos (step 3).
The Fusion SQL Thriftserver is a Spark application with a specific number of CPU cores and memory allocated from the pool of Spark resources. You can scale out the number of Spark worker nodes to increase available memory and CPU resources to the SQL service. The Thriftserver sends the query to Spark to be parsed into a Logical query plan (step 4). During the query planning stage, Spark sends the logical plan to Fusion’s pushdown strategy component (step 5). The pushdown strategy analyzes the query plan to determine if there is an optimal Solr query / streaming expression that can “push-down” aggregations into Solr to improve performance and scalability. For instance, the following SQL query can be translated into a Solr facet query by the Fusion pushdown strategy:
select count(1) as the_count, movie_id from ratings group by movie_id
The basic idea behind Fusion’s pushdown strategy is it is much faster to let Solr facets perform basic aggregations than it is to export raw documents from Solr and have Spark perform the aggregation. If an optimal pushdown query is not possible, then Spark will pull raw documents from Solr and then perform any joins / aggregations needed in Spark. Put simply, the Fusion SQL service tries to translate SQL queries into optimized Solr queries but failing that, the service simply reads all matching docs for a query into Spark and performs the SQL execution logic across the Spark cluster.
During pushdown analysis, Fusion calls out to the registered AuthzFilterProvider implementation to get a filter query to perform row-level filtering for the Kerberos authenticated user (step 6). By default there is no row-level security provider but users can install their own implementation using the Fusion SQL service API.
Lastly, a distributed Solr query gets executed by Spark to return documents that satisfy the SQL query criteria and row-level security filter (step 7). To leverage the distributed nature of Spark and Solr, Fusion SQL sends a query to all replicas for each shard in a Solr collection. Consequently, you can scale out SQL query performance by adding more Spark and/or Solr resources to your cluster.
Show connecting to Fusion SQL from Tableau (or maybe Apache Superset)
Build a simple data visualization on-the-fly
Just a placeholder slide for what is shown in the demo
Avg. time on site / # of interactions per variant
Show results in App Insights