SlideShare a Scribd company logo
1 of 26
SQL Analytics for Search Engineers
Timothy Potter
Manager of Smart Data @ Lucidworks / Apache Solr Committer
@thelabdude
#Activate18 #ActivateSearch
An ever-expanding list of needs from search engineers
• Better relevancy, less manual tuning
• Bigger scale, less downtime, fixed resources
• Higher QPS, more complex query pipelines
• More bespoke, search-driven applications,
faster!
• Trying out new ideas
• Making better decisions with self-service
analytics
• Random one-off jobs for this and that
• Use AI everywhere!
The ideal solution …
• Easy to explain to your boss how it works
• Tooling available
• Résumé friendly
• Extensible / customizable / flexible
• Scalable
• People want to feel productive
SQL in Fusion!
Data Ingest = Project Friction
• Bespoke, search-driven applications >
general purpose dashboard tools
• Getting data in continues to be a hassle
/ friction when getting started
• Need something nimble but also fast /
scalable
• For every connector, there’s probably
20 SQL / NoSQL data silos
Fusion’s Parallel Bulk Loader
• Get to the fun stuff faster!
• Complement Fusion’s connectors for those dirty
ETL jobs that cause friction in every project
• High performance parallel reads from structured
data sources, including Cassandra, Elastic, HBase,
JDBC, Hadoop, …
• Basic ETL tasks with SQL and/or custom Scala
• ML Model predictions as UDF
• Direct to Solr for optimal speed or send to index-
pipelines for optimal flexibility
A foundation built on SparkSQL
• Expose structured data as a DataFrame:
RDD + schema
• 100’s of data sources + formats
• spark-solr translates Solr query results
to a DataFrame
• Highly optimized parallel reads, with
predicate pushdown across a Spark
cluster
• Spark optimizes the SQL query plan
• 100’s of built-in functions
Demo: Parallel Bulk Loader
Parallel Bulk Loader
Read parquet
from S3
Write to a Fusion
Index Pipeline
Advanced transforms
with Scala
Transform with SQL
Add job dependencies
On-the-fly
User Feedback to Improve Relevancy
• MRR is sub-optimal for many queries?
• Want to boost some docs based on user
click behavior (per query)
• Older clicks should age out over time
• Some user actions are more important
than others: click < cart add < purchase
• Sometimes you need to join signals with
other tables, e.g. item metadata
• Hide complex business logic behind UDF
/ UDAF (pluggable)
• Designed for change!
Signal Data Flow in Fusion
Demo: Parallel Bulk Loader
SQL Aggregation
Join with other
tables
Custom UDAF
Final output to
Solr
Window Functions
WITH sessions AS (
SELECT *, sum(IF(diff_secs > 30, 1, 0))
OVER (PARTITION BY clientip ORDER BY ts) session_id
FROM (
SELECT *, unix_timestamp(ts) - lag(unix_timestamp(ts))
OVER (PARTITION BY clientip ORDER BY ts) as diff_secs
FROM ${inputCollection}
WHERE clientip IS NOT NULL AND ts IS NOT NULL AND bytes IS NOT NULL
AND verb IS NOT NULL AND response IS NOT NULL
)
) SELECT concat_ws('||', clientip,session_id) as id,
first(clientip) as clientip,
min(ts) as session_start,
max(ts) as session_end,
timediff(max(ts), min(ts), "MILLISECONDS") as session_len_ms_l,
sum(bytes) as total_bytes_l,
count(*) as total_requests_l
FROM sessions
GROUP BY clientip, session_id
Lag window
function
SQL Aggregations Scalability
• Aggregate 42M signals into 11M groups
(query / doc_id)
• ~18 mins on 3 node EC2 cluster (r3.xlarge)
• Mostly I/O from/to Solr
Why Self-service Analytics?
• Powerful connectors, relevance, speed,
and massive scalability = more mission-
critical datasets finding their way into
Fusion
• Don’t be another data silo!
• Let users ask questions of this data
using their tool of choice w/o adding
work for the IT group!
• Aggregations over full-text ranked
results
• But it has to be fast else you’re right
back to data warehousing problems
Self-service Analytics
• Fusion SQL is a JDBC service that
supports SQL
• Fusion SQL plugs into Apache
Spark’s query planner to translate
SQL into optimized Solr queries
(streaming expressions and JSON
facets)
• Integrate with popular BI tools like
Tableau, PowerBI, and Spotfire +
Notebooks like Apache Zeppelin
Demo: Parallel Bulk Loader
Self-Service Analytics
Self-service Analytics Performance
• Blog performed a comparison of their SQL engine against
common DBs using a count distinct query typical for
dashboards
• 14M logs, 1200 distinct dashboards, 1700 distinct
user_id/dashboard_id pairs
• Replicated the experiment with Fusion on Ec2 (m1.xlarge),
single instance of Solr
Fusion: ~900ms
28M rows: ~1.3secs
https://www.periscopedata.com/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html
Self-service Analytics Performance
SELECT m.title as title, agg.aggCount as aggCount
FROM movies m
INNER JOIN (
SELECT movie_id, COUNT(*) as aggCount
FROM ratings
WHERE rating >= 4 GROUP BY movie_id
ORDER BY aggCount desc LIMIT 10) as agg
ON agg.movie_id = m.id
ORDER BY aggCount DESC
20M rows
Fusion SQL : ~1.1 secs
MySQL: 17 secs (w/ index on movie_id)
Movielens data: Aggregate 20M ratings
https://lucidworks.com/2018/08/06/using-tableau-sql-and-search-for-fast-data-visualizations/
Experiments
• Run live experiments to try out
new ideas and compare
outcomes between variants
• Built-in metrics: MRR, avg|min|
max response time, CTR …
and you guessed it! SQL
• Bayesian Bandits to
explore/exploit the best
performing variant
Demo: Parallel Bulk Loader
Experiment Metrics
Recap
• How to build powerful SQL aggregations with
joins, custom UDF/ UDAF, and window functions to
power boosting and recommendations
• Ingesting data from data sources using SQL for
ETL, ML
• Self-service analytics from popular BI visualization
tools
• Measure outcomes between variants in an
experiment using SQL
https://github.com/lucidworks/fusion-spark-bootcamp
Top 10 Things you can do with SQL in Fusion
1. Aggregate signals by query / doc / user to compute boost
weights and generate recommendations
2. Ingest & ETL from 100’s of data sources using SparkSQL
3. Use ML models to generate predictions and Lucene text
analysis using UDF functions
4. Join data from multiple Solr collections and data sources
5. Self-service analytics with BI tools like Tableau and PowerBI
6. Hide complex business logic behind UDF / UDAF
7. Use window functions for tasks like sessionization
8. Grouping sets and cubes for advanced analytic reporting
9. Compute KPIs across variants in an experiment
10. Expose complex Solr streaming expressions as simple SQL
views
Thank you!
Timothy Potter
Manager Smart Data, Lucidworks
@thelabdude
#Activate18 #ActivateSearch

More Related Content

What's hot

Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr... Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...Databricks
 
Analyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeAnalyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeBizTalk360
 
Tableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of ThoughtTableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of ThoughtMongoDB
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsThomas Sykes
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data FactoryBizTalk360
 
Data Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsData Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsIke Ellis
 
Migrate a successful transactional database to azure
Migrate a successful transactional database to azureMigrate a successful transactional database to azure
Migrate a successful transactional database to azureIke Ellis
 
ECS19 - Mike Ammerlaan - Microsoft Graph Data Connect
ECS19 - Mike Ammerlaan - Microsoft Graph Data ConnectECS19 - Mike Ammerlaan - Microsoft Graph Data Connect
ECS19 - Mike Ammerlaan - Microsoft Graph Data ConnectEuropean Collaboration Summit
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBIke Ellis
 
Continuous Optimization for Distributed BigData Analysis
Continuous Optimization for Distributed BigData AnalysisContinuous Optimization for Distributed BigData Analysis
Continuous Optimization for Distributed BigData AnalysisKai Sasaki
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story Roman Chukh
 
Moving to the cloud; PaaS, IaaS or Managed Instance
Moving to the cloud; PaaS, IaaS or Managed InstanceMoving to the cloud; PaaS, IaaS or Managed Instance
Moving to the cloud; PaaS, IaaS or Managed InstanceThomas Sykes
 
CCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysisCCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysiswalk2talk srl
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeDatabricks
 
Azure saturday pn 2018
Azure saturday pn 2018Azure saturday pn 2018
Azure saturday pn 2018Marco Pozzan
 
Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse
Top 5 Things to Know About Integrating MongoDB into Your Data WarehouseTop 5 Things to Know About Integrating MongoDB into Your Data Warehouse
Top 5 Things to Know About Integrating MongoDB into Your Data WarehouseMongoDB
 

What's hot (20)

Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr... Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 
Analyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeAnalyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data Lake
 
Tableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of ThoughtTableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of Thought
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Power BI: Tips and Tricks
Power BI: Tips and TricksPower BI: Tips and Tricks
Power BI: Tips and Tricks
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
 
Data Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsData Modeling on Azure for Analytics
Data Modeling on Azure for Analytics
 
Migrate a successful transactional database to azure
Migrate a successful transactional database to azureMigrate a successful transactional database to azure
Migrate a successful transactional database to azure
 
ECS19 - Mike Ammerlaan - Microsoft Graph Data Connect
ECS19 - Mike Ammerlaan - Microsoft Graph Data ConnectECS19 - Mike Ammerlaan - Microsoft Graph Data Connect
ECS19 - Mike Ammerlaan - Microsoft Graph Data Connect
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Continuous Optimization for Distributed BigData Analysis
Continuous Optimization for Distributed BigData AnalysisContinuous Optimization for Distributed BigData Analysis
Continuous Optimization for Distributed BigData Analysis
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
 
Moving to the cloud; PaaS, IaaS or Managed Instance
Moving to the cloud; PaaS, IaaS or Managed InstanceMoving to the cloud; PaaS, IaaS or Managed Instance
Moving to the cloud; PaaS, IaaS or Managed Instance
 
R in Power BI
R in Power BIR in Power BI
R in Power BI
 
CCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysisCCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysis
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake
 
Azure saturday pn 2018
Azure saturday pn 2018Azure saturday pn 2018
Azure saturday pn 2018
 
Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse
Top 5 Things to Know About Integrating MongoDB into Your Data WarehouseTop 5 Things to Know About Integrating MongoDB into Your Data Warehouse
Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse
 

Similar to SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore IndexSolidQ
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...AboutYouGmbH
 
Microsoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the CloudMicrosoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the CloudMark Kromer
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Anubhav Kale
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache SparkMiklos Christine
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksLucidworks
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarRTTS
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)James Serra
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastDatabricks
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseJames Serra
 
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol ValidationBIOVIA
 

Similar to SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers (20)

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Microsoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the CloudMicrosoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the Cloud
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 
Dax & sql in power bi
Dax & sql in power biDax & sql in power bi
Dax & sql in power bi
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Breaking data
Breaking dataBreaking data
Breaking data
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
 

More from Lucidworks

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategyLucidworks
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceLucidworks
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsLucidworks
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesLucidworks
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Lucidworks
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...Lucidworks
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Lucidworks
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Lucidworks
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteLucidworks
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentLucidworks
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeLucidworks
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Lucidworks
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchLucidworks
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Lucidworks
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyLucidworks
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Lucidworks
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceLucidworks
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchLucidworks
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondLucidworks
 

More from Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers

  • 1. SQL Analytics for Search Engineers Timothy Potter Manager of Smart Data @ Lucidworks / Apache Solr Committer @thelabdude #Activate18 #ActivateSearch
  • 2. An ever-expanding list of needs from search engineers • Better relevancy, less manual tuning • Bigger scale, less downtime, fixed resources • Higher QPS, more complex query pipelines • More bespoke, search-driven applications, faster! • Trying out new ideas • Making better decisions with self-service analytics • Random one-off jobs for this and that • Use AI everywhere!
  • 3. The ideal solution … • Easy to explain to your boss how it works • Tooling available • Résumé friendly • Extensible / customizable / flexible • Scalable • People want to feel productive SQL in Fusion!
  • 4. Data Ingest = Project Friction • Bespoke, search-driven applications > general purpose dashboard tools • Getting data in continues to be a hassle / friction when getting started • Need something nimble but also fast / scalable • For every connector, there’s probably 20 SQL / NoSQL data silos
  • 5. Fusion’s Parallel Bulk Loader • Get to the fun stuff faster! • Complement Fusion’s connectors for those dirty ETL jobs that cause friction in every project • High performance parallel reads from structured data sources, including Cassandra, Elastic, HBase, JDBC, Hadoop, … • Basic ETL tasks with SQL and/or custom Scala • ML Model predictions as UDF • Direct to Solr for optimal speed or send to index- pipelines for optimal flexibility
  • 6. A foundation built on SparkSQL • Expose structured data as a DataFrame: RDD + schema • 100’s of data sources + formats • spark-solr translates Solr query results to a DataFrame • Highly optimized parallel reads, with predicate pushdown across a Spark cluster • Spark optimizes the SQL query plan • 100’s of built-in functions
  • 7. Demo: Parallel Bulk Loader Parallel Bulk Loader
  • 8. Read parquet from S3 Write to a Fusion Index Pipeline Advanced transforms with Scala Transform with SQL Add job dependencies On-the-fly
  • 9. User Feedback to Improve Relevancy • MRR is sub-optimal for many queries? • Want to boost some docs based on user click behavior (per query) • Older clicks should age out over time • Some user actions are more important than others: click < cart add < purchase • Sometimes you need to join signals with other tables, e.g. item metadata • Hide complex business logic behind UDF / UDAF (pluggable) • Designed for change!
  • 10. Signal Data Flow in Fusion
  • 11. Demo: Parallel Bulk Loader SQL Aggregation
  • 12. Join with other tables Custom UDAF Final output to Solr
  • 13. Window Functions WITH sessions AS ( SELECT *, sum(IF(diff_secs > 30, 1, 0)) OVER (PARTITION BY clientip ORDER BY ts) session_id FROM ( SELECT *, unix_timestamp(ts) - lag(unix_timestamp(ts)) OVER (PARTITION BY clientip ORDER BY ts) as diff_secs FROM ${inputCollection} WHERE clientip IS NOT NULL AND ts IS NOT NULL AND bytes IS NOT NULL AND verb IS NOT NULL AND response IS NOT NULL ) ) SELECT concat_ws('||', clientip,session_id) as id, first(clientip) as clientip, min(ts) as session_start, max(ts) as session_end, timediff(max(ts), min(ts), "MILLISECONDS") as session_len_ms_l, sum(bytes) as total_bytes_l, count(*) as total_requests_l FROM sessions GROUP BY clientip, session_id Lag window function
  • 14. SQL Aggregations Scalability • Aggregate 42M signals into 11M groups (query / doc_id) • ~18 mins on 3 node EC2 cluster (r3.xlarge) • Mostly I/O from/to Solr
  • 15. Why Self-service Analytics? • Powerful connectors, relevance, speed, and massive scalability = more mission- critical datasets finding their way into Fusion • Don’t be another data silo! • Let users ask questions of this data using their tool of choice w/o adding work for the IT group! • Aggregations over full-text ranked results • But it has to be fast else you’re right back to data warehousing problems
  • 16. Self-service Analytics • Fusion SQL is a JDBC service that supports SQL • Fusion SQL plugs into Apache Spark’s query planner to translate SQL into optimized Solr queries (streaming expressions and JSON facets) • Integrate with popular BI tools like Tableau, PowerBI, and Spotfire + Notebooks like Apache Zeppelin
  • 17.
  • 18. Demo: Parallel Bulk Loader Self-Service Analytics
  • 19.
  • 20. Self-service Analytics Performance • Blog performed a comparison of their SQL engine against common DBs using a count distinct query typical for dashboards • 14M logs, 1200 distinct dashboards, 1700 distinct user_id/dashboard_id pairs • Replicated the experiment with Fusion on Ec2 (m1.xlarge), single instance of Solr Fusion: ~900ms 28M rows: ~1.3secs https://www.periscopedata.com/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html
  • 21. Self-service Analytics Performance SELECT m.title as title, agg.aggCount as aggCount FROM movies m INNER JOIN ( SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as agg ON agg.movie_id = m.id ORDER BY aggCount DESC 20M rows Fusion SQL : ~1.1 secs MySQL: 17 secs (w/ index on movie_id) Movielens data: Aggregate 20M ratings https://lucidworks.com/2018/08/06/using-tableau-sql-and-search-for-fast-data-visualizations/
  • 22. Experiments • Run live experiments to try out new ideas and compare outcomes between variants • Built-in metrics: MRR, avg|min| max response time, CTR … and you guessed it! SQL • Bayesian Bandits to explore/exploit the best performing variant
  • 23. Demo: Parallel Bulk Loader Experiment Metrics
  • 24. Recap • How to build powerful SQL aggregations with joins, custom UDF/ UDAF, and window functions to power boosting and recommendations • Ingesting data from data sources using SQL for ETL, ML • Self-service analytics from popular BI visualization tools • Measure outcomes between variants in an experiment using SQL https://github.com/lucidworks/fusion-spark-bootcamp
  • 25. Top 10 Things you can do with SQL in Fusion 1. Aggregate signals by query / doc / user to compute boost weights and generate recommendations 2. Ingest & ETL from 100’s of data sources using SparkSQL 3. Use ML models to generate predictions and Lucene text analysis using UDF functions 4. Join data from multiple Solr collections and data sources 5. Self-service analytics with BI tools like Tableau and PowerBI 6. Hide complex business logic behind UDF / UDAF 7. Use window functions for tasks like sessionization 8. Grouping sets and cubes for advanced analytic reporting 9. Compute KPIs across variants in an experiment 10. Expose complex Solr streaming expressions as simple SQL views
  • 26. Thank you! Timothy Potter Manager Smart Data, Lucidworks @thelabdude #Activate18 #ActivateSearch

Editor's Notes

  1. How are you going to get all this done? In Fusion, we chose SQL as the foundational technology to solve many of these issues.
  2. So I think we’re all pretty clear on the scope of the problem, but what might the ideal solution look like? Audience poll: - How many know SQL and have used it in some fashion in the last year - How many have integrated with some sort of SQL database with search today
  3. One of the amazing things about app studio is you can rapidly build bespoke search applications w/o creating another data silo! Getting data indexed is not the end goal of a project, an impediment on most projects, adds friction and distracts us from the important stuff (queries / visualization) Organizations are really good at provisioning data silos To let people ask new questions from your data, they need access across many data sources SQL and NoSQL databases are everywhere! Need something nimble to go grab data from multiple places and Connectors are great for complex business apps like Sharepoint and Box but for every Sharepoint there’s a 100 SQL / NoSQL databases in a modern org
  4. SQL lets Spark create an optimized query plan, which sometimes we know how to optimize further for Solr Typically built by experts NoSQL: Cassandra, HBase, Hive, Mongo S3, HDFS, parquet Search: Solr, Elastic RDBMS: JDBC, Redshift, Hive Azure, Google Analytics
  5. Ingest data from S3 Invoke an ML model to do NLP stuff Do some basic ETL with SQL
  6. Just a placeholder slide for what is shown in the demo
  7. Spark function reference: https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/index.html
  8. See: https://doc.lucidworks.com/fusion-ai/4.0/user-guide/signals/signals.html
  9. Fusion’s built-in click signal SQL aggregation job Time-decay function Custom UDF (price bucketing) Custom SQL job: sessionization of logs with window functions
  10. Just a placeholder slide for what is shown in the demo
  11. Just a placeholder to show another example of a SQL agg job, this time with a window function to find sessions.
  12. The traditional problem with self-service analytics is speed, flexibility, scalability A whole that’s greater than the sum of its parts
  13. Pushdown the computation of an aggregated query into Solr for maximum performance Or, pull rows into Spark from Solr to perform most any analytics task
  14. At step 1, a Fusion data analyst is authenticated by the JDBC/ODBC client application (e.g. SpotFire or Tableau) using Kerberos. Once authenticated, the user’s SQL query is sent to the Fusion SQL Thriftserver over HTTP (step 2 in the diagram). The SQL Thriftserver uses the service principal keytab to validate the incoming user identity using Kerberos (step 3). The Fusion SQL Thriftserver is a Spark application with a specific number of CPU cores and memory allocated from the pool of Spark resources. You can scale out the number of Spark worker nodes to increase available memory and CPU resources to the SQL service. The Thriftserver sends the query to Spark to be parsed into a Logical query plan (step 4). During the query planning stage, Spark sends the logical plan to Fusion’s pushdown strategy component (step 5). The pushdown strategy analyzes the query plan to determine if there is an optimal Solr query / streaming expression that can “push-down” aggregations into Solr to improve performance and scalability. For instance, the following SQL query can be translated into a Solr facet query by the Fusion pushdown strategy: select count(1) as the_count, movie_id from ratings group by movie_id The basic idea behind Fusion’s pushdown strategy is it is much faster to let Solr facets perform basic aggregations than it is to export raw documents from Solr and have Spark perform the aggregation. If an optimal pushdown query is not possible, then Spark will pull raw documents from Solr and then perform any joins / aggregations needed in Spark. Put simply, the Fusion SQL service tries to translate SQL queries into optimized Solr queries but failing that, the service simply reads all matching docs for a query into Spark and performs the SQL execution logic across the Spark cluster. During pushdown analysis, Fusion calls out to the registered AuthzFilterProvider implementation to get a filter query to perform row-level filtering for the Kerberos authenticated user (step 6). By default there is no row-level security provider but users can install their own implementation using the Fusion SQL service API. Lastly, a distributed Solr query gets executed by Spark to return documents that satisfy the SQL query criteria and row-level security filter (step 7). To leverage the distributed nature of Spark and Solr, Fusion SQL sends a query to all replicas for each shard in a Solr collection. Consequently, you can scale out SQL query performance by adding more Spark and/or Solr resources to your cluster.
  15. Show connecting to Fusion SQL from Tableau (or maybe Apache Superset) Build a simple data visualization on-the-fly
  16. Just a placeholder slide for what is shown in the demo
  17. Avg. time on site / # of interactions per variant Show results in App Insights