SlideShare a Scribd company logo
1 of 38
AI from your data lake
Using Solr for analytics
Who we are
Cassandra Targett
Lucene/Solr Committer & PMC
Director of Engineering at
Lucidworks
Solr and HDP Search
Development
Marcelline Saunders
Director of Global Partner
Enablement at Lucidworks
Lucidworks is the primary sponsor
of the Apache Solr project
Employs over 40% of the active
committers on the Solr project
Contributes over 70% of Solr's
open source codebase
40%
70%
Based in San Francisco
Offices in Bangalore, Bangkok,
New York City, London, Raleigh
Over 400 customers across the
Fortune 1000
Fusion, a Solr-powered platform
for search-driven apps
Consulting and support for
organizations using Solr Produces the world’s largest open source user
conference for Lucene/Solr (now also AI!)
Visit activate-conf.com for more information & to register
About Solr
Solr is the most popular search engine available today
Built on Lucene
Open source
Scalable
Distributed
Flexible
Extensible
Search Features:
● Admin UI
● Facets
● Hit highlights
● Multiple languages
● Spell check, auto-
complete
What is HDP Search?
Developed by Lucidworks
Built & Distributed by
Hortonworks
Add-on package for HDP, which
includes:
● Apache Solr
● HDFS, Hive and Pig Connectors
● Ambari MPack for Solr
● Banana
● Documentation
HDP Search
SerDe
Job Jar
Data Files
AI Features in
Solr
● Streaming Expressions
○ Math programming syntax
○ Train regression models
○ Classify results of a search
○ Parallel processing
○ Graph Traversal
○ Parallel SQL
● Learning-to-Rank
● Analytics Component
Streaming Expressions
Powerful stream processing language for Solr
● Suite of functions to query,
transform, and aggregate your
data
● Functions can be nested to
perform multiple tasks in one
request
● Work across your entire
dataset
● Request/response stream
processing
● Batch stream processing
● Fast interactive MapReduce
● Aggregations (pushed down
faceted and shuffling
MapReduce)
● Parallel relational algebra
(distributed joins, intersections,
unions, complements)
● Publish/subscribe messaging
● Distributed graph traversal
● Machine learning and parallel
iterative model training
● Anomaly detection
● Recommendation systems
● Retrieve and rank services
● Text classification and feature
extraction
● Streaming NLP
● Build your own!
What Can You Do?
Stream Sources
output -> tuples
Streaming Sources originate streams (of
tuples).
● search
● jdbc
● echo
● facet
● features
● nodes
● knn
● model
● random
● significantTerms
● shortestPath
● shuffle
● stats
● timeseries
● train
● topic
● tuple
Stream
Decorators
input -> tuples
output -> tuples
● cartesianProduct
● classify
● commit
● complement
● daemon
● eval
● executor
● fetch
● having
● leftOuterJoin
● hashJoin
● innerJoin
● intersect
● merge
● null
● outerHashJoin
● parallel
● priority
● reduce
● rollup
● scoreNodes
● select
● sort
● top
● unique
● update
Stream Decorators wrap other stream functions or
perform operations on a stream (of tuples).
Stream
Evaluators
input -> parameter
(possibly from a field in a
tuple)
output -> parameter
(possibly from a field in a
tuple)
● analyze
● abs
● add
● div
● log
● mult
● sub
● pow
● mod
● ceil
● floor
● sin
● asin
● sinh
● cos
● acos
● atan
● round
● sqrt
● cbrt
● and
● eq
● eor
● gteq
● gt
● if
● lteq
● lt
● not
● or
● raw
● sample
Stream Evaluators are functions that evaluate
parameters and return a result. These can be used
to transform values inside the tuples in a streaming
expression, or can be used independently.
● regress
● predict
● standardize
● distance
● kmeans
● timeseries
● monteCarlo
● cumulativeProbablity
● betaDistribution
● termVectors
● matrix
● rowCount
● mean
● describe
● percentile
● cov
...and many MORE
Parallel Batch
Processing
Train a Logistic Regression
Model
Distributed Joins
Pull Results from External Database
Sources: https://lucene.apache.org/solr/guide/streaming-expressions.html http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html
Classify Search
Results
Rapid Export of all
Search Results
Streaming Expression Examples
Parallel SQL
● SQL interface for writing streaming expressions
● Statements are parsed to proper streaming expression syntax
● Supports a basic SQL syntax: SELECT, WHERE, ORDER BY,
LIMIT, etc.
rollup
(search
(techproducts,q=”*:*”,fl=”id,color”,sort=”color asc”),
over=”color”, count(*))
SELECT count(*) from techproducts
WHERE _text_=’(*:*)’ GROUP BY color
Graph Traversal
● Part of Solr’s broader Streaming
Expressions capability
● Implements a powerful, breadth-first
traversal
● Works across shards AND collections
● Supports aggregations
● Cycle aware
● Ability to both traverse AND score
nodes within the graph
Graph Traversal - Syntax
All movies that user "trey" watched
gatherNodes(movielens,walk="trey->user_name_s",gather="movie_id_i")
All movies that viewers of a specific movie watched
gatherNodes(movielens,
gatherNodes(movielens,walk="123->movie_id_i",gather="user_id_i"),
walk="node->user_id_i",gather="movie_id_i", trackTraversal="true"
)
Graph Traversal - Use Cases
• Anomaly detection /
fraud detection
• Recommenders
• Social network analysis
• Graph Search
• Access Control
• Relationship discovery / scoring
Examples
o Find all draft blog posts about “Parallel SQL”
written by a developer
o Find all tweets mentioning “Solr” by me or people
I follow
o Find all draft blog posts about “Parallel SQL”
written by a developer
o Find 3-star hotels in NYC my friends stayed in
last year
Learning to Rank (LTR)
Rank query results based on trained models
Traditional relevance ranking uses algorithms that calculate user
query terms to terms in the document (TF/IDF, BM25)
LTR allows you to rank results for user queries according to trained
models stored in Solr (trained outside Solr)
Factors for training data:
● Implicit: clicks, time spent on page, historical sales, previously
viewed documents
● Explicit: human judgement
Analytics Component
Calculate complex statistical aggregations over result sets.
Expressions, functions and groupings of data from your documents:
● Expressions: calculations to perform over the result set to
return a single value
● Functions: variables re-used in expressions or groupings
● Groupings: facets, which can include functions or expressions
neg, round, ceil, if, gt, lt, add, sub, div, sum, count,
unique, percentile, date, concat, log, pow, mean, min, max
Tools for Analytics & Visualization
Search Driven Analytics
Motivation
- Go beyond full text search
- Self-service exploration of data
- Provide tools for analysts to mine data without having to
understand query languages
- Create views of data for users
Why SQL with Search?
● Known query language
● Eliminates re-training users on proprietary tools and query
languages
● Third party BI tools use JDBC/ODBC
● Leverage powerful full text search
● Join Solr collections
● Join Solr collections with other data sources
Analytics Visualization tools
Banana (available with HDP Search)
Solr 6.0 + (Solr SQL)
- Apache Zeppelin
Lucidworks Fusion (Spark SQL - Solr SQL)
- Tableau
- Apache Zeppelin
- Jupyter
- Any third party product that supports JDBC/ODBC
Lucidworks Fusion App Insights
Banana Dashboards
Provided with HDP
Search
Easily create
dashboards for a Solr
collection
Based on facet queries
Requires basic
knowledge of Solr
Banana Dashboards
Zeppelin Integration
FusionSQL - Using Spark and Solr together
Tableau: Solr Collections look like tables
Join across Solr Collections
Fusion - Tableau: Self Service BI/Analytics
Tableau Analytics
• Leverage existing BI tools like Tableau and
Zeppelin
• Add full-text search and advanced Solr AI
features to your SQL query
• Ranking by relevance
• Joins across collections
• Fast and responsive queries at scale
• Ask interesting questions of your data
SQL
Benefits with
Solr/Fusion
35
Fusion App Insights
• Customizable dashboards to visualize
Query Analytics.
• Built in Analytics reports based on
Fusion AI Smart jobs for analyzing query
performance.
• Experiment analysis to give you
feedback on how search variants are
performing.
• Thorough analytics on users, sessions,
and all interactions (signals)
Resources
Solr Reference Guide:
● Streaming Expressions: https://lucene.apache.org/solr/guide/streaming-expressions.html
● Setting up Solr to be used with generic SQL clients: https://lucene.apache.org/solr/guide/7_3/parallel-sql-
interface.html#generic-clients
● Solr and Apache Zeppelin: https://lucene.apache.org/solr/guide/7_3/solr-jdbc-apache-zeppelin.html#solr-jdbc-apache-
zeppelin
Lucidworks Fusion (Solr SQL and Spark SQL) - setting up Tableau
https://lucidworks.com/2017/02/01/sql-in-fusion-3/
Tech at Bloomberg: The search for Solr analytics: https://www.techatbloomberg.com/blog/the-search-for-solr-analytics/
Questions?

More Related Content

What's hot

Discrete Fourier Transform
Discrete Fourier TransformDiscrete Fourier Transform
Discrete Fourier Transform
Shahryar Ali
 
Digital communication
Digital communicationDigital communication
Digital communication
meashi
 

What's hot (20)

GPON-FTTx Training
GPON-FTTx TrainingGPON-FTTx Training
GPON-FTTx Training
 
The coherent optical edge
The coherent optical edgeThe coherent optical edge
The coherent optical edge
 
Information theory
Information theoryInformation theory
Information theory
 
Discrete Fourier Transform
Discrete Fourier TransformDiscrete Fourier Transform
Discrete Fourier Transform
 
ALU GPON TRAINING 1
ALU GPON TRAINING 1ALU GPON TRAINING 1
ALU GPON TRAINING 1
 
DAB costs and benefits
DAB costs and benefitsDAB costs and benefits
DAB costs and benefits
 
BGP FlowSpec experience and future developments
BGP FlowSpec experience and future developmentsBGP FlowSpec experience and future developments
BGP FlowSpec experience and future developments
 
Optical Transport Network (OTN) Tutorial
Optical Transport Network (OTN) TutorialOptical Transport Network (OTN) Tutorial
Optical Transport Network (OTN) Tutorial
 
RF Module Design - [Chapter 1] From Basics to RF Transceivers
RF Module Design - [Chapter 1] From Basics to RF TransceiversRF Module Design - [Chapter 1] From Basics to RF Transceivers
RF Module Design - [Chapter 1] From Basics to RF Transceivers
 
greedy algorithm Fractional Knapsack
greedy algorithmFractional Knapsack greedy algorithmFractional Knapsack
greedy algorithm Fractional Knapsack
 
Digital communication
Digital communicationDigital communication
Digital communication
 
Properties of dft
Properties of dftProperties of dft
Properties of dft
 
Properties of Regular Expressions
Properties of Regular ExpressionsProperties of Regular Expressions
Properties of Regular Expressions
 
CCNP Security-VPN
CCNP Security-VPNCCNP Security-VPN
CCNP Security-VPN
 
Vertex cover Problem
Vertex cover ProblemVertex cover Problem
Vertex cover Problem
 
BGP Advanced topics
BGP Advanced topicsBGP Advanced topics
BGP Advanced topics
 
Introduction to SIM and USIM
Introduction to SIM and USIMIntroduction to SIM and USIM
Introduction to SIM and USIM
 
Large BGP Communities
Large BGP CommunitiesLarge BGP Communities
Large BGP Communities
 
Signal & systems
Signal & systemsSignal & systems
Signal & systems
 
Gpon xgpon ng pon xgs-pon
Gpon xgpon ng pon xgs-ponGpon xgpon ng pon xgs-pon
Gpon xgpon ng pon xgs-pon
 

Similar to AI from your data lake: Using Solr for analytics

The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Lucidworks
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 

Similar to AI from your data lake: Using Solr for analytics (20)

Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6
 
MongoDB Basics
MongoDB BasicsMongoDB Basics
MongoDB Basics
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's New
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Design for scale
Design for scaleDesign for scale
Design for scale
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engine
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Getting started with Graph Databases & Neo4j
Getting started with Graph Databases & Neo4jGetting started with Graph Databases & Neo4j
Getting started with Graph Databases & Neo4j
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
From Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryFrom Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank Story
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Scalable Search Analytics
Scalable Search AnalyticsScalable Search Analytics
Scalable Search Analytics
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

AI from your data lake: Using Solr for analytics

  • 1. AI from your data lake Using Solr for analytics
  • 2. Who we are Cassandra Targett Lucene/Solr Committer & PMC Director of Engineering at Lucidworks Solr and HDP Search Development Marcelline Saunders Director of Global Partner Enablement at Lucidworks
  • 3. Lucidworks is the primary sponsor of the Apache Solr project Employs over 40% of the active committers on the Solr project Contributes over 70% of Solr's open source codebase 40% 70% Based in San Francisco Offices in Bangalore, Bangkok, New York City, London, Raleigh Over 400 customers across the Fortune 1000 Fusion, a Solr-powered platform for search-driven apps Consulting and support for organizations using Solr Produces the world’s largest open source user conference for Lucene/Solr (now also AI!)
  • 4. Visit activate-conf.com for more information & to register
  • 5. About Solr Solr is the most popular search engine available today Built on Lucene Open source Scalable Distributed Flexible Extensible Search Features: ● Admin UI ● Facets ● Hit highlights ● Multiple languages ● Spell check, auto- complete
  • 6. What is HDP Search? Developed by Lucidworks Built & Distributed by Hortonworks Add-on package for HDP, which includes: ● Apache Solr ● HDFS, Hive and Pig Connectors ● Ambari MPack for Solr ● Banana ● Documentation
  • 8. AI Features in Solr ● Streaming Expressions ○ Math programming syntax ○ Train regression models ○ Classify results of a search ○ Parallel processing ○ Graph Traversal ○ Parallel SQL ● Learning-to-Rank ● Analytics Component
  • 9. Streaming Expressions Powerful stream processing language for Solr ● Suite of functions to query, transform, and aggregate your data ● Functions can be nested to perform multiple tasks in one request ● Work across your entire dataset
  • 10. ● Request/response stream processing ● Batch stream processing ● Fast interactive MapReduce ● Aggregations (pushed down faceted and shuffling MapReduce) ● Parallel relational algebra (distributed joins, intersections, unions, complements) ● Publish/subscribe messaging ● Distributed graph traversal ● Machine learning and parallel iterative model training ● Anomaly detection ● Recommendation systems ● Retrieve and rank services ● Text classification and feature extraction ● Streaming NLP ● Build your own! What Can You Do?
  • 11. Stream Sources output -> tuples Streaming Sources originate streams (of tuples). ● search ● jdbc ● echo ● facet ● features ● nodes ● knn ● model ● random ● significantTerms ● shortestPath ● shuffle ● stats ● timeseries ● train ● topic ● tuple
  • 12. Stream Decorators input -> tuples output -> tuples ● cartesianProduct ● classify ● commit ● complement ● daemon ● eval ● executor ● fetch ● having ● leftOuterJoin ● hashJoin ● innerJoin ● intersect ● merge ● null ● outerHashJoin ● parallel ● priority ● reduce ● rollup ● scoreNodes ● select ● sort ● top ● unique ● update Stream Decorators wrap other stream functions or perform operations on a stream (of tuples).
  • 13. Stream Evaluators input -> parameter (possibly from a field in a tuple) output -> parameter (possibly from a field in a tuple) ● analyze ● abs ● add ● div ● log ● mult ● sub ● pow ● mod ● ceil ● floor ● sin ● asin ● sinh ● cos ● acos ● atan ● round ● sqrt ● cbrt ● and ● eq ● eor ● gteq ● gt ● if ● lteq ● lt ● not ● or ● raw ● sample Stream Evaluators are functions that evaluate parameters and return a result. These can be used to transform values inside the tuples in a streaming expression, or can be used independently. ● regress ● predict ● standardize ● distance ● kmeans ● timeseries ● monteCarlo ● cumulativeProbablity ● betaDistribution ● termVectors ● matrix ● rowCount ● mean ● describe ● percentile ● cov ...and many MORE
  • 14. Parallel Batch Processing Train a Logistic Regression Model Distributed Joins Pull Results from External Database Sources: https://lucene.apache.org/solr/guide/streaming-expressions.html http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html Classify Search Results Rapid Export of all Search Results Streaming Expression Examples
  • 15. Parallel SQL ● SQL interface for writing streaming expressions ● Statements are parsed to proper streaming expression syntax ● Supports a basic SQL syntax: SELECT, WHERE, ORDER BY, LIMIT, etc. rollup (search (techproducts,q=”*:*”,fl=”id,color”,sort=”color asc”), over=”color”, count(*)) SELECT count(*) from techproducts WHERE _text_=’(*:*)’ GROUP BY color
  • 16. Graph Traversal ● Part of Solr’s broader Streaming Expressions capability ● Implements a powerful, breadth-first traversal ● Works across shards AND collections ● Supports aggregations ● Cycle aware ● Ability to both traverse AND score nodes within the graph
  • 17. Graph Traversal - Syntax All movies that user "trey" watched gatherNodes(movielens,walk="trey->user_name_s",gather="movie_id_i") All movies that viewers of a specific movie watched gatherNodes(movielens, gatherNodes(movielens,walk="123->movie_id_i",gather="user_id_i"), walk="node->user_id_i",gather="movie_id_i", trackTraversal="true" )
  • 18. Graph Traversal - Use Cases • Anomaly detection / fraud detection • Recommenders • Social network analysis • Graph Search • Access Control • Relationship discovery / scoring Examples o Find all draft blog posts about “Parallel SQL” written by a developer o Find all tweets mentioning “Solr” by me or people I follow o Find all draft blog posts about “Parallel SQL” written by a developer o Find 3-star hotels in NYC my friends stayed in last year
  • 19. Learning to Rank (LTR) Rank query results based on trained models Traditional relevance ranking uses algorithms that calculate user query terms to terms in the document (TF/IDF, BM25) LTR allows you to rank results for user queries according to trained models stored in Solr (trained outside Solr) Factors for training data: ● Implicit: clicks, time spent on page, historical sales, previously viewed documents ● Explicit: human judgement
  • 20. Analytics Component Calculate complex statistical aggregations over result sets. Expressions, functions and groupings of data from your documents: ● Expressions: calculations to perform over the result set to return a single value ● Functions: variables re-used in expressions or groupings ● Groupings: facets, which can include functions or expressions neg, round, ceil, if, gt, lt, add, sub, div, sum, count, unique, percentile, date, concat, log, pow, mean, min, max
  • 21. Tools for Analytics & Visualization
  • 22. Search Driven Analytics Motivation - Go beyond full text search - Self-service exploration of data - Provide tools for analysts to mine data without having to understand query languages - Create views of data for users
  • 23. Why SQL with Search? ● Known query language ● Eliminates re-training users on proprietary tools and query languages ● Third party BI tools use JDBC/ODBC ● Leverage powerful full text search ● Join Solr collections ● Join Solr collections with other data sources
  • 24. Analytics Visualization tools Banana (available with HDP Search) Solr 6.0 + (Solr SQL) - Apache Zeppelin Lucidworks Fusion (Spark SQL - Solr SQL) - Tableau - Apache Zeppelin - Jupyter - Any third party product that supports JDBC/ODBC Lucidworks Fusion App Insights
  • 25. Banana Dashboards Provided with HDP Search Easily create dashboards for a Solr collection Based on facet queries Requires basic knowledge of Solr
  • 28.
  • 29. FusionSQL - Using Spark and Solr together
  • 30. Tableau: Solr Collections look like tables
  • 31. Join across Solr Collections
  • 32. Fusion - Tableau: Self Service BI/Analytics
  • 34. • Leverage existing BI tools like Tableau and Zeppelin • Add full-text search and advanced Solr AI features to your SQL query • Ranking by relevance • Joins across collections • Fast and responsive queries at scale • Ask interesting questions of your data SQL Benefits with Solr/Fusion
  • 35. 35 Fusion App Insights • Customizable dashboards to visualize Query Analytics. • Built in Analytics reports based on Fusion AI Smart jobs for analyzing query performance. • Experiment analysis to give you feedback on how search variants are performing. • Thorough analytics on users, sessions, and all interactions (signals)
  • 36.
  • 37. Resources Solr Reference Guide: ● Streaming Expressions: https://lucene.apache.org/solr/guide/streaming-expressions.html ● Setting up Solr to be used with generic SQL clients: https://lucene.apache.org/solr/guide/7_3/parallel-sql- interface.html#generic-clients ● Solr and Apache Zeppelin: https://lucene.apache.org/solr/guide/7_3/solr-jdbc-apache-zeppelin.html#solr-jdbc-apache- zeppelin Lucidworks Fusion (Solr SQL and Spark SQL) - setting up Tableau https://lucidworks.com/2017/02/01/sql-in-fusion-3/ Tech at Bloomberg: The search for Solr analytics: https://www.techatbloomberg.com/blog/the-search-for-solr-analytics/