SlideShare a Scribd company logo
1 of 38
AI from your data lake
Using Solr for analytics
Who we are
Cassandra Targett
Lucene/Solr Committer & PMC
Director of Engineering at
Lucidworks
Solr and HDP Search
Development
Marcelline Saunders
Director of Global Partner
Enablement at Lucidworks
Lucidworks is the primary sponsor
of the Apache Solr project
Employs over 40% of the active
committers on the Solr project
Contributes over 70% of Solr's
open source codebase
40%
70%
Based in San Francisco
Offices in Bangalore, Bangkok,
New York City, London, Raleigh
Over 400 customers across the
Fortune 1000
Fusion, a Solr-powered platform
for search-driven apps
Consulting and support for
organizations using Solr Produces the world’s largest open source user
conference for Lucene/Solr (now also AI!)
Visit activate-conf.com for more information & to register
About Solr
Solr is the most popular search engine available today
Built on Lucene
Open source
Scalable
Distributed
Flexible
Extensible
Search Features:
● Admin UI
● Facets
● Hit highlights
● Multiple languages
● Spell check, auto-
complete
What is HDP Search?
Developed by Lucidworks
Built & Distributed by
Hortonworks
Add-on package for HDP, which
includes:
● Apache Solr
● HDFS, Hive and Pig Connectors
● Ambari MPack for Solr
● Banana
● Documentation
HDP Search
SerDe
Job Jar
Data Files
AI Features in
Solr
● Streaming Expressions
○ Math programming syntax
○ Train regression models
○ Classify results of a search
○ Parallel processing
○ Graph Traversal
○ Parallel SQL
● Learning-to-Rank
● Analytics Component
Streaming Expressions
Powerful stream processing language for Solr
● Suite of functions to query,
transform, and aggregate your
data
● Functions can be nested to
perform multiple tasks in one
request
● Work across your entire
dataset
● Request/response stream
processing
● Batch stream processing
● Fast interactive MapReduce
● Aggregations (pushed down
faceted and shuffling
MapReduce)
● Parallel relational algebra
(distributed joins, intersections,
unions, complements)
● Publish/subscribe messaging
● Distributed graph traversal
● Machine learning and parallel
iterative model training
● Anomaly detection
● Recommendation systems
● Retrieve and rank services
● Text classification and feature
extraction
● Streaming NLP
● Build your own!
What Can You Do?
Stream Sources
output -> tuples
Streaming Sources originate streams (of
tuples).
● search
● jdbc
● echo
● facet
● features
● nodes
● knn
● model
● random
● significantTerms
● shortestPath
● shuffle
● stats
● timeseries
● train
● topic
● tuple
Stream
Decorators
input -> tuples
output -> tuples
● cartesianProduct
● classify
● commit
● complement
● daemon
● eval
● executor
● fetch
● having
● leftOuterJoin
● hashJoin
● innerJoin
● intersect
● merge
● null
● outerHashJoin
● parallel
● priority
● reduce
● rollup
● scoreNodes
● select
● sort
● top
● unique
● update
Stream Decorators wrap other stream functions or
perform operations on a stream (of tuples).
Stream
Evaluators
input -> parameter
(possibly from a field in a
tuple)
output -> parameter
(possibly from a field in a
tuple)
● analyze
● abs
● add
● div
● log
● mult
● sub
● pow
● mod
● ceil
● floor
● sin
● asin
● sinh
● cos
● acos
● atan
● round
● sqrt
● cbrt
● and
● eq
● eor
● gteq
● gt
● if
● lteq
● lt
● not
● or
● raw
● sample
Stream Evaluators are functions that evaluate
parameters and return a result. These can be used
to transform values inside the tuples in a streaming
expression, or can be used independently.
● regress
● predict
● standardize
● distance
● kmeans
● timeseries
● monteCarlo
● cumulativeProbablity
● betaDistribution
● termVectors
● matrix
● rowCount
● mean
● describe
● percentile
● cov
...and many MORE
Parallel Batch
Processing
Train a Logistic Regression
Model
Distributed Joins
Pull Results from External Database
Sources: https://lucene.apache.org/solr/guide/streaming-expressions.html http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html
Classify Search
Results
Rapid Export of all
Search Results
Streaming Expression Examples
Parallel SQL
● SQL interface for writing streaming expressions
● Statements are parsed to proper streaming expression syntax
● Supports a basic SQL syntax: SELECT, WHERE, ORDER BY,
LIMIT, etc.
rollup
(search
(techproducts,q=”*:*”,fl=”id,color”,sort=”color asc”),
over=”color”, count(*))
SELECT count(*) from techproducts
WHERE _text_=’(*:*)’ GROUP BY color
Graph Traversal
● Part of Solr’s broader Streaming
Expressions capability
● Implements a powerful, breadth-first
traversal
● Works across shards AND collections
● Supports aggregations
● Cycle aware
● Ability to both traverse AND score
nodes within the graph
Graph Traversal - Syntax
All movies that user "trey" watched
gatherNodes(movielens,walk="trey->user_name_s",gather="movie_id_i")
All movies that viewers of a specific movie watched
gatherNodes(movielens,
gatherNodes(movielens,walk="123->movie_id_i",gather="user_id_i"),
walk="node->user_id_i",gather="movie_id_i", trackTraversal="true"
)
Graph Traversal - Use Cases
• Anomaly detection /
fraud detection
• Recommenders
• Social network analysis
• Graph Search
• Access Control
• Relationship discovery / scoring
Examples
o Find all draft blog posts about “Parallel SQL”
written by a developer
o Find all tweets mentioning “Solr” by me or people
I follow
o Find all draft blog posts about “Parallel SQL”
written by a developer
o Find 3-star hotels in NYC my friends stayed in
last year
Learning to Rank (LTR)
Rank query results based on trained models
Traditional relevance ranking uses algorithms that calculate user
query terms to terms in the document (TF/IDF, BM25)
LTR allows you to rank results for user queries according to trained
models stored in Solr (trained outside Solr)
Factors for training data:
● Implicit: clicks, time spent on page, historical sales, previously
viewed documents
● Explicit: human judgement
Analytics Component
Calculate complex statistical aggregations over result sets.
Expressions, functions and groupings of data from your documents:
● Expressions: calculations to perform over the result set to
return a single value
● Functions: variables re-used in expressions or groupings
● Groupings: facets, which can include functions or expressions
neg, round, ceil, if, gt, lt, add, sub, div, sum, count,
unique, percentile, date, concat, log, pow, mean, min, max
Tools for Analytics & Visualization
Search Driven Analytics
Motivation
- Go beyond full text search
- Self-service exploration of data
- Provide tools for analysts to mine data without having to
understand query languages
- Create views of data for users
Why SQL with Search?
● Known query language
● Eliminates re-training users on proprietary tools and query
languages
● Third party BI tools use JDBC/ODBC
● Leverage powerful full text search
● Join Solr collections
● Join Solr collections with other data sources
Analytics Visualization tools
Banana (available with HDP Search)
Solr 6.0 + (Solr SQL)
- Apache Zeppelin
Lucidworks Fusion (Spark SQL - Solr SQL)
- Tableau
- Apache Zeppelin
- Jupyter
- Any third party product that supports JDBC/ODBC
Lucidworks Fusion App Insights
Banana Dashboards
Provided with HDP
Search
Easily create
dashboards for a Solr
collection
Based on facet queries
Requires basic
knowledge of Solr
Banana Dashboards
Zeppelin Integration
FusionSQL - Using Spark and Solr together
Tableau: Solr Collections look like tables
Join across Solr Collections
Fusion - Tableau: Self Service BI/Analytics
Tableau Analytics
• Leverage existing BI tools like Tableau and
Zeppelin
• Add full-text search and advanced Solr AI
features to your SQL query
• Ranking by relevance
• Joins across collections
• Fast and responsive queries at scale
• Ask interesting questions of your data
SQL
Benefits with
Solr/Fusion
35
Fusion App Insights
• Customizable dashboards to visualize
Query Analytics.
• Built in Analytics reports based on
Fusion AI Smart jobs for analyzing query
performance.
• Experiment analysis to give you
feedback on how search variants are
performing.
• Thorough analytics on users, sessions,
and all interactions (signals)
Resources
Solr Reference Guide:
● Streaming Expressions: https://lucene.apache.org/solr/guide/streaming-expressions.html
● Setting up Solr to be used with generic SQL clients: https://lucene.apache.org/solr/guide/7_3/parallel-sql-
interface.html#generic-clients
● Solr and Apache Zeppelin: https://lucene.apache.org/solr/guide/7_3/solr-jdbc-apache-zeppelin.html#solr-jdbc-apache-
zeppelin
Lucidworks Fusion (Solr SQL and Spark SQL) - setting up Tableau
https://lucidworks.com/2017/02/01/sql-in-fusion-3/
Tech at Bloomberg: The search for Solr analytics: https://www.techatbloomberg.com/blog/the-search-for-solr-analytics/
Questions?

More Related Content

What's hot

Chapter06 Managing Disks And Data Storage
Chapter06      Managing  Disks And  Data  StorageChapter06      Managing  Disks And  Data  Storage
Chapter06 Managing Disks And Data Storage
Raja Waseem Akhtar
 

What's hot (20)

Cloud computing
Cloud computingCloud computing
Cloud computing
 
Chapter06 Managing Disks And Data Storage
Chapter06      Managing  Disks And  Data  StorageChapter06      Managing  Disks And  Data  Storage
Chapter06 Managing Disks And Data Storage
 
Mobile Cloud Computing
Mobile Cloud Computing Mobile Cloud Computing
Mobile Cloud Computing
 
Standardization in Cloud/Cloud Computing
Standardization in Cloud/Cloud ComputingStandardization in Cloud/Cloud Computing
Standardization in Cloud/Cloud Computing
 
Disaster Recovery: Understanding Trend, Methodology, Solution, and Standard
Disaster Recovery:  Understanding Trend, Methodology, Solution, and StandardDisaster Recovery:  Understanding Trend, Methodology, Solution, and Standard
Disaster Recovery: Understanding Trend, Methodology, Solution, and Standard
 
Vagrant y Docker - Guía práctica de uso
Vagrant y Docker - Guía práctica de usoVagrant y Docker - Guía práctica de uso
Vagrant y Docker - Guía práctica de uso
 
Origins of cloud computing
Origins of cloud computingOrigins of cloud computing
Origins of cloud computing
 
Cloud computing system models for distributed and cloud computing
Cloud computing system models for distributed and cloud computingCloud computing system models for distributed and cloud computing
Cloud computing system models for distributed and cloud computing
 
Cloud Security, Standards and Applications
Cloud Security, Standards and ApplicationsCloud Security, Standards and Applications
Cloud Security, Standards and Applications
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster Computing
 
Virtualization - An Introduction (Study Notes)
Virtualization - An Introduction (Study Notes)Virtualization - An Introduction (Study Notes)
Virtualization - An Introduction (Study Notes)
 
Cloud Computing Architecture
Cloud Computing ArchitectureCloud Computing Architecture
Cloud Computing Architecture
 
Vm migration techniques
Vm migration techniquesVm migration techniques
Vm migration techniques
 
Gcc notes unit 1
Gcc notes unit 1Gcc notes unit 1
Gcc notes unit 1
 
Cloud computing Risk management
Cloud computing Risk management  Cloud computing Risk management
Cloud computing Risk management
 
pfSense Installation Slide
pfSense Installation SlidepfSense Installation Slide
pfSense Installation Slide
 
Migration into a Cloud
Migration into a CloudMigration into a Cloud
Migration into a Cloud
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Levels of Virtualization.docx
Levels of Virtualization.docxLevels of Virtualization.docx
Levels of Virtualization.docx
 
Intro to linux
Intro to linuxIntro to linux
Intro to linux
 

Similar to AI from your data lake: Using Solr for analytics

The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Lucidworks
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 

Similar to AI from your data lake: Using Solr for analytics (20)

Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6
 
MongoDB Basics
MongoDB BasicsMongoDB Basics
MongoDB Basics
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's New
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Design for scale
Design for scaleDesign for scale
Design for scale
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engine
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Getting started with Graph Databases & Neo4j
Getting started with Graph Databases & Neo4jGetting started with Graph Databases & Neo4j
Getting started with Graph Databases & Neo4j
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
From Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryFrom Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank Story
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Scalable Search Analytics
Scalable Search AnalyticsScalable Search Analytics
Scalable Search Analytics
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

AI from your data lake: Using Solr for analytics

  • 1. AI from your data lake Using Solr for analytics
  • 2. Who we are Cassandra Targett Lucene/Solr Committer & PMC Director of Engineering at Lucidworks Solr and HDP Search Development Marcelline Saunders Director of Global Partner Enablement at Lucidworks
  • 3. Lucidworks is the primary sponsor of the Apache Solr project Employs over 40% of the active committers on the Solr project Contributes over 70% of Solr's open source codebase 40% 70% Based in San Francisco Offices in Bangalore, Bangkok, New York City, London, Raleigh Over 400 customers across the Fortune 1000 Fusion, a Solr-powered platform for search-driven apps Consulting and support for organizations using Solr Produces the world’s largest open source user conference for Lucene/Solr (now also AI!)
  • 4. Visit activate-conf.com for more information & to register
  • 5. About Solr Solr is the most popular search engine available today Built on Lucene Open source Scalable Distributed Flexible Extensible Search Features: ● Admin UI ● Facets ● Hit highlights ● Multiple languages ● Spell check, auto- complete
  • 6. What is HDP Search? Developed by Lucidworks Built & Distributed by Hortonworks Add-on package for HDP, which includes: ● Apache Solr ● HDFS, Hive and Pig Connectors ● Ambari MPack for Solr ● Banana ● Documentation
  • 8. AI Features in Solr ● Streaming Expressions ○ Math programming syntax ○ Train regression models ○ Classify results of a search ○ Parallel processing ○ Graph Traversal ○ Parallel SQL ● Learning-to-Rank ● Analytics Component
  • 9. Streaming Expressions Powerful stream processing language for Solr ● Suite of functions to query, transform, and aggregate your data ● Functions can be nested to perform multiple tasks in one request ● Work across your entire dataset
  • 10. ● Request/response stream processing ● Batch stream processing ● Fast interactive MapReduce ● Aggregations (pushed down faceted and shuffling MapReduce) ● Parallel relational algebra (distributed joins, intersections, unions, complements) ● Publish/subscribe messaging ● Distributed graph traversal ● Machine learning and parallel iterative model training ● Anomaly detection ● Recommendation systems ● Retrieve and rank services ● Text classification and feature extraction ● Streaming NLP ● Build your own! What Can You Do?
  • 11. Stream Sources output -> tuples Streaming Sources originate streams (of tuples). ● search ● jdbc ● echo ● facet ● features ● nodes ● knn ● model ● random ● significantTerms ● shortestPath ● shuffle ● stats ● timeseries ● train ● topic ● tuple
  • 12. Stream Decorators input -> tuples output -> tuples ● cartesianProduct ● classify ● commit ● complement ● daemon ● eval ● executor ● fetch ● having ● leftOuterJoin ● hashJoin ● innerJoin ● intersect ● merge ● null ● outerHashJoin ● parallel ● priority ● reduce ● rollup ● scoreNodes ● select ● sort ● top ● unique ● update Stream Decorators wrap other stream functions or perform operations on a stream (of tuples).
  • 13. Stream Evaluators input -> parameter (possibly from a field in a tuple) output -> parameter (possibly from a field in a tuple) ● analyze ● abs ● add ● div ● log ● mult ● sub ● pow ● mod ● ceil ● floor ● sin ● asin ● sinh ● cos ● acos ● atan ● round ● sqrt ● cbrt ● and ● eq ● eor ● gteq ● gt ● if ● lteq ● lt ● not ● or ● raw ● sample Stream Evaluators are functions that evaluate parameters and return a result. These can be used to transform values inside the tuples in a streaming expression, or can be used independently. ● regress ● predict ● standardize ● distance ● kmeans ● timeseries ● monteCarlo ● cumulativeProbablity ● betaDistribution ● termVectors ● matrix ● rowCount ● mean ● describe ● percentile ● cov ...and many MORE
  • 14. Parallel Batch Processing Train a Logistic Regression Model Distributed Joins Pull Results from External Database Sources: https://lucene.apache.org/solr/guide/streaming-expressions.html http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html Classify Search Results Rapid Export of all Search Results Streaming Expression Examples
  • 15. Parallel SQL ● SQL interface for writing streaming expressions ● Statements are parsed to proper streaming expression syntax ● Supports a basic SQL syntax: SELECT, WHERE, ORDER BY, LIMIT, etc. rollup (search (techproducts,q=”*:*”,fl=”id,color”,sort=”color asc”), over=”color”, count(*)) SELECT count(*) from techproducts WHERE _text_=’(*:*)’ GROUP BY color
  • 16. Graph Traversal ● Part of Solr’s broader Streaming Expressions capability ● Implements a powerful, breadth-first traversal ● Works across shards AND collections ● Supports aggregations ● Cycle aware ● Ability to both traverse AND score nodes within the graph
  • 17. Graph Traversal - Syntax All movies that user "trey" watched gatherNodes(movielens,walk="trey->user_name_s",gather="movie_id_i") All movies that viewers of a specific movie watched gatherNodes(movielens, gatherNodes(movielens,walk="123->movie_id_i",gather="user_id_i"), walk="node->user_id_i",gather="movie_id_i", trackTraversal="true" )
  • 18. Graph Traversal - Use Cases • Anomaly detection / fraud detection • Recommenders • Social network analysis • Graph Search • Access Control • Relationship discovery / scoring Examples o Find all draft blog posts about “Parallel SQL” written by a developer o Find all tweets mentioning “Solr” by me or people I follow o Find all draft blog posts about “Parallel SQL” written by a developer o Find 3-star hotels in NYC my friends stayed in last year
  • 19. Learning to Rank (LTR) Rank query results based on trained models Traditional relevance ranking uses algorithms that calculate user query terms to terms in the document (TF/IDF, BM25) LTR allows you to rank results for user queries according to trained models stored in Solr (trained outside Solr) Factors for training data: ● Implicit: clicks, time spent on page, historical sales, previously viewed documents ● Explicit: human judgement
  • 20. Analytics Component Calculate complex statistical aggregations over result sets. Expressions, functions and groupings of data from your documents: ● Expressions: calculations to perform over the result set to return a single value ● Functions: variables re-used in expressions or groupings ● Groupings: facets, which can include functions or expressions neg, round, ceil, if, gt, lt, add, sub, div, sum, count, unique, percentile, date, concat, log, pow, mean, min, max
  • 21. Tools for Analytics & Visualization
  • 22. Search Driven Analytics Motivation - Go beyond full text search - Self-service exploration of data - Provide tools for analysts to mine data without having to understand query languages - Create views of data for users
  • 23. Why SQL with Search? ● Known query language ● Eliminates re-training users on proprietary tools and query languages ● Third party BI tools use JDBC/ODBC ● Leverage powerful full text search ● Join Solr collections ● Join Solr collections with other data sources
  • 24. Analytics Visualization tools Banana (available with HDP Search) Solr 6.0 + (Solr SQL) - Apache Zeppelin Lucidworks Fusion (Spark SQL - Solr SQL) - Tableau - Apache Zeppelin - Jupyter - Any third party product that supports JDBC/ODBC Lucidworks Fusion App Insights
  • 25. Banana Dashboards Provided with HDP Search Easily create dashboards for a Solr collection Based on facet queries Requires basic knowledge of Solr
  • 28.
  • 29. FusionSQL - Using Spark and Solr together
  • 30. Tableau: Solr Collections look like tables
  • 31. Join across Solr Collections
  • 32. Fusion - Tableau: Self Service BI/Analytics
  • 34. • Leverage existing BI tools like Tableau and Zeppelin • Add full-text search and advanced Solr AI features to your SQL query • Ranking by relevance • Joins across collections • Fast and responsive queries at scale • Ask interesting questions of your data SQL Benefits with Solr/Fusion
  • 35. 35 Fusion App Insights • Customizable dashboards to visualize Query Analytics. • Built in Analytics reports based on Fusion AI Smart jobs for analyzing query performance. • Experiment analysis to give you feedback on how search variants are performing. • Thorough analytics on users, sessions, and all interactions (signals)
  • 36.
  • 37. Resources Solr Reference Guide: ● Streaming Expressions: https://lucene.apache.org/solr/guide/streaming-expressions.html ● Setting up Solr to be used with generic SQL clients: https://lucene.apache.org/solr/guide/7_3/parallel-sql- interface.html#generic-clients ● Solr and Apache Zeppelin: https://lucene.apache.org/solr/guide/7_3/solr-jdbc-apache-zeppelin.html#solr-jdbc-apache- zeppelin Lucidworks Fusion (Solr SQL and Spark SQL) - setting up Tableau https://lucidworks.com/2017/02/01/sql-in-fusion-3/ Tech at Bloomberg: The search for Solr analytics: https://www.techatbloomberg.com/blog/the-search-for-solr-analytics/