Introductory technical session on Apache Solr's (HDP Search) artificial intelligence and machine learning features to discover relationships and insights across big data in the enterprise. Discussions will include how Solr performs graph traversal, anomaly detection, NLP and time-series analysis, and how you can display this data to users with easy-to-create dashboards.
This technical session will review Apache Solr’s streaming expressions, which were introduced in Solr 6.5. With over 100 expressions and evaluators, conditional logic, variables and data structures these functions form the basis of a new paradigm that brings many of the features from the relational world into search. These new capabilities form the basis of a powerful functional programming language that enables the implementation of many parallel computing use cases such as anomaly detection, streaming NLP, graph traversal and time-series analysis.
In order to discover and analyze big data, third party tools such as Jupyter, Tableau, and Lucidworks Insights will be reviewed.
Speaker
Cassandra Targett, Lucidworks, Director of Engineering
Marcelline Saunders, Lucidworks, Director, Global Partner Enablement
2. Who we are
Cassandra Targett
Lucene/Solr Committer & PMC
Director of Engineering at
Lucidworks
Solr and HDP Search
Development
Marcelline Saunders
Director of Global Partner
Enablement at Lucidworks
3. Lucidworks is the primary sponsor
of the Apache Solr project
Employs over 40% of the active
committers on the Solr project
Contributes over 70% of Solr's
open source codebase
40%
70%
Based in San Francisco
Offices in Bangalore, Bangkok,
New York City, London, Raleigh
Over 400 customers across the
Fortune 1000
Fusion, a Solr-powered platform
for search-driven apps
Consulting and support for
organizations using Solr Produces the world’s largest open source user
conference for Lucene/Solr (now also AI!)
5. About Solr
Solr is the most popular search engine available today
Built on Lucene
Open source
Scalable
Distributed
Flexible
Extensible
Search Features:
● Admin UI
● Facets
● Hit highlights
● Multiple languages
● Spell check, auto-
complete
6. What is HDP Search?
Developed by Lucidworks
Built & Distributed by
Hortonworks
Add-on package for HDP, which
includes:
● Apache Solr
● HDFS, Hive and Pig Connectors
● Ambari MPack for Solr
● Banana
● Documentation
8. AI Features in
Solr
● Streaming Expressions
○ Math programming syntax
○ Train regression models
○ Classify results of a search
○ Parallel processing
○ Graph Traversal
○ Parallel SQL
● Learning-to-Rank
● Analytics Component
9. Streaming Expressions
Powerful stream processing language for Solr
● Suite of functions to query,
transform, and aggregate your
data
● Functions can be nested to
perform multiple tasks in one
request
● Work across your entire
dataset
10. ● Request/response stream
processing
● Batch stream processing
● Fast interactive MapReduce
● Aggregations (pushed down
faceted and shuffling
MapReduce)
● Parallel relational algebra
(distributed joins, intersections,
unions, complements)
● Publish/subscribe messaging
● Distributed graph traversal
● Machine learning and parallel
iterative model training
● Anomaly detection
● Recommendation systems
● Retrieve and rank services
● Text classification and feature
extraction
● Streaming NLP
● Build your own!
What Can You Do?
13. Stream
Evaluators
input -> parameter
(possibly from a field in a
tuple)
output -> parameter
(possibly from a field in a
tuple)
● analyze
● abs
● add
● div
● log
● mult
● sub
● pow
● mod
● ceil
● floor
● sin
● asin
● sinh
● cos
● acos
● atan
● round
● sqrt
● cbrt
● and
● eq
● eor
● gteq
● gt
● if
● lteq
● lt
● not
● or
● raw
● sample
Stream Evaluators are functions that evaluate
parameters and return a result. These can be used
to transform values inside the tuples in a streaming
expression, or can be used independently.
● regress
● predict
● standardize
● distance
● kmeans
● timeseries
● monteCarlo
● cumulativeProbablity
● betaDistribution
● termVectors
● matrix
● rowCount
● mean
● describe
● percentile
● cov
...and many MORE
14. Parallel Batch
Processing
Train a Logistic Regression
Model
Distributed Joins
Pull Results from External Database
Sources: https://lucene.apache.org/solr/guide/streaming-expressions.html http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html
Classify Search
Results
Rapid Export of all
Search Results
Streaming Expression Examples
15. Parallel SQL
● SQL interface for writing streaming expressions
● Statements are parsed to proper streaming expression syntax
● Supports a basic SQL syntax: SELECT, WHERE, ORDER BY,
LIMIT, etc.
rollup
(search
(techproducts,q=”*:*”,fl=”id,color”,sort=”color asc”),
over=”color”, count(*))
SELECT count(*) from techproducts
WHERE _text_=’(*:*)’ GROUP BY color
16. Graph Traversal
● Part of Solr’s broader Streaming
Expressions capability
● Implements a powerful, breadth-first
traversal
● Works across shards AND collections
● Supports aggregations
● Cycle aware
● Ability to both traverse AND score
nodes within the graph
17. Graph Traversal - Syntax
All movies that user "trey" watched
gatherNodes(movielens,walk="trey->user_name_s",gather="movie_id_i")
All movies that viewers of a specific movie watched
gatherNodes(movielens,
gatherNodes(movielens,walk="123->movie_id_i",gather="user_id_i"),
walk="node->user_id_i",gather="movie_id_i", trackTraversal="true"
)
18. Graph Traversal - Use Cases
• Anomaly detection /
fraud detection
• Recommenders
• Social network analysis
• Graph Search
• Access Control
• Relationship discovery / scoring
Examples
o Find all draft blog posts about “Parallel SQL”
written by a developer
o Find all tweets mentioning “Solr” by me or people
I follow
o Find all draft blog posts about “Parallel SQL”
written by a developer
o Find 3-star hotels in NYC my friends stayed in
last year
19. Learning to Rank (LTR)
Rank query results based on trained models
Traditional relevance ranking uses algorithms that calculate user
query terms to terms in the document (TF/IDF, BM25)
LTR allows you to rank results for user queries according to trained
models stored in Solr (trained outside Solr)
Factors for training data:
● Implicit: clicks, time spent on page, historical sales, previously
viewed documents
● Explicit: human judgement
20. Analytics Component
Calculate complex statistical aggregations over result sets.
Expressions, functions and groupings of data from your documents:
● Expressions: calculations to perform over the result set to
return a single value
● Functions: variables re-used in expressions or groupings
● Groupings: facets, which can include functions or expressions
neg, round, ceil, if, gt, lt, add, sub, div, sum, count,
unique, percentile, date, concat, log, pow, mean, min, max
22. Search Driven Analytics
Motivation
- Go beyond full text search
- Self-service exploration of data
- Provide tools for analysts to mine data without having to
understand query languages
- Create views of data for users
23. Why SQL with Search?
● Known query language
● Eliminates re-training users on proprietary tools and query
languages
● Third party BI tools use JDBC/ODBC
● Leverage powerful full text search
● Join Solr collections
● Join Solr collections with other data sources
24. Analytics Visualization tools
Banana (available with HDP Search)
Solr 6.0 + (Solr SQL)
- Apache Zeppelin
Lucidworks Fusion (Spark SQL - Solr SQL)
- Tableau
- Apache Zeppelin
- Jupyter
- Any third party product that supports JDBC/ODBC
Lucidworks Fusion App Insights
25. Banana Dashboards
Provided with HDP
Search
Easily create
dashboards for a Solr
collection
Based on facet queries
Requires basic
knowledge of Solr
34. • Leverage existing BI tools like Tableau and
Zeppelin
• Add full-text search and advanced Solr AI
features to your SQL query
• Ranking by relevance
• Joins across collections
• Fast and responsive queries at scale
• Ask interesting questions of your data
SQL
Benefits with
Solr/Fusion
35. 35
Fusion App Insights
• Customizable dashboards to visualize
Query Analytics.
• Built in Analytics reports based on
Fusion AI Smart jobs for analyzing query
performance.
• Experiment analysis to give you
feedback on how search variants are
performing.
• Thorough analytics on users, sessions,
and all interactions (signals)
36.
37. Resources
Solr Reference Guide:
● Streaming Expressions: https://lucene.apache.org/solr/guide/streaming-expressions.html
● Setting up Solr to be used with generic SQL clients: https://lucene.apache.org/solr/guide/7_3/parallel-sql-
interface.html#generic-clients
● Solr and Apache Zeppelin: https://lucene.apache.org/solr/guide/7_3/solr-jdbc-apache-zeppelin.html#solr-jdbc-apache-
zeppelin
Lucidworks Fusion (Solr SQL and Spark SQL) - setting up Tableau
https://lucidworks.com/2017/02/01/sql-in-fusion-3/
Tech at Bloomberg: The search for Solr analytics: https://www.techatbloomberg.com/blog/the-search-for-solr-analytics/