SlideShare a Scribd company logo
Practical Machine Learning for
Smarter Search
with Solr and Spark
Jake Mannix
@pbrane
Lead Data Engineer, Lucidworks
$ whoami
Now: Lucidworks, Office of the CTO: applied ML / data engineering R&D
Previously:
• Allen Institute for AI: Semantic Search on academic research publications
• Twitter: account search, user interest modeling, content recommendations
• LinkedIn: profile search, generic entity-to-entity recommender systems
Prehistory:
• other software companies, algebraic topology, particle cosmology
• Why Spark and Solr for Data Engineering?
• Quick intro to Solr
• Quick intro to Spark
• Example: ManyNewsgroups
• data exploration
• clustering: unsupervised ML
• classification: supervised ML
• recommender: collaborative filtering + content-based
• search ranking
Overview
Practical Data Science with Spark and Solr
Why does Solr need Spark?
Why does Spark need Solr?
Why do data engineering with Solr and Spark?
Solr Spark
• Data exploration and visualization
• Easy ingestion and feature
selection
• Powerful ranking features
• Quick and dirty classification and
clustering
• Simple operation and scaling
• Stats and math built in
• General purpose batch/streaming
compute engine
Whole collection analysis!
• Fast, large scale iterative
algorithms
• Advanced machine learning:
MLLib, Mahout, Deep Learning4j
• Lots of integrations with other big
data systems
Why does Spark need Solr?
Typical Hadoop / Spark data-engineering task, start with some data on
HDFS:
$ hdfs dfs -ls /user/jake/mail/lucene-solr-user/2015
…
-rw-r--r-- 1 jake staff 6304388 Feb 4 18:22 part-00001.lzo
-rw-r--r-- 1 jake staff 7977085 Feb 4 18:22 part-00002.lzo
-rw-r--r-- 1 jake staff 7210817 Feb 4 18:22 part-00003.lzo
-rw-r--r-- 1 jake staff 1215048 Feb 4 18:22 part-00004.lzo
Now what? What’s in these files?
Solr gives you:
• random access data store
• full-text search
• fast aggregate statistics
• just starting out: no HDFS / S3 necessary!
• world-class multilingual text analytics:
• no more: tokens = str.toLowerCase().split(“s+“)
• relevancy / ranking
• realtime HTTP service layer
• Apache Lucene
• Grouping and Joins
• Stats, expressions,
transformations and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
Solr Key Features
• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete,
highlighting
• Cursors
• More Like This
• De-duplication
Why Spark for Solr?
• Spark-shell: a Big Data REPL with all your fave JVM libs!
• Build the index in parallel very, very quickly!
• Aggregations
• Boosts, stats, iterative global computations
• Offline compute to update index with additional info (e.g. PageRank,
popularity)
• Whole corpus analytics and ML: clustering, classification, CF, rankers
• General-purpose distributed computation
• Joins with other storage (Cassandra, HDFS, DB, HBase)
Spark Key Features
• General purpose, high powered cluster computing system
• Modern, faster alternative to MapReduce
• 3x faster w/ 10x less hardware for Terasort
• Great for iterative algorithms
• APIs for Java, Scala, Python and R
• Rich set of add-on libraries for machine learning, graph processing,
integrations with SQL and other systems
• Deploys: Standalone, Hadoop YARN, Mesos, AWS, Docker, …
• Initial exploration of ASF mailing-list archives
• Index it into Solr
• Explore a bit deeper: unsupervised Spark ML
• Exploit labels: predictive analytics
Example: Many NewsGroups
• Initial exploration of ASF mailing-list archives
• index into Solr: just need to turn your records into json
• facet:
• fields with low cardinality or with sensible ranges
• document size histogram
• projects, authors, dates
• find: broken fields, automated content, expected data missing, errors
• now: load into a spark RDD via SolrRDD:
Many NewsGroups: Initial Exploration
• cleanup/filtering via spark DataFrame operations:
• create thread groups:
Many NewsGroups: Initial Exploration
• try other text analyzers: (no more str.split(“w+”)! )
Many NewsGroups: Initial Exploration
ref: Lucidworks blog on LuceneTextAnalyzer by Steve Rowe
• Unsupervised machine learning:
• clustering documents with KMeans
• extract topics with Latent Dirichlet Allocation
• learn word vectors with Word2Vec
Many NewsGroups: Exploratory Data Science
• Vectorize and run KMeans:
Many NewsGroups: Exploratory Data Science
• Build topic models with LDA:
Many NewsGroups: Exploratory Data Science
• Build word vector representations with Word2Vec:
Many NewsGroups: Exploratory Data Science
• Now for some real Data Science:
Many NewsGroups: Supervised Learning
• What else could you do?
• Try other classification algs, cross-validate to pick!
• Recommender Systems
• content-based:
• mail-thread as “item”, head msgs grouped by
replier as “user” profile
• search query of users against items to recommend
• collaborative-filtering:
• users replying to a head msg “rate” them +-tively
• train a Spark ML ALS RecSys model
• Train search rankers in click logs
Many NewsGroups: Next steps?
Resources
• spark-solr: https://github.com/Lucidworks/spark-solr
• Company: http://www.lucidworks.com
• Our blog: http://www.lucidworks.com/blog
• Apache Solr: http://lucene.apache.org/solr
• Apache Spark: http://spark.apache.org
• Fusion: http://www.lucidworks.com/products/fusion
• Twitter: @pbrane

More Related Content

What's hot

ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
Daniel N
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
Bryan Yang
 
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
Amazon Web Services
 
Elasticsearch: You know, for search! and more!
Elasticsearch: You know, for search! and more!Elasticsearch: You know, for search! and more!
Elasticsearch: You know, for search! and more!
Philips Kokoh Prasetyo
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
Ricardo Peres
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
Tao Feng
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
Jurriaan Persyn
 
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
South London Geek Nights
 
Solr 6.0 Graph Query Overview
Solr 6.0 Graph Query OverviewSolr 6.0 Graph Query Overview
Solr 6.0 Graph Query Overview
Kevin Watters
 
State-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache SolrState-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache Solr
guest432cd6
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
Spark Summit
 
Graph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsGraph Databases for SQL Server Professionals
Graph Databases for SQL Server Professionals
Stéphane Fréchette
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
Clifford James
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
Joey Wen
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
gagravarr
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
Andy Petrella
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
Sujit Pal
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
Eric Rodriguez (Hiring in Lex)
 

What's hot (19)

ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
 
Elasticsearch: You know, for search! and more!
Elasticsearch: You know, for search! and more!Elasticsearch: You know, for search! and more!
Elasticsearch: You know, for search! and more!
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
 
Solr 6.0 Graph Query Overview
Solr 6.0 Graph Query OverviewSolr 6.0 Graph Query Overview
Solr 6.0 Graph Query Overview
 
State-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache SolrState-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache Solr
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
 
Graph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsGraph Databases for SQL Server Professionals
Graph Databases for SQL Server Professionals
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 

Viewers also liked

Part 2 (machine learning overview) all machine learning is pattern search
Part 2 (machine learning overview)   all machine learning is pattern searchPart 2 (machine learning overview)   all machine learning is pattern search
Part 2 (machine learning overview) all machine learning is pattern search
International School of Engineering
 
Machine Learning for Search at LinkedIn
Machine Learning for Search at LinkedInMachine Learning for Search at LinkedIn
Machine Learning for Search at LinkedIn
Viet Ha-Thuc
 
Jake Mannix, MLconf 2013
Jake Mannix, MLconf 2013Jake Mannix, MLconf 2013
Jake Mannix, MLconf 2013
MLconf
 
Concept search for e commerce with solr
Concept search for e commerce with solrConcept search for e commerce with solr
Concept search for e commerce with solr
lucenerevolution
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
Charles Martin
 
eCommerce for Everyone: What to Expect in 2017 - State of Search
eCommerce for Everyone: What to Expect in 2017 - State of SearcheCommerce for Everyone: What to Expect in 2017 - State of Search
eCommerce for Everyone: What to Expect in 2017 - State of Search
Elizabeth Marsten
 
Learning to Rank Personalized Search Results in Professional Networks
Learning to Rank Personalized Search Results in Professional NetworksLearning to Rank Personalized Search Results in Professional Networks
Learning to Rank Personalized Search Results in Professional Networks
Viet Ha-Thuc
 
Machine Learning Search and SEO - Zenith; Duluth, MN.
Machine Learning Search and SEO - Zenith; Duluth, MN. Machine Learning Search and SEO - Zenith; Duluth, MN.
Machine Learning Search and SEO - Zenith; Duluth, MN.
Eric Enge
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
Tommaso Teofili
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
Lucidworks
 
Architecture of a search engine
Architecture of a search engineArchitecture of a search engine
Architecture of a search engine
Sylvain Utard
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm

Viewers also liked (13)

Part 2 (machine learning overview) all machine learning is pattern search
Part 2 (machine learning overview)   all machine learning is pattern searchPart 2 (machine learning overview)   all machine learning is pattern search
Part 2 (machine learning overview) all machine learning is pattern search
 
Machine Learning for Search at LinkedIn
Machine Learning for Search at LinkedInMachine Learning for Search at LinkedIn
Machine Learning for Search at LinkedIn
 
Jake Mannix, MLconf 2013
Jake Mannix, MLconf 2013Jake Mannix, MLconf 2013
Jake Mannix, MLconf 2013
 
Concept search for e commerce with solr
Concept search for e commerce with solrConcept search for e commerce with solr
Concept search for e commerce with solr
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
eCommerce for Everyone: What to Expect in 2017 - State of Search
eCommerce for Everyone: What to Expect in 2017 - State of SearcheCommerce for Everyone: What to Expect in 2017 - State of Search
eCommerce for Everyone: What to Expect in 2017 - State of Search
 
Learning to Rank Personalized Search Results in Professional Networks
Learning to Rank Personalized Search Results in Professional NetworksLearning to Rank Personalized Search Results in Professional Networks
Learning to Rank Personalized Search Results in Professional Networks
 
Machine Learning Search and SEO - Zenith; Duluth, MN.
Machine Learning Search and SEO - Zenith; Duluth, MN. Machine Learning Search and SEO - Zenith; Duluth, MN.
Machine Learning Search and SEO - Zenith; Duluth, MN.
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Architecture of a search engine
Architecture of a search engineArchitecture of a search engine
Architecture of a search engine
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 

Similar to Practical Machine Learning for Smarter Search with Spark+Solr

Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
MLconf
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Lucidworks
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
Lucidworks
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
gregchanan
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
JSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge GraphsJSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge Graphs
Franz Inc. - AllegroGraph
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
Open Analytics
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
 
Apache drill
Apache drillApache drill
Apache drill
MapR Technologies
 
Webinar: Fusion for Data Science
Webinar: Fusion for Data ScienceWebinar: Fusion for Data Science
Webinar: Fusion for Data Science
Lucidworks
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
WO Community
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
Dmitry Kan
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Domino Data Lab
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
Jesus Rodriguez
 
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
NoSQLmatters
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
Alkuvoima
 
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Lucidworks
 

Similar to Practical Machine Learning for Smarter Search with Spark+Solr (20)

Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
JSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge GraphsJSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge Graphs
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Apache drill
Apache drillApache drill
Apache drill
 
Webinar: Fusion for Data Science
Webinar: Fusion for Data ScienceWebinar: Fusion for Data Science
Webinar: Fusion for Data Science
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
 

Recently uploaded

Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 

Recently uploaded (20)

Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 

Practical Machine Learning for Smarter Search with Spark+Solr

  • 1. Practical Machine Learning for Smarter Search with Solr and Spark Jake Mannix @pbrane Lead Data Engineer, Lucidworks
  • 2. $ whoami Now: Lucidworks, Office of the CTO: applied ML / data engineering R&D Previously: • Allen Institute for AI: Semantic Search on academic research publications • Twitter: account search, user interest modeling, content recommendations • LinkedIn: profile search, generic entity-to-entity recommender systems Prehistory: • other software companies, algebraic topology, particle cosmology
  • 3. • Why Spark and Solr for Data Engineering? • Quick intro to Solr • Quick intro to Spark • Example: ManyNewsgroups • data exploration • clustering: unsupervised ML • classification: supervised ML • recommender: collaborative filtering + content-based • search ranking Overview
  • 4. Practical Data Science with Spark and Solr Why does Solr need Spark? Why does Spark need Solr?
  • 5. Why do data engineering with Solr and Spark? Solr Spark • Data exploration and visualization • Easy ingestion and feature selection • Powerful ranking features • Quick and dirty classification and clustering • Simple operation and scaling • Stats and math built in • General purpose batch/streaming compute engine Whole collection analysis! • Fast, large scale iterative algorithms • Advanced machine learning: MLLib, Mahout, Deep Learning4j • Lots of integrations with other big data systems
  • 6. Why does Spark need Solr? Typical Hadoop / Spark data-engineering task, start with some data on HDFS: $ hdfs dfs -ls /user/jake/mail/lucene-solr-user/2015 … -rw-r--r-- 1 jake staff 6304388 Feb 4 18:22 part-00001.lzo -rw-r--r-- 1 jake staff 7977085 Feb 4 18:22 part-00002.lzo -rw-r--r-- 1 jake staff 7210817 Feb 4 18:22 part-00003.lzo -rw-r--r-- 1 jake staff 1215048 Feb 4 18:22 part-00004.lzo Now what? What’s in these files?
  • 7. Solr gives you: • random access data store • full-text search • fast aggregate statistics • just starting out: no HDFS / S3 necessary! • world-class multilingual text analytics: • no more: tokens = str.toLowerCase().split(“s+“) • relevancy / ranking • realtime HTTP service layer
  • 8. • Apache Lucene • Grouping and Joins • Stats, expressions, transformations and more • Lang. Detection • Extensible • Massive Scale/Fault tolerance Solr Key Features • Full text search (Info Retr.) • Facets/Guided Nav galore! • Lots of data types • Spelling, auto-complete, highlighting • Cursors • More Like This • De-duplication
  • 9. Why Spark for Solr? • Spark-shell: a Big Data REPL with all your fave JVM libs! • Build the index in parallel very, very quickly! • Aggregations • Boosts, stats, iterative global computations • Offline compute to update index with additional info (e.g. PageRank, popularity) • Whole corpus analytics and ML: clustering, classification, CF, rankers • General-purpose distributed computation • Joins with other storage (Cassandra, HDFS, DB, HBase)
  • 10. Spark Key Features • General purpose, high powered cluster computing system • Modern, faster alternative to MapReduce • 3x faster w/ 10x less hardware for Terasort • Great for iterative algorithms • APIs for Java, Scala, Python and R • Rich set of add-on libraries for machine learning, graph processing, integrations with SQL and other systems • Deploys: Standalone, Hadoop YARN, Mesos, AWS, Docker, …
  • 11. • Initial exploration of ASF mailing-list archives • Index it into Solr • Explore a bit deeper: unsupervised Spark ML • Exploit labels: predictive analytics Example: Many NewsGroups
  • 12. • Initial exploration of ASF mailing-list archives • index into Solr: just need to turn your records into json • facet: • fields with low cardinality or with sensible ranges • document size histogram • projects, authors, dates • find: broken fields, automated content, expected data missing, errors • now: load into a spark RDD via SolrRDD: Many NewsGroups: Initial Exploration
  • 13. • cleanup/filtering via spark DataFrame operations: • create thread groups: Many NewsGroups: Initial Exploration
  • 14. • try other text analyzers: (no more str.split(“w+”)! ) Many NewsGroups: Initial Exploration ref: Lucidworks blog on LuceneTextAnalyzer by Steve Rowe
  • 15. • Unsupervised machine learning: • clustering documents with KMeans • extract topics with Latent Dirichlet Allocation • learn word vectors with Word2Vec Many NewsGroups: Exploratory Data Science
  • 16. • Vectorize and run KMeans: Many NewsGroups: Exploratory Data Science
  • 17. • Build topic models with LDA: Many NewsGroups: Exploratory Data Science
  • 18. • Build word vector representations with Word2Vec: Many NewsGroups: Exploratory Data Science
  • 19. • Now for some real Data Science: Many NewsGroups: Supervised Learning
  • 20. • What else could you do? • Try other classification algs, cross-validate to pick! • Recommender Systems • content-based: • mail-thread as “item”, head msgs grouped by replier as “user” profile • search query of users against items to recommend • collaborative-filtering: • users replying to a head msg “rate” them +-tively • train a Spark ML ALS RecSys model • Train search rankers in click logs Many NewsGroups: Next steps?
  • 21. Resources • spark-solr: https://github.com/Lucidworks/spark-solr • Company: http://www.lucidworks.com • Our blog: http://www.lucidworks.com/blog • Apache Solr: http://lucene.apache.org/solr • Apache Spark: http://spark.apache.org • Fusion: http://www.lucidworks.com/products/fusion • Twitter: @pbrane

Editor's Notes

  1. I live at the intersection of information retrieval and machine learning, and the scalable / distributed systems engineering to enable them
  2. Two open source Apache projects: Solr is a fantastically scalable, production-grade search engine Spark is a high-performance, flexible distributed computation engine
  3. Download one? These aren’t big, but sometimes each chunk *is* big. These are moderately human-readable (after decompress), so you could “hdfs cat” them, but often it’s binary: parquet/avro/thrift/protobufs. Maybe you can “hdfs cat” them, too… but you want to explore, tinker, see what ugly rough edges there are.
  4. No HDFS means no NameNode SPOF, no need for HA NN, Ops to go with that.
  5. faceting does global reads / stats, but is “just counting”
  6. overview: buildVectorizer runs word count, filters out ultra-rare and too freq words, builds a dictionary with IDF values in it vectorize: applies the vectorizer as a UDF to make a vector field buildKMeansModel: runs kMeans over the vectors
  7. next slide: w2v - sooooooo much code! ;P
  8. this model may be applied to data you have that doesn’t have labels it may also be serialized, and loaded up into a service layer (plug for Fusion) Remember: just index these modified DataFrames back into Solr and take a peek at them!