Practical Machine Learning for Smarter Search with Solr and Spark

Practical Machine Learning for
Smarter Search
with Solr and Spark
Jake Mannix
@pbrane
Lead Data Engineer, Lucidworks

$ whoami
Now: Lucidworks, Office of the CTO: applied ML / data engineering R&D
Previously:
• Allen Institute for AI: Semantic Search on academic research publications
• Twitter: account search, user interest modeling, content recommendations
• LinkedIn: profile search, generic entity-to-entity recommender systems
Prehistory:
• other software companies, algebraic topology, particle cosmology

• Why Spark and Solr for Data Engineering?
• Quick intro to Solr
• Quick intro to Spark
• Example: ManyNewsgroups
• data exploration
• clustering: unsupervised ML
• classification: supervised ML
• recommender: collaborative filtering + content-based
• search ranking
Overview

Practical Data Science with Spark and Solr
Why does Solr need Spark?
Why does Spark need Solr?

Why do data engineering with Solr and Spark?
Solr Spark
• Data exploration and visualization
• Easy ingestion and feature
selection
• Powerful ranking features
• Quick and dirty classification and
clustering
• Simple operation and scaling
• Stats and math built in
• General purpose batch/streaming
compute engine
Whole collection analysis!
• Fast, large scale iterative
algorithms
• Advanced machine learning:
MLLib, Mahout, Deep Learning4j
• Lots of integrations with other big
data systems

Why does Spark need Solr?
Typical Hadoop / Spark data-engineering task, start with some data on
HDFS:
$ hdfs dfs -ls /user/jake/mail/lucene-solr-user/2015
…
-rw-r--r-- 1 jake staff 6304388 Feb 4 18:22 part-00001.lzo
Now what? What’s in these files?

Solr gives you:
• random access data store
• full-text search
• fast aggregate statistics
• just starting out: no HDFS / S3 necessary!
• world-class multilingual text analytics:
• no more: tokens = str.toLowerCase().split(“s+“)
• relevancy / ranking
• realtime HTTP service layer

• Apache Lucene
• Grouping and Joins
• Stats, expressions,
transformations and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
Solr Key Features
• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete,
highlighting
• Cursors
• More Like This
• De-duplication

Why Spark for Solr?
• Spark-shell: a Big Data REPL with all your fave JVM libs!
• Build the index in parallel very, very quickly!
• Aggregations
• Boosts, stats, iterative global computations
• Offline compute to update index with additional info (e.g. PageRank,
popularity)
• Whole corpus analytics and ML: clustering, classification, CF, rankers
• General-purpose distributed computation
• Joins with other storage (Cassandra, HDFS, DB, HBase)

Spark Key Features
• General purpose, high powered cluster computing system
• Modern, faster alternative to MapReduce
• 3x faster w/ 10x less hardware for Terasort
• Great for iterative algorithms
• APIs for Java, Scala, Python and R
• Rich set of add-on libraries for machine learning, graph processing,
integrations with SQL and other systems
• Deploys: Standalone, Hadoop YARN, Mesos, AWS, Docker, …

• Initial exploration of ASF mailing-list archives
• Index it into Solr
• Explore a bit deeper: unsupervised Spark ML
• Exploit labels: predictive analytics
Example: Many NewsGroups

• Initial exploration of ASF mailing-list archives
• index into Solr: just need to turn your records into json
• facet:
• fields with low cardinality or with sensible ranges
• document size histogram
• projects, authors, dates
• find: broken fields, automated content, expected data missing, errors
• now: load into a spark RDD via SolrRDD:
Many NewsGroups: Initial Exploration

• cleanup/filtering via spark DataFrame operations:
• create thread groups:

• try other text analyzers: (no more str.split(“w+”)! )
ref: Lucidworks blog on LuceneTextAnalyzer by Steve Rowe

• Unsupervised machine learning:
• clustering documents with KMeans
• extract topics with Latent Dirichlet Allocation
• learn word vectors with Word2Vec
Many NewsGroups: Exploratory Data Science

• Vectorize and run KMeans:

• Build topic models with LDA:

• Build word vector representations with Word2Vec:

• Now for some real Data Science:
Many NewsGroups: Supervised Learning

• What else could you do?
• Try other classification algs, cross-validate to pick!
• Recommender Systems
• content-based:
• mail-thread as “item”, head msgs grouped by
replier as “user” profile
• search query of users against items to recommend
• collaborative-filtering:
• users replying to a head msg “rate” them +-tively
• train a Spark ML ALS RecSys model
• Train search rankers in click logs
Many NewsGroups: Next steps?

Resources
• spark-solr: https://github.com/Lucidworks/spark-solr
• Company: http://www.lucidworks.com
• Our blog: http://www.lucidworks.com/blog
• Apache Solr: http://lucene.apache.org/solr
• Apache Spark: http://spark.apache.org
• Fusion: http://www.lucidworks.com/products/fusion
• Twitter: @pbrane

Practical Machine Learning for Smarter Search with Solr and Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Practical Machine Learning for Smarter Search with Solr and Spark

Similar to Practical Machine Learning for Smarter Search with Solr and Spark (20)

Recently uploaded

Recently uploaded (20)

Practical Machine Learning for Smarter Search with Solr and Spark

Editor's Notes