Solr is a great tool to have in the data scientist toolbox. In this talk, I walk through several demos of using Solr to data science activities as well as explore various use cases for Solr and data science
Solr in a nutshell
Solr is both established & growing
Largest community of developers.
2500+open Solr jobs.
Solr most widely used search
solution on the planet.
Unmatched Solr expertise.
of the active
of the open source
code is committed
world’s largest open source user
conference dedicated to Lucene/Solr.
Solr has tens of thousands
of applications in production.
Solr’s Key Features
• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete,
• More Like This
• Apache Lucene
• Grouping and Joins
• Stats, expressions,
transformations and more
• Lang. Detection
• Massive Scale/Fault tolerance
It is increasingly important to know
what is important!
Corollary: The faster you know what is important, the better
• Feature Selection
• Analyzers for all types
• Easily get weights for terms
• Term Vectors
• Data Reduction
• Data quality tools
Feature Selection and Data Reduction
• Quick and dirty:
• kNN, others
• Carrot^2 integration for search result
• Integration with Mahout
• Lucene provides Bayesian classiﬁers
built on index
• Easily build training and test sets via
Classiﬁcation and Clustering
• Built in expressions, stats, function
queries make custom ranking a snap!
• Search is essentially vector * matrix
• Lucene index is a ranking optimized
• More coming!
Clicks, tweets, ratings, locations and much more can all
be leveraged to provide high quality recommendations
to users and deeper insight for data scientists
Signals power relevance
Increase the ﬁndability of
documents and records with
automatic creation of tags, ﬁelds
Curate the user experience in
your application with artiﬁcial
result ranking, document
injections and obfuscation
Result ManipulationIndex Time Enrichment
Perform real time decision
making and routing in order to
map a users intention or
• Ships w/ built-in Solr-based Recommender OOTB,
but easy to extend
• Demo: eCommerce data set
• ~1.2M products
• ~4M clicks
• Data ingest:
• JSON, CSV, XML, Rich types (PDF, etc.), custom
• Clients for Python, R, Java, .NET and more
• http://cran.r-project.org/web/packages/solr/index.html, amongst
• Output formats: JSON, CSV, XML, custom
Solr and Your Tools
• Vector Space or Probabilistic, it’s your choice!
• Killer FST
• Wicked fast
• Pluggable compression, queries, indexing and
• Advanced Similarity Models
• Lang. Modeling, Divergence from Random,
• Easy to plug-in ranking
for Data Science