Solr for Data Science
Scalable search and analytics in one
Grant Ingersoll, CTO: @gsingers
http://github.com/lucidworks/solr-for-datascience
Solr in a nutshell
8M+ total
downloads
Solr is both established & growing
250,000+
monthly downloads
Largest community of developers.
2500+open Solr jobs.
Solr most widely used search
solution on the planet.
Lucidworks
Unmatched Solr expertise.
1/3
of the active
committers
70%
of the open source
code is committed
Lucene/Solr Revolution
world’s largest open source user
conference dedicated to Lucene/Solr.
Solr has tens of thousands
of applications in production.
You use
Solr everyday.
Solr’s Key Features
• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete,
highlighting
• Cursors
• More Like This
• De-duplication
• Apache Lucene
• Grouping and Joins
• Stats, expressions,
transformations and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
It is increasingly important to know
what is important!
Corollary: The faster you know what is important, the better
Data
Exploration
• Solr - Logstash - Kibana
!
• http://lucidworks.com/
product/integrations/silk/
• Open source at:
• https://github.com/
LucidWorks/banana
• https://github.com/
LucidWorks/solrlogmanager
SiLK
• Feature Selection
• Analyzers for all types
• Easily get weights for terms
• Term Vectors
• Data Reduction
• Filters
• Analyzers
• Data quality tools
Feature Selection and Data Reduction
• Quick and dirty:
• kNN, others
• Carrot^2 integration for search result
clustering
• Integration with Mahout
• Lucene provides Bayesian classifiers
built on index
• Easily build training and test sets via
filter queries
Classification and Clustering
• Built in expressions, stats, function
queries make custom ranking a snap!
• Search is essentially vector * matrix
• Lucene index is a ranking optimized
matrix
• More coming!
Math
Clicks, tweets, ratings, locations and much more can all
be leveraged to provide high quality recommendations
to users and deeper insight for data scientists
!
Signals power relevance
Query Modification
Increase the findability of
documents and records with
automatic creation of tags, fields
and meta-data
Curate the user experience in
your application with artificial
result ranking, document
injections and obfuscation
Result ManipulationIndex Time Enrichment
Perform real time decision
making and routing in order to
map a users intention or
enterprise policy
• http://www.lucidworks.com/products/fusion
• Ships w/ built-in Solr-based Recommender OOTB,
but easy to extend
• Demo: eCommerce data set
• ~1.2M products
• ~4M clicks
Lucidworks Fusion
• Data ingest:
• JSON, CSV, XML, Rich types (PDF, etc.), custom
• Clients for Python, R, Java, .NET and more
• http://cran.r-project.org/web/packages/solr/index.html, amongst
others
• Output formats: JSON, CSV, XML, custom
Solr and Your Tools
• Vector Space or Probabilistic, it’s your choice!
• Killer FST
• Wicked fast
• Pluggable compression, queries, indexing and
more
• Advanced Similarity Models
• Lang. Modeling, Divergence from Random,
more
• Easy to plug-in ranking
for Data Science
But what about?
• More Facets/Stats
• Combine pivots, ranges and stats
• Percentiles via t-digest
• hyper-log-log
• Deeper Spark integration for Solr
• Custom distributed computation and aggregations/maths
• Advanced schema on read options
• Time series? Trends? Anomaly Detection?
• Learn to rank?
What’s coming?
Lucidworks Open Source
• Logstash for Solr:
• https://github.com/LucidWorks/solrlogmanager
• Banana (Kibana for Solr):
• https://github.com/LucidWorks/banana
• Effortless AWS deployment and monitoring:
• http://www.github.com/lucidworks/solr-scale-tk
• Data Quality Toolkit:
• https://github.com/LucidWorks/data-quality
• Spark Integration
• https://github.com/LucidWorks/spark-solr
• This code: http://github.com/lucidworks/solr-for-
datascience
• Company: http://www.lucidworks.com
• Our blog: http://www.lucidworks.com/blog
• Book: http://www.manning.com/ingersoll
• Solr: http://lucene.apache.org/solr
• Fusion: http://www.lucidworks.com/products/fusion
• Twitter: @gsingers
Resources
Solr for Data Science

Solr for Data Science

  • 1.
    Solr for DataScience Scalable search and analytics in one Grant Ingersoll, CTO: @gsingers
  • 3.
  • 4.
    Solr in anutshell 8M+ total downloads Solr is both established & growing 250,000+ monthly downloads Largest community of developers. 2500+open Solr jobs. Solr most widely used search solution on the planet. Lucidworks Unmatched Solr expertise. 1/3 of the active committers 70% of the open source code is committed Lucene/Solr Revolution world’s largest open source user conference dedicated to Lucene/Solr. Solr has tens of thousands of applications in production. You use Solr everyday.
  • 5.
    Solr’s Key Features •Full text search (Info Retr.) • Facets/Guided Nav galore! • Lots of data types • Spelling, auto-complete, highlighting • Cursors • More Like This • De-duplication • Apache Lucene • Grouping and Joins • Stats, expressions, transformations and more • Lang. Detection • Extensible • Massive Scale/Fault tolerance
  • 7.
    It is increasinglyimportant to know what is important! Corollary: The faster you know what is important, the better
  • 8.
  • 9.
    • Solr -Logstash - Kibana ! • http://lucidworks.com/ product/integrations/silk/ • Open source at: • https://github.com/ LucidWorks/banana • https://github.com/ LucidWorks/solrlogmanager SiLK
  • 11.
    • Feature Selection •Analyzers for all types • Easily get weights for terms • Term Vectors • Data Reduction • Filters • Analyzers • Data quality tools Feature Selection and Data Reduction
  • 12.
    • Quick anddirty: • kNN, others • Carrot^2 integration for search result clustering • Integration with Mahout • Lucene provides Bayesian classifiers built on index • Easily build training and test sets via filter queries Classification and Clustering
  • 13.
    • Built inexpressions, stats, function queries make custom ranking a snap! • Search is essentially vector * matrix • Lucene index is a ranking optimized matrix • More coming! Math
  • 14.
    Clicks, tweets, ratings,locations and much more can all be leveraged to provide high quality recommendations to users and deeper insight for data scientists ! Signals power relevance Query Modification Increase the findability of documents and records with automatic creation of tags, fields and meta-data Curate the user experience in your application with artificial result ranking, document injections and obfuscation Result ManipulationIndex Time Enrichment Perform real time decision making and routing in order to map a users intention or enterprise policy
  • 15.
    • http://www.lucidworks.com/products/fusion • Shipsw/ built-in Solr-based Recommender OOTB, but easy to extend • Demo: eCommerce data set • ~1.2M products • ~4M clicks Lucidworks Fusion
  • 16.
    • Data ingest: •JSON, CSV, XML, Rich types (PDF, etc.), custom • Clients for Python, R, Java, .NET and more • http://cran.r-project.org/web/packages/solr/index.html, amongst others • Output formats: JSON, CSV, XML, custom Solr and Your Tools
  • 17.
    • Vector Spaceor Probabilistic, it’s your choice! • Killer FST • Wicked fast • Pluggable compression, queries, indexing and more • Advanced Similarity Models • Lang. Modeling, Divergence from Random, more • Easy to plug-in ranking for Data Science
  • 18.
  • 19.
    • More Facets/Stats •Combine pivots, ranges and stats • Percentiles via t-digest • hyper-log-log • Deeper Spark integration for Solr • Custom distributed computation and aggregations/maths • Advanced schema on read options • Time series? Trends? Anomaly Detection? • Learn to rank? What’s coming?
  • 20.
    Lucidworks Open Source •Logstash for Solr: • https://github.com/LucidWorks/solrlogmanager • Banana (Kibana for Solr): • https://github.com/LucidWorks/banana • Effortless AWS deployment and monitoring: • http://www.github.com/lucidworks/solr-scale-tk • Data Quality Toolkit: • https://github.com/LucidWorks/data-quality • Spark Integration • https://github.com/LucidWorks/spark-solr
  • 21.
    • This code:http://github.com/lucidworks/solr-for- datascience • Company: http://www.lucidworks.com • Our blog: http://www.lucidworks.com/blog • Book: http://www.manning.com/ingersoll • Solr: http://lucene.apache.org/solr • Fusion: http://www.lucidworks.com/products/fusion • Twitter: @gsingers Resources