SlideShare a Scribd company logo
Solr for Data Science
Scalable search and analytics in one
Grant Ingersoll, CTO: @gsingers
http://github.com/lucidworks/solr-for-datascience
Solr in a nutshell
8M+ total
downloads
Solr is both established & growing
250,000+
monthly downloads
Largest community of developers.
2500+open Solr jobs.
Solr most widely used search
solution on the planet.
Lucidworks
Unmatched Solr expertise.
1/3
of the active
committers
70%
of the open source
code is committed
Lucene/Solr Revolution
world’s largest open source user
conference dedicated to Lucene/Solr.
Solr has tens of thousands
of applications in production.
You use
Solr everyday.
Solr’s Key Features
• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete,
highlighting
• Cursors
• More Like This
• De-duplication
• Apache Lucene
• Grouping and Joins
• Stats, expressions,
transformations and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
It is increasingly important to know
what is important!
Corollary: The faster you know what is important, the better
Data
Exploration
• Solr - Logstash - Kibana
!
• http://lucidworks.com/
product/integrations/silk/
• Open source at:
• https://github.com/
LucidWorks/banana
• https://github.com/
LucidWorks/solrlogmanager
SiLK
• Feature Selection
• Analyzers for all types
• Easily get weights for terms
• Term Vectors
• Data Reduction
• Filters
• Analyzers
• Data quality tools
Feature Selection and Data Reduction
• Quick and dirty:
• kNN, others
• Carrot^2 integration for search result
clustering
• Integration with Mahout
• Lucene provides Bayesian classifiers
built on index
• Easily build training and test sets via
filter queries
Classification and Clustering
• Built in expressions, stats, function
queries make custom ranking a snap!
• Search is essentially vector * matrix
• Lucene index is a ranking optimized
matrix
• More coming!
Math
Clicks, tweets, ratings, locations and much more can all
be leveraged to provide high quality recommendations
to users and deeper insight for data scientists
!
Signals power relevance
Query Modification
Increase the findability of
documents and records with
automatic creation of tags, fields
and meta-data
Curate the user experience in
your application with artificial
result ranking, document
injections and obfuscation
Result ManipulationIndex Time Enrichment
Perform real time decision
making and routing in order to
map a users intention or
enterprise policy
• http://www.lucidworks.com/products/fusion
• Ships w/ built-in Solr-based Recommender OOTB,
but easy to extend
• Demo: eCommerce data set
• ~1.2M products
• ~4M clicks
Lucidworks Fusion
• Data ingest:
• JSON, CSV, XML, Rich types (PDF, etc.), custom
• Clients for Python, R, Java, .NET and more
• http://cran.r-project.org/web/packages/solr/index.html, amongst
others
• Output formats: JSON, CSV, XML, custom
Solr and Your Tools
• Vector Space or Probabilistic, it’s your choice!
• Killer FST
• Wicked fast
• Pluggable compression, queries, indexing and
more
• Advanced Similarity Models
• Lang. Modeling, Divergence from Random,
more
• Easy to plug-in ranking
for Data Science
But what about?
• More Facets/Stats
• Combine pivots, ranges and stats
• Percentiles via t-digest
• hyper-log-log
• Deeper Spark integration for Solr
• Custom distributed computation and aggregations/maths
• Advanced schema on read options
• Time series? Trends? Anomaly Detection?
• Learn to rank?
What’s coming?
Lucidworks Open Source
• Logstash for Solr:
• https://github.com/LucidWorks/solrlogmanager
• Banana (Kibana for Solr):
• https://github.com/LucidWorks/banana
• Effortless AWS deployment and monitoring:
• http://www.github.com/lucidworks/solr-scale-tk
• Data Quality Toolkit:
• https://github.com/LucidWorks/data-quality
• Spark Integration
• https://github.com/LucidWorks/spark-solr
• This code: http://github.com/lucidworks/solr-for-
datascience
• Company: http://www.lucidworks.com
• Our blog: http://www.lucidworks.com/blog
• Book: http://www.manning.com/ingersoll
• Solr: http://lucene.apache.org/solr
• Fusion: http://www.lucidworks.com/products/fusion
• Twitter: @gsingers
Resources
Solr for Data Science

More Related Content

What's hot

Webinar: Rapid Solr Development with Fusion
Webinar: Rapid Solr Development with FusionWebinar: Rapid Solr Development with Fusion
Webinar: Rapid Solr Development with FusionLucidworks
 
Webinar: Fusion for Data Science
Webinar: Fusion for Data ScienceWebinar: Fusion for Data Science
Webinar: Fusion for Data Science
Lucidworks
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
Lucidworks (Archived)
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topics
Valentin Kropov
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
Spark Summit
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Lucidworks
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Caserta
 
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patentsBuilding a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
OpenSource Connections
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
Ran Wei
 
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & SparkWebinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Lucidworks
 
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
Lucidworks
 
Data Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoData Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah Guido
Bitly
 
Webinar: Event Processing & Data Analytics with Lucidworks Fusion
Webinar: Event Processing & Data Analytics with Lucidworks FusionWebinar: Event Processing & Data Analytics with Lucidworks Fusion
Webinar: Event Processing & Data Analytics with Lucidworks Fusion
Lucidworks
 
TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR
TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADRTweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR
TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR
Lucidworks
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search Training
Cloudera, Inc.
 
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
TUMRA | Big Data Science - Gain a competitive advantage through Big Data & Data Science
 
Use cases for cassandra in federal and state government
Use cases for cassandra in federal and state governmentUse cases for cassandra in federal and state government
Use cases for cassandra in federal and state government
OpenSource Connections
 
Uof memphis nosql mike king dell v1.5 feb18
Uof memphis nosql mike king dell v1.5 feb18Uof memphis nosql mike king dell v1.5 feb18
Uof memphis nosql mike king dell v1.5 feb18
Mike King
 
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...
Databricks
 

What's hot (20)

Webinar: Rapid Solr Development with Fusion
Webinar: Rapid Solr Development with FusionWebinar: Rapid Solr Development with Fusion
Webinar: Rapid Solr Development with Fusion
 
Webinar: Fusion for Data Science
Webinar: Fusion for Data ScienceWebinar: Fusion for Data Science
Webinar: Fusion for Data Science
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topics
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
 
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patentsBuilding a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & SparkWebinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
 
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
 
Data Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoData Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah Guido
 
Webinar: Event Processing & Data Analytics with Lucidworks Fusion
Webinar: Event Processing & Data Analytics with Lucidworks FusionWebinar: Event Processing & Data Analytics with Lucidworks Fusion
Webinar: Event Processing & Data Analytics with Lucidworks Fusion
 
TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR
TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADRTweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR
TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search Training
 
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
 
Use cases for cassandra in federal and state government
Use cases for cassandra in federal and state governmentUse cases for cassandra in federal and state government
Use cases for cassandra in federal and state government
 
Uof memphis nosql mike king dell v1.5 feb18
Uof memphis nosql mike king dell v1.5 feb18Uof memphis nosql mike king dell v1.5 feb18
Uof memphis nosql mike king dell v1.5 feb18
 
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...
 

Viewers also liked

Solr Anti - patterns
Solr Anti - patternsSolr Anti - patterns
Solr Anti - patterns
Rafał Kuć
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Grant Ingersoll
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
Grant Ingersoll
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Grant Ingersoll
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
Grant Ingersoll
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
Grant Ingersoll
 
Taming Text
Taming TextTaming Text
Taming Text
Grant Ingersoll
 
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksVisualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Lucidworks
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
Lucidworks
 

Viewers also liked (10)

Solr Anti - patterns
Solr Anti - patternsSolr Anti - patterns
Solr Anti - patterns
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
Taming Text
Taming TextTaming Text
Taming Text
 
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksVisualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 

Similar to Solr for Data Science

Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's New
Lucidworks
 
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Lucidworks
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Scalable Search Analytics
Scalable Search AnalyticsScalable Search Analytics
Scalable Search Analytics
enterprisesearchmeetup
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
Application of Library Management Software: NewGenLib
Application of Library Management Software: NewGenLibApplication of Library Management Software: NewGenLib
Application of Library Management Software: NewGenLib
David Nzoputa Ofili
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Petter Skodvin-Hvammen
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
CareerBuilder.com
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
OpenSource Connections
 
Solr 101
Solr 101Solr 101
Solr 101
Findwise
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Trey Grainger
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Sease
 
Webinar: Site Search in an Hour with Fusion
Webinar: Site Search in an Hour with FusionWebinar: Site Search in an Hour with Fusion
Webinar: Site Search in an Hour with Fusion
Lucidworks
 
Getting started faster with LucidWorks for Solr
Getting started faster with LucidWorks for SolrGetting started faster with LucidWorks for Solr
Getting started faster with LucidWorks for Solr
Lucidworks (Archived)
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
Trey Grainger
 
Apache Solr 5.0 and beyond
Apache Solr 5.0 and beyondApache Solr 5.0 and beyond
Apache Solr 5.0 and beyond
Anshum Gupta
 
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
OpenSource Connections
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Migrating to SharePoint Online - How Micosoft Does IT
Migrating to SharePoint Online - How Micosoft Does ITMigrating to SharePoint Online - How Micosoft Does IT
Migrating to SharePoint Online - How Micosoft Does IT
Karuana Gatimu
 

Similar to Solr for Data Science (20)

EnterpriseSearch
EnterpriseSearchEnterpriseSearch
EnterpriseSearch
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's New
 
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Scalable Search Analytics
Scalable Search AnalyticsScalable Search Analytics
Scalable Search Analytics
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Application of Library Management Software: NewGenLib
Application of Library Management Software: NewGenLibApplication of Library Management Software: NewGenLib
Application of Library Management Software: NewGenLib
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
 
Solr 101
Solr 101Solr 101
Solr 101
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Webinar: Site Search in an Hour with Fusion
Webinar: Site Search in an Hour with FusionWebinar: Site Search in an Hour with Fusion
Webinar: Site Search in an Hour with Fusion
 
Getting started faster with LucidWorks for Solr
Getting started faster with LucidWorks for SolrGetting started faster with LucidWorks for Solr
Getting started faster with LucidWorks for Solr
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
Apache Solr 5.0 and beyond
Apache Solr 5.0 and beyondApache Solr 5.0 and beyond
Apache Solr 5.0 and beyond
 
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Migrating to SharePoint Online - How Micosoft Does IT
Migrating to SharePoint Online - How Micosoft Does ITMigrating to SharePoint Online - How Micosoft Does IT
Migrating to SharePoint Online - How Micosoft Does IT
 

More from Grant Ingersoll

Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Grant Ingersoll
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
Grant Ingersoll
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Grant Ingersoll
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
Grant Ingersoll
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Grant Ingersoll
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
Grant Ingersoll
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
Grant Ingersoll
 

More from Grant Ingersoll (11)

Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 

Recently uploaded

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 

Recently uploaded (20)

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 

Solr for Data Science

  • 1. Solr for Data Science Scalable search and analytics in one Grant Ingersoll, CTO: @gsingers
  • 2.
  • 4. Solr in a nutshell 8M+ total downloads Solr is both established & growing 250,000+ monthly downloads Largest community of developers. 2500+open Solr jobs. Solr most widely used search solution on the planet. Lucidworks Unmatched Solr expertise. 1/3 of the active committers 70% of the open source code is committed Lucene/Solr Revolution world’s largest open source user conference dedicated to Lucene/Solr. Solr has tens of thousands of applications in production. You use Solr everyday.
  • 5. Solr’s Key Features • Full text search (Info Retr.) • Facets/Guided Nav galore! • Lots of data types • Spelling, auto-complete, highlighting • Cursors • More Like This • De-duplication • Apache Lucene • Grouping and Joins • Stats, expressions, transformations and more • Lang. Detection • Extensible • Massive Scale/Fault tolerance
  • 6.
  • 7. It is increasingly important to know what is important! Corollary: The faster you know what is important, the better
  • 9. • Solr - Logstash - Kibana ! • http://lucidworks.com/ product/integrations/silk/ • Open source at: • https://github.com/ LucidWorks/banana • https://github.com/ LucidWorks/solrlogmanager SiLK
  • 10.
  • 11. • Feature Selection • Analyzers for all types • Easily get weights for terms • Term Vectors • Data Reduction • Filters • Analyzers • Data quality tools Feature Selection and Data Reduction
  • 12. • Quick and dirty: • kNN, others • Carrot^2 integration for search result clustering • Integration with Mahout • Lucene provides Bayesian classifiers built on index • Easily build training and test sets via filter queries Classification and Clustering
  • 13. • Built in expressions, stats, function queries make custom ranking a snap! • Search is essentially vector * matrix • Lucene index is a ranking optimized matrix • More coming! Math
  • 14. Clicks, tweets, ratings, locations and much more can all be leveraged to provide high quality recommendations to users and deeper insight for data scientists ! Signals power relevance Query Modification Increase the findability of documents and records with automatic creation of tags, fields and meta-data Curate the user experience in your application with artificial result ranking, document injections and obfuscation Result ManipulationIndex Time Enrichment Perform real time decision making and routing in order to map a users intention or enterprise policy
  • 15. • http://www.lucidworks.com/products/fusion • Ships w/ built-in Solr-based Recommender OOTB, but easy to extend • Demo: eCommerce data set • ~1.2M products • ~4M clicks Lucidworks Fusion
  • 16. • Data ingest: • JSON, CSV, XML, Rich types (PDF, etc.), custom • Clients for Python, R, Java, .NET and more • http://cran.r-project.org/web/packages/solr/index.html, amongst others • Output formats: JSON, CSV, XML, custom Solr and Your Tools
  • 17. • Vector Space or Probabilistic, it’s your choice! • Killer FST • Wicked fast • Pluggable compression, queries, indexing and more • Advanced Similarity Models • Lang. Modeling, Divergence from Random, more • Easy to plug-in ranking for Data Science
  • 19. • More Facets/Stats • Combine pivots, ranges and stats • Percentiles via t-digest • hyper-log-log • Deeper Spark integration for Solr • Custom distributed computation and aggregations/maths • Advanced schema on read options • Time series? Trends? Anomaly Detection? • Learn to rank? What’s coming?
  • 20. Lucidworks Open Source • Logstash for Solr: • https://github.com/LucidWorks/solrlogmanager • Banana (Kibana for Solr): • https://github.com/LucidWorks/banana • Effortless AWS deployment and monitoring: • http://www.github.com/lucidworks/solr-scale-tk • Data Quality Toolkit: • https://github.com/LucidWorks/data-quality • Spark Integration • https://github.com/LucidWorks/spark-solr
  • 21. • This code: http://github.com/lucidworks/solr-for- datascience • Company: http://www.lucidworks.com • Our blog: http://www.lucidworks.com/blog • Book: http://www.manning.com/ingersoll • Solr: http://lucene.apache.org/solr • Fusion: http://www.lucidworks.com/products/fusion • Twitter: @gsingers Resources