SlideShare a Scribd company logo
1 of 17
Download to read offline
Search   Discover   Analyze




Large Scale Search, Discovery and
Analytics with Solr, Mahout and
Hadoop




Grant Ingersoll
Chief Scientist
Lucid Imagination


                                                  |   1
Search is Dead, Long Live Search



   Good keyword search is a                       Documents
    commodity and easy to get
    up and running


   The Bar is Raised                Content                      User
                                   Relationships               Interaction
     – Relevance is (always will
       be?) hard


   Holistic view of the data
    AND the users is critical
                                                    Access




                                                                       |     2
Topics



   Quick Background and needs
   Architecture
     – Abstract
     – Practical
   SDA In Practice
     – Components
     – Challenges and Lessons Learned
   Wrap Up




                                        |   3
Why Search, Discovery and Analytics (SDA)?

                                     User Needs
                                       – Real-time, ad hoc access to content
                                       – Aggressive Prioritization based on Importance
             Search
                                       – Serendipity
                                       – Feedback/Learning from past


                                     Business Needs
 Analytics            Discovery        – Deeper insight into users
                                       – Leverage existing internal knowledge
                                       – Cost effective




                                                                                |   4
What Do Developers Need for SDA?



   Fast, efficient, scalable search
     – Bulk and Near Real Time Indexing
     – Handle billions of records w/ sub-second search and faceting
   Large scale, cost effective storage and processing capabilities
     – Need whole data consumption and analysis
     – Experimentation/Sampling tools
     – Distributed In Memory where appropriate
   NLP and machine learning tools that scale to enhance discovery and
    analysis




                                                                      |   5
Abstract -> Practical SDA Architecture
                           Access (API, UI,Visualization)

                                  Search, Discovery and Analytics              Glue
                       Stats Mahout, R, GATE, Others
                          Pig, Machine   Docs     User                        Admin
                      Package Learning  Access Modeling

                                        Experiment Mgmt                       Service
                                                                               Mgmt
       Content                     Computation and Storage
      Acquisition
                                              DB
                                                               Dist.          Data
                      Search                 NoSQL
                                                              Process         Mgmt
                                              KV

                      Shards                  Shards                Shards
                       Shards                  Shards                Shards
                         Shards                   Logs                  DFS




                    Provisioning, Monitoring, Infrastructure


                                                                                        |   6
Computation and Storage


       Solr                  Hadoop                     HBase

• Document Index        • Stores Logs,          • Metric Storage
• Document                Raw files,            • User Histories
  Storage?                intermediate          • Document
                          files, etc.             Storage?
• SolrCloud             • WebHDFS
  makes sharding
  easy                  • Small file are an
                          unnatural act

Challenges
     • Who is the authoritative store? Solr or HBase?
     • Real time vs. Batch
     • Where should analysis be done?
                                                                |   7
Search In Practice



   Three primary concerns
     – Performance/Scaling


     – Relevance


     – Operations: monitoring, failover, etc.


   Business typically cares more about relevance
   Devs more about performance (and then ops)




                                                    |   8
Search with Solr: Scaling and NRT



   SolrCloud takes care of distributed indexing and search needs
     – Transaction logs for recovery
     – Automatic leader election, so no more master/worker
     – Have to declare number of shards now, but splitting coming soon
     – Use CloudSolrServer in SolrJ
   NRT Config tips:
     – 1 second soft commits for NRT updates
     – 1 minute hard commits (no searcher reopen)




                                                                         |   9
Search: Relevance



   ABT – Always Be Testing
     – Experiment management is critical
     – Top X + Random Sampling of Long Tail
     – Click logs
   Track Everything!
     – Queries
     – Clicks
     – Displayed Documents
     – Mouse/Scroll tracking???
   Phrases are your friend




                                              |   10
Discovery Components

       Serendipity             Organization            Data Quality

•   Trends                 • Importance           • Document factor
•   Topics                 • Clustering             Distributions
•   Recommendations        • Classification         • Length
•   Related Items            • Named Entities       • Boosts
•   More Like This         • Time Factors         • Duplicates
•   Did you mean?          • Faceting
•   Stat. Interesting
    Phrases

Challenges
        • Many of these are intense calculations or iterative
        • Many are subjective and require a lot of experimentation


                                                                      |   11
Discovery with Mahout



   Mahout’s 3 “C”s provide tools for helping across many aspects of discovery
     – Collaborative Filtering
     – Classification
     – Clustering
   Also:
     – Collocations (Statistically Interesting Phrases)
     – SVD
     – Others
   Challenges:
     – High cost to iterative machine learning algorithms
     – Mahout is very command line oriented
     – Some areas less mature

                                                                             |   12
Aside: Experiment Management



   Plan for running experiments from the beginning across Search and
    Discovery components
     – Your analytics engine should help!
   Types of Experiments to consider
     – Indexing/Analysis
     – Query parsing
     – Scoring formulas
     – Machine Learning Models
     – Recommendations, many more
   Make it easy to do A/B testing across all experiments and compare and
    contrast the results



                                                                            |   13
Analytics Components



   Commonly used components
     – Solr
     – R Stats
     – Hive
     – Pig
     – Commercial


   Starting with Search and Discovery metrics and analysis gives context into
    where to make investments for broader analytics




                                                                                 |   14
Analytics in Practice



   Simple Counts:
     – Facets
     – Term and Document frequencies
     – Clicks
   Search and Discovery example metrics
     – Relevance measures like Mean Reciprocal Rank
     – Histograms/Drilldowns around Number of Results
     – Log and navigation analysis


   Data cleanliness analysis is helpful for finding potential issues in content




                                                                                   |   15
Wrap



   Search, Discovery and Analytics, when combined into a single, coherent
    system provides powerful insight into both your content and your users


   Solr + Hadoop + Mahout


   Design for the big picture when building search-based applications




                                                                             |   16
Find me



   http://www.lucidimagination.com


   grant@lucidimagination.com
   @gsingers




                                      |   17

More Related Content

What's hot

Concept Searching Overview Google Vs Fast
Concept Searching Overview  Google Vs FastConcept Searching Overview  Google Vs Fast
Concept Searching Overview Google Vs Fastmartingarland
 
Book Recommendation System using Data Mining for the University of Hong Kong ...
Book Recommendation System using Data Mining for the University of Hong Kong ...Book Recommendation System using Data Mining for the University of Hong Kong ...
Book Recommendation System using Data Mining for the University of Hong Kong ...CITE
 
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Carlos Castillo (ChaTo)
 
Data mining process powerpoint presentation templates
Data mining process powerpoint presentation templatesData mining process powerpoint presentation templates
Data mining process powerpoint presentation templatesSlideTeam.net
 
Data mining process powerpoint ppt slides.
Data mining process powerpoint ppt slides.Data mining process powerpoint ppt slides.
Data mining process powerpoint ppt slides.SlideTeam.net
 
Hadoop Data Reservoir Webinar
Hadoop Data Reservoir WebinarHadoop Data Reservoir Webinar
Hadoop Data Reservoir WebinarPlatfora
 
Kuali update v4 - mw
Kuali update   v4 - mwKuali update   v4 - mw
Kuali update v4 - mwsarnoa
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousingDataWorks Summit
 

What's hot (8)

Concept Searching Overview Google Vs Fast
Concept Searching Overview  Google Vs FastConcept Searching Overview  Google Vs Fast
Concept Searching Overview Google Vs Fast
 
Book Recommendation System using Data Mining for the University of Hong Kong ...
Book Recommendation System using Data Mining for the University of Hong Kong ...Book Recommendation System using Data Mining for the University of Hong Kong ...
Book Recommendation System using Data Mining for the University of Hong Kong ...
 
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
 
Data mining process powerpoint presentation templates
Data mining process powerpoint presentation templatesData mining process powerpoint presentation templates
Data mining process powerpoint presentation templates
 
Data mining process powerpoint ppt slides.
Data mining process powerpoint ppt slides.Data mining process powerpoint ppt slides.
Data mining process powerpoint ppt slides.
 
Hadoop Data Reservoir Webinar
Hadoop Data Reservoir WebinarHadoop Data Reservoir Webinar
Hadoop Data Reservoir Webinar
 
Kuali update v4 - mw
Kuali update   v4 - mwKuali update   v4 - mw
Kuali update v4 - mw
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousing
 

Viewers also liked

Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopGrant Ingersoll
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemGrant Ingersoll
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Grant Ingersoll
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineGrant Ingersoll
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xGrant Ingersoll
 

Viewers also liked (10)

Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
Taming Text
Taming TextTaming Text
Taming Text
 

Similar to Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr

MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR Technologies
 
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceHadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceTed Dunning
 
MapR lucidworks joint webinar
MapR lucidworks joint webinarMapR lucidworks joint webinar
MapR lucidworks joint webinarTed Dunning
 
Crowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopCrowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopDataWorks Summit
 
Building a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceBuilding a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceRobert H. McDonald
 
Mesh Labs Introduction June 2012
Mesh Labs Introduction June 2012Mesh Labs Introduction June 2012
Mesh Labs Introduction June 2012Umesh Ramalingachar
 
CNI Fall 2011 Meeting Presentation Margaret Hedstrom & Robert McDonald (Dec. ...
CNI Fall 2011 Meeting Presentation Margaret Hedstrom & Robert McDonald (Dec. ...CNI Fall 2011 Meeting Presentation Margaret Hedstrom & Robert McDonald (Dec. ...
CNI Fall 2011 Meeting Presentation Margaret Hedstrom & Robert McDonald (Dec. ...SEAD
 
Data Governance for Data Lakes
Data Governance for Data LakesData Governance for Data Lakes
Data Governance for Data LakesKiran Kamreddy
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineSalford Systems
 
Analytic Platforms in the Real World with 451Research and Calpont_July 2012
Analytic Platforms in the Real World with 451Research and Calpont_July 2012Analytic Platforms in the Real World with 451Research and Calpont_July 2012
Analytic Platforms in the Real World with 451Research and Calpont_July 2012Calpont Corporation
 
How Search 2.0 Has Been Redefined by Enterprise 2.0
How Search 2.0 Has Been Redefined by Enterprise 2.0How Search 2.0 Has Been Redefined by Enterprise 2.0
How Search 2.0 Has Been Redefined by Enterprise 2.0Enterprise 2.0 Conference
 
Ibm info sphere datastage and hadoop two best-of-breed solutions together-f...
Ibm info sphere datastage and hadoop   two best-of-breed solutions together-f...Ibm info sphere datastage and hadoop   two best-of-breed solutions together-f...
Ibm info sphere datastage and hadoop two best-of-breed solutions together-f...ArunshankarArjunan
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
Bi 4.0 Migration Strategy and Best Practices
Bi 4.0 Migration Strategy and Best PracticesBi 4.0 Migration Strategy and Best Practices
Bi 4.0 Migration Strategy and Best PracticesEric Molner
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
 
Data Management for Librarians: An Introduction
Data Management for Librarians: An IntroductionData Management for Librarians: An Introduction
Data Management for Librarians: An IntroductionGarethKnight
 

Similar to Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr (20)

MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211
 
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceHadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
 
MapR lucidworks joint webinar
MapR lucidworks joint webinarMapR lucidworks joint webinar
MapR lucidworks joint webinar
 
Crowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopCrowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over Hadoop
 
Building a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceBuilding a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability Science
 
Mesh Labs Introduction June 2012
Mesh Labs Introduction June 2012Mesh Labs Introduction June 2012
Mesh Labs Introduction June 2012
 
CNI Fall 2011 Meeting Presentation Margaret Hedstrom & Robert McDonald (Dec. ...
CNI Fall 2011 Meeting Presentation Margaret Hedstrom & Robert McDonald (Dec. ...CNI Fall 2011 Meeting Presentation Margaret Hedstrom & Robert McDonald (Dec. ...
CNI Fall 2011 Meeting Presentation Margaret Hedstrom & Robert McDonald (Dec. ...
 
Data Governance for Data Lakes
Data Governance for Data LakesData Governance for Data Lakes
Data Governance for Data Lakes
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
 
Analytic Platforms in the Real World with 451Research and Calpont_July 2012
Analytic Platforms in the Real World with 451Research and Calpont_July 2012Analytic Platforms in the Real World with 451Research and Calpont_July 2012
Analytic Platforms in the Real World with 451Research and Calpont_July 2012
 
FAST Search for SharePoint
FAST Search for SharePointFAST Search for SharePoint
FAST Search for SharePoint
 
How Search 2.0 Has Been Redefined by Enterprise 2.0
How Search 2.0 Has Been Redefined by Enterprise 2.0How Search 2.0 Has Been Redefined by Enterprise 2.0
How Search 2.0 Has Been Redefined by Enterprise 2.0
 
BI Introduction
BI IntroductionBI Introduction
BI Introduction
 
Ibm info sphere datastage and hadoop two best-of-breed solutions together-f...
Ibm info sphere datastage and hadoop   two best-of-breed solutions together-f...Ibm info sphere datastage and hadoop   two best-of-breed solutions together-f...
Ibm info sphere datastage and hadoop two best-of-breed solutions together-f...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
Bi 4.0 Migration Strategy and Best Practices
Bi 4.0 Migration Strategy and Best PracticesBi 4.0 Migration Strategy and Best Practices
Bi 4.0 Migration Strategy and Best Practices
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Data Management for Librarians: An Introduction
Data Management for Librarians: An IntroductionData Management for Librarians: An Introduction
Data Management for Librarians: An Introduction
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 

More from Grant Ingersoll

Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopGrant Ingersoll
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Grant Ingersoll
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsGrant Ingersoll
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopGrant Ingersoll
 

More from Grant Ingersoll (11)

Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 

Recently uploaded

SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTxtailishbaloch
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptxHansamali Gamage
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxNeo4j
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfTejal81
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechProduct School
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.IPLOOK Networks
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInThousandEyes
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosErol GIRAUDY
 
UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2DianaGray10
 
The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)IES VE
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsDianaGray10
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1DianaGray10
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024Brian Pichman
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and businessFrancesco Corti
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Alkin Tezuysal
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingFrancesco Corti
 

Recently uploaded (20)

SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenarios
 
UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2
 
The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile Brochure
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projects
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024
 
SheDev 2024
SheDev 2024SheDev 2024
SheDev 2024
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and business
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is going
 

Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr

  • 1. Search Discover Analyze Large Scale Search, Discovery and Analytics with Solr, Mahout and Hadoop Grant Ingersoll Chief Scientist Lucid Imagination | 1
  • 2. Search is Dead, Long Live Search  Good keyword search is a Documents commodity and easy to get up and running  The Bar is Raised Content User Relationships Interaction – Relevance is (always will be?) hard  Holistic view of the data AND the users is critical Access | 2
  • 3. Topics  Quick Background and needs  Architecture – Abstract – Practical  SDA In Practice – Components – Challenges and Lessons Learned  Wrap Up | 3
  • 4. Why Search, Discovery and Analytics (SDA)?  User Needs – Real-time, ad hoc access to content – Aggressive Prioritization based on Importance Search – Serendipity – Feedback/Learning from past  Business Needs Analytics Discovery – Deeper insight into users – Leverage existing internal knowledge – Cost effective | 4
  • 5. What Do Developers Need for SDA?  Fast, efficient, scalable search – Bulk and Near Real Time Indexing – Handle billions of records w/ sub-second search and faceting  Large scale, cost effective storage and processing capabilities – Need whole data consumption and analysis – Experimentation/Sampling tools – Distributed In Memory where appropriate  NLP and machine learning tools that scale to enhance discovery and analysis | 5
  • 6. Abstract -> Practical SDA Architecture Access (API, UI,Visualization) Search, Discovery and Analytics Glue Stats Mahout, R, GATE, Others Pig, Machine Docs User Admin Package Learning Access Modeling Experiment Mgmt Service Mgmt Content Computation and Storage Acquisition DB Dist. Data Search NoSQL Process Mgmt KV Shards Shards Shards Shards Shards Shards Shards Logs DFS Provisioning, Monitoring, Infrastructure | 6
  • 7. Computation and Storage Solr Hadoop HBase • Document Index • Stores Logs, • Metric Storage • Document Raw files, • User Histories Storage? intermediate • Document files, etc. Storage? • SolrCloud • WebHDFS makes sharding easy • Small file are an unnatural act Challenges • Who is the authoritative store? Solr or HBase? • Real time vs. Batch • Where should analysis be done? | 7
  • 8. Search In Practice  Three primary concerns – Performance/Scaling – Relevance – Operations: monitoring, failover, etc.  Business typically cares more about relevance  Devs more about performance (and then ops) | 8
  • 9. Search with Solr: Scaling and NRT  SolrCloud takes care of distributed indexing and search needs – Transaction logs for recovery – Automatic leader election, so no more master/worker – Have to declare number of shards now, but splitting coming soon – Use CloudSolrServer in SolrJ  NRT Config tips: – 1 second soft commits for NRT updates – 1 minute hard commits (no searcher reopen) | 9
  • 10. Search: Relevance  ABT – Always Be Testing – Experiment management is critical – Top X + Random Sampling of Long Tail – Click logs  Track Everything! – Queries – Clicks – Displayed Documents – Mouse/Scroll tracking???  Phrases are your friend | 10
  • 11. Discovery Components Serendipity Organization Data Quality • Trends • Importance • Document factor • Topics • Clustering Distributions • Recommendations • Classification • Length • Related Items • Named Entities • Boosts • More Like This • Time Factors • Duplicates • Did you mean? • Faceting • Stat. Interesting Phrases Challenges • Many of these are intense calculations or iterative • Many are subjective and require a lot of experimentation | 11
  • 12. Discovery with Mahout  Mahout’s 3 “C”s provide tools for helping across many aspects of discovery – Collaborative Filtering – Classification – Clustering  Also: – Collocations (Statistically Interesting Phrases) – SVD – Others  Challenges: – High cost to iterative machine learning algorithms – Mahout is very command line oriented – Some areas less mature | 12
  • 13. Aside: Experiment Management  Plan for running experiments from the beginning across Search and Discovery components – Your analytics engine should help!  Types of Experiments to consider – Indexing/Analysis – Query parsing – Scoring formulas – Machine Learning Models – Recommendations, many more  Make it easy to do A/B testing across all experiments and compare and contrast the results | 13
  • 14. Analytics Components  Commonly used components – Solr – R Stats – Hive – Pig – Commercial  Starting with Search and Discovery metrics and analysis gives context into where to make investments for broader analytics | 14
  • 15. Analytics in Practice  Simple Counts: – Facets – Term and Document frequencies – Clicks  Search and Discovery example metrics – Relevance measures like Mean Reciprocal Rank – Histograms/Drilldowns around Number of Results – Log and navigation analysis  Data cleanliness analysis is helpful for finding potential issues in content | 15
  • 16. Wrap  Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users  Solr + Hadoop + Mahout  Design for the big picture when building search-based applications | 16
  • 17. Find me  http://www.lucidimagination.com  grant@lucidimagination.com  @gsingers | 17

Editor's Notes

  1. The bar is raised: when we first started Lucid, the problems were all around standing up Lucene or Solr or dealing with performance issues, now the large majority of them are around taking search to the next level: better relevance, personalization, recommendations, etc., i.e. how to have better relevance
  2. How do you gain insight?The Search boxis the UI for data these daysFeedback improvements into system for usersExtract key metrics for business understanding
  3. Make into images?
  4. All about ad hoc and bulk storage and computationAll about the analytics that drive your computationGlue to make it all work together – data where it needs to be when it needs to be thereAll are examples of ways to do this. There are actually a fair number of viable alternatives for all of these pieces, all in open sourceI tend to stick to Apache and “commercial” friendly licenses, where possible
  5. Authoritative store: managing across, consistency, etc.Analysis should be done where it most makes sense given the location of the data and the type of analysis being doneHadoop and HBase stuff are all pretty straightforward
  6. Relevance – plan for relevance testing from day 1.
  7. Log and navigation: clicks, search trails, etc.Data cleanliness: Never viewed docs that are related to other documents
  8. Big Picture: too often devs are stuck in the weeds