Where developers are challenged, what developers want and where DevEx is going
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
1. Search Discover Analyze
Large Scale Search, Discovery and
Analytics with Solr, Mahout and
Hadoop
Grant Ingersoll
Chief Scientist
Lucid Imagination
| 1
2. Search is Dead, Long Live Search
Good keyword search is a Documents
commodity and easy to get
up and running
The Bar is Raised Content User
Relationships Interaction
– Relevance is (always will
be?) hard
Holistic view of the data
AND the users is critical
Access
| 2
3. Topics
Quick Background and needs
Architecture
– Abstract
– Practical
SDA In Practice
– Components
– Challenges and Lessons Learned
Wrap Up
| 3
4. Why Search, Discovery and Analytics (SDA)?
User Needs
– Real-time, ad hoc access to content
– Aggressive Prioritization based on Importance
Search
– Serendipity
– Feedback/Learning from past
Business Needs
Analytics Discovery – Deeper insight into users
– Leverage existing internal knowledge
– Cost effective
| 4
5. What Do Developers Need for SDA?
Fast, efficient, scalable search
– Bulk and Near Real Time Indexing
– Handle billions of records w/ sub-second search and faceting
Large scale, cost effective storage and processing capabilities
– Need whole data consumption and analysis
– Experimentation/Sampling tools
– Distributed In Memory where appropriate
NLP and machine learning tools that scale to enhance discovery and
analysis
| 5
6. Abstract -> Practical SDA Architecture
Access (API, UI,Visualization)
Search, Discovery and Analytics Glue
Stats Mahout, R, GATE, Others
Pig, Machine Docs User Admin
Package Learning Access Modeling
Experiment Mgmt Service
Mgmt
Content Computation and Storage
Acquisition
DB
Dist. Data
Search NoSQL
Process Mgmt
KV
Shards Shards Shards
Shards Shards Shards
Shards Logs DFS
Provisioning, Monitoring, Infrastructure
| 6
7. Computation and Storage
Solr Hadoop HBase
• Document Index • Stores Logs, • Metric Storage
• Document Raw files, • User Histories
Storage? intermediate • Document
files, etc. Storage?
• SolrCloud • WebHDFS
makes sharding
easy • Small file are an
unnatural act
Challenges
• Who is the authoritative store? Solr or HBase?
• Real time vs. Batch
• Where should analysis be done?
| 7
8. Search In Practice
Three primary concerns
– Performance/Scaling
– Relevance
– Operations: monitoring, failover, etc.
Business typically cares more about relevance
Devs more about performance (and then ops)
| 8
9. Search with Solr: Scaling and NRT
SolrCloud takes care of distributed indexing and search needs
– Transaction logs for recovery
– Automatic leader election, so no more master/worker
– Have to declare number of shards now, but splitting coming soon
– Use CloudSolrServer in SolrJ
NRT Config tips:
– 1 second soft commits for NRT updates
– 1 minute hard commits (no searcher reopen)
| 9
10. Search: Relevance
ABT – Always Be Testing
– Experiment management is critical
– Top X + Random Sampling of Long Tail
– Click logs
Track Everything!
– Queries
– Clicks
– Displayed Documents
– Mouse/Scroll tracking???
Phrases are your friend
| 10
11. Discovery Components
Serendipity Organization Data Quality
• Trends • Importance • Document factor
• Topics • Clustering Distributions
• Recommendations • Classification • Length
• Related Items • Named Entities • Boosts
• More Like This • Time Factors • Duplicates
• Did you mean? • Faceting
• Stat. Interesting
Phrases
Challenges
• Many of these are intense calculations or iterative
• Many are subjective and require a lot of experimentation
| 11
12. Discovery with Mahout
Mahout’s 3 “C”s provide tools for helping across many aspects of discovery
– Collaborative Filtering
– Classification
– Clustering
Also:
– Collocations (Statistically Interesting Phrases)
– SVD
– Others
Challenges:
– High cost to iterative machine learning algorithms
– Mahout is very command line oriented
– Some areas less mature
| 12
13. Aside: Experiment Management
Plan for running experiments from the beginning across Search and
Discovery components
– Your analytics engine should help!
Types of Experiments to consider
– Indexing/Analysis
– Query parsing
– Scoring formulas
– Machine Learning Models
– Recommendations, many more
Make it easy to do A/B testing across all experiments and compare and
contrast the results
| 13
14. Analytics Components
Commonly used components
– Solr
– R Stats
– Hive
– Pig
– Commercial
Starting with Search and Discovery metrics and analysis gives context into
where to make investments for broader analytics
| 14
15. Analytics in Practice
Simple Counts:
– Facets
– Term and Document frequencies
– Clicks
Search and Discovery example metrics
– Relevance measures like Mean Reciprocal Rank
– Histograms/Drilldowns around Number of Results
– Log and navigation analysis
Data cleanliness analysis is helpful for finding potential issues in content
| 15
16. Wrap
Search, Discovery and Analytics, when combined into a single, coherent
system provides powerful insight into both your content and your users
Solr + Hadoop + Mahout
Design for the big picture when building search-based applications
| 16
The bar is raised: when we first started Lucid, the problems were all around standing up Lucene or Solr or dealing with performance issues, now the large majority of them are around taking search to the next level: better relevance, personalization, recommendations, etc., i.e. how to have better relevance
How do you gain insight?The Search boxis the UI for data these daysFeedback improvements into system for usersExtract key metrics for business understanding
Make into images?
All about ad hoc and bulk storage and computationAll about the analytics that drive your computationGlue to make it all work together – data where it needs to be when it needs to be thereAll are examples of ways to do this. There are actually a fair number of viable alternatives for all of these pieces, all in open sourceI tend to stick to Apache and “commercial” friendly licenses, where possible
Authoritative store: managing across, consistency, etc.Analysis should be done where it most makes sense given the location of the data and the type of analysis being doneHadoop and HBase stuff are all pretty straightforward
Relevance – plan for relevance testing from day 1.
Log and navigation: clicks, search trails, etc.Data cleanliness: Never viewed docs that are related to other documents
Big Picture: too often devs are stuck in the weeds