• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 

Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr

on

  • 937 views

Berlin

Berlin

Statistics

Views

Total Views
937
Views on SlideShare
937
Embed Views
0

Actions

Likes
0
Downloads
15
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • The bar is raised: when we first started Lucid, the problems were all around standing up Lucene or Solr or dealing with performance issues, now the large majority of them are around taking search to the next level: better relevance, personalization, recommendations, etc., i.e. how to have better relevance
  • How do you gain insight?The Search boxis the UI for data these daysFeedback improvements into system for usersExtract key metrics for business understanding
  • Make into images?
  • All about ad hoc and bulk storage and computationAll about the analytics that drive your computationGlue to make it all work together – data where it needs to be when it needs to be thereAll are examples of ways to do this. There are actually a fair number of viable alternatives for all of these pieces, all in open sourceI tend to stick to Apache and “commercial” friendly licenses, where possible
  • Authoritative store: managing across, consistency, etc.Analysis should be done where it most makes sense given the location of the data and the type of analysis being doneHadoop and HBase stuff are all pretty straightforward
  • Relevance – plan for relevance testing from day 1.
  • Log and navigation: clicks, search trails, etc.Data cleanliness: Never viewed docs that are related to other documents
  • Big Picture: too often devs are stuck in the weeds

Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr Presentation Transcript

  • Search Discover AnalyzeLarge Scale Search, Discovery andAnalytics with Solr, Mahout andHadoopGrant IngersollChief ScientistLucid Imagination | 1
  • Search is Dead, Long Live Search Good keyword search is a Documents commodity and easy to get up and running The Bar is Raised Content User Relationships Interaction – Relevance is (always will be?) hard Holistic view of the data AND the users is critical Access | 2
  • Topics Quick Background and needs Architecture – Abstract – Practical SDA In Practice – Components – Challenges and Lessons Learned Wrap Up | 3
  • Why Search, Discovery and Analytics (SDA)?  User Needs – Real-time, ad hoc access to content – Aggressive Prioritization based on Importance Search – Serendipity – Feedback/Learning from past  Business Needs Analytics Discovery – Deeper insight into users – Leverage existing internal knowledge – Cost effective | 4
  • What Do Developers Need for SDA? Fast, efficient, scalable search – Bulk and Near Real Time Indexing – Handle billions of records w/ sub-second search and faceting Large scale, cost effective storage and processing capabilities – Need whole data consumption and analysis – Experimentation/Sampling tools – Distributed In Memory where appropriate NLP and machine learning tools that scale to enhance discovery and analysis | 5
  • Abstract -> Practical SDA Architecture Access (API, UI,Visualization) Search, Discovery and Analytics Glue Stats Mahout, R, GATE, Others Pig, Machine Docs User Admin Package Learning Access Modeling Experiment Mgmt Service Mgmt Content Computation and Storage Acquisition DB Dist. Data Search NoSQL Process Mgmt KV Shards Shards Shards Shards Shards Shards Shards Logs DFS Provisioning, Monitoring, Infrastructure | 6
  • Computation and Storage Solr Hadoop HBase• Document Index • Stores Logs, • Metric Storage• Document Raw files, • User Histories Storage? intermediate • Document files, etc. Storage?• SolrCloud • WebHDFS makes sharding easy • Small file are an unnatural actChallenges • Who is the authoritative store? Solr or HBase? • Real time vs. Batch • Where should analysis be done? | 7
  • Search In Practice Three primary concerns – Performance/Scaling – Relevance – Operations: monitoring, failover, etc. Business typically cares more about relevance Devs more about performance (and then ops) | 8
  • Search with Solr: Scaling and NRT SolrCloud takes care of distributed indexing and search needs – Transaction logs for recovery – Automatic leader election, so no more master/worker – Have to declare number of shards now, but splitting coming soon – Use CloudSolrServer in SolrJ NRT Config tips: – 1 second soft commits for NRT updates – 1 minute hard commits (no searcher reopen) | 9
  • Search: Relevance ABT – Always Be Testing – Experiment management is critical – Top X + Random Sampling of Long Tail – Click logs Track Everything! – Queries – Clicks – Displayed Documents – Mouse/Scroll tracking??? Phrases are your friend | 10
  • Discovery Components Serendipity Organization Data Quality• Trends • Importance • Document factor• Topics • Clustering Distributions• Recommendations • Classification • Length• Related Items • Named Entities • Boosts• More Like This • Time Factors • Duplicates• Did you mean? • Faceting• Stat. Interesting PhrasesChallenges • Many of these are intense calculations or iterative • Many are subjective and require a lot of experimentation | 11
  • Discovery with Mahout Mahout’s 3 “C”s provide tools for helping across many aspects of discovery – Collaborative Filtering – Classification – Clustering Also: – Collocations (Statistically Interesting Phrases) – SVD – Others Challenges: – High cost to iterative machine learning algorithms – Mahout is very command line oriented – Some areas less mature | 12
  • Aside: Experiment Management Plan for running experiments from the beginning across Search and Discovery components – Your analytics engine should help! Types of Experiments to consider – Indexing/Analysis – Query parsing – Scoring formulas – Machine Learning Models – Recommendations, many more Make it easy to do A/B testing across all experiments and compare and contrast the results | 13
  • Analytics Components Commonly used components – Solr – R Stats – Hive – Pig – Commercial Starting with Search and Discovery metrics and analysis gives context into where to make investments for broader analytics | 14
  • Analytics in Practice Simple Counts: – Facets – Term and Document frequencies – Clicks Search and Discovery example metrics – Relevance measures like Mean Reciprocal Rank – Histograms/Drilldowns around Number of Results – Log and navigation analysis Data cleanliness analysis is helpful for finding potential issues in content | 15
  • Wrap Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users Solr + Hadoop + Mahout Design for the big picture when building search-based applications | 16
  • Find me http://www.lucidimagination.com grant@lucidimagination.com @gsingers | 17