Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Large Scale Search, Discovery                                and Analysis in Action                                Grant I...
Search is Dead, Long Live Search• Good keyword search is a commodity and easy to get up and running                       ...
Topics• Background and needs• Architecture• Road Ahead• SDA In Action   • Components   • Challenges and Lessons Learned• W...
Confidential © Copyright 2012
Sample Use Cases    • Claims processing and analysis, including fraud analysis    • Large scale content acquisition and ac...
In Focus: Personalized Medicine                                              Alignment                                    ...
In Focus: Log Processing in Telecommunications    • Each year, large sums of money are lost due to       fraudulent calls ...
Confidential © Copyright 2012
Confidential © Copyright 2012
Confidential © Copyright 2012
Confidential © Copyright 2012
Confidential © Copyright 2012
Computation and Storage       LucidWorks                                     Hadoop                HBase       Search/Solr...
Search In Practice• Three primary concerns   • Performance/Scaling   • Relevance   • Operations: monitoring, failover, etc...
Search: Relevance• Always Be Testing  • Experiment management is critical  • Top X + sampling  • Click Logs• Track Everyth...
Discovery Components          Serendipity              Organization         Data Quality•   Trends                     • I...
Discovery with Mahout• Mahout’s 3 “C”s provide tools for helping across many   aspects of discovery   • Collaborative Filt...
Aside: Experiment Management• Plan for running experiments from the beginning   across Search and Discovery components   •...
Analytics in Practice• Many of the components discussed provide analytical  features  • Leverage existing tools: R, etc.• ...
Wrap• Search, Discovery and Analytics, when combined into   a single, coherent system provides powerful insight   into bot...
Discussion and Resources     • Questions?     • http://www.lucidworks.com     • grant@lucidworks.com     • @gsingers     C...
Upcoming SlideShare
Loading in …5
×

Large Scale Search, Discovery and Analytics in Action

1,567 views

Published on

Published in: Technology, Education
  • Be the first to comment

Large Scale Search, Discovery and Analytics in Action

  1. 1. Large Scale Search, Discovery and Analysis in Action Grant Ingersoll Chief Scientist September 18, 2012Confidential © Copyright 2012
  2. 2. Search is Dead, Long Live Search• Good keyword search is a commodity and easy to get up and running Documents• The Bar is Raised • Relevance is (always will be?) hard Content User• Holistic view of the data Relationships Interaction AND the users is critical• Search, Discovery and Analytics are the key to unlocking this view of users Access and data Confidential and Proprietary © 2012 LucidWorks
  3. 3. Topics• Background and needs• Architecture• Road Ahead• SDA In Action • Components • Challenges and Lessons Learned• Wrap UpConfidential and Proprietary© 2012 LucidWorks
  4. 4. Confidential © Copyright 2012
  5. 5. Sample Use Cases • Claims processing and analysis, including fraud analysis • Large scale content acquisition and access for: • Defense, intelligence and pharmaceutical applications • Views of data surrounding natural disasters and other tragedies for research, archiving and therapeutic purposes • Analysis of Website and social media interactions • Access and processing of genetic information for improved medical treatments • Log processing and fraud detection in telecommunications Confidential and Proprietary5 © 2012 LucidWorks
  6. 6. In Focus: Personalized Medicine Alignment and other Genetic analysis Variations Patient DNA Standard Therapies Alternative Therapies Search and Faceting Confidential and Proprietary6 © 2012 LucidWorks
  7. 7. In Focus: Log Processing in Telecommunications • Each year, large sums of money are lost due to fraudulent calls and poor service • Logs are usually semi-structured and contain vital information about errors and fraud • Deeper batch analytics can provide insight into patterns across vast amounts of data • Search of call and network information (via logs) is critical to providing deeper analysis and understanding of these errors and fraudulent activities Confidential and Proprietary7 © 2012 LucidWorks
  8. 8. Confidential © Copyright 2012
  9. 9. Confidential © Copyright 2012
  10. 10. Confidential © Copyright 2012
  11. 11. Confidential © Copyright 2012
  12. 12. Confidential © Copyright 2012
  13. 13. Computation and Storage LucidWorks Hadoop HBase Search/Solr• Document Index • Stores Logs, • Metric Storage Raw files,• Faceting intermediate • User files, etc. Histories/Profile • WebHDFS• SolrCloud makes sharding • Document easy • Small files are Storage an unnatural actChallenges • Who is the authoritative store? • Real time vs. Batch • Where should analysis be done? Confidential and Proprietary © 2012 LucidWorks
  14. 14. Search In Practice• Three primary concerns • Performance/Scaling • Relevance • Operations: monitoring, failover, etc.• Business typically cares more about relevance• Devs care more about performance at first…Confidential and Proprietary© 2012 LucidWorks
  15. 15. Search: Relevance• Always Be Testing • Experiment management is critical • Top X + sampling • Click Logs• Track Everything! • Queries • Clicks • Displayed Documents • Mouse/Scroll tracking?• Phrases are your friendsConfidential and Proprietary© 2012 LucidWorks
  16. 16. Discovery Components Serendipity Organization Data Quality• Trends • Importance • Document factor• Topics • Clustering Distributions• Recommendations • Classification • Length• Related Items • Named Entities • Boosts• More Like This • Time Factors • Duplicates• Did you mean? • Faceting• Stat. Interesting PhrasesChallenges• Many of these are intense calculations or iterative• Many are subjective and require a lot of experimentationConfidential and Proprietary© 2012 LucidWorks
  17. 17. Discovery with Mahout• Mahout’s 3 “C”s provide tools for helping across many aspects of discovery • Collaborative Filtering • Classification • Clustering• Also: • Collocations (Statistically Interesting Phrases) • Singular Value Decomposition (SVD) • Others• Challenges: • High cost to iterative machine learning algorithms • Mahout is very command line oriented • Some areas less matureConfidential and Proprietary© 2012 LucidWorks
  18. 18. Aside: Experiment Management• Plan for running experiments from the beginning across Search and Discovery components • Your engine should help!• Types of Experiments to consider • Indexing/Analysis • Query parsing • Scoring formulas • Machine Learning Models • Recommendations, many more• Make it easy to do A/B testing across all experiments and compare and contrast the resultsConfidential and Proprietary© 2012 LucidWorks
  19. 19. Analytics in Practice• Many of the components discussed provide analytical features • Leverage existing tools: R, etc.• Simple Counts: • Facets • Term and Document frequencies • Clicks• Search and Discovery example metrics • Relevance measures like Mean Reciprocal Rank • Histograms/Drilldowns around Number of Results • Log and navigation analysis• Data cleanliness analysis is helpful for finding potential issues in contentConfidential and Proprietary© 2012 LucidWorks
  20. 20. Wrap• Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users• LucidWorks has combined many of these things into LucidWorks Big Data • http://www.lucidworks.com/products/lucidworks-big-data• Design for the big picture when building search-based applicationsConfidential and Proprietary© 2012 LucidWorks
  21. 21. Discussion and Resources • Questions? • http://www.lucidworks.com • grant@lucidworks.com • @gsingers Confidential and Proprietary21 © 2012 LucidWorks

×