Leveraging Solr and Mahout


My talk from last night's Big Data Warehouse meetup in NYC on using Solr and Mahout to build next generation data access tools

  1. 1. Leveraging Solr and Mahout for Next Gen Data Access and Insight Grant Ingersoll Chief ScientistConfidential © Copyright 2012
  2. 2. Search is Dead, Long Live Search• Modern Data Challenges are multi-structured• Search is a system building block Content - Text is only a part of the story• If the algorithms fit, Content use them! Relationships Users• Embrace fuzziness! Access• Scoring features are everywhereConfidential and Proprietary© 2012 LucidWorks
  3. 3. Topics • Intros • Search (R)Evolution • Apache Solr • Apache Mahout • Search and Machine Learning • Scaling Confidential and Proprietary3 © 2012 LucidWorks
  4. 4. Grant’s Background• Co-founder: - LucidWorks – Chief Scientist - Apache Mahout• Long time Lucene/Solr committer• Author: Taming Text -• Background in IR and NLP - Built CLIR, QA and a variety of other search-based appsConfidential and Proprietary© 2012 LucidWorks
  5. 5. Search (R)evolution• Search use leads to search abuse - Denormalization frees your mind - Scoring is just a sparse matrix multiply• Lucene/Solr evolution - Non-free text usages abound - Many DB-like features - NoSQL before NoSQL was cool - Flexible indexing - Finite State Transducers FTW!• Scale• “This ain’t your father’s relevance anymore”Confidential and Proprietary© 2012 LucidWorks
  6. 6. Apache Solr?• “Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as Tomcat. “ -• Did I mention free?Confidential and Proprietary© 2012 LucidWorks
  7. 7. Apache Mahout• Goal: create library of scalable machine learning algorithms• Mahout’s 3 “C”s provide tools for helping across many aspects of discovery - Collaborative Filtering - Classification - Clustering• Also: - Collocations (Statistically Interesting Phrases) - SVD - Java math, primitives libraries and moreConfidential and Proprietary© 2012 LucidWorks
  8. 8. Search + Machine Learning• Search-driven applications present multiple opportunities for leveraging machine learning - Clustering – Enhance Discovery, outlier detection - Classification – Queries, Documents, Users - Content Recommendation – Collab. Filtering and personalization - NLP – phrases, named entities, co-reference, much more• Many of these can also power faceted navigation• Aside: Search can also often be used effectively to implement many machine learning algorithmsConfidential and Proprietary© 2012 LucidWorks
  9. 9. How and When Access APIs •View into Search View Analytic numeric/hist Personalization & oric data 1 Services Machine Learning 2 Services Shards 3 N •Classification •Recommendation •Documents Classification Discovery & Document Store •Users Models Enrichment •Logs Clustering, In memory classification, NLP, Replicated topic identification, Multi-tenant search log analysis, user behavior Content Acquisition ETL, batch or near real-time Data • LucidWorks Search connectors • PushConfidential and Proprietary© 2012 LucidWorks
  10. 10. Scaling• Search - Solr Cloud = Large scale, distributed search and faceting »• Machine Learning - Mahout is built on Hadoop for most things - SGD is sequential and really fast• Sometimes all you can do is make an educated guess - Storm, Kafka, etc. can help by allowing you to make estimates in near real timeConfidential and Proprietary© 2012 LucidWorks
  11. 11. Wrap• Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users• LucidWorks has combined many of these things into LucidWorks Big Data -• Design for the big picture when building search-based applicationsConfidential and Proprietary© 2012 LucidWorks
  12. 12. Resources• LucidWorks - - - @LucidImagineer• Me - - @gsingers• Taming Text - - - @tamingtextConfidential and Proprietary© 2012 LucidWorks