Leveraging Solr and Mahout

1,254 views

Published on

My talk from last night's Big Data Warehouse meetup in NYC on using Solr and Mahout to build next generation data access tools

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,254
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
27
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • This is a money slide where people should say “Wow man”. They shouldn’t understand the implications of this, but they should be very, very aware that something big just slide into the room.Tech Building Block: Not just textNot just users + queriesEmbrace Fuzziness: Esp. in Big Data, it is the only way you are going to survive.TED: I think that this should make the case for advanced that is still search at its heart. The idea that search can be radically changed should be on the next slide.
  • Search Abuse Can discuss how I started just doing free text, but then a curious thing happened, started to see people using the engine for things like: key/value, denormalized DBs, browsing engines, plagiarism detection, teaching languages, record linkage and much, much moreSearch has added more DB features over the yearsTED: We need to introduce the idea of *REVOLUTION* somewhere in here.
  • Big Picture: too often devs are stuck in the weeds
  • Leveraging Solr and Mahout

    1. 1. Leveraging Solr and Mahout for Next Gen Data Access and Insight Grant Ingersoll Chief ScientistConfidential © Copyright 2012
    2. 2. Search is Dead, Long Live Search• Modern Data Challenges are multi-structured• Search is a system building block Content - Text is only a part of the story• If the algorithms fit, Content use them! Relationships Users• Embrace fuzziness! Access• Scoring features are everywhereConfidential and Proprietary© 2012 LucidWorks
    3. 3. Topics • Intros • Search (R)Evolution • Apache Solr • Apache Mahout • Search and Machine Learning • Scaling Confidential and Proprietary3 © 2012 LucidWorks
    4. 4. Grant’s Background• Co-founder: - LucidWorks – Chief Scientist - Apache Mahout• Long time Lucene/Solr committer• Author: Taming Text - www.manning.com/ingersoll• Background in IR and NLP - Built CLIR, QA and a variety of other search-based appsConfidential and Proprietary© 2012 LucidWorks
    5. 5. Search (R)evolution• Search use leads to search abuse - Denormalization frees your mind - Scoring is just a sparse matrix multiply• Lucene/Solr evolution - Non-free text usages abound - Many DB-like features - NoSQL before NoSQL was cool - Flexible indexing - Finite State Transducers FTW!• Scale• “This ain’t your father’s relevance anymore”Confidential and Proprietary© 2012 LucidWorks
    6. 6. Apache Solr?• “Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as Tomcat. “ - http://lucene.apache.org/solr• Did I mention free?Confidential and Proprietary© 2012 LucidWorks
    7. 7. Apache Mahout• Goal: create library of scalable machine learning algorithms• Mahout’s 3 “C”s provide tools for helping across many aspects of discovery - Collaborative Filtering - Classification - Clustering• Also: - Collocations (Statistically Interesting Phrases) - SVD - Java math, primitives libraries and moreConfidential and Proprietary© 2012 LucidWorks
    8. 8. Search + Machine Learning• Search-driven applications present multiple opportunities for leveraging machine learning - Clustering – Enhance Discovery, outlier detection - Classification – Queries, Documents, Users - Content Recommendation – Collab. Filtering and personalization - NLP – phrases, named entities, co-reference, much more• Many of these can also power faceted navigation• Aside: Search can also often be used effectively to implement many machine learning algorithmsConfidential and Proprietary© 2012 LucidWorks
    9. 9. How and When Access APIs •View into Search View Analytic numeric/hist Personalization & oric data 1 Services Machine Learning 2 Services Shards 3 N •Classification •Recommendation •Documents Classification Discovery & Document Store •Users Models Enrichment •Logs Clustering, In memory classification, NLP, Replicated topic identification, Multi-tenant search log analysis, user behavior Content Acquisition ETL, batch or near real-time Data • LucidWorks Search connectors • PushConfidential and Proprietary© 2012 LucidWorks
    10. 10. Scaling• Search - Solr Cloud = Large scale, distributed search and faceting » http://wiki.apache.org/solr/SolrCloud• Machine Learning - Mahout is built on Hadoop for most things - SGD is sequential and really fast• Sometimes all you can do is make an educated guess - Storm, Kafka, etc. can help by allowing you to make estimates in near real timeConfidential and Proprietary© 2012 LucidWorks
    11. 11. Wrap• Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users• LucidWorks has combined many of these things into LucidWorks Big Data - http://www.lucidworks.com/products/lucidworks-big-data• Design for the big picture when building search-based applicationsConfidential and Proprietary© 2012 LucidWorks
    12. 12. Resources• LucidWorks - http://www.lucidworks.com - http://www.lucidworks.com/products/lucidworks-big-data - @LucidImagineer• Me - grant@lucidworks.com - @gsingers• Taming Text - http://www.manning.com/ingersoll - http://www.tamingtext.com - @tamingtextConfidential and Proprietary© 2012 LucidWorks

    ×