Big Data Warehousing: Leveraging Solr & Mahout


Published on

Over the past few years, relevant recommendations have become expected and essential as part of the customer experience. From the customer’s perspective, marketing interactions are becoming helpful and time saving, instead of being generic, out of context, and annoying. If you shop at any of the major online retailers such as Amazon or Bluefly you may think they somehow have gotten inside your head as they present and recommend products relevant to you. This is an exponential improvement of the traditional psych-demographic profiling and targeting of the “old world”.

We talked about how Mahout can be leveraged to build a Recommendation Engine with a minimum of coding. We discussd how the open source search and machine learning capabilities of Apache Solr and Mahout can be combined to power large scale data driven applications that effectively combine real time access with large scale enrichment and discovery.

Caserta Concepts has grown beyond its roots as a provider of traditional data warehouse and BI consulting to also offer big data warehousing. If you’re a developer and are experienced in Hadoop, Hive, HBase, Mahout, Datameer or other Big Data technologies, we want to get to know you!

For more information, visit

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This is a money slide where people should say “Wow man”. They shouldn’t understand the implications of this, but they should be very, very aware that something big just slide into the room.Tech Building Block: Not just textNot just users + queriesEmbrace Fuzziness: Esp. in Big Data, it is the only way you are going to survive.TED: I think that this should make the case for advanced that is still search at its heart. The idea that search can be radically changed should be on the next slide.
  • Search Abuse Can discuss how I started just doing free text, but then a curious thing happened, started to see people using the engine for things like: key/value, denormalized DBs, browsing engines, plagiarism detection, teaching languages, record linkage and much, much moreSearch has added more DB features over the yearsTED: We need to introduce the idea of *REVOLUTION* somewhere in here.
  • Big Picture: too often devs are stuck in the weeds
  • Big Data Warehousing: Leveraging Solr & Mahout

    1. 1. Leveraging Solr and Mahout for Next Gen Data Access and Insight Grant Ingersoll Chief ScientistConfidential © Copyright 2012
    2. 2. Search is Dead, Long Live Search• Modern Data Challenges are multi-structured• Search is a system building block Content - Text is only a part of the story• If the algorithms fit, Content use them! Relationships Users• Embrace fuzziness! Access• Scoring features are everywhereConfidential and Proprietary© 2012 LucidWorks
    3. 3. Topics • Intros • Search (R)Evolution • Apache Solr • Apache Mahout • Search and Machine Learning • Scaling Confidential and Proprietary3 © 2012 LucidWorks
    4. 4. Grant’s Background• Co-founder: - LucidWorks – Chief Scientist - Apache Mahout• Long time Lucene/Solr committer• Author: Taming Text -• Background in IR and NLP - Built CLIR, QA and a variety of other search-based appsConfidential and Proprietary© 2012 LucidWorks
    5. 5. Search (R)evolution• Search use leads to search abuse - Denormalization frees your mind - Scoring is just a sparse matrix multiply• Lucene/Solr evolution - Non-free text usages abound - Many DB-like features - NoSQL before NoSQL was cool - Flexible indexing - Finite State Transducers FTW!• Scale• “This ain’t your father’s relevance anymore”Confidential and Proprietary© 2012 LucidWorks
    6. 6. Apache Solr?• “Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as Tomcat. “ -• Did I mention free?Confidential and Proprietary© 2012 LucidWorks
    7. 7. Apache Mahout• Goal: create library of scalable machine learning algorithms• Mahout’s 3 “C”s provide tools for helping across many aspects of discovery - Collaborative Filtering - Classification - Clustering• Also: - Collocations (Statistically Interesting Phrases) - SVD - Java math, primitives libraries and moreConfidential and Proprietary© 2012 LucidWorks
    8. 8. Search + Machine Learning• Search-driven applications present multiple opportunities for leveraging machine learning - Clustering – Enhance Discovery, outlier detection - Classification – Queries, Documents, Users - Content Recommendation – Collab. Filtering and personalization - NLP – phrases, named entities, co-reference, much more• Many of these can also power faceted navigation• Aside: Search can also often be used effectively to implement many machine learning algorithmsConfidential and Proprietary© 2012 LucidWorks
    9. 9. How and When Access APIs •View into Search View Analytic numeric/hist Personalization & oric data 1 Services Machine Learning 2 Services Shards 3 N •Classification •Recommendation •Documents Classification Discovery & Document Store •Users Models Enrichment •Logs Clustering, classific In memory ation, NLP, topic Replicated identification, searc Multi-tenant h log analysis, user behavior Content Acquisition ETL, batch or near real-time Data • LucidWorks Search connectors • PushConfidential and Proprietary© 2012 LucidWorks
    10. 10. Scaling• Search - Solr Cloud = Large scale, distributed search and faceting »• Machine Learning - Mahout is built on Hadoop for most things - SGD is sequential and really fast• Sometimes all you can do is make an educated guess - Storm, Kafka, etc. can help by allowing you to make estimates in near real timeConfidential and Proprietary© 2012 LucidWorks
    11. 11. Wrap• Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users• LucidWorks has combined many of these things into LucidWorks Big Data -• Design for the big picture when building search-based applicationsConfidential and Proprietary© 2012 LucidWorks
    12. 12. Resources• LucidWorks - - - @LucidImagineer• Me - - @gsingers• Taming Text - - - @tamingtextConfidential and Proprietary© 2012 LucidWorks