SlideShare a Scribd company logo
1 of 32
Download to read offline
Extending Solr: Building a Cloud-like
   Knowledge Discovery Platform


       Trey Grainger,,CareerBuilder
Overview
CareerBuilder’s Cloud-like Knowledge Discovery Platform
   • Scalable approaches to multi-lingual text
     analysis (with research study)
       Multiple fields vs Multiple Cores vs Single Field

   • Custom Scoring
       Payloads and on-the-fly bucket scoring
       Implementing a keyword spamming penalty

   • Solr as a Cloud Service
       Scalable, customizable search for everybody

   • Knowledge Discovery & Data Analytics
My background
 Trey Grainger
   • Search Technology Development Team Lead
     @ CareerBuilder.com


 Relevant Background:
   • Search & Recommendations
   • High-volume, N-tier Architectures
   • NLP, Relevancy Tuning, user group testing & machine
     learning


 Fun Side Project:
   • Founder and Site Architect @ Celiaccess.com
CareerBuilder’s Search Scale
 Over 1 million new jobs each month
 Over 40 million resumes
 ~150 globally distributed search servers
  (in the U.S., Europe, & Asia)
 Several thousand unique, dynamically generated
  indexes
 Over a million searches an hour
 >100 Million Search Documents
Job Search
Resume Search
Talent Network Search
Auto-Complete
Geo-spatial Search
Recommendations
 We classify all content (Jobs, Resumes, etc.) and index
  the classified content into Solr

 We use a combination of collaborative filtering and
  classification techniques

 We utilize a custom scorer and payloads to apply
  higher bucket weights to more relevant content

 Recommendations are real-time and largely driven by
  search
Job Recommendations
Resume Recommendations
Multi-lingual Analysis
 Approach 1: Different Field Per Language
   •   Advantages:
         Simple, easiest to implement
   •   Disadvantages:
         My require keeping duplicate copies of your text per language
         If searching across each field (dismax style), slows search down, especially if
          handling many languages


 Approach 2: Different Solr Core per language
   Each core has your field defined with a different Analyzer chain
   specific to that core’s language
   •   Advantages:
         Searching can be completely language-agnostic and additional overhead to search
          more languages simultaneously is negligible
   •   Disadvantages:
         Multi-lingual documents require indexing to multiple cores, potentially messing up
          relevancy and adding complexity
         Have to write your own language-dependent sharding
         If you don’t already have distributed search, this adds complexity and overhead
Multi-lingual Analysis
   Approach 3: All languages in one field
         •   Advantages:
               Only one field needed regardless of number of languages
               Avoids a field explosion or a Solr core explosion as you scale to handle more languages

         •   Disadvantages:
               Can end up with some “noise” in the index if you process most text in lots of languages
                (especially if stemming and not lemmatizing)
               Currently requires writing your own Tokenizer or Filter


   Strategy:
         •   1) Copy token stream and create a stemmer/lemmatizer for each language
             2) Pass the original into each stemmer/lemmatizer
             3) Stack the outputs of each stemmer/lemmatizer

Input:
Output:
Multi-lingual Analysis
 Case Study: Stemming vs. Lemmatization
  •   Example: dries >> dri vs dries >> dry




                          Measuring Recall Overlap Between Options

  Take-away: Lemmatization allows you to greatly increase recall while
  preserving the precision you lose with stemming

  i.e. English shows 92% increase in recall using Lemmatization with
  minimal impact on precision
Custom Scoring
   Search Terms can be boosted differently:
     •   q=web^2 development^5 AND jobtitle:(software engineer)^10


   Some Fields can be weighted (scored) higher than others
     •   i.e. Field1^10, Field2^5, Field3^2, Field 4^.01

   Content within Fields can be boosted differently
     •   design [1] / engineer [1] / really [ ] / great [ ] / job [ ] / ten[3] / years[3] / experience[3] /
         careerbuilder [2] / design [2], …
         Field1: bucket=[1] boost=10; Field2: bucket=[2] boost=1.5; Field3: bucket=[] weight=1; Field4: bucket=3
         weight=1.5


     •   We can pass in a parameter to solr at query time specifying the boost to apply to each
         bucket i.e. …&bucketWeights=1:10;2:1.5;3:1.5

   You can also do index-time boosting, but this reduces your ability to do query-side
    relevancy experiments and requires norms to always be on

   By making all scoring parameters overridable at query time, we are able to do A / B
    testing to consistently improve our relevancy model
Stopping Keyword Spamming
 We already subclass PayloadTermQuery and tie in custom scoring
  for our buckets weights

 For each payload “bucket” (or across all buckets), we can count
  the number of hits and penalize the score if a particular keyword
  appears too many times

 Payload scoring then essentially becomes
    •   BucketBoost(payloadBucket) * HitMap(#hitsPerbucket)


 By adjusting our HitMap function, we can thus generate any kind of
  relevancy curve for how much each additional term adds to (or
  subtracts from) the relevancy score for that document
    •   ex: Bell curve, Linear, Bi-linear, Linear with drop-off, custom map, etc.
CareerBuilder’s Search Cloud
 Goals:
  • Make search easy to use and accessible to all engineers (not
    just the search team)

  • Allow schema changes without mucking with solr (on hundreds
    of servers)

  • Make solr installs generic and independent of any particular
    implementation
Creating a virtual search engine
 3 Main Cloud Actions: Index, Search, Delete
Creating a virtual search engine
 Creating a Schema
Creating a virtual search engine
 Creating a Document




 Processing Results
  •   A QueryResult object comes back from the SearchEngine.Search method with all of
      the main types (search records, facets, meta info, etc) parsed out into objects



 Behind the Scenes:
  •   We have a distributed architecture handling queuing all documents to
      appropriate datacenters, feeding the clusters, and load-balancing
      searches between all available clusters for the given search pool.
Knowledge Discovery & Data Analytics
Knowledge Discovery & Data Analytics
Knowledge Discovery & Data Analytics




                                       25
Knowledge Discovery & Data Analytics
Knowledge Discovery & Data Analytics
Knowledge Discovery & Data Analytics
Clustering: Nursing
Clustering: .Net
Clustering: Hyperion Developer
Take Aways
 Know how your linguistics affect precision and recall
  and choose wisely; know how to tweak for your domain.

 A flexible software api that turn Solr into a SAAS type
  cloud app can greatly increase agility and adoption of
  search.

 Search isn’t just about finding and navigating content…
  it can be used to learn from and create it, as well.
Contact
 Trey Grainger
       • trey.grainger@careerbuilder.com
       • http://www.careerbuilder.com

More Related Content

What's hot

Optimizing Your Infrastructure Costs on AWS
Optimizing Your Infrastructure Costs on AWSOptimizing Your Infrastructure Costs on AWS
Optimizing Your Infrastructure Costs on AWS
Amazon Web Services
 
Deep Learning for Data Scientists: Using Apache MXNet and R on AWS - June 201...
Deep Learning for Data Scientists: Using Apache MXNet and R on AWS - June 201...Deep Learning for Data Scientists: Using Apache MXNet and R on AWS - June 201...
Deep Learning for Data Scientists: Using Apache MXNet and R on AWS - June 201...
Amazon Web Services
 

What's hot (20)

(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
Optimizing Your Infrastructure Costs on AWS
Optimizing Your Infrastructure Costs on AWSOptimizing Your Infrastructure Costs on AWS
Optimizing Your Infrastructure Costs on AWS
 
AWS Cost Optimization Strategy
AWS Cost Optimization StrategyAWS Cost Optimization Strategy
AWS Cost Optimization Strategy
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
Deep Learning for Data Scientists: Using Apache MXNet and R on AWS - June 201...
Deep Learning for Data Scientists: Using Apache MXNet and R on AWS - June 201...Deep Learning for Data Scientists: Using Apache MXNet and R on AWS - June 201...
Deep Learning for Data Scientists: Using Apache MXNet and R on AWS - June 201...
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
 
Optimising TCO with AWS at Websummit Dublin
Optimising TCO with AWS at Websummit DublinOptimising TCO with AWS at Websummit Dublin
Optimising TCO with AWS at Websummit Dublin
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
Azure Cosmos DB by Mohammed Gadi AUG April 2019
Azure Cosmos DB by Mohammed Gadi AUG April 2019Azure Cosmos DB by Mohammed Gadi AUG April 2019
Azure Cosmos DB by Mohammed Gadi AUG April 2019
 
EC2 Performance, Spot Instance ROI and EMR Scalability
EC2 Performance, Spot Instance ROI and EMR ScalabilityEC2 Performance, Spot Instance ROI and EMR Scalability
EC2 Performance, Spot Instance ROI and EMR Scalability
 
Build your own ASR engine
Build your own ASR engineBuild your own ASR engine
Build your own ASR engine
 
AWS Summit Sydney 2014 | Moving to the Cloud. What does it Mean to your Business
AWS Summit Sydney 2014 | Moving to the Cloud. What does it Mean to your BusinessAWS Summit Sydney 2014 | Moving to the Cloud. What does it Mean to your Business
AWS Summit Sydney 2014 | Moving to the Cloud. What does it Mean to your Business
 
AWS Webcast - Total Cost of (Non) Ownership
AWS Webcast - Total Cost of (Non) Ownership  AWS Webcast - Total Cost of (Non) Ownership
AWS Webcast - Total Cost of (Non) Ownership
 
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
 

Viewers also liked

Hellosong
HellosongHellosong
Hellosong
tanica
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Lucidworks (Archived)
 
A haiti
A haitiA haiti
A haiti
tanica
 

Viewers also liked (20)

All Data Big and Small
All Data Big and SmallAll Data Big and Small
All Data Big and Small
 
Van gogh
Van goghVan gogh
Van gogh
 
Using LWE/Solr/Lucene for eCom
Using LWE/Solr/Lucene for eComUsing LWE/Solr/Lucene for eCom
Using LWE/Solr/Lucene for eCom
 
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
 
Integration of apache solr with crawlers
Integration of apache solr with crawlersIntegration of apache solr with crawlers
Integration of apache solr with crawlers
 
"Search, APIs,Capability Management and the Sensis Journey"
"Search, APIs,Capability Management and the Sensis Journey""Search, APIs,Capability Management and the Sensis Journey"
"Search, APIs,Capability Management and the Sensis Journey"
 
Hellosong
HellosongHellosong
Hellosong
 
Getting started with Lucidworks Enterprise
Getting started with Lucidworks EnterpriseGetting started with Lucidworks Enterprise
Getting started with Lucidworks Enterprise
 
Com camp2014
Com camp2014Com camp2014
Com camp2014
 
What Lucene and Solr Open Source Search can do for Enterprise Search
What Lucene and Solr Open Source Search can do for Enterprise SearchWhat Lucene and Solr Open Source Search can do for Enterprise Search
What Lucene and Solr Open Source Search can do for Enterprise Search
 
What’s New in Solr 1.4
What’s New in Solr 1.4What’s New in Solr 1.4
What’s New in Solr 1.4
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
IE のサポート変更が Azure に及ぼす影響
IE のサポート変更が Azure に及ぼす影響IE のサポート変更が Azure に及ぼす影響
IE のサポート変更が Azure に及ぼす影響
 
Solr & Lucene at Etsy
Solr & Lucene at EtsySolr & Lucene at Etsy
Solr & Lucene at Etsy
 
A haiti
A haitiA haiti
A haiti
 
What’s New in Apache Lucene 3.0
What’s New in Apache Lucene 3.0What’s New in Apache Lucene 3.0
What’s New in Apache Lucene 3.0
 
The Gaiety Hotel
The Gaiety HotelThe Gaiety Hotel
The Gaiety Hotel
 
Coterie 9 11
Coterie 9 11Coterie 9 11
Coterie 9 11
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 

Similar to Extending Solr: Building a Cloud-like Knowledge Discovery Platform

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Lucidworks
 
Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...
Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...
Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...
Marina Peregud
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 

Similar to Extending Solr: Building a Cloud-like Knowledge Discovery Platform (20)

Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
 
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search Training
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
Design Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesDesign Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best Practices
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Design Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesDesign Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best Practices
 
Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...
Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...
Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorization
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization
 
Software design with Domain-driven design
Software design with Domain-driven design Software design with Domain-driven design
Software design with Domain-driven design
 
Azure CosmosDb - Where we are
Azure CosmosDb - Where we areAzure CosmosDb - Where we are
Azure CosmosDb - Where we are
 
Distributed teams
Distributed teamsDistributed teams
Distributed teams
 
Distributed_teams
Distributed_teamsDistributed_teams
Distributed_teams
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
 

More from Lucidworks (Archived)

Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Lucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Lucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Lucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Lucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Lucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Lucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Lucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
Lucidworks (Archived)
 

More from Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 

Extending Solr: Building a Cloud-like Knowledge Discovery Platform

  • 1. Extending Solr: Building a Cloud-like Knowledge Discovery Platform Trey Grainger,,CareerBuilder
  • 2. Overview CareerBuilder’s Cloud-like Knowledge Discovery Platform • Scalable approaches to multi-lingual text analysis (with research study)  Multiple fields vs Multiple Cores vs Single Field • Custom Scoring  Payloads and on-the-fly bucket scoring  Implementing a keyword spamming penalty • Solr as a Cloud Service  Scalable, customizable search for everybody • Knowledge Discovery & Data Analytics
  • 3. My background  Trey Grainger • Search Technology Development Team Lead @ CareerBuilder.com  Relevant Background: • Search & Recommendations • High-volume, N-tier Architectures • NLP, Relevancy Tuning, user group testing & machine learning  Fun Side Project: • Founder and Site Architect @ Celiaccess.com
  • 4. CareerBuilder’s Search Scale  Over 1 million new jobs each month  Over 40 million resumes  ~150 globally distributed search servers (in the U.S., Europe, & Asia)  Several thousand unique, dynamically generated indexes  Over a million searches an hour  >100 Million Search Documents
  • 10. Recommendations  We classify all content (Jobs, Resumes, etc.) and index the classified content into Solr  We use a combination of collaborative filtering and classification techniques  We utilize a custom scorer and payloads to apply higher bucket weights to more relevant content  Recommendations are real-time and largely driven by search
  • 13. Multi-lingual Analysis  Approach 1: Different Field Per Language • Advantages:  Simple, easiest to implement • Disadvantages:  My require keeping duplicate copies of your text per language  If searching across each field (dismax style), slows search down, especially if handling many languages  Approach 2: Different Solr Core per language Each core has your field defined with a different Analyzer chain specific to that core’s language • Advantages:  Searching can be completely language-agnostic and additional overhead to search more languages simultaneously is negligible • Disadvantages:  Multi-lingual documents require indexing to multiple cores, potentially messing up relevancy and adding complexity  Have to write your own language-dependent sharding  If you don’t already have distributed search, this adds complexity and overhead
  • 14. Multi-lingual Analysis  Approach 3: All languages in one field • Advantages:  Only one field needed regardless of number of languages  Avoids a field explosion or a Solr core explosion as you scale to handle more languages • Disadvantages:  Can end up with some “noise” in the index if you process most text in lots of languages (especially if stemming and not lemmatizing)  Currently requires writing your own Tokenizer or Filter  Strategy: • 1) Copy token stream and create a stemmer/lemmatizer for each language 2) Pass the original into each stemmer/lemmatizer 3) Stack the outputs of each stemmer/lemmatizer Input: Output:
  • 15. Multi-lingual Analysis  Case Study: Stemming vs. Lemmatization • Example: dries >> dri vs dries >> dry Measuring Recall Overlap Between Options Take-away: Lemmatization allows you to greatly increase recall while preserving the precision you lose with stemming i.e. English shows 92% increase in recall using Lemmatization with minimal impact on precision
  • 16. Custom Scoring  Search Terms can be boosted differently: • q=web^2 development^5 AND jobtitle:(software engineer)^10  Some Fields can be weighted (scored) higher than others • i.e. Field1^10, Field2^5, Field3^2, Field 4^.01  Content within Fields can be boosted differently • design [1] / engineer [1] / really [ ] / great [ ] / job [ ] / ten[3] / years[3] / experience[3] / careerbuilder [2] / design [2], … Field1: bucket=[1] boost=10; Field2: bucket=[2] boost=1.5; Field3: bucket=[] weight=1; Field4: bucket=3 weight=1.5 • We can pass in a parameter to solr at query time specifying the boost to apply to each bucket i.e. …&bucketWeights=1:10;2:1.5;3:1.5  You can also do index-time boosting, but this reduces your ability to do query-side relevancy experiments and requires norms to always be on  By making all scoring parameters overridable at query time, we are able to do A / B testing to consistently improve our relevancy model
  • 17. Stopping Keyword Spamming  We already subclass PayloadTermQuery and tie in custom scoring for our buckets weights  For each payload “bucket” (or across all buckets), we can count the number of hits and penalize the score if a particular keyword appears too many times  Payload scoring then essentially becomes • BucketBoost(payloadBucket) * HitMap(#hitsPerbucket)  By adjusting our HitMap function, we can thus generate any kind of relevancy curve for how much each additional term adds to (or subtracts from) the relevancy score for that document • ex: Bell curve, Linear, Bi-linear, Linear with drop-off, custom map, etc.
  • 18. CareerBuilder’s Search Cloud  Goals: • Make search easy to use and accessible to all engineers (not just the search team) • Allow schema changes without mucking with solr (on hundreds of servers) • Make solr installs generic and independent of any particular implementation
  • 19. Creating a virtual search engine  3 Main Cloud Actions: Index, Search, Delete
  • 20. Creating a virtual search engine  Creating a Schema
  • 21. Creating a virtual search engine  Creating a Document  Processing Results • A QueryResult object comes back from the SearchEngine.Search method with all of the main types (search records, facets, meta info, etc) parsed out into objects  Behind the Scenes: • We have a distributed architecture handling queuing all documents to appropriate datacenters, feeding the clusters, and load-balancing searches between all available clusters for the given search pool.
  • 22. Knowledge Discovery & Data Analytics
  • 23. Knowledge Discovery & Data Analytics
  • 24. Knowledge Discovery & Data Analytics 25
  • 25. Knowledge Discovery & Data Analytics
  • 26. Knowledge Discovery & Data Analytics
  • 27. Knowledge Discovery & Data Analytics
  • 31. Take Aways  Know how your linguistics affect precision and recall and choose wisely; know how to tweak for your domain.  A flexible software api that turn Solr into a SAAS type cloud app can greatly increase agility and adoption of search.  Search isn’t just about finding and navigating content… it can be used to learn from and create it, as well.
  • 32. Contact  Trey Grainger • trey.grainger@careerbuilder.com • http://www.careerbuilder.com