Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Semantic & Multilingual Strategies in Lucene/Solr 
Trey Grainger 
Director of Engineering, Search & Analytics@CareerBuilder
Outline 
•Introduction 
•Text Analysis Refresher 
•Language-specific text Analysis 
•Multilingual Search Strategies 
•Auto...
About Me 
Trey Grainger 
Director of Engineering, Search & Analytics 
Joined CareerBuilderin 2007 as Software Engineer 
MB...
At CareerBuilder, SolrPowers...
Text Analysis Refresher
Text Analysis Refresher 
A text field in Lucene/Solrhas an Analyzer containing: 
①Zero or more CharFilters 
Takes incoming...
Text Analysis Refresher 
A text field in Lucene/Solrhas an Analyzer containing: 
①Zero or more CharFilters 
Takes incoming...
Text Analysis Refresher 
A text field in Lucene/Solrhas an Analyzer containing: 
①Zero or more CharFilters 
Takes incoming...
Text Analysis Refresher 
A text field in Lucene/Solrhas an Analyzer containing: 
①Zero or more CharFilters 
Takes incoming...
Language-specific Text Analysis
Example English Analysis Chains 
<fieldTypename="text_en" class="solr.TextField" 
positionIncrementGap="100"> 
<analyzer> ...
Per-language Analysis Chains 
*Some of the 32 different languages configurations in Appendix B of Solrin Action
Per-language Analysis Chains 
*Some of the 32 different languages configurations in Appendix B of Solrin Action
Which Stemmer do I choose? 
*From Solrin Action, Chapter 14
Common English Stemmers 
*From Solrin Action, Chapter 14
When Stemming goes awry 
Fixing Stemming Mistakes: 
•Unfortunately, every stemmer will have problem-cases that aren’t hand...
Stemming vs. Lemmatization 
•Stemming: algorithmic manipulation of text, based upon common per-language rules 
•Lemmatizat...
Multilingual Search Strategies
Multilingual Search Strategies 
How do you handle: 
…a different language per document? 
…multiple languages in the same d...
Strategy 1: Separate field per language 
*From Solrin Action, Chapter 14
Separate field per language 
<field name="id" type="string" indexed="true" stored="true" /> <field name="title" type="stri...
Separate field per language: one language per document 
<doc> 
<field name="id">1</field> 
<fieldname="title">The Adventur...
Separate field per language: multiple languages per document 
Query 1: 
http://localhost:8983/solr/field-per-language/sele...
Summary: Separate field per language 
*From Solrin Action, Chapter 14
Strategy 2: Separate collection per language 
*From Solrin Action, Chapter 14
Separate collection per language: schema.xml 
*From Solrin Action, Chapter 14
Separate collection per language: Indexing & Querying 
Indexing: 
cd $SOLR_IN_ACTION/example-docs/ 
java -jar -Durl=http:/...
Summary: Separate index per language 
*From Solrin Action, Chapter 14
Strategy 3: One Field for all languages 
*From Solrin Action, Chapter 14
One Field for all languages: Feature Status 
•Note: This feature is not yet committed to Solr 
•I’m working on it in my fr...
One Field for all languages 
Step 1: Define Multilingual Field 
schema.xml: 
<fieldTypename="multilingual_text" class="sia...
One Field For All Languages: Stacked Token Streams 
1) English Field 
2) Spanish Field 
3) English + Spanish combined in M...
Strategy 3: All languages in one field 
* 
*See Solrin Action, Chapter 14
Automatic Language Identification
Identifying languages in documents 
solrconfig.xml 
... 
<updateRequestProcessorChainname="langid"> 
<processorclass="org....
Identifying languages in documents 
Sending documents: 
cd $SOLR_IN_ACTION/example-docs/ 
java -Durl=http://localhost:8983...
Mapping data to language-specific fields 
solrconfig.xml 
... 
<updateRequestProcessorChainname="langid"> 
<processorclass...
Semantic Strategies
The need for Semantic Search 
User’s Query: machine learning research and development Portland, OR software engineer AND h...
Semantic Search Architecture –Query Parsing 
1)Generate Model of Domain-specific phrases 
•Can mine query logs or actual t...
machine learning 
Keywords: 
Search Behavior, 
Application Behavior, etc. 
Job Title Classifier, Skills Extractor, Job Lev...
Differentiating related terms 
Synonyms: cpa=> certified public accountant 
rn=> registered nurser.n. => registered nurseA...
Semantic Search “under the hood”
2014 Publications & Presentations 
Books: 
Solrin Action-A comprehensive guide to implementing scalable search using Apach...
Conclusion 
•Language analysis options for each language are very configurable 
•There are multiple strategies for handlin...
Contact Info 
Yes, WE ARE HIRING@CareerBuilder. Come talk with me if you are interested… 
Trey Grainger 
trey.grainger@car...
Previous presentations:
Upcoming SlideShare
Loading in …5
×

Semantic & Multilingual Strategies in Lucene/Solr

12,343 views

Published on

When searching on text, choosing the right CharFilters, Tokenizer, stemmers, and other TokenFilters for each supported language is critical. Additional tools of the trade include language detection through UpdateRequestProcessors, parts of speech analysis, entity extraction, stopword and synonym lists, relevancy differentiation for exact vs. stemmed vs. conceptual matches, and identification of statistically interesting phrases per language. For multilingual search, you also need to choose between several strategies such as: searching across multiple fields, using a separate collection per language combination, or combining multiple languages in a single field (custom code is required for this and will be open sourced). These all have their own strengths and weaknesses depending upon your use case. This talk will provide a tutorial (with code examples) on how to pull off each of these strategies as well as compare and contrast the different kinds of stemmers, review the precision/recall impact of stemming vs. lemmatization, and describe some techniques for extracting meaningful relationships between terms to power a semantic search experience per-language. Come learn how to build an excellent semantic and multilingual search system using the best tools and techniques Lucene/Solr has to offer!

Published in: Technology
  • You can now be your own boss and get yourself a very generous daily income. START FREE...♣♣♣ https://tinyurl.com/realmoneystreams2019
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Your opinions matter! get paid BIG $$$ for them! START NOW!!.. ◆◆◆ https://tinyurl.com/realmoneystreams2019
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Does sia.ch14.MultiTextField work with solr 6.2.0?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • This is the best short summary I've seen of the pros and cons of various strategies for handling multilingual content in Solr. As this post was end of 2014 and I haven't seen anything in the recent Solr documentation, I am guessing Strategy 3: One Field for All Languages was not commited to the mainline Solr.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Semantic & Multilingual Strategies in Lucene/Solr

  1. 1. Semantic & Multilingual Strategies in Lucene/Solr Trey Grainger Director of Engineering, Search & Analytics@CareerBuilder
  2. 2. Outline •Introduction •Text Analysis Refresher •Language-specific text Analysis •Multilingual Search Strategies •Automatic Language Identification •Semantic Search Strategies (understanding “meaning”) •Conclusion
  3. 3. About Me Trey Grainger Director of Engineering, Search & Analytics Joined CareerBuilderin 2007 as Software Engineer MBA, Management of Technology –GA Tech BA, Computer Science, Business, & Philosophy –Furman University Mining Massive Datasets (in progress) -Stanford University Fun outside of CB: •Co-author of Solr in Action, plus several research papers •Frequent conference speaker •Founder of Celiaccess.com, the gluten-free search engine •Lucene/Solrcontributor
  4. 4. At CareerBuilder, SolrPowers...
  5. 5. Text Analysis Refresher
  6. 6. Text Analysis Refresher A text field in Lucene/Solrhas an Analyzer containing: ①Zero or more CharFilters Takes incoming text and “cleans it up” before it is tokenized ②One Tokenizer Splits incoming text into a Token Stream containing Zero or more Tokens ③Zero or more TokenFilters Examines and optionally modifies each Token in the Token Stream *From Solrin Action, Chapter 6
  7. 7. Text Analysis Refresher A text field in Lucene/Solrhas an Analyzer containing: ①Zero or more CharFilters Takes incoming text and “cleans it up” before it is tokenized ②One Tokenizer Splits incoming text into a Token Stream containing Zero or more Tokens ③Zero or more TokenFilters Examines and optionally modifies each Token in the Token Stream *From Solrin Action, Chapter 6
  8. 8. Text Analysis Refresher A text field in Lucene/Solrhas an Analyzer containing: ①Zero or more CharFilters Takes incoming text and “cleans it up” before it is tokenized ②OneTokenizer Splits incoming text into a Token Stream containing Zero or more Tokens ③Zero or more TokenFilters Examines and optionally modifies each Token in the Token Stream *From Solrin Action, Chapter 6
  9. 9. Text Analysis Refresher A text field in Lucene/Solrhas an Analyzer containing: ①Zero or more CharFilters Takes incoming text and “cleans it up” before it is tokenized ②One Tokenizer Splits incoming text into a Token Stream containing Zero or more Tokens ③Zero or more TokenFilters Examines and optionally modifies each Token in the Token Stream *From Solrin Action, Chapter 6
  10. 10. Language-specific Text Analysis
  11. 11. Example English Analysis Chains <fieldTypename="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizerclass="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt” ignoreCase="true" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="lang/en_protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType> <fieldTypename="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilterclass="solr.HTMLStripCharFilterFactory"/> <tokenizerclass="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="lang/en_synonyms.txt" IignoreCase="true" expand="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.KStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType>
  12. 12. Per-language Analysis Chains *Some of the 32 different languages configurations in Appendix B of Solrin Action
  13. 13. Per-language Analysis Chains *Some of the 32 different languages configurations in Appendix B of Solrin Action
  14. 14. Which Stemmer do I choose? *From Solrin Action, Chapter 14
  15. 15. Common English Stemmers *From Solrin Action, Chapter 14
  16. 16. When Stemming goes awry Fixing Stemming Mistakes: •Unfortunately, every stemmer will have problem-cases that aren’t handled as you would expect •Thankfully, Stemmers can be overriden •KeywordMarkerFilter: protects a list of terms you specify from being stemmed •StemmerOverrideFilter: applies a list of custom term mappings you specify Alternate strategy: •Use Lemmatization(root-form analysis) instead of Stemming •Commercial vendorshelp tremendously in this space(see http://www.basistech.com/case-study-career-builder/) •The Hunspellstemmer enables dictionary-based support of varying quality in over 100 languages
  17. 17. Stemming vs. Lemmatization •Stemming: algorithmic manipulation of text, based upon common per-language rules •Lemmatization: finds the dictionary form of a term (lemma means “root”) -dramatically improves precision(only matching terms that “should” match), while not significantly impacting recall(all terms that should match do match). *From Solrin Action, Chapter 14
  18. 18. Multilingual Search Strategies
  19. 19. Multilingual Search Strategies How do you handle: …a different language per document? …multiple languages in the same document? …multiple languages in the same field? Strategies: 1)Separate field per language 2)Separate collection/core per language 3)All languages in one field
  20. 20. Strategy 1: Separate field per language *From Solrin Action, Chapter 14
  21. 21. Separate field per language <field name="id" type="string" indexed="true" stored="true" /> <field name="title" type="string" indexed="true" stored="true" /> <field name="content_english" type="text_english" indexed="true” stored="true" /> <field name="content_french" type="text_french" indexed="true” stored="true" /> <field name="content_spanish" type="text_spanish" indexed="true” stored="true" /> <fieldTypename="text_english" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizerclass="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory” ignoreCase="true" words="lang/stopwords_en.txt"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.KStemFilterFactory"/> </analyzer> </fieldType> <fieldTypename="text_spanish" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizerclass="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/ stopwords_es.txt" format="snowball"/> <filter class="solr.SpanishLightStemFilterFactory"/> </analyzer> </fieldType> <fieldTypename="text_french" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizerclass="solr.StandardTokenizerFactory"/> <filter class="solr.ElisionFilterFactory” ignoreCase="true" articles="lang/contractions_fr.txt"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fr.txt” format="snowball"/> <filter class="solr.FrenchLightStemFilterFactory"/> </analyzer> </fieldType> schema.xml *From Solrin Action, Chapter 14
  22. 22. Separate field per language: one language per document <doc> <field name="id">1</field> <fieldname="title">The Adventures of Huckleberry Finn</field> <field name="content_english">YOU don't know about me without you have read a book by the name of The Adventures of Tom Sawyer; but that ain'tno matter. That book was made by Mr. Mark Twain, and he told the truth, mainly. There was things which he stretched, but mainly he told the truth. <field> </doc> <doc> <field name="id ">2</field> <field name="title">Les Misérables</field> <field name="content_french">Nuln'auraitpule dire; tout cequ'onsavait, c'estque, lorsqu'ilrevintd'Italie, ilétaitprêtre. </field> </doc> <doc> <field name="id">3</field> <field name="title">Don Quixote</field> <field name="content_spanish">Demasiadacordurapuedeserla peorde las locuras, verla vidacomoesy no comodeberíade ser. </field> </doc> Query: http://localhost:8983/solr/field-per-language/select? fl=title& defType=edismax& qf=content_englishcontent_frenchcontent_spanish& q="he told the truth" OR"ilétaitprêtre" OR"verla vidacomoes" Response: { "response":{"numFound":3,"start":0,"docs":[ { "title":["The Adventures of Huckleberry Finn"]}, { "title":["Don Quixote"]}, { "title":["Les Misérables"]}] } *From Solrin Action, Chapter 14
  23. 23. Separate field per language: multiple languages per document Query 1: http://localhost:8983/solr/field-per-language/select? fl=title& defType=edismax& qf=content_englishcontent_frenchcontent_spanish& q="wisdom” Query 2: http://localhost:8983/solr/field-per-language/select?... q="sabiduría” Query 3: http://localhost:8983/solr/field-per-language/select?... q="sagesse” Response: (same for queries 1–3) { "response":{"numFound":1,"start":0,"docs":[ { "title":["Proverbs"]}] } Documents: <doc> <field name="id">4</field> <field name="title">Proverbs</field> <field name="content_spanish"> No la abandonesy ellavelarásobre ti, ámalay ellateprotegerá. Lo principal esla sabiduría; adquiere sabiduría, y con todolo queobtengasadquiereinteligencia. </field> <field name="content_english">Do not forsake wisdom, and she will protect you; love her, and she will watch over you. Wisdom is supreme; therefore get wisdom. Though it cost all you have, get understanding. </field> <field name="content_french">N'abandonnepas la sagesse, et ellete gardera, aime-la, et elleteprotégera. Voicile début de la sagesse: acquierslasagesse, procure-toile discernementau prix de tout cequetupossèdes. <field> </doc> *From Solrin Action, Chapter 14
  24. 24. Summary: Separate field per language *From Solrin Action, Chapter 14
  25. 25. Strategy 2: Separate collection per language *From Solrin Action, Chapter 14
  26. 26. Separate collection per language: schema.xml *From Solrin Action, Chapter 14
  27. 27. Separate collection per language: Indexing & Querying Indexing: cd $SOLR_IN_ACTION/example-docs/ java -jar -Durl=http://localhost:8983/solr/english/update post.jar ➥ch14/documents/english.xml java -jar -Durl=http://localhost:8983/solr/spanish/update post.jar ➥ch14/documents/spanish.xml java -jar -Durl=http://localhost:8983/solr/french/update post.jar ➥ch14/documents/french.xml Query (collections in SolrCloud): http://localhost:8983/solr/aggregator/select? shards=english,spanish,french df=content& q=query in any language here Query (specific cores): http://localhost:8983/solr/aggregator/select? shards=localhost:8983/solr/english, localhost:8983/solr/spanish, localhost:8983/solr/french& df=content& q=query in any language here Documents: All documents just have a single “content” field. The documents get routedto a different language-specific Solrcollection based upon the language of the content field. *From Solrin Action, Chapter 14
  28. 28. Summary: Separate index per language *From Solrin Action, Chapter 14
  29. 29. Strategy 3: One Field for all languages *From Solrin Action, Chapter 14
  30. 30. One Field for all languages: Feature Status •Note: This feature is not yet committed to Solr •I’m working on it in my free time. Currently it supports: •Update Request Processorwhich canautomatically detect the languages of documentsand choose the correct analyzers •Field Type which allows dynamically choosing one or more analyzers on a per-field (indexing) and per term (querying) basis. •Current Code from Solr in Actionis available and is freely available on github. •There is a JIRA ticket open to ultimately contribute this back to Solr: Solr-6492 •Some work is still necessary to make querying more user friendly.
  31. 31. One Field for all languages Step 1: Define Multilingual Field schema.xml: <fieldTypename="multilingual_text" class="sia.ch14.MultiTextField" sortMissingLast="true" defaultFieldType="text_general" fieldMappings="en:text_english, es:text_spanish, fr:text_french, de:text_german"/>[1] <field name="text" type="multilingual_text" indexed="true" multiValued="true" /> [1]Note that "text_english", "text_spanish", "text_french", and "text_german" refer to field types defined elsewhere in the schema.xml [2]Uses the "defaultFieldType", in this case "text_general", defined elsewhere in schema.xml <add><doc>… <field name="text">general keywords</field> [2] <field name="text”>en,es|theschool, lasescuelas</field>… </doc></add> <add><doc>… <field name="text">en|theschool</field> <field name="text">es|lasescuelas</field>… </doc></add> Step 2: Index documents http://localhost:8983/solr/collection1/select? q=es|escuelaOR en,es,de|schoolOR school [2] Step 3: Search
  32. 32. One Field For All Languages: Stacked Token Streams 1) English Field 2) Spanish Field 3) English + Spanish combined in Multilingual Text Field multilingual_text ①For each language requested, the appropriate field type is chosen ②The input text is passed separately to the Analyzer chain for each field type ③The resulting Token Streams from each Analyzer chain arestacked into a unified Token Stream based upon their position increments *Screenshot from Solrin Action, Chapter 14
  33. 33. Strategy 3: All languages in one field * *See Solrin Action, Chapter 14
  34. 34. Automatic Language Identification
  35. 35. Identifying languages in documents solrconfig.xml ... <updateRequestProcessorChainname="langid"> <processorclass="org.apache.solr.update.processor. LangDetectLanguageIdentifierUpdateProcessorFactory"> <lstname="invariants"> <strname="langid.fl">content, content_lang1,content_lang2,content_lang3</str> <strname="langid.langField">language</str> <strname="langid.langsField">languages</str> ... </lst> </processor> .. </updateRequestProcessorChain> … <requestHandlername="/update" class="solr.UpdateRequestHandler"> <lstname="invariants"> <strname="update.chain">langid</str> </lst> </requestHandler> ... schema.xml ... <field name="language" type="string" indexed="true" stored="true" /> <field name="languages" type="string" indexed="true" stored="true" multiValued="true"/> ... *See Solrin Action, Chapter 14
  36. 36. Identifying languages in documents Sending documents: cd $SOLR_IN_ACTION/example-docs/ java -Durl=http://localhost:8983/solr/langid/update ➥-jar post.jarch14/documents/langid.xml Query http://localhost:8983/solr/langid/select? q=*:*& fl=title,language,languages Results [{ "title":"TheAdventures of HuckelberryFinn", "language":"en", "languages":["en"]}, { "title":"LesMisérables", "language":"fr", "languages":["fr"]}, { "title":"DonQuoxite", "language":"es", "languages":["es"]}, { "title":"Proverbs", "language":"fr", "languages":["fr”, "en”,"es"]}] *See Solrin Action, Chapter 14
  37. 37. Mapping data to language-specific fields solrconfig.xml ... <updateRequestProcessorChainname="langid"> <processorclass="org.apache.solr.update.processor. LangDetectLanguageIdentifierUpdateProcessorFactory"> <lstname="invariants"> <strname="langid.fl">content</str> <strname="langid.langField">language</str> <strname="langid.map">true</str> <strname="langid.map.fl">content</str> <strname="langid.whitelist">en,es,fr</str> <strname="langid.map.lcmap"> en:englishes:spanishfr:french</str> <strname="langid.fallback">en</str> </lst> </processor> ... </updateRequestProcessorChain> ... Indexed Documents: [{ "title":"TheAdventures of Huckleberry Finn", "language":"en", "content_english":[ "YOU don't know about me without..."]}, { "title":"LesMisérables", "language":"fr", "content_french":[ "Nuln'auraitpule dire; tout ce..."]}, { "title":"DonQuixote", "language":"es", "content_spanish":[ "Demasiadacordurapuedeserla peor..."]}] }] *See Solrin Action, Chapter 14
  38. 38. Semantic Strategies
  39. 39. The need for Semantic Search User’s Query: machine learning research and development Portland, OR software engineer AND hadoopjava Traditional Query Parsing: (machine ANDlearningANDresearch ANDdevelopmentANDportland) OR(software ANDengineer ANDhadoopANDjava) Semantic Query Parsing: "machine learning" AND"research and development" AND"Portland, OR” AND"software engineer" ANDhadoopANDjava Semantically Expanded Query: ("machine learning"^10OR"data scientist" OR"data mining" OR"computer vision") AND("research and development"^10OR"r&d") ANDAND("Portland, OR"^10OR"Portland, Oregon" OR{!geofiltpt=45.512,-122.676 d=50sfield=geo}) AND("software engineer"^10OR"software developer") AND(hadoop^10OR"big data" ORhbaseORhive) AND(java^10 ORj2ee)
  40. 40. Semantic Search Architecture –Query Parsing 1)Generate Model of Domain-specific phrases •Can mine query logs or actual text of documents for significant phrases within your domain [1] 2)Feed known phrases to SolrTextTagger(uses LuceneFST for high-throughput term lookups) 3)Use SolrTextTaggerto perform entity extraction on incoming queries(tagging documents is also optional) 4)Shown on next slide: Pass extracted entities to a Query Augmentation phase to rewrite query with enhanced semantic understanding(synonyms, related keywords, related categories, etc.) [1] K. Aljadda, M. Korayem, T. Grainger, C. Russell. "CrowdsourcedQuery Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014. [2]https://github.com/OpenSextant/SolrTextTagger
  41. 41. machine learning Keywords: Search Behavior, Application Behavior, etc. Job Title Classifier, Skills Extractor, Job Level Classifier, etc. Clustering relationships Semantic Query Augmentation keywords:((machine learning)^10OR { AT_LEAST_2: ("data mining"^0.9,matlab^0.8, "data scientist"^0.75, "artificial intelligence"^0.7, "neural networks"^0.55))} { BOOST_TO_TOP:(job_title:( "software engineer" OR "data manager" OR "data scientist" OR "hadoopengineer"))} Modified Query: Related Occupations machine learning: {15-1031.00 .58Computer Software Engineers, Applications 15-1011.00 .55 Computer and Information Scientists, Research 15-1032.00 .52 Computer Software Engineers, Systems Software } machine learning: { software engineer .65, data manager .3, data scientist .25, hadoopengineer .2, } Common Job Titles Semantic Search Architecture –Query Augmentation Related Phrases machine learning: { data mining .9, matlab.8, data scientist .75, artificial intelligence .7, neural networks .55 } Known keyword phrases java developer machine learningregistered nurse
  42. 42. Differentiating related terms Synonyms: cpa=> certified public accountant rn=> registered nurser.n. => registered nurseAmbiguous Terms*: driver=> driver (trucking)~80% driver => driver (software)~20% Related Terms: r.n. => nursing, bsnhadoop=> mapreduce, hive, pig *differentiated based upon user and query context
  43. 43. Semantic Search “under the hood”
  44. 44. 2014 Publications & Presentations Books: Solrin Action-A comprehensive guide to implementing scalable search using Apache Solr Research papers: ●Towards a Job title Classification System ●Augmenting Recommendation Systems Using a Model of Semantically-related Terms Extracted from User Behavior ●sCooL: A system for academic institution name normalization ●CrowdsourcedQuery Augmentation through Semantic Discovery of Domain-specific jargon ●PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems ●SKILL: A System for Skill Identification and Normalization Speaking Engagements: ●WSDM 2014 Workshop: “Web-Scale Classification: Classifying Big Data from the Web” ●Atlanta SolrMeetup ●Atlanta Big Data Meetup ●The Second International Symposium on Big Data and Data Analytics ●Lucene/SolrRevolution 2014 ●RecSys2014 ●IEEE Big Data Conference 2014
  45. 45. Conclusion •Language analysis options for each language are very configurable •There are multiple strategies for handling multilingual content based upon your use case •When in doubt, automatic language detection can be easily leveraged in your indexing pipeline •The next generation of query/relevancy improvements will be able to understand the intent of the user.
  46. 46. Contact Info Yes, WE ARE HIRING@CareerBuilder. Come talk with me if you are interested… Trey Grainger trey.grainger@careerbuilder.com@treygrainger http://solrinaction.com Conference discount (43% off):lusorevcftw Other presentations: http://www.treygrainger.com
  47. 47. Previous presentations:

×