3. About Me
Trey Grainger
Director of Engineering, Search & Analytics
Joined CareerBuilderin 2007 as Software Engineer
MBA, Management of Technology –GA Tech
BA, Computer Science, Business, & Philosophy –Furman University
Mining Massive Datasets (in progress) -Stanford University
Fun outside of CB:
•Co-author of Solr in Action, plus several research papers
•Frequent conference speaker
•Founder of Celiaccess.com, the gluten-free search engine
•Lucene/Solrcontributor
6. Text Analysis Refresher
A text field in Lucene/Solrhas an Analyzer containing:
①Zero or more CharFilters
Takes incoming text and “cleans it up” before it is tokenized
②One Tokenizer
Splits incoming text into a Token Stream containing Zero or more Tokens
③Zero or more TokenFilters
Examines and optionally modifies each Token in the Token Stream
*From Solrin Action, Chapter 6
7. Text Analysis Refresher
A text field in Lucene/Solrhas an Analyzer containing:
①Zero or more CharFilters
Takes incoming text and “cleans it up” before it is tokenized
②One Tokenizer
Splits incoming text into a Token Stream containing Zero or more Tokens
③Zero or more TokenFilters
Examines and optionally modifies each Token in the Token Stream
*From Solrin Action, Chapter 6
8. Text Analysis Refresher
A text field in Lucene/Solrhas an Analyzer containing:
①Zero or more CharFilters
Takes incoming text and “cleans it up” before it is tokenized
②OneTokenizer
Splits incoming text into a Token Stream containing Zero or more Tokens
③Zero or more TokenFilters
Examines and optionally modifies each Token in the Token Stream
*From Solrin Action, Chapter 6
9. Text Analysis Refresher
A text field in Lucene/Solrhas an Analyzer containing:
①Zero or more CharFilters
Takes incoming text and “cleans it up” before it is tokenized
②One Tokenizer
Splits incoming text into a Token Stream containing Zero or more Tokens
③Zero or more TokenFilters
Examines and optionally modifies each Token in the Token Stream
*From Solrin Action, Chapter 6
16. When Stemming goes awry
Fixing Stemming Mistakes:
•Unfortunately, every stemmer will have problem-cases that aren’t handled as you would expect
•Thankfully, Stemmers can be overriden
•KeywordMarkerFilter: protects a list of terms you specify from being stemmed
•StemmerOverrideFilter: applies a list of custom term mappings you specify
Alternate strategy:
•Use Lemmatization(root-form analysis) instead of Stemming
•Commercial vendorshelp tremendously in this space(see http://www.basistech.com/case-study-career-builder/)
•The Hunspellstemmer enables dictionary-based support of varying quality in over 100 languages
17. Stemming vs. Lemmatization
•Stemming: algorithmic manipulation of text, based upon common per-language rules
•Lemmatization: finds the dictionary form of a term (lemma means “root”)
-dramatically improves precision(only matching terms that “should” match), while not significantly impacting recall(all terms that should match do match).
*From Solrin Action, Chapter 14
19. Multilingual Search Strategies
How do you handle:
…a different language per document?
…multiple languages in the same document? …multiple languages in the same field?
Strategies:
1)Separate field per language
2)Separate collection/core per language
3)All languages in one field
22. Separate field per language: one language per document
<doc>
<field name="id">1</field>
<fieldname="title">The Adventures of Huckleberry Finn</field>
<field name="content_english">YOU don't know about me without you have read
a book by the name of The Adventures of Tom Sawyer; but that ain'tno
matter. That book was made by Mr. Mark Twain, and he told the truth,
mainly. There was things which he stretched, but mainly he told the truth.
<field>
</doc>
<doc>
<field name="id ">2</field>
<field name="title">Les Misérables</field>
<field name="content_french">Nuln'auraitpule dire; tout cequ'onsavait,
c'estque, lorsqu'ilrevintd'Italie, ilétaitprêtre.
</field>
</doc>
<doc>
<field name="id">3</field>
<field name="title">Don Quixote</field>
<field name="content_spanish">Demasiadacordurapuedeserla peorde las
locuras, verla vidacomoesy no comodeberíade ser.
</field>
</doc>
Query:
http://localhost:8983/solr/field-per-language/select?
fl=title&
defType=edismax&
qf=content_englishcontent_frenchcontent_spanish&
q="he told the truth" OR"ilétaitprêtre" OR"verla vidacomoes"
Response:
{
"response":{"numFound":3,"start":0,"docs":[
{
"title":["The Adventures of Huckleberry Finn"]},
{
"title":["Don Quixote"]},
{
"title":["Les Misérables"]}]
}
*From Solrin Action, Chapter 14
23. Separate field per language: multiple languages per document
Query 1:
http://localhost:8983/solr/field-per-language/select?
fl=title&
defType=edismax&
qf=content_englishcontent_frenchcontent_spanish&
q="wisdom”
Query 2:
http://localhost:8983/solr/field-per-language/select?...
q="sabiduría”
Query 3:
http://localhost:8983/solr/field-per-language/select?...
q="sagesse”
Response: (same for queries 1–3)
{
"response":{"numFound":1,"start":0,"docs":[
{
"title":["Proverbs"]}]
}
Documents:
<doc>
<field name="id">4</field>
<field name="title">Proverbs</field>
<field name="content_spanish"> No la abandonesy ellavelarásobre
ti, ámalay ellateprotegerá. Lo principal esla sabiduría; adquiere
sabiduría, y con todolo queobtengasadquiereinteligencia.
</field>
<field name="content_english">Do not forsake wisdom, and she will protect you; love her, and she will watch over you. Wisdom is supreme;
therefore get wisdom. Though it cost all you have, get understanding.
</field>
<field name="content_french">N'abandonnepas la sagesse, et ellete
gardera, aime-la, et elleteprotégera. Voicile début de la sagesse:
acquierslasagesse, procure-toile discernementau prix de tout cequetupossèdes.
<field>
</doc>
*From Solrin Action, Chapter 14
27. Separate collection per language: Indexing & Querying
Indexing:
cd $SOLR_IN_ACTION/example-docs/
java -jar -Durl=http://localhost:8983/solr/english/update post.jar
➥ch14/documents/english.xml
java -jar -Durl=http://localhost:8983/solr/spanish/update post.jar
➥ch14/documents/spanish.xml
java -jar -Durl=http://localhost:8983/solr/french/update post.jar
➥ch14/documents/french.xml
Query (collections in SolrCloud):
http://localhost:8983/solr/aggregator/select?
shards=english,spanish,french
df=content&
q=query in any language here
Query (specific cores):
http://localhost:8983/solr/aggregator/select?
shards=localhost:8983/solr/english,
localhost:8983/solr/spanish,
localhost:8983/solr/french&
df=content&
q=query in any language here
Documents:
All documents just have a single “content” field. The documents get routedto a different language-specific Solrcollection based upon the language of the content field.
*From Solrin Action, Chapter 14
29. Strategy 3: One Field for all languages
*From Solrin Action, Chapter 14
30. One Field for all languages: Feature Status
•Note: This feature is not yet committed to Solr
•I’m working on it in my free time. Currently it supports:
•Update Request Processorwhich canautomatically detect the languages of documentsand choose the correct analyzers
•Field Type which allows dynamically choosing one or more analyzers on a per-field (indexing) and per term (querying) basis.
•Current Code from Solr in Actionis available and is freely available on github.
•There is a JIRA ticket open to ultimately contribute this back to Solr: Solr-6492
•Some work is still necessary to make querying more user friendly.
31. One Field for all languages
Step 1: Define Multilingual Field
schema.xml:
<fieldTypename="multilingual_text" class="sia.ch14.MultiTextField"
sortMissingLast="true" defaultFieldType="text_general"
fieldMappings="en:text_english,
es:text_spanish,
fr:text_french,
de:text_german"/>[1]
<field name="text" type="multilingual_text" indexed="true" multiValued="true" />
[1]Note that "text_english", "text_spanish", "text_french", and "text_german" refer to field types defined elsewhere in the schema.xml
[2]Uses the "defaultFieldType", in this case "text_general", defined elsewhere in schema.xml
<add><doc>…
<field name="text">general keywords</field> [2] <field name="text”>en,es|theschool, lasescuelas</field>… </doc></add> <add><doc>…
<field name="text">en|theschool</field>
<field name="text">es|lasescuelas</field>…
</doc></add>
Step 2: Index documents
http://localhost:8983/solr/collection1/select? q=es|escuelaOR en,es,de|schoolOR school [2]
Step 3: Search
32. One Field For All Languages: Stacked Token Streams
1) English Field
2) Spanish Field
3) English + Spanish combined in Multilingual Text Field
multilingual_text
①For each language requested, the appropriate field type is chosen
②The input text is passed separately to the Analyzer chain for each field type
③The resulting Token Streams from each Analyzer chain arestacked into a unified Token Stream based upon their position increments
*Screenshot from Solrin Action, Chapter 14
33. Strategy 3: All languages in one field
*
*See Solrin Action, Chapter 14
39. The need for Semantic Search
User’s Query: machine learning research and development Portland, OR software engineer AND hadoopjava
Traditional Query Parsing: (machine ANDlearningANDresearch ANDdevelopmentANDportland) OR(software ANDengineer ANDhadoopANDjava)
Semantic Query Parsing: "machine learning" AND"research and development" AND"Portland, OR” AND"software engineer" ANDhadoopANDjava
Semantically Expanded Query: ("machine learning"^10OR"data scientist" OR"data mining" OR"computer vision") AND("research and development"^10OR"r&d") ANDAND("Portland, OR"^10OR"Portland, Oregon" OR{!geofiltpt=45.512,-122.676 d=50sfield=geo}) AND("software engineer"^10OR"software developer") AND(hadoop^10OR"big data" ORhbaseORhive) AND(java^10 ORj2ee)
40. Semantic Search Architecture –Query Parsing
1)Generate Model of Domain-specific phrases
•Can mine query logs or actual text of documents for significant phrases within your domain [1]
2)Feed known phrases to SolrTextTagger(uses LuceneFST for high-throughput term lookups)
3)Use SolrTextTaggerto perform entity extraction on incoming queries(tagging documents is also optional)
4)Shown on next slide: Pass extracted entities to a Query Augmentation phase to rewrite query with enhanced semantic understanding(synonyms, related keywords, related categories, etc.)
[1] K. Aljadda, M. Korayem, T. Grainger, C. Russell. "CrowdsourcedQuery Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
[2]https://github.com/OpenSextant/SolrTextTagger
41. machine learning
Keywords:
Search Behavior,
Application Behavior, etc.
Job Title Classifier, Skills Extractor, Job Level Classifier, etc.
Clustering relationships
Semantic Query Augmentation
keywords:((machine learning)^10OR { AT_LEAST_2: ("data mining"^0.9,matlab^0.8, "data scientist"^0.75, "artificial intelligence"^0.7, "neural networks"^0.55))}
{ BOOST_TO_TOP:(job_title:( "software engineer" OR "data manager" OR "data scientist" OR "hadoopengineer"))}
Modified Query:
Related Occupations
machine learning: {15-1031.00 .58Computer Software Engineers, Applications
15-1011.00 .55
Computer and Information Scientists, Research
15-1032.00 .52 Computer Software Engineers, Systems Software }
machine learning:
{ software engineer .65, data manager .3, data scientist .25, hadoopengineer .2, }
Common Job Titles
Semantic Search Architecture –Query Augmentation
Related Phrases
machine learning:
{ data mining .9, matlab.8, data scientist .75, artificial intelligence .7, neural networks .55 }
Known keyword phrases
java developer
machine learningregistered nurse
42. Differentiating related terms
Synonyms: cpa=> certified public accountant
rn=> registered nurser.n. => registered nurseAmbiguous Terms*: driver=> driver (trucking)~80% driver => driver (software)~20%
Related Terms: r.n. => nursing, bsnhadoop=> mapreduce, hive, pig
*differentiated based upon user and query context
44. 2014 Publications & Presentations
Books:
Solrin Action-A comprehensive guide to implementing scalable search using Apache Solr
Research papers:
●Towards a Job title Classification System
●Augmenting Recommendation Systems Using a Model of Semantically-related Terms Extracted from User Behavior
●sCooL: A system for academic institution name normalization
●CrowdsourcedQuery Augmentation through Semantic Discovery of Domain-specific jargon
●PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems
●SKILL: A System for Skill Identification and Normalization
Speaking Engagements:
●WSDM 2014 Workshop: “Web-Scale Classification: Classifying Big Data from the Web”
●Atlanta SolrMeetup
●Atlanta Big Data Meetup
●The Second International Symposium on Big Data and Data Analytics
●Lucene/SolrRevolution 2014
●RecSys2014
●IEEE Big Data Conference 2014
45. Conclusion
•Language analysis options for each language are very configurable
•There are multiple strategies for handling multilingual content based upon your use case
•When in doubt, automatic language detection can be easily leveraged in your indexing pipeline
•The next generation of query/relevancy improvements will be able to understand the intent of the user.
46. Contact Info
Yes, WE ARE HIRING@CareerBuilder. Come talk with me if you are interested…
Trey Grainger
trey.grainger@careerbuilder.com@treygrainger
http://solrinaction.com
Conference discount (43% off):lusorevcftw
Other presentations: http://www.treygrainger.com