The document discusses learning to rank models for job search rankings on an hourly job marketplace platform. It describes:
1) The complexity of matching job seekers to job postings given the many factors involved and limited historical data.
2) An iterative process of developing learning to rank models, testing improvements through A/B testing, and analyzing results to further tune the models over time.
3) Key factors considered in the models include job title/description matches, employer name, location matches, distance between seeker and job, and search/user attributes. Performance is evaluated on multiple metrics like application and conversion rates.
Interleaving, Evaluation to Self-learning Search @904LabsJohn T. Kane
Presented at Open Source Connections Haystack Relevance Conference on 904Labs' "Interleaving: from Evaluation to Self-Learning". 904Labs is the first to commercialize "Online Learning to Rank" as a state-of-art for technical Self-learning Search Ranking that automatically takes into account your customers human behaviors for personalized search results.
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
Presentation as given to the Haystack Conference, which outlines research and techniques for automatic extraction of keywords, concepts, and vocabularies from text corpora.
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
Talk Abstract: Most work in semantic search has thus far focused upon either manually building language-specific taxonomies/ontologies or upon automatic techniques such as clustering or dimensionality reduction to discover latent semantic links within the content that is being searched. The former is very labor intensive and is hard to maintain, while the latter is prone to noise and may be hard for a human to understand or to interact with directly. We believe that the links between similar user’s queries represent a largely untapped source for discovering latent semantic relationships between search terms. The proposed system is capable of mining user search logs to discover semantic relationships between key phrases in a manner that is language agnostic, human understandable, and virtually noise-free.
his talk will feature some of my recent research into the alternative uses for Solr facets and facet metadata. I will develop the idea that facets can be used to discover similarities between items and attributes in a search index, and show some interesting applications of this idea. A common takeaway is that using facets and facet metadata in non-conventional ways enables the semantic context of a query to be automatically tuned. This has important implications for user-centric and semantically focused relevance.
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Trey Grainger
Search engines frequently miss the mark when it comes to understanding user intent. This talk will walk through some of the key building blocks necessary to turn a search engine into a dynamically-learning "intent engine", able to interpret and search on meaning, not just keywords. We will walk through CareerBuilder's semantic search architecture, including semantic autocomplete, query and document interpretation, probabilistic query parsing, automatic taxonomy discovery, keyword disambiguation, and personalization based upon user context/behavior. We will also see how to leverage an inverted index (Lucene/Solr) as a knowledge graph that can be used as a dynamic ontology to extract phrases, understand and weight the semantic relationships between those phrases and known entities, and expand the query to include those additional conceptual relationships.
As an example, most search engines completely miss the mark at parsing a query like (Senior Java Developer Portland, OR Hadoop). We will show how to dynamically understand that "senior" designates an experience level, that "java developer" is a job title related to "software engineering", that "portland, or" is a city with a specific geographical boundary (as opposed to a keyword followed by a boolean operator), and that "hadoop" is the skill "Apache Hadoop", which is also related to other terms like "hbase", "hive", and "map/reduce". We will discuss how to train the search engine to parse the query into this intended understanding and how to reflect this understanding to the end user to provide an insightful, augmented search experience.
Topics: Semantic Search, Apache Solr, Finite State Transducers, Probabilistic Query Parsing, Bayes Theorem, Augmented Search, Recommendations, Query Disambiguation, NLP, Knowledge Graphs
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
What if your search engine could automatically tune its own domain-specific relevancy model? What if it could learn the important phrases and topics within your domain, automatically identify alternate spellings (synonyms, acronyms, and related phrases) and disambiguate multiple meanings of those phrases, learn the conceptual relationships embedded within your documents, and even use machine-learned ranking to discover the relative importance of different features and then automatically optimize its own ranking algorithms for your domain?
In this presentation, you’ll learn you how to do just that - to evolving Lucene/Solr implementations into self-learning data systems which are able to accept user queries, deliver relevance-ranked results, and automatically learn from your users’ subsequent interactions to continually deliver a more relevant experience for each keyword, category, and group of users.
Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the relevance signals present in the collective feedback from every prior user interaction with the system. Come learn how to move beyond manual relevancy tuning and toward a closed-loop system leveraging both the embedded meaning within your content and the wisdom of the crowds to automatically generate search relevancy algorithms optimized for your domain.
Interleaving, Evaluation to Self-learning Search @904LabsJohn T. Kane
Presented at Open Source Connections Haystack Relevance Conference on 904Labs' "Interleaving: from Evaluation to Self-Learning". 904Labs is the first to commercialize "Online Learning to Rank" as a state-of-art for technical Self-learning Search Ranking that automatically takes into account your customers human behaviors for personalized search results.
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
Presentation as given to the Haystack Conference, which outlines research and techniques for automatic extraction of keywords, concepts, and vocabularies from text corpora.
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
Talk Abstract: Most work in semantic search has thus far focused upon either manually building language-specific taxonomies/ontologies or upon automatic techniques such as clustering or dimensionality reduction to discover latent semantic links within the content that is being searched. The former is very labor intensive and is hard to maintain, while the latter is prone to noise and may be hard for a human to understand or to interact with directly. We believe that the links between similar user’s queries represent a largely untapped source for discovering latent semantic relationships between search terms. The proposed system is capable of mining user search logs to discover semantic relationships between key phrases in a manner that is language agnostic, human understandable, and virtually noise-free.
his talk will feature some of my recent research into the alternative uses for Solr facets and facet metadata. I will develop the idea that facets can be used to discover similarities between items and attributes in a search index, and show some interesting applications of this idea. A common takeaway is that using facets and facet metadata in non-conventional ways enables the semantic context of a query to be automatically tuned. This has important implications for user-centric and semantically focused relevance.
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Trey Grainger
Search engines frequently miss the mark when it comes to understanding user intent. This talk will walk through some of the key building blocks necessary to turn a search engine into a dynamically-learning "intent engine", able to interpret and search on meaning, not just keywords. We will walk through CareerBuilder's semantic search architecture, including semantic autocomplete, query and document interpretation, probabilistic query parsing, automatic taxonomy discovery, keyword disambiguation, and personalization based upon user context/behavior. We will also see how to leverage an inverted index (Lucene/Solr) as a knowledge graph that can be used as a dynamic ontology to extract phrases, understand and weight the semantic relationships between those phrases and known entities, and expand the query to include those additional conceptual relationships.
As an example, most search engines completely miss the mark at parsing a query like (Senior Java Developer Portland, OR Hadoop). We will show how to dynamically understand that "senior" designates an experience level, that "java developer" is a job title related to "software engineering", that "portland, or" is a city with a specific geographical boundary (as opposed to a keyword followed by a boolean operator), and that "hadoop" is the skill "Apache Hadoop", which is also related to other terms like "hbase", "hive", and "map/reduce". We will discuss how to train the search engine to parse the query into this intended understanding and how to reflect this understanding to the end user to provide an insightful, augmented search experience.
Topics: Semantic Search, Apache Solr, Finite State Transducers, Probabilistic Query Parsing, Bayes Theorem, Augmented Search, Recommendations, Query Disambiguation, NLP, Knowledge Graphs
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
What if your search engine could automatically tune its own domain-specific relevancy model? What if it could learn the important phrases and topics within your domain, automatically identify alternate spellings (synonyms, acronyms, and related phrases) and disambiguate multiple meanings of those phrases, learn the conceptual relationships embedded within your documents, and even use machine-learned ranking to discover the relative importance of different features and then automatically optimize its own ranking algorithms for your domain?
In this presentation, you’ll learn you how to do just that - to evolving Lucene/Solr implementations into self-learning data systems which are able to accept user queries, deliver relevance-ranked results, and automatically learn from your users’ subsequent interactions to continually deliver a more relevant experience for each keyword, category, and group of users.
Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the relevance signals present in the collective feedback from every prior user interaction with the system. Come learn how to move beyond manual relevancy tuning and toward a closed-loop system leveraging both the embedded meaning within your content and the wisdom of the crowds to automatically generate search relevancy algorithms optimized for your domain.
Reflected intelligence evolving self-learning data systemsTrey Grainger
In this presentation, we’ll talk about evolving self-learning search and recommendation systems which are able to accept user queries, deliver relevance-ranked results, and iteratively learn from the users’ subsequent interactions to continually deliver a more relevant experience. Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the collective feedback from all prior user interactions with the system. Through iterative feedback loops, such a system can leverage user interactions to learn the meaning of important phrases and topics within a domain, identify alternate spellings and disambiguate multiple meanings of those phrases, learn the conceptual relationships between phrases, and even learn the relative importance of features to automatically optimize its own ranking algorithms on a per-query, per-category, or per-user/group basis.
Enhancing relevancy through personalization & semantic searchTrey Grainger
Matching keywords is just step one in the effort to maximize the relevancy of your search platform. In this talk, you'll learn how to implement advanced relevancy techniques which enable your search platform to "learn" from your content and users' behavior. Topics will include automatic synonym discovery, latent semantic indexing, payload scoring, document-to-document searching, foreground vs. background corpus analysis for interesting term extraction, collaborative filtering, and mining user behavior to drive geographically and conceptually personalized search results. You'll learn how CareerBuilder has enhanced Solr (also utilizing Hadoop) to dynamically discover relationships between data and behavior, and how you can implement similar techniques to greatly enhance the relevancy of your search platform.
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineTrey Grainger
Search engines frequently miss the mark when it comes to understanding user intent. This talk will describe how to overcome this by leveraging Lucene/Solr to power a knowledge graph that can extract phrases, understand and weight the semantic relationships between those phrases and known entities, and expand the query to include those additional conceptual relationships. For example, if a user types in (Senior Java Developer Portland, OR Hadoop), you or I know that the term “senior” designates an experience level, that “java developer” is a job title related to “software engineering”, that “portland, or” is a city with a specific geographical boundary, and that “hadoop” is a technology related to terms like “hbase”, “hive”, and “map/reduce”. Out of the box, however, most search engines just parse this query as text:((senior AND java AND developer AND portland) OR (hadoop)), which is not at all what the user intended. We will discuss how to train the search engine to parse the query into this intended understanding, and how to reflect this understanding to the end user to provide an insightful, augmented search experience. Topics: Semantic Search, Finite State Transducers, Probabilistic Parsing, Bayes Theorem, Augmented Search, Recommendations, NLP, Knowledge Graphs
Self-learned Relevancy with Apache SolrTrey Grainger
Search engines are known for "relevancy", but the relevancy models that ship out of the box (BM25, classic tf-idf, etc.) are just scratching the surface of what's needed for a truly insightful application.
What if your search engine could automatically tune its own domain-specific relevancy model based on user interactions? What if it could learn the important phrases and topics within your domain, learn the conceptual relationships embedded within your documents, and even use machine-learned ranking to discover the relative importance of different features and then automatically optimize its own ranking algorithms for your domain? What if you could further use SQL queries to explore these relationships within your own BI tools and return results in ranked order to deliver relevance-driven analytics visualizations?
In this presentation, we'll walk through how you can leverage the myriad of capabilities in the Apache Solr ecosystem (such as the Solr Text Tagger, Semantic Knowledge Graph, Spark-Solr, Solr SQL, learning to rank, probabilistic query parsing, and Lucidworks Fusion) to build self-learning, relevance-first search, recommendations, and data analytics applications.
The Apache Solr Semantic Knowledge GraphTrey Grainger
What if instead of a query returning documents, you could alternatively return other keywords most related to the query: i.e. given a search for "data science", return me back results like "machine learning", "predictive modeling", "artificial neural networks", etc.? Solr’s Semantic Knowledge Graph does just that. It leverages the inverted index to automatically model the significance of relationships between every term in the inverted index (even across multiple fields) allowing real-time traversal and ranking of any relationship within your documents. Use cases for the Semantic Knowledge Graph include disambiguation of multiple meanings of terms (does "driver" mean truck driver, printer driver, a type of golf club, etc.), searching on vectors of related keywords to form a conceptual search (versus just a text match), powering recommendation algorithms, ranking lists of keywords based upon conceptual cohesion to reduce noise, summarizing documents by extracting their most significant terms, and numerous other applications involving anomaly detection, significance/relationship discovery, and semantic search. In this talk, we'll do a deep dive into the internals of how the Semantic Knowledge Graph works and will walk you through how to get up and running with an example dataset to explore the meaningful relationships hidden within your data.
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
This talk describes how to implement conceptual search (semantic search) within a modern search engine using the word2vec algorithm to learn concepts. We also cover how to auto-tune the search engine parameters using black box optimization techniques, and the problems of feedback loops encountered when building machine learning systems that modify the user behavior used to train the system.
"Searching for Meaning: The Hidden Structure in Unstructured Data". Presentation by Trey Grainger at the Southern Data Science Conference (SDSC) 2018. Covers linguistic theory, application in search and information retrieval, and knowledge graph and ontology learning methods for automatically deriving contextualized meaning from unstructured (free text) content.
The Relevance of the Apache Solr Semantic Knowledge GraphTrey Grainger
The Semantic Knowledge Graph is an Apache Solr plugin that can be used to discover and rank the relationships between any arbitrary queries or terms within the search index. It is a relevancy swiss army knife, able to discover related terms and concepts, disambiguate different meanings of terms given their context, cleanup noise in datasets, discover previously unknown relationships between entities across documents and fields, rank lists of keywords based upon conceptual cohesion to reduce noise, summarize documents by extracting their most significant terms, generate recommendations and personalized search, and power numerous other applications involving anomaly detection, significance/relationship discovery, and semantic search. This talk will walk you through how to setup and use this plugin in concert with other open source tools (probabilistic query parser, SolrTextTagger for entity extraction) to parse, interpret, and much more correctly model the true intent of user searches than traditional keyword-based search approaches.
The task of keyword extraction is to automatically identify a set of terms that best describe the document. Automatic keyword extraction establishes a foundation for various natural language processing applications: information retrieval, the automatic indexing and classification of documents, automatic summarization and high-level semantic description, etc. Although the keyword extraction applications usually work on single documents (document-oriented task), keyword extraction is also applicable to a more demanding task, i.e. the keyword extraction from a whole collection of documents or from an entire web site, or from tweets from Twitter. In the era of big-data, obtaining an effective and efficient method for automatic keyword extraction from huge amounts of multi-topic textual sources is of high importance.
We proposed a novel Selectivity-Based Keyword Extraction (SBKE) method, which extracts keywords from the source text represented as a network. The node selectivity value is calculated from a weighted network as the average weight distributed on the links of a single node and is used in the procedure of keyword candidate ranking and extraction. The selectivity slightly outperforms an extraction based on the standard centrality measures. Therefore, the selectivity and its modification – generalized selectivity as the node centrality measures are included in the SBKE method. Selectivity-based extraction does not require linguistic knowledge as it is derived purely from statistical and structural information of the network and it can be easily ported to new languages and used in a multilingual scenario. The true potential of the proposed SBKE method is in its generality, portability and low computation costs, which positions it as a strong candidate for preparing collections which lack human annotations for keyword extraction. Testing of the portability of the SBKE was tested on Croatian, Serbian and English texts – more precisely it was developed on Croatian News and ported for extraction from parallel abstracts of scientific publication in the Serbian and English languages.
The constructed parallel corpus of scientific abstracts with annotated keywords allows a better comparison of the performance of the method across languages since we have the controlled experimental environment and data. The achieved keyword extraction results measured with an F1 score are 49.57% for English and 46.73% for the Serbian language, if we disregard keywords that are not present in the abstracts. In case that we evaluate against the whole keyword set, the F1 scores are 40.08% and 45.71% respectively. This work shows that SBKE can be easily ported to new a language, domain and type of text in the sense of its structure. Still, there are drawbacks – the method can extract only the words that appear in the text.
Building a semantic search system - one that can correctly parse and interpret end-user intent and return the ideal results for users’ queries - is not an easy task. It requires semantically parsing the terms, phrases, and structure within queries, disambiguating polysemous terms, correcting misspellings, expanding to conceptually synonymous or related concepts, and rewriting queries in a way that maps the correct interpretation of each end user’s query into the ideal representation of features and weights that will return the best results for that user. Not only that, but the above must often be done within the confines of a very specific domain - ripe with its own jargon and linguistic and conceptual nuances.
This talk will walk through the anatomy of a semantic search system and how each of the pieces described above fit together to deliver a final solution. We'll leverage several recently-released capabilities in Apache Solr (the Semantic Knowledge Graph, Solr Text Tagger, Statistical Phrase Identifier) and Lucidworks Fusion (query log mining, misspelling job, word2vec job, query pipelines, relevancy experiment backtesting) to show you an end-to-end working Semantic Search system that can automatically learn the nuances of any domain and deliver a substantially more relevant search experience.
Search engines, and Apache Solr in particular, are quickly shifting the focus away from “big data” systems storing massive amounts of raw (but largely unharnessed) content, to “smart data” systems where the most relevant and actionable content is quickly surfaced instead. Apache Solr is the blazing-fast and fault-tolerant distributed search engine leveraged by 90% of Fortune 500 companies. As a community-driven open source project, Solr brings in diverse contributions from many of the top companies in the world, particularly those for whom returning the most relevant results is mission critical.
Out of the box, Solr includes advanced capabilities like learning to rank (machine-learned ranking), graph queries and distributed graph traversals, job scheduling for processing batch and streaming data workloads, the ability to build and deploy machine learning models, and a wide variety of query parsers and functions allowing you to very easily build highly relevant and domain-specific semantic search, recommendations, or personalized search experiences. These days, Solr even enables you to run SQL queries directly against it, mixing and matching the full power of Solr’s free-text, geospatial, and other search capabilities with the a prominent query language already known by most developers (and which many external systems can use to query Solr directly).
Due to the community-oriented nature of Solr, the ecosystem of capabilities also spans well beyond just the core project. In this talk, we’ll also cover several other projects within the larger Apache Lucene/Solr ecosystem that further enhance Solr’s smart data capabilities: bi-directional integration of Apache Spark and Solr’s capabilities, large-scale entity extraction, semantic knowledge graphs for discovering, traversing, and scoring meaningful relationships within your data, auto-generation of domain-specific ontologies, running SPARQL queries against Solr on RDF triples, probabilistic identification of key phrases within a query or document, conceptual search leveraging Word2Vec, and even Lucidworks’ own Fusion project which extends Solr to provide an enterprise-ready smart data platform out of the box.
We’ll dive into how all of these capabilities can fit within your data science toolbox, and you’ll come away with a really good feel for how to build highly relevant “smart data” applications leveraging these key technologies.
Building Search & Recommendation EnginesTrey Grainger
In this talk, you'll learn how to build your own search and recommendation engine based on the open source Apache Lucene/Solr project. We'll dive into some of the data science behind how search engines work, covering multi-lingual text analysis, natural language processing, relevancy ranking algorithms, knowledge graphs, reflected intelligence, collaborative filtering, and other machine learning techniques used to drive relevant results for free-text queries. We'll also demonstrate how to build a recommendation engine leveraging the same platform and techniques that power search for most of the world's top companies. You'll walk away from this presentation with the toolbox you need to go and implement your very own search-based product using your own data.
Haystack 2019 - Towards a Learning To Rank Ecosystem @ Snag - We've got LTR t...OpenSource Connections
As the largest online marketplace for hourly jobs in the US, Snag strives to connect millions of job seekers with part/full time, hourly and on-demand employment opportunities. Snag started building its learning-to-rank (LTR)-based search system using the Elasticsearch learning-to-rank plugin in 2017 and has switched all of its user queries to LTR by mid-2018, generating significant lift to overall search quality. While fine-tuning and maintaining the LTR system over the past 12 months, our team has come to the realization that continued success of the LTR system requires not only a great ranking model, but also an ecosystem of intelligent metadata services and reliable data infrastructure.
This talk is a collection of examples about the growing pains and remedies of iterating LTR beyond v1.0 at Snag. To start, we will address a few nuances of LTR as a machine-learning problem, e.g. high sample complexity, potential biases from training data, limitations of BM25-based features, incorporation of user preferences, evaluation metrics to please both human users and SEO bots, etc. Then, we will present a few of our newest developments to supplement the current LTR system, including our posting deduplication services, job title normalization services, and architectural designs of our next-generation signal platform and posting enrichment pipeline.
Reflected intelligence evolving self-learning data systemsTrey Grainger
In this presentation, we’ll talk about evolving self-learning search and recommendation systems which are able to accept user queries, deliver relevance-ranked results, and iteratively learn from the users’ subsequent interactions to continually deliver a more relevant experience. Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the collective feedback from all prior user interactions with the system. Through iterative feedback loops, such a system can leverage user interactions to learn the meaning of important phrases and topics within a domain, identify alternate spellings and disambiguate multiple meanings of those phrases, learn the conceptual relationships between phrases, and even learn the relative importance of features to automatically optimize its own ranking algorithms on a per-query, per-category, or per-user/group basis.
Enhancing relevancy through personalization & semantic searchTrey Grainger
Matching keywords is just step one in the effort to maximize the relevancy of your search platform. In this talk, you'll learn how to implement advanced relevancy techniques which enable your search platform to "learn" from your content and users' behavior. Topics will include automatic synonym discovery, latent semantic indexing, payload scoring, document-to-document searching, foreground vs. background corpus analysis for interesting term extraction, collaborative filtering, and mining user behavior to drive geographically and conceptually personalized search results. You'll learn how CareerBuilder has enhanced Solr (also utilizing Hadoop) to dynamically discover relationships between data and behavior, and how you can implement similar techniques to greatly enhance the relevancy of your search platform.
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineTrey Grainger
Search engines frequently miss the mark when it comes to understanding user intent. This talk will describe how to overcome this by leveraging Lucene/Solr to power a knowledge graph that can extract phrases, understand and weight the semantic relationships between those phrases and known entities, and expand the query to include those additional conceptual relationships. For example, if a user types in (Senior Java Developer Portland, OR Hadoop), you or I know that the term “senior” designates an experience level, that “java developer” is a job title related to “software engineering”, that “portland, or” is a city with a specific geographical boundary, and that “hadoop” is a technology related to terms like “hbase”, “hive”, and “map/reduce”. Out of the box, however, most search engines just parse this query as text:((senior AND java AND developer AND portland) OR (hadoop)), which is not at all what the user intended. We will discuss how to train the search engine to parse the query into this intended understanding, and how to reflect this understanding to the end user to provide an insightful, augmented search experience. Topics: Semantic Search, Finite State Transducers, Probabilistic Parsing, Bayes Theorem, Augmented Search, Recommendations, NLP, Knowledge Graphs
Self-learned Relevancy with Apache SolrTrey Grainger
Search engines are known for "relevancy", but the relevancy models that ship out of the box (BM25, classic tf-idf, etc.) are just scratching the surface of what's needed for a truly insightful application.
What if your search engine could automatically tune its own domain-specific relevancy model based on user interactions? What if it could learn the important phrases and topics within your domain, learn the conceptual relationships embedded within your documents, and even use machine-learned ranking to discover the relative importance of different features and then automatically optimize its own ranking algorithms for your domain? What if you could further use SQL queries to explore these relationships within your own BI tools and return results in ranked order to deliver relevance-driven analytics visualizations?
In this presentation, we'll walk through how you can leverage the myriad of capabilities in the Apache Solr ecosystem (such as the Solr Text Tagger, Semantic Knowledge Graph, Spark-Solr, Solr SQL, learning to rank, probabilistic query parsing, and Lucidworks Fusion) to build self-learning, relevance-first search, recommendations, and data analytics applications.
The Apache Solr Semantic Knowledge GraphTrey Grainger
What if instead of a query returning documents, you could alternatively return other keywords most related to the query: i.e. given a search for "data science", return me back results like "machine learning", "predictive modeling", "artificial neural networks", etc.? Solr’s Semantic Knowledge Graph does just that. It leverages the inverted index to automatically model the significance of relationships between every term in the inverted index (even across multiple fields) allowing real-time traversal and ranking of any relationship within your documents. Use cases for the Semantic Knowledge Graph include disambiguation of multiple meanings of terms (does "driver" mean truck driver, printer driver, a type of golf club, etc.), searching on vectors of related keywords to form a conceptual search (versus just a text match), powering recommendation algorithms, ranking lists of keywords based upon conceptual cohesion to reduce noise, summarizing documents by extracting their most significant terms, and numerous other applications involving anomaly detection, significance/relationship discovery, and semantic search. In this talk, we'll do a deep dive into the internals of how the Semantic Knowledge Graph works and will walk you through how to get up and running with an example dataset to explore the meaningful relationships hidden within your data.
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
This talk describes how to implement conceptual search (semantic search) within a modern search engine using the word2vec algorithm to learn concepts. We also cover how to auto-tune the search engine parameters using black box optimization techniques, and the problems of feedback loops encountered when building machine learning systems that modify the user behavior used to train the system.
"Searching for Meaning: The Hidden Structure in Unstructured Data". Presentation by Trey Grainger at the Southern Data Science Conference (SDSC) 2018. Covers linguistic theory, application in search and information retrieval, and knowledge graph and ontology learning methods for automatically deriving contextualized meaning from unstructured (free text) content.
The Relevance of the Apache Solr Semantic Knowledge GraphTrey Grainger
The Semantic Knowledge Graph is an Apache Solr plugin that can be used to discover and rank the relationships between any arbitrary queries or terms within the search index. It is a relevancy swiss army knife, able to discover related terms and concepts, disambiguate different meanings of terms given their context, cleanup noise in datasets, discover previously unknown relationships between entities across documents and fields, rank lists of keywords based upon conceptual cohesion to reduce noise, summarize documents by extracting their most significant terms, generate recommendations and personalized search, and power numerous other applications involving anomaly detection, significance/relationship discovery, and semantic search. This talk will walk you through how to setup and use this plugin in concert with other open source tools (probabilistic query parser, SolrTextTagger for entity extraction) to parse, interpret, and much more correctly model the true intent of user searches than traditional keyword-based search approaches.
The task of keyword extraction is to automatically identify a set of terms that best describe the document. Automatic keyword extraction establishes a foundation for various natural language processing applications: information retrieval, the automatic indexing and classification of documents, automatic summarization and high-level semantic description, etc. Although the keyword extraction applications usually work on single documents (document-oriented task), keyword extraction is also applicable to a more demanding task, i.e. the keyword extraction from a whole collection of documents or from an entire web site, or from tweets from Twitter. In the era of big-data, obtaining an effective and efficient method for automatic keyword extraction from huge amounts of multi-topic textual sources is of high importance.
We proposed a novel Selectivity-Based Keyword Extraction (SBKE) method, which extracts keywords from the source text represented as a network. The node selectivity value is calculated from a weighted network as the average weight distributed on the links of a single node and is used in the procedure of keyword candidate ranking and extraction. The selectivity slightly outperforms an extraction based on the standard centrality measures. Therefore, the selectivity and its modification – generalized selectivity as the node centrality measures are included in the SBKE method. Selectivity-based extraction does not require linguistic knowledge as it is derived purely from statistical and structural information of the network and it can be easily ported to new languages and used in a multilingual scenario. The true potential of the proposed SBKE method is in its generality, portability and low computation costs, which positions it as a strong candidate for preparing collections which lack human annotations for keyword extraction. Testing of the portability of the SBKE was tested on Croatian, Serbian and English texts – more precisely it was developed on Croatian News and ported for extraction from parallel abstracts of scientific publication in the Serbian and English languages.
The constructed parallel corpus of scientific abstracts with annotated keywords allows a better comparison of the performance of the method across languages since we have the controlled experimental environment and data. The achieved keyword extraction results measured with an F1 score are 49.57% for English and 46.73% for the Serbian language, if we disregard keywords that are not present in the abstracts. In case that we evaluate against the whole keyword set, the F1 scores are 40.08% and 45.71% respectively. This work shows that SBKE can be easily ported to new a language, domain and type of text in the sense of its structure. Still, there are drawbacks – the method can extract only the words that appear in the text.
Building a semantic search system - one that can correctly parse and interpret end-user intent and return the ideal results for users’ queries - is not an easy task. It requires semantically parsing the terms, phrases, and structure within queries, disambiguating polysemous terms, correcting misspellings, expanding to conceptually synonymous or related concepts, and rewriting queries in a way that maps the correct interpretation of each end user’s query into the ideal representation of features and weights that will return the best results for that user. Not only that, but the above must often be done within the confines of a very specific domain - ripe with its own jargon and linguistic and conceptual nuances.
This talk will walk through the anatomy of a semantic search system and how each of the pieces described above fit together to deliver a final solution. We'll leverage several recently-released capabilities in Apache Solr (the Semantic Knowledge Graph, Solr Text Tagger, Statistical Phrase Identifier) and Lucidworks Fusion (query log mining, misspelling job, word2vec job, query pipelines, relevancy experiment backtesting) to show you an end-to-end working Semantic Search system that can automatically learn the nuances of any domain and deliver a substantially more relevant search experience.
Search engines, and Apache Solr in particular, are quickly shifting the focus away from “big data” systems storing massive amounts of raw (but largely unharnessed) content, to “smart data” systems where the most relevant and actionable content is quickly surfaced instead. Apache Solr is the blazing-fast and fault-tolerant distributed search engine leveraged by 90% of Fortune 500 companies. As a community-driven open source project, Solr brings in diverse contributions from many of the top companies in the world, particularly those for whom returning the most relevant results is mission critical.
Out of the box, Solr includes advanced capabilities like learning to rank (machine-learned ranking), graph queries and distributed graph traversals, job scheduling for processing batch and streaming data workloads, the ability to build and deploy machine learning models, and a wide variety of query parsers and functions allowing you to very easily build highly relevant and domain-specific semantic search, recommendations, or personalized search experiences. These days, Solr even enables you to run SQL queries directly against it, mixing and matching the full power of Solr’s free-text, geospatial, and other search capabilities with the a prominent query language already known by most developers (and which many external systems can use to query Solr directly).
Due to the community-oriented nature of Solr, the ecosystem of capabilities also spans well beyond just the core project. In this talk, we’ll also cover several other projects within the larger Apache Lucene/Solr ecosystem that further enhance Solr’s smart data capabilities: bi-directional integration of Apache Spark and Solr’s capabilities, large-scale entity extraction, semantic knowledge graphs for discovering, traversing, and scoring meaningful relationships within your data, auto-generation of domain-specific ontologies, running SPARQL queries against Solr on RDF triples, probabilistic identification of key phrases within a query or document, conceptual search leveraging Word2Vec, and even Lucidworks’ own Fusion project which extends Solr to provide an enterprise-ready smart data platform out of the box.
We’ll dive into how all of these capabilities can fit within your data science toolbox, and you’ll come away with a really good feel for how to build highly relevant “smart data” applications leveraging these key technologies.
Building Search & Recommendation EnginesTrey Grainger
In this talk, you'll learn how to build your own search and recommendation engine based on the open source Apache Lucene/Solr project. We'll dive into some of the data science behind how search engines work, covering multi-lingual text analysis, natural language processing, relevancy ranking algorithms, knowledge graphs, reflected intelligence, collaborative filtering, and other machine learning techniques used to drive relevant results for free-text queries. We'll also demonstrate how to build a recommendation engine leveraging the same platform and techniques that power search for most of the world's top companies. You'll walk away from this presentation with the toolbox you need to go and implement your very own search-based product using your own data.
Haystack 2019 - Towards a Learning To Rank Ecosystem @ Snag - We've got LTR t...OpenSource Connections
As the largest online marketplace for hourly jobs in the US, Snag strives to connect millions of job seekers with part/full time, hourly and on-demand employment opportunities. Snag started building its learning-to-rank (LTR)-based search system using the Elasticsearch learning-to-rank plugin in 2017 and has switched all of its user queries to LTR by mid-2018, generating significant lift to overall search quality. While fine-tuning and maintaining the LTR system over the past 12 months, our team has come to the realization that continued success of the LTR system requires not only a great ranking model, but also an ecosystem of intelligent metadata services and reliable data infrastructure.
This talk is a collection of examples about the growing pains and remedies of iterating LTR beyond v1.0 at Snag. To start, we will address a few nuances of LTR as a machine-learning problem, e.g. high sample complexity, potential biases from training data, limitations of BM25-based features, incorporation of user preferences, evaluation metrics to please both human users and SEO bots, etc. Then, we will present a few of our newest developments to supplement the current LTR system, including our posting deduplication services, job title normalization services, and architectural designs of our next-generation signal platform and posting enrichment pipeline.
Leveraging Machine Learning for Competitive Advantage by Dylan Hogg - Search ...Search Party
Dylan Hogg, Search Party's Head of Data Science explores how machine learning provides a set of tools to extract value and insights from data. His presentation, at the Chief Data Officer Forum in Sydney, covers our scalable data platform, what machine learning is and two different applications of machine learning within our product.
Leveraging Machine Learning for Competitive Advantage at Search PartyDylan Hogg
Chief Data Officer Forum presentation by Dylan Hogg, head of data science at Search Party.
Covers our scalable data platform, what machine learning is and two different applications of machine learning within our product.
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...Databricks
From training billions of ad impressions to scaling gradient boosted trees with more than three million nodes, Ad Targeting at Yelp uses Apache Spark in many stages of its large-scale machine learning pipeline.
This session will explore examples of how Yelp employed and tweaked Spark to support big data feature engineering, visualizations and machine learning model training, evaluation and diagnostics. You’ll also hear about the challenges in building and deploying such a large-scale intelligent system in a production environment.
Personalized Job Recommendation System at LinkedIn: Practical Challenges and ...Benjamin Le
A Industry talk given at RecSys 2017 talking about Job Recommendations at LinkedIn and some of the challenges we faced and solved. https://recsys.acm.org/recsys17/industry-session-2/#content-tab-1-4-tab
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
This is a presentation made on the 13th August 2014 at the SF Data Mining Meetup at Trulia. It's about Dataiku and the Kaggle Personalized Web Search Ranking challenge sponsored by Yandex
OBJECTIVE: To merge into a dynamic organization as IT Professional in the field of Software Quality Testing that will strategically utilize my existing skill sets while providing opportunities to amalgamate personal enrichment with professional goals
How to Prepare for Product Based Companies?Joel Kingsley
Preparing for Product Based Companies(especially the high paying ones) can be quite challenging if you don't have an idea of what to expect. In this presentation given in the Student Mentor Programme at RMK College of Engineering and Technology, I went through some of my experience and some tips for preparing for product based companies.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
12. The (0ld) system 12
● System that is too complex to accurately tune the boosts: Relevancy
whack-a-mole
● Inventory content frequently changes
● Lacks data driven input -- assumption driven without proper statistical
analysis.
“If only there was a way to do this
differently…”
14. Job search is both an IR and a match problem
Search/ IR (e.g. Youtube)
14
{ User } { Resource }
Match (e.g. online chess)
● Many to Many
● Asymmetric
● Unlimited supply
{ Player} { Player }
● One to One
● Symmetric
● No extra supply
Job Search
{ Job Seekers } {Job Positions}
● One to Many
● Asymmetric and Bi-directional
● Limited supply, unlimited “attempts”
15. ● Fragmented
Organized around “Shifts”. A worker
can be assigned 1 to 30+ hours per
week. Many hold multiple jobs
● Transactional
Workers stay at each job for 6 months
on average
● Lightly Skilled
Many hourly jobs require just a high
school diploma
Hourly Jobs are not ‘Sticky’ 15
https://www.snag.co/employers/wp-content/uploads/2016/07/2016
_SOTHW_Report-3.pdf
16. Hourly job search is often a recommendation
Schedule and location can be more important
than actual duty of the job
Queries are not explicit (40% don’t have keywords)
20. Learning to Rank Model 20
Development Environment
Abandonment: 0
Relevancy Labels Features
Click: 1
Apply Intent: 2
match scores on job title,
employer name, job type, ...
distance <position, seeker>
match scores on query location
(e.g. zip-code, city)
match scores on job description
query string attributes (e.g
length, entity type)
posting attributes (e.g. position,
requirements, industry,
semantics representation)
.
.
.
lambdamart
Composability!
21. Training Pipeline - esltr plugin 0.x 21
Development Environment
data
warehouse
posting
collection
event
sampler
posting
sampler
training
data
generator
posting
ingestion
model
generator
feature
backfilling
relevancy label
parser
Ranklib
relevancy
scores
query
info
features training
data
ranking
model
posting
docs
user
events
training
index
search
engine
(dev)
search
engine
(prod)
22. Training Pipeline - esltr plugin 1.0 22
Development Environment
data
warehouse
event
sampler
training
data
generator
model
generator
feature
parser
relevancy label
generator
relevancy
scores
features training
data
ranking
model
user
events
live feature logs
+
HyperOpt
search
engine
(prod)
23. Offline Validation Pre-Deployment 23
Development Environment
● Re-ranking historical queries
Gives good directional guidance, but not very accurate in absolute numbers due to 1)
inability to account for new items and 2) contamination from sponsored postings with
artificially high rankings.
● Manual examination of common query patterns
Great for sanity checks. Reveals details beyond relevancy labels. More indicative of
future performance.
● Best of Both Worlds?
Aljadda, Khalifeh & Korayem, Mohammed & Grainger, Trey. (2018). Fully Automated QA
System for Large Scale Search and Recommendation Engines Leveraging Implicit User
Feedback.
24. Deployment via A-B testing 24
Production Environment
Don’t modify the existing system.
25. Deployment via A-B testing 25
Production Environment
a) Build a parallel system
b) Iterate
c) Test
d) Evaluate
29. Iteration 1 (Q2 2017) 29
● LTR Features
1. job_title match score
2. job_description match score
3. employer_name match score
4. city-state_match score
5. zipcode_match score
6. distance <query location, posting>
● Relevancy Labels
Click : 1
Apply Intent: 2
Completed Applications: 3
● Success Criteria
- NDCG@10
● Use Cases
Site: desktop, mobile web
User: registered
Search Type:
- zip-code location only
- zip-code location + keyword
30. Relevancy Performance 30
Iteration 1
● Pros
Immediate boost of NDCG for zipcode-only
searches (~5%)
● Cons
Keyword and location-only searches shared
same feature space, leading to polarized
user experience.
● Todo
Add query-string-related attributes to the
list of features
Query:
- keyword: Starbucks
- location: Arlington, VA, 22201
Results:
Rank Employer Location
1 Starbucks Arlington, VA, 22201
2 Starbucks Arlington, VA, 22203
3 WholeFoods Arlington, VA, 22201
4 Starbucks Washington, DC, 20007
● When things don’t work:
31. Iteration 2 (Q3 2017) 31
● Success Criteria
- NDCG@10
- Application Rate
(# of applications/ # of search sessions)
● Use Cases
Site: desktop, mobile web
User: registered, unregistered
Search Type:
- zip-code location only
- zip-code location + keyword
- text location only
- text location + keyword
● LTR Features
1. job_title match scores
2. job_description match scores
3. employer_name match scores
4. location “match” level
5. distance <seeker, posting>
6. query location level
7. query length
8. platform (e.g. desktop, mobile)
9. job seeker registration status
● Relevancy Labels
Click : 1
Apply Intent: 2
Completed Applications: 3
32. Relevancy Performance 32
Iteration 2
● Pros
More stable performance across the board.
● Cons
Low geo-location resolution rate (~95%) hurt
queries with text locations
Default text analyzers supplied noisy signals
to ltr.
● Todo
Enhance geo-coding logics
Define customized analyzers (e.g. stopwords,
synonym filters, keyword markers) for every
field used by the ranking model
Query:
- keyword: Part time restaurant
Results:
Rank Title Employer
1 Part time server Chipotle
2 Full time cook KFC
3 Part time Cashier Restaurant Depot
4 Cook District Taco
● When things don’t work:
33. Iteration 3 (Q4 2017) 33
● Success Criteria
- NDCG@10
- Application rate
(# of applications / # of search sessions)
- Applicant conversion rate
(# of applicants / # of users)
- Applications per user
(# of applications / # of users)
● Use Cases
Site: desktop, mobile web
User: registered, unregistered
Search Type:
- zip-code location only
- zip-code location + keyword
- text location only
- text location + keyword
- keyword only
● LTR Features
1. job_title match scores
2. job_description match scores
3. employer_name match scores
4. location “match” level
5. distance <seeker, postings>
6. query location level
7. query length
8. platform (e.g. desktop, mobile)
9. job seeker registration status
10. is_faceted flag
● Relevancy Labels
Click : 1
Apply Intent: 2
Completed Applications: 3
34. Relevancy Performance 34
Iteration 3
● Pros
Location only searches are 10%+ better than
baseline. Keyword searches broke even.
● Cons
Large numbers of tied LTR scores artificially
limited user options via presentation bias
Lack of features about job description contexts
meant “click-baits” received too much
exposure
● Todo
Randomize the ranking of postings with tied
LTR scores on a per-user/session basis
Add query independent posting-level features
Query:
- keyword: PT (part time)
- location: Arlington, VA
Results:
Rank Title Location
1 Part time Cashier Arlington, VA, 22201
2 Drive Uber PT! Arlington, VA, 22209
3 Drive Uber PT! Arlington, VA, 22202
4 Drive Uber PT! Arlington, VA, 22203
● When things don’t work:
35. Current Iteration (Q1 2018) 35
● Success Criteria
- Application rate
(# of applications/ # of search sessions)
- Applicant conversion rate
(# of applicants/ # of users)
- Applications per user
(# of applications / # of users)
- Application diversity
(# of distinct applied postings/ # of applications)
● Use cases
Site: mobile apps, desktop, mobile web
User: registered, unregistered
Search Type:
- zip-code location only
- zip-code location + keyword
- text location only
- text location + keyword
- keyword only
- user coordinates only (a.k.a Jobs near me)
- user coordinates + keyword
● LTR features
1. job_title match scores
2. job_description match scores
3. employer_name match scores
4. location “match” level
5. distance <seeker, postings>
6. query location level
7. query length
8. platform (e.g. desktop, mobile)
9. job seeker registration status
10. is_faceted flag
11. Location conf level of postings
(proxy for posting quality)
● Relevancy Labels
Click : 1
Apply Intent: 2
Completed Applications: 3
36. Android App Live Performance (April, 2018) 36
Metrics Qualitative assessments
● Signal Regularisation: No particular
field has outsized impact on relevancy
anymore
● Signal Coordination: e.g. The
interaction between text and location
relevancy are more balanced
● Randomized ties => Better Match:
Randomization enables
well-distributed matchings and better
marketplace health, and partially
corrects positional bias
Metric
Control
(80% user)
Test (20%
user)
Average
% Lift
Application
Rate
0.1273
(0.0005)
0.1409
(0.0011) 10.72%
Applicant
Conversion
Rate
33.86%
(0.20%)
36.64%
(0.43%) 8.22%
Apply Intent
Diversity
0.676
(0.002)
0.759
(0.004) 12.40%
Click
Diversity
0.663
(0.002)
0.807
(0.004) 21.62%
37. Engineering Challenges 37
● Latency
● API: window size from 3000 to 1000 to 500
● Igniter (posting ingestion) execution time
● Signal Quality
● Randomization for result consistency
39. Lessons Learned 39
Model Development
● Relevancy tuning can create feedback loops. Look ahead
Changes in the ranking function sometimes triggers changes in user behavior, which in turn invalidate
said ranking function. Treat relevancy tuning as interactive experiments, not a curve-fitting exercise
● Apply strong model assumptions to correct deficiencies in old ranking functions
Use sound behavioral hypothesis via data analysis and qualitative user research to regulate model
behavior. Historical data can be noisy. Let AB tests be the final judge.
● Engineer the relevancy labels as well as the features
Implicit feedbacks are not absolute measures of relevancy and should be modeled to account for biases
and behavioral assumptions
● Ranking functions are only as expressive as the features you feed them
Any relevancy insights that can’t be encoded as meaningful differences in the feature space will not be
reflected in the search results
40. Lessons Learned 40
Engineering & Infrastructure
● Prioritize on velocity of iteration (analysis paralysis)
● Worked backwards from conclusions about system latency
42. Posting and Query Semantics Features 42
● Contextual information in posting
descriptions contribute many relevancy
signals
● Back-testings on both manually crafted
bag-of-words features and
machine-learned representations (e.g.
via SVD, word2vec) already saw
significant lift of reranked NDCG
● Some concerns for query-time
performance and over-fitting of long
NLP feature vectors
“... hiring individuals to work as part-time
Package Handlers... involves continual
lifting, lowering and sliding packages that
typically weigh 25 - 35 lbs… typically do
not work on holidays.... working
approximately 17.5 - 20 hours per week…
outstanding education assistance of up to
$2,625 per semester...”
“We have a part time opening for a delivery
driver position. Must be authorized to work
in the US”
High context
Low context
43. Click / Relevancy Label Modeling 43
Model Improvements
● Build multi-stage click models to
account for factors that cannot be
formulated as query-time LTR features
(e.g. rank position, between-session
correlations).
● Creates a positive feedback loop that
boosts potentially relevant postings
with low exposures (and penalize the
reverse)
44. Personalized Matching 44
Model Improvements
● Incorporate LTR features about matching
signals between job seeker preferences/
qualifications and job requirements
● (Potentially) an online learning module
that dynamically adjusts the rankings
shown to each user based on onsite
behavior
(...That pays >$15 per
hour. No night shifts!
...is In the retail
industry, where I have
5 years of experience
Bonus points if it’s
Harris Teeter…)
I want a part
time job near
my home!
45. Engineering Improvements 45
Engineering & Infrastructure
● Push-button training pipeline
● Automated push button deployment for re-indexing
● Latency and scale improvements
47. 47
● Elasticsearch: https://www.elastic.co/guide/en/elasticsearch/reference/current/index.htm
● ES Learning to Rank Plugin: http://elasticsearch-learning-to-rank.readthedocs.io/en/latest/
● Relevancy tuning: Turnbull, Doug, and John Berryman. Relevant Search with Applications for Solr and Elasticsearch. Manning, 2016.
● lambdaMart: C. Burges. From RankNet to LambdaRank to LambdaMART: An overview. Technical Report MSR-TR-2010-82,
Microsoft Research, 2010.
● ranklib: https://sourceforge.net/p/lemur/wiki/RankLib
● xgboost: Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). ACM, New York, NY, USA, 785-794
K.V. Rashmi and Ran Gilad-Bachrach, Dart: Dropouts meet multiple additive regression trees, April 2015
● hyperopt: J. Bergstra, R. Bardenet, Y. Bengio and B. Kégl. Algorithms for Hyper-parameter Optimization. Proc. Neural Information
Processing Systems 24 (NIPS2011), 2546–2554, 2011
● Interleaving: O. Chapelle, T. Joachims, F. Radlinski, Yisong Yue, Large-Scale Validation and Analysis of Interleaved Search Evaluation,
ACM Transactions on Information Systems (TOIS), 30(1):6.1-6.41, 2012.
T. Joachims, Evaluating Retrieval Performance Using Clickthrough Data, Proceedings of the SIGIR Workshop on Mathematical/Formal
Methods in Information Retrieval, 2002.
● document & query embeddings: Mitra, Bhaskar & Craswell, Nick. (2017). Neural Models for Information Retrieval.
Hamed Zamani and W. Bruce Croft. 2017. Relevance-based Word Embedding. In Proceedings of the 40th International ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR '17). ACM, New York, NY, USA, 505-514
● Click model: Chuklin, A., Markov, I., & de Rijke, M. (2015). Click models for web search. Synthesis Lectures on Information Concepts
Retrieval and Services, 7(3), 1–115. With Pyclick: https://github.com/markovi/PyClick
Y. Hu, Y. Koren and C. Volinsky, "Collaborative Filtering for Implicit Feedback Datasets," 2008 Eighth IEEE International Conference
on Data Mining, Pisa, 2008, pp. 263-272.