With advanced tools available for search like Solr and Elasticsearch, companies are embedding search in almost all their products and websites. Search is becoming mainstream. Therefore we can focus on teaching the search engine tricks to return more relevant results. One new trick is called "learning to rank". During the presentation, you'll learn what Learning To Rank is, when to apply it and of course, you'll get an example to show how it works using Elasticsearch and a learning to rank plugin. After this presentation, you have learned to combine machine learning models and search.
Combining machine learning and search through learning to rankJettro Coenradie
In this presentation, we will go through all the steps to use machine learning to improve your search results. We'll discuss the search basics you need to know as well as some machine learning basics. After that, we use a sample application available at the URL https://rolling500.luminis.amsterdam to show improvements using a trained model and the learning to rank plugin in Elasticsearch.
Building Search & Recommendation EnginesTrey Grainger
In this talk, you'll learn how to build your own search and recommendation engine based on the open source Apache Lucene/Solr project. We'll dive into some of the data science behind how search engines work, covering multi-lingual text analysis, natural language processing, relevancy ranking algorithms, knowledge graphs, reflected intelligence, collaborative filtering, and other machine learning techniques used to drive relevant results for free-text queries. We'll also demonstrate how to build a recommendation engine leveraging the same platform and techniques that power search for most of the world's top companies. You'll walk away from this presentation with the toolbox you need to go and implement your very own search-based product using your own data.
Learning to Rank (LTR) presentation at RELX Search Summit 2018. Contains information about history of LTR, taxonomy of LTR algorithms, popular algorithms, and case studies of applying LTR using the TMDB dataset using Solr, Elasticsearch and without index support.
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation enginelucenerevolution
This session will present a detailed tear-down and walk-through of a working soup-to-nuts recommendation engine that uses observations of multiple kinds of behavior to do combined recommendation and cross recommendation. The system is built using Mahout to do off-line analysis and Solr to provide real-time recommendations. The presentation will also include enough theory to provide useful working intuitions for those desiring to adapt this design.
The entire system including a data generator, off-line analysis scripts, Solr configurations and sample web pages will be made available on github for attendees to modify as they like.
The World Wide Web is moving from a Web of hyper-linked documents to a Web of linked data. Thanks to the Semantic Web technological stack and to the more recent Linked Open Data (LOD) initiative, a vast amount of RDF data have been published in freely accessible datasets connected with each other to form the so called LOD cloud. As of today, we have tons of RDF data available in the Web of Data, but only a few applications really exploit their potential power. The availability of such data is for sure an opportunity to feed personalized information access tools such as recommender systems. We will show how to plug Linked Open Data in a recommendation engine in order to build a new generation of LOD-enabled applications.
(Lecture given @ the 11th Reasoning Web Summer School - Berlin - August 1, 2015)
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search.
Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.
Combining machine learning and search through learning to rankJettro Coenradie
In this presentation, we will go through all the steps to use machine learning to improve your search results. We'll discuss the search basics you need to know as well as some machine learning basics. After that, we use a sample application available at the URL https://rolling500.luminis.amsterdam to show improvements using a trained model and the learning to rank plugin in Elasticsearch.
Building Search & Recommendation EnginesTrey Grainger
In this talk, you'll learn how to build your own search and recommendation engine based on the open source Apache Lucene/Solr project. We'll dive into some of the data science behind how search engines work, covering multi-lingual text analysis, natural language processing, relevancy ranking algorithms, knowledge graphs, reflected intelligence, collaborative filtering, and other machine learning techniques used to drive relevant results for free-text queries. We'll also demonstrate how to build a recommendation engine leveraging the same platform and techniques that power search for most of the world's top companies. You'll walk away from this presentation with the toolbox you need to go and implement your very own search-based product using your own data.
Learning to Rank (LTR) presentation at RELX Search Summit 2018. Contains information about history of LTR, taxonomy of LTR algorithms, popular algorithms, and case studies of applying LTR using the TMDB dataset using Solr, Elasticsearch and without index support.
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation enginelucenerevolution
This session will present a detailed tear-down and walk-through of a working soup-to-nuts recommendation engine that uses observations of multiple kinds of behavior to do combined recommendation and cross recommendation. The system is built using Mahout to do off-line analysis and Solr to provide real-time recommendations. The presentation will also include enough theory to provide useful working intuitions for those desiring to adapt this design.
The entire system including a data generator, off-line analysis scripts, Solr configurations and sample web pages will be made available on github for attendees to modify as they like.
The World Wide Web is moving from a Web of hyper-linked documents to a Web of linked data. Thanks to the Semantic Web technological stack and to the more recent Linked Open Data (LOD) initiative, a vast amount of RDF data have been published in freely accessible datasets connected with each other to form the so called LOD cloud. As of today, we have tons of RDF data available in the Web of Data, but only a few applications really exploit their potential power. The availability of such data is for sure an opportunity to feed personalized information access tools such as recommender systems. We will show how to plug Linked Open Data in a recommendation engine in order to build a new generation of LOD-enabled applications.
(Lecture given @ the 11th Reasoning Web Summer School - Berlin - August 1, 2015)
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search.
Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.
ChatGPT
Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It involves applying various techniques and methods to extract insights from data sets, often with the goal of uncovering patterns, trends, relationships, or making predictions.
Here's an overview of the key steps and techniques involved in data analysis:
Data Collection: The first step in data analysis is gathering relevant data from various sources. This can include structured data from databases, spreadsheets, or surveys, as well as unstructured data such as text documents, social media posts, or sensor readings.
Data Cleaning and Preprocessing: Once the data is collected, it often needs to be cleaned and preprocessed to ensure its quality and suitability for analysis. This involves handling missing values, removing duplicates, addressing inconsistencies, and transforming data into a suitable format for analysis.
Exploratory Data Analysis (EDA): EDA involves examining and understanding the data through summary statistics, visualizations, and statistical techniques. It helps identify patterns, distributions, outliers, and potential relationships between variables. EDA also helps in formulating hypotheses and guiding further analysis.
Data Modeling and Statistical Analysis: In this step, various statistical techniques and models are applied to the data to gain deeper insights. This can include descriptive statistics, inferential statistics, hypothesis testing, regression analysis, time series analysis, clustering, classification, and more. The choice of techniques depends on the nature of the data and the research questions being addressed.
Data Visualization: Data visualization plays a crucial role in data analysis. It involves creating meaningful and visually appealing representations of data through charts, graphs, plots, and interactive dashboards. Visualizations help in communicating insights effectively and spotting trends or patterns that may be difficult to identify in raw data.
Interpretation and Conclusion: Once the analysis is performed, the findings need to be interpreted in the context of the problem or research objectives. Conclusions are drawn based on the results, and recommendations or insights are provided to stakeholders or decision-makers.
Reporting and Communication: The final step is to present the results and findings of the data analysis in a clear and concise manner. This can be in the form of reports, presentations, or interactive visualizations. Effective communication of the analysis results is crucial for stakeholders to understand and make informed decisions based on the insights gained.
Data analysis is widely used in various fields, including business, finance, marketing, healthcare, social sciences, and more. It plays a crucial role in extracting value from data, supporting evidence-based decision-making, and driving actionable insig
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
Spark 2.0 provided strong performance enhancements to the Spark core while advancing Spark ML usability to use data frames. But what happens when you run Spark 2.0 machine learning algorithms on a large cluster with a very large data set? Do you even get any benefit from using a very large data set? It depends. How do new hardware advances affect the topology of high performance Spark clusters. In this talk we will explore Spark 2.0 Machine Learning at scale and share our findings with the community.
As our test platform we will be using a new cluster design, different from typical Hadoop clusters, with more cores, more RAM and latest generation NVMe SSD’s and a 100GbE network with a goal of more performance, in a more space and energy efficient footprint.
ADV Slides: Graph Databases on the EdgeDATAVERSITY
Graph databases may be the unsung heroes of data platforms. They are poised to expand dramatically in the next few years as the nature of important analytics data expands dramatically into understanding. We live and work today in a highly connected world where individuals and their relationships organize perceptions, consumer behaviors, and many other business success factors. Where patterns are involved in relationships, it is imperative to understand them. Graph databases are the technology that is best-suited to determining and understanding data relationships.
This code-lite session is a primer on graph databases and the relationship data stored in them for the analytics architect in the enterprise. It will help you determine why, how, and where to apply graphs, and how to get started.
Speaker: Philippe Mizrahi - Associate Product Manager - Lyft
Abstract: Philippe Mizrahi works on Lyft’s data discovery and metadata engine, Amundsen. With the help of a Neo4j graph database, Amundsen has improved Lyft’s data discovery by reducing time to discover data by 10x.
During this session, Philippe will dive deep into Amundsen’s use cases, impact, and architecture, which effectively combines a comprehensive knowledge graph based upon Neo4j, centralized metadata and other search ranking optimizations to discover data quickly.
Pistoia alliance harmonizing fair data catalog approaches webinarPistoia Alliance
Multiple groups in the life sciences community have started their journey towards data FAIR-ification by implementing Data Catalogs, a clear first step towards Finding your data. While in many cases the approaches are quite similar, in both origin and intent, differing implementations could end up hampering interoperability and reuse. The Pistoia Alliance and the Linked Data Community of Practice hosted a panel discussion describing at three implementations and their downstream goals:
[1] Pharma cross-omics data catalogs,
[2] Clinical data catalogs
[3] Bioschemas for dataset discoverability on the inter/intranet
How to Feed a Data Hungry Organization – by Traveloka Data TeamTraveloka
In Traveloka's Inaugural Data Meetup held in April 2017, Ainun Najib (Head of Data), Dr. Philip Thomas (Lead Data Scientist), and Rendy B. Junior (Lead Data Engineer) shared about the journey that Traveloka's Data Team have taken so far so that the audience can learn from the struggles and triumphs in managing Traveloka's burgeoning data.
You will learn more about:
1) Data culture in Traveloka
2) Data engineering in Traveloka
3) Data science in Traveloka
To follow our LinkedIn page, visit bit.ly/TravelokaLinkedInPage
Safe Harbor Statement
Our discussion may include predictions, estimates or other information that might be considered conclusive. While these conclusive statements represent our current judgment on the best practices, they are subject to risks and uncertainties that could cause actual results to differ materially. You are cautioned not to place undue reliance on our statements, which reflect our opinions only as of the date of this presentation. Please keep in mind that we are not obligating ourselves to revise or publicly release the results of any revision to these presentation materials in light of new information or future events.
Temporal and semantic analysis of richly typed social networks from user-gene...Zide Meng
We propose an approach to detect topics, overlapping communities of interest, expertise, trends and activities in user-generated content sites and in particular in question-answering forums such as StackOverflow. We first describe QASM (Question & Answer Social Media), a system based on social network analysis to manage the two main resources in question-answering sites: users and content. We also introduce the QASM vocabulary used to formalize both the level of interest and the expertise of users on topics. We then propose an efficient approach to detect communities of interest. It relies on another method to enrich questions with a more general tag when needed. We compared three detection methods on a dataset extracted from the popular Q&A site StackOverflow. Our method based on topic modeling and user membership assignment is shown to be much simpler and faster while preserving the quality of detection. We then propose an additional method to automatically generate a label for a detected topic by analyzing the meaning and links of its bag of words. We conduct a user study to compare different algorithms to choose a label. Finally we extend our probabilistic graphical model to jointly model topics, expertise, activities and trends. We performed experiments with real-world data to confirm the effectiveness of our joint model, studying user behaviors and topic dynamics.
http://www-sop.inria.fr/members/Zide.Meng/
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
Number 2 in the Data Science for Dummies series - We'll predict Titanic survival with Databricks, python and MLSpark.
These are the slides only (excuse the Powerpoint animation issues) - check out the actual tech talk on YouTube: https://rodneyjoyce.home.blog/2019/05/03/data-science-for-dummies-machine-learning-with-databricks-python-sparkml-tech-talk-1-of-7/)
If you have not used Databricks before check out the first talk - Databricks for Dummies.
Here's the rest of the series: https://rodneyjoyce.home.blog/tag/data-science-for-dummies/
1) Data Science overview with Databricks
2) Titanic survival prediction with Azure Machine Learning Studio + Kaggle
3) Data Engineering with Titanic dataset + Databricks + Python
4) Titanic with Databricks + Spark ML
5) Titanic with Databricks + Azure Machine Learning Service
6) Titanic with Databricks + MLS + AutoML
7) Titanic with Databricks + MLFlow
8) Titanic with .NET Core + ML.NET
9) Deployment, DevOps/MLOps and Productionisation
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Orkestra
UIIN Conference, Madrid, 27-29 May 2024
James Wilson, Orkestra and Deusto Business School
Emily Wise, Lund University
Madeline Smith, The Glasgow School of Art
Have you ever wondered how search works while visiting an e-commerce site, internal website, or searching through other types of online resources? Look no further than this informative session on the ways that taxonomies help end-users navigate the internet! Hear from taxonomists and other information professionals who have first-hand experience creating and working with taxonomies that aid in navigation, search, and discovery across a range of disciplines.
More Related Content
Similar to Combining machine learning and search through learning to rank
ChatGPT
Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It involves applying various techniques and methods to extract insights from data sets, often with the goal of uncovering patterns, trends, relationships, or making predictions.
Here's an overview of the key steps and techniques involved in data analysis:
Data Collection: The first step in data analysis is gathering relevant data from various sources. This can include structured data from databases, spreadsheets, or surveys, as well as unstructured data such as text documents, social media posts, or sensor readings.
Data Cleaning and Preprocessing: Once the data is collected, it often needs to be cleaned and preprocessed to ensure its quality and suitability for analysis. This involves handling missing values, removing duplicates, addressing inconsistencies, and transforming data into a suitable format for analysis.
Exploratory Data Analysis (EDA): EDA involves examining and understanding the data through summary statistics, visualizations, and statistical techniques. It helps identify patterns, distributions, outliers, and potential relationships between variables. EDA also helps in formulating hypotheses and guiding further analysis.
Data Modeling and Statistical Analysis: In this step, various statistical techniques and models are applied to the data to gain deeper insights. This can include descriptive statistics, inferential statistics, hypothesis testing, regression analysis, time series analysis, clustering, classification, and more. The choice of techniques depends on the nature of the data and the research questions being addressed.
Data Visualization: Data visualization plays a crucial role in data analysis. It involves creating meaningful and visually appealing representations of data through charts, graphs, plots, and interactive dashboards. Visualizations help in communicating insights effectively and spotting trends or patterns that may be difficult to identify in raw data.
Interpretation and Conclusion: Once the analysis is performed, the findings need to be interpreted in the context of the problem or research objectives. Conclusions are drawn based on the results, and recommendations or insights are provided to stakeholders or decision-makers.
Reporting and Communication: The final step is to present the results and findings of the data analysis in a clear and concise manner. This can be in the form of reports, presentations, or interactive visualizations. Effective communication of the analysis results is crucial for stakeholders to understand and make informed decisions based on the insights gained.
Data analysis is widely used in various fields, including business, finance, marketing, healthcare, social sciences, and more. It plays a crucial role in extracting value from data, supporting evidence-based decision-making, and driving actionable insig
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
Spark 2.0 provided strong performance enhancements to the Spark core while advancing Spark ML usability to use data frames. But what happens when you run Spark 2.0 machine learning algorithms on a large cluster with a very large data set? Do you even get any benefit from using a very large data set? It depends. How do new hardware advances affect the topology of high performance Spark clusters. In this talk we will explore Spark 2.0 Machine Learning at scale and share our findings with the community.
As our test platform we will be using a new cluster design, different from typical Hadoop clusters, with more cores, more RAM and latest generation NVMe SSD’s and a 100GbE network with a goal of more performance, in a more space and energy efficient footprint.
ADV Slides: Graph Databases on the EdgeDATAVERSITY
Graph databases may be the unsung heroes of data platforms. They are poised to expand dramatically in the next few years as the nature of important analytics data expands dramatically into understanding. We live and work today in a highly connected world where individuals and their relationships organize perceptions, consumer behaviors, and many other business success factors. Where patterns are involved in relationships, it is imperative to understand them. Graph databases are the technology that is best-suited to determining and understanding data relationships.
This code-lite session is a primer on graph databases and the relationship data stored in them for the analytics architect in the enterprise. It will help you determine why, how, and where to apply graphs, and how to get started.
Speaker: Philippe Mizrahi - Associate Product Manager - Lyft
Abstract: Philippe Mizrahi works on Lyft’s data discovery and metadata engine, Amundsen. With the help of a Neo4j graph database, Amundsen has improved Lyft’s data discovery by reducing time to discover data by 10x.
During this session, Philippe will dive deep into Amundsen’s use cases, impact, and architecture, which effectively combines a comprehensive knowledge graph based upon Neo4j, centralized metadata and other search ranking optimizations to discover data quickly.
Pistoia alliance harmonizing fair data catalog approaches webinarPistoia Alliance
Multiple groups in the life sciences community have started their journey towards data FAIR-ification by implementing Data Catalogs, a clear first step towards Finding your data. While in many cases the approaches are quite similar, in both origin and intent, differing implementations could end up hampering interoperability and reuse. The Pistoia Alliance and the Linked Data Community of Practice hosted a panel discussion describing at three implementations and their downstream goals:
[1] Pharma cross-omics data catalogs,
[2] Clinical data catalogs
[3] Bioschemas for dataset discoverability on the inter/intranet
How to Feed a Data Hungry Organization – by Traveloka Data TeamTraveloka
In Traveloka's Inaugural Data Meetup held in April 2017, Ainun Najib (Head of Data), Dr. Philip Thomas (Lead Data Scientist), and Rendy B. Junior (Lead Data Engineer) shared about the journey that Traveloka's Data Team have taken so far so that the audience can learn from the struggles and triumphs in managing Traveloka's burgeoning data.
You will learn more about:
1) Data culture in Traveloka
2) Data engineering in Traveloka
3) Data science in Traveloka
To follow our LinkedIn page, visit bit.ly/TravelokaLinkedInPage
Safe Harbor Statement
Our discussion may include predictions, estimates or other information that might be considered conclusive. While these conclusive statements represent our current judgment on the best practices, they are subject to risks and uncertainties that could cause actual results to differ materially. You are cautioned not to place undue reliance on our statements, which reflect our opinions only as of the date of this presentation. Please keep in mind that we are not obligating ourselves to revise or publicly release the results of any revision to these presentation materials in light of new information or future events.
Temporal and semantic analysis of richly typed social networks from user-gene...Zide Meng
We propose an approach to detect topics, overlapping communities of interest, expertise, trends and activities in user-generated content sites and in particular in question-answering forums such as StackOverflow. We first describe QASM (Question & Answer Social Media), a system based on social network analysis to manage the two main resources in question-answering sites: users and content. We also introduce the QASM vocabulary used to formalize both the level of interest and the expertise of users on topics. We then propose an efficient approach to detect communities of interest. It relies on another method to enrich questions with a more general tag when needed. We compared three detection methods on a dataset extracted from the popular Q&A site StackOverflow. Our method based on topic modeling and user membership assignment is shown to be much simpler and faster while preserving the quality of detection. We then propose an additional method to automatically generate a label for a detected topic by analyzing the meaning and links of its bag of words. We conduct a user study to compare different algorithms to choose a label. Finally we extend our probabilistic graphical model to jointly model topics, expertise, activities and trends. We performed experiments with real-world data to confirm the effectiveness of our joint model, studying user behaviors and topic dynamics.
http://www-sop.inria.fr/members/Zide.Meng/
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
Number 2 in the Data Science for Dummies series - We'll predict Titanic survival with Databricks, python and MLSpark.
These are the slides only (excuse the Powerpoint animation issues) - check out the actual tech talk on YouTube: https://rodneyjoyce.home.blog/2019/05/03/data-science-for-dummies-machine-learning-with-databricks-python-sparkml-tech-talk-1-of-7/)
If you have not used Databricks before check out the first talk - Databricks for Dummies.
Here's the rest of the series: https://rodneyjoyce.home.blog/tag/data-science-for-dummies/
1) Data Science overview with Databricks
2) Titanic survival prediction with Azure Machine Learning Studio + Kaggle
3) Data Engineering with Titanic dataset + Databricks + Python
4) Titanic with Databricks + Spark ML
5) Titanic with Databricks + Azure Machine Learning Service
6) Titanic with Databricks + MLS + AutoML
7) Titanic with Databricks + MLFlow
8) Titanic with .NET Core + ML.NET
9) Deployment, DevOps/MLOps and Productionisation
Similar to Combining machine learning and search through learning to rank (20)
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Orkestra
UIIN Conference, Madrid, 27-29 May 2024
James Wilson, Orkestra and Deusto Business School
Emily Wise, Lund University
Madeline Smith, The Glasgow School of Art
Have you ever wondered how search works while visiting an e-commerce site, internal website, or searching through other types of online resources? Look no further than this informative session on the ways that taxonomies help end-users navigate the internet! Hear from taxonomists and other information professionals who have first-hand experience creating and working with taxonomies that aid in navigation, search, and discovery across a range of disciplines.
This presentation by Morris Kleiner (University of Minnesota), was made during the discussion “Competition and Regulation in Professions and Occupations” held at the Working Party No. 2 on Competition and Regulation on 10 June 2024. More papers and presentations on the topic can be found out at oe.cd/crps.
This presentation was uploaded with the author’s consent.
Acorn Recovery: Restore IT infra within minutesIP ServerOne
Introducing Acorn Recovery as a Service, a simple, fast, and secure managed disaster recovery (DRaaS) by IP ServerOne. A DR solution that helps restore your IT infra within minutes.
This presentation, created by Syed Faiz ul Hassan, explores the profound influence of media on public perception and behavior. It delves into the evolution of media from oral traditions to modern digital and social media platforms. Key topics include the role of media in information propagation, socialization, crisis awareness, globalization, and education. The presentation also examines media influence through agenda setting, propaganda, and manipulative techniques used by advertisers and marketers. Furthermore, it highlights the impact of surveillance enabled by media technologies on personal behavior and preferences. Through this comprehensive overview, the presentation aims to shed light on how media shapes collective consciousness and public opinion.
0x01 - Newton's Third Law: Static vs. Dynamic AbusersOWASP Beja
f you offer a service on the web, odds are that someone will abuse it. Be it an API, a SaaS, a PaaS, or even a static website, someone somewhere will try to figure out a way to use it to their own needs. In this talk we'll compare measures that are effective against static attackers and how to battle a dynamic attacker who adapts to your counter-measures.
About the Speaker
===============
Diogo Sousa, Engineering Manager @ Canonical
An opinionated individual with an interest in cryptography and its intersection with secure software development.
17. Recap: Inverted Index
Terms doc_ids ttf
against 5 1
band 6 1
beatles 3 1
machine 5 1
radiohead 4 1
rage 5 1
rolling 2 1
stones 2 1
the 5,6 2
u2 1 1
Doc Id Artist
1 U2
2 Rolling Stones
3 Beatles
4 Radiohead
5 Rage against the machine
6 The Band
18. {
"album": "War",
"image": "rs-137178-883f5fe955b2745cd539.jpg",
"information": "<p>U2 were on the cusp of becoming one
of the Eighties' most important groups when their
third album came out. It's the band's most
overtly political album, with songs about Poland's
Solidarity movement ("New Year's Day")
and Irish unrest ("Sunday Bloody Sunday")
charged with explosive, passionate guitar rock.</p>",
"sequence": 223,
"order": 279,
"label": "Island",
"artist": "U2",
"year": "1983",
"id": 350640
}
25. Learning to rank
Y = f(X)
X - set of queries
Y - Ideal ordered set of documents per query
f(X) - The model with learned parameters through an
algorithm
26. Learning to rank
Query 1
Model
X f(X) Y
Query 2
Query 3
Query 4
5 15 67 3 17
23 3 7 88 45
3 27 25 23 5
6 99 22 27 33
27. Model Evaluation Types
• Binary relevance (MAP, Precision)
• Graded relevance, position based (DCG, NDCG)
• Graded relevance, cascade based (ERR)
http://bit.ly/eval-metric
28. Learning to Rank
approaches
• Point wise - Calculate a score for each item and sort them
• Pair wise - Compare two items each time and sort them.
Use binary classifier for one document that is better than
the other.
• List wise - Optimise the complete list directly using one of
the evaluation measures, averaged over all queries
http://bit.ly/list-wise-approach
29. LTR: Steps to take
1. Create Judgement List (Ground Truth)
2. Define features for the model
3. Log features during usage
4. Training and testing the model
5. Deploying and using the model
31. Judgement List:
Expert Panel
• Time consuming and Expensive
• Error prone due to different judgements
http://bit.ly/ebay-hum-judge
32. Judgement List:
Implicit Feedback
• Log user behaviour
• A click is not a relevance judgement
• Compare actual clicks versus expected clicks
33. Click models
• Random Click Model -> Every document has the same
chance of being clicked
• Click Through Rate Model -> Uses the fact that the first
document is clicked far more than the second document
• Cascade Model -> A click in the third item also tells us
something about the first two items
• Dynamic Bayesian Network Model -> Uses the historical
clicks in a search session and the actual relevance of a
document.
http://bit.ly/click-model
36. Our Judgement List
# grade (0-4) queryid docId title
#
# qid:1: u2
# qid:2: metallica
# qid:3: beatles
#
# https://sourceforge.net/p/lemur/wiki/RankLib%20File%20Format/
#
4 qid:1 #350967 U2-Achtung Baby
4 qid:1 #350696 U2-All That You Can't Leave Behind
1 qid:1 #350897 Radiohead-The Bends
1 qid:1 #351105 Radiohead-Kid A
1 qid:1 #351029 The Beatles-Sgt. Pepper's Lonely Hearts Club Band
4 qid:2 #350672 Metallica-Metallica
4 qid:2 #350816 Metallica-Master of Puppets
0 qid:2 #351029 The Beatles-Sgt. Pepper's Lonely Hearts Club Band
4 qid:3 #350976 Beatles-Meet the Beatles!
1 qid:3 #350881 The Byrds-Younger Than Yesterday
37. 2. features for the model
• Elasticsearch queries
• Raw Term Statistics
• Document Frequency
• Total Term Frequency
• Also max, min, sum (in case of multiple terms, fields)
38. 2. features for the model
{
"query": {
"match": {
"artist.analyzed": "{{keywords}}"
}
}
}
Two other features are similar for fields “album.analyzed” and “information”
41. 4. Train and test model
• Making use of Ranklib
• Following Algorithms are available:
• MART, RankNet, RankBoost, AdaRank, Coordinate Ascent,
LambdaMART, ListNet, Random Forests
• Supported evaluation metrics to optimise on training data
• MAP, NDCG@k, DCG@k, P@k, RR@k, ERR@k
• Can specify separate train, validation and test set
• Can normalise feature sets
42. Models using Ranklib
MART
Multiple Additive Regression Trees, a gradient boosting machine. Can be
used for regression as well as classification.
RankNet
Compare two feature vectors using stochastic gradient descent with the
help of a cost function.
RankBoost
Based on AdaBoost, combining many weak rankings into a single highly
accurate ranking. Is pairwise comparison
AdaRank
Combines a number of weak learners in a linear way. Builds on AdaBoost,
but directed more at ranking.
Coordinate Ascent
Optimises one parameter at a time, keeping the other constant. Done
iteratively until some convergence criteria is met.
LambdaRank
Optimisation of RankNet that only looks at the gradients represented by
arrows indicating how much they need to move up or down
LambdaMART Combines using the gradients of LambdaRank and the use of MART
ListNet
Uses a list wise loss function, a neural network and gradient descent.
Similar to RankNet, only difference is List versus Pair loss functions.
Random Forests
Number of trees to vote for the most popular class for a vector of features.
One tree would not be better than a random choice, but a forest is
Linear Regression
Most of the times to simplistic for the learning to rank problem with lots of
features, but good to have available to at least try
43. Evaluation metrics
MAP Mean Average Precision: The average of all P@k
NDCG@k
Normalised DCG: A DCG with a value between 0 and 1, normalised by the
highest score.
DCG@k
Discounted Cumulative Gain: Add relevance of all documents discounted
by the position of the document making the 1st document more important
P@k Percentage of relevant documents of this top K
RR@k Reciprocal Rank: 1/K where K is the first relevant document
ERR@k
Expected Reciprocal Rank: discounts documents that are below a very
relevant document. (Can be used for List wise comparison)