This document discusses learning to rank search results using machine learning techniques. It covers:
1. Creating a ground truth judgement list by obtaining labelled data from expert panels or implicit user feedback.
2. Defining features for the machine learning model to use, such as term statistics, document fields, and Elasticsearch queries.
3. Logging feature values during search to populate the training data.
4. Training and testing ranking models using algorithms like MART, RankNet, and LambdaRank in the Ranklib library.
5. Deploying the trained model and continuing to gather implicit feedback in a feedback loop to improve the model over time.
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
As a leading e-commerce company in fashion in the Netherlands, Wehkamp dedicates itself to provide a better shopping experience for the customers. Using Spark, the data science team is able to develop various machine-learning projects for this purpose based on the large scale data of products and customers. A major topic for the data science team is ranking products. If a visitor enters a search phrase, what are the best products that fit the search phrase and in what order should the products been shown? Ranking products is also important if a visitor enters a product overview page, where hundreds or even thousands of products of a certain article type are displayed.
In this project, Spark is used in the whole pipeline: retrieving and processing the search phrases and their results, making click models, creating feature sets, training and evaluating ranking models, pushing the models to production using ElasticSearch and creating Tableau dashboarding. In this talk, we are going to demonstrate how we use Spark to build up the whole pipeline of ranking products and the challenges we faced along the way.
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...lucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011
Attendees with come away from this presentation with a good understanding and access to source
code for boosting and/or filtering documents by recency, popularity, and personal preferences. My
solution improves upon the common “recipe” based solution for boosting by document age. The
framework also supports boosting documents by a popularity score, which is calculated and
managed outside the index. I will present a few different ways to calculate popularity in a scalable
manner. Lastly, my solution supports the concept of a personal document collection, where each
user is only interested in a subset of the total number of documents in the index.
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
All production environment requires monitoring and alerting. Apache Spark also has a configurable metrics system in order to allow users to report Spark metrics to a variety of sinks. Prometheus is one of the popular open-source monitoring and alerting toolkits which is used with Apache Spark together.
( ELK Stack Training - https://www.edureka.co/elk-stack-trai... )
This Kibana tutorial by Edureka will give you an introduction to the Kibana 5 Dashboard and help you get started with working on the ELK Stack. Below are the topics covered in this Kibana tutorial video:
1. Introduction To ELK Stack
2. Role Of Kibana In ELK
3. Kibana 5 Dashboard
4. Demo: Kibana For Visualization & Analytics
Grafana Loki is a newly developed logs aggregation system that integrated very nicely with Grafana dashboard to link metrics with logs or just use logs as a separate panel. It is open-source and has a growing community.
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
As a leading e-commerce company in fashion in the Netherlands, Wehkamp dedicates itself to provide a better shopping experience for the customers. Using Spark, the data science team is able to develop various machine-learning projects for this purpose based on the large scale data of products and customers. A major topic for the data science team is ranking products. If a visitor enters a search phrase, what are the best products that fit the search phrase and in what order should the products been shown? Ranking products is also important if a visitor enters a product overview page, where hundreds or even thousands of products of a certain article type are displayed.
In this project, Spark is used in the whole pipeline: retrieving and processing the search phrases and their results, making click models, creating feature sets, training and evaluating ranking models, pushing the models to production using ElasticSearch and creating Tableau dashboarding. In this talk, we are going to demonstrate how we use Spark to build up the whole pipeline of ranking products and the challenges we faced along the way.
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...lucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011
Attendees with come away from this presentation with a good understanding and access to source
code for boosting and/or filtering documents by recency, popularity, and personal preferences. My
solution improves upon the common “recipe” based solution for boosting by document age. The
framework also supports boosting documents by a popularity score, which is calculated and
managed outside the index. I will present a few different ways to calculate popularity in a scalable
manner. Lastly, my solution supports the concept of a personal document collection, where each
user is only interested in a subset of the total number of documents in the index.
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
All production environment requires monitoring and alerting. Apache Spark also has a configurable metrics system in order to allow users to report Spark metrics to a variety of sinks. Prometheus is one of the popular open-source monitoring and alerting toolkits which is used with Apache Spark together.
( ELK Stack Training - https://www.edureka.co/elk-stack-trai... )
This Kibana tutorial by Edureka will give you an introduction to the Kibana 5 Dashboard and help you get started with working on the ELK Stack. Below are the topics covered in this Kibana tutorial video:
1. Introduction To ELK Stack
2. Role Of Kibana In ELK
3. Kibana 5 Dashboard
4. Demo: Kibana For Visualization & Analytics
Grafana Loki is a newly developed logs aggregation system that integrated very nicely with Grafana dashboard to link metrics with logs or just use logs as a separate panel. It is open-source and has a growing community.
Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks
Have you ever wondered how to implement your own operator pattern for you service X in Kubernetes? You can learn this in this session and see an example of open-source project that does spawn Apache Spark clusters on Kubernetes and OpenShift following the pattern. You will leave this talk with a better understanding of how spark-on-k8s native scheduling mechanism can be leveraged and how you can wrap your own service into operator pattern not only in Go lang but also in Java. The pod with spark operator and optionally the spark clusters expose the metrics for Prometheus so it makes it eas
La gestione dei log è da sempre un argomento complesso e nel tempo si sono cercate varie soluzioni più o meno complesse, spesso difficili da integrare nel proprio stack applicativo. Daremo un’ overview generale dei principali sistemi di aggregazione evoluta dei log in realtime (Fluentd, Greylog, eccetera) e illustreremo del motivo ci ha spinto a scegliere ELK per risolvere un’esigenza del nostro cliente; ovvero di consultare i log in modo piu comprensibile da persone non tecniche.
Lo stack ELK (Elasticsearch Logstash Kibana) permette agli sviluppatori di consultare i log in fase di debug / produzione senza avvalersi dello staff sistemistico. Dimostreremo come abbiamo eseguito il deployment dello stack ELK e lo abbiamo implementato per interpretare e strutturare
i log applicativi di Magento.
So, what is the ELK Stack? "ELK" is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana. Elasticsearch is a search and analytics engine. Logstash is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a "stash" like Elasticsearch. Kibana lets users visualize data with charts and graphs in Elasticsearch.
PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...Edureka!
( ** Deep Learning Training: https://www.edureka.co/ai-deep-learning-with-tensorflow ** )
This Edureka PyTorch Tutorial (Blog: https://goo.gl/4zxMfU) will help you in understanding various important basics of PyTorch. It also includes a use-case in which we will create an image classifier that will predict the accuracy of an image data-set using PyTorch.
Below are the topics covered in this tutorial:
1. What is Deep Learning?
2. What are Neural Networks?
3. Libraries available in Python
4. What is PyTorch?
5. Use-Case of PyTorch
6. Summary
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in FlinkFlink Forward
http://flink-forward.org/kb_sessions/beyond-the-watermark-on-demand-backfilling-in-flink/
Flink has consistency guarantees and efficient checkpointing model which make it a good fit for Uber’s money-related use cases, such as driver incentives. However, Flink’s time-progress model is built around a single watermark, which is incompatible with Uber’s business need for generating aggregates retroactively. The talk covers our solution for on-demand backfilling. It also outlines other abstractions and features we expect Flink to support as it matures.
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing.
Dense Retrieval with Apache Solr Neural Search.pdfSease
Neural Search is an industry derivation from the academic field of Neural information Retrieval. More and more frequently, we hear about how Artificial Intelligence (AI) permeates every aspect of our lives and this includes also software engineering and Information Retrieval.
In particular, the advent of Deep Learning introduced the use of deep neural networks to solve complex problems that could not be solved simply by an algorithm. Deep Learning can be used to produce a vector representation of both the query and the documents in a corpus of information. Search, in general, comprises of performing four primary steps:
- generate a representation of the query that describes the information need - generate a representation of the document that captures the information contained in it
- match the query and the document representations from the corpus of information
- assign a score to each matched document in order to establish a meaningful document ranking by relevance in the results.
With the Neural Search module, Apache Solr is introducing support for neural network based techniques that can improve these four aspects of search.
Eland: A Python client for data analysis and explorationElasticsearch
Python is a highly adopted language for data science and analysis. Eland is a Python client and toolkit for DataFrames, big data, machine learning, and ETL in Elasticsearch. Get an introduction to Eland with a hands-on demo where you’ll learn about the DataFrame implementation of Eland, as well as how to manage machine learning models.
Extending Flink SQL for stream processing use casesFlink Forward
Flink Forward San Francisco 2022.
Apache Flink is a powerful stream processing platform that enables users to build complex real time applications. Flink SQL provides a SQL interface that implements standard SQL. While the standard SQL provides a perfect interface for batch processing, in stream processing context, it can result is ambiguity and complex syntax. As an example, consider these three types of streams: Append-only stream, Retract stream and Upsert stream. Using standard SQL, we would represent all of these streams as Table along with the Table concept in batch processing. Such overloading of concepts can result in ambiguity in SQL statements in streaming context. In this talk, we will present extensions to the Flink SQL that simplify SQL statements in the context of stream processing. We will show how such extensions work in the context of a Flink application using different use cases. These extensions are only sugar syntax and users should be able to use Flink SQL as is if they desire.
by
Hojjat Jafarpour
YouTube Link: https://youtu.be/WvhQhj4n6b8
** Python Certification Training: https://www.edureka.co/python **
This Edureka PPT on 'What is Python?' will help you understand and learn python programming language with its features. It is one of the most widely adopted programming language in the industry currently. Below are the topics covered in this Python Programming tutorial
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
Building High-Throughput, Low-Latency Pipelines in Kafkaconfluent
William Hill is one of the UK’s largest, most well-established gaming companies with a global presence across 9 countries with over 16,000 employees. In recent years the gaming industry and in particular sports betting, has been revolutionised by technology. Customers now demand a wide range of events and markets to bet on both pre-game and in-play 24/7. This has driven out a business need to process more data, provide more updates and offer more markets and prices in real time.
At William Hill, we have invested in a completely new trading platform using Apache Kafka. We process vast quantities of data from a variety of feeds, this data is fed through a variety of odds compilation models, before being piped out to UI apps for use by our trading teams to provide events, markets and pricing data out to various end points across the whole of William Hill. We deal with thousands of sporting events, each with sometimes hundreds of betting markets, each market receiving hundreds of updates. This scales up to vast numbers of messages flowing through our system. We have to process, transform and route that data in real time. Using Apache Kafka, we have built a high throughput, low latency pipeline, based on Cloud hosted Microservices. When we started, we were on a steep learning curve with Kafka, Microservices and associated technologies. This led to fast learnings and fast failings.
In this session, we will tell the story of what we built, what went well, what didn’t go so well and what we learnt. This is a story of how a team of developers learnt (and are still learning) how to use Kafka. We hope that you will be able to take away lessons and learnings of how to build a data processing pipeline with Apache Kafka.
OSMC 2022 | Logstash, Beats, Elastic Agent, Open Telemetry — what’s the right...NETWAYS
Back in the old days with the ELK Stack, ingesting logs (and other data) was straight forward: Logstash or maybe Fluend. Today you have a lot more options: Beats have been around for a long time, but Elastic Agent is the hot new thing. And then there is also Open Telemetry that’s growing in use-cases. What’s the right choice? This talk gives a quick overview of the current options and their tradeoffs including some common scenarios and how one or more of the tools can solve your problems.
Learn how to get started with GraphQL, what is the great advantage of GraphQL compared to a REST API and where to find more resources. These slides were used in the first of three webinars in our series on GraphQL on 1 December 2017.
In this presentation, we are going to discuss how elasticsearch handles the various operations like insert, update, delete. We would also cover what is an inverted index and how segment merging works.
Building a real time big data analytics platform with solrTrey Grainger
Having “big data” is great, but turning that data into actionable intelligence is where the real value lies. This talk will demonstrate how you can use Solr to build a highly scalable data analytics engine to enable customers to engage in lightning fast, real-time knowledge discovery.
At CareerBuilder, we utilize these techniques to report the supply and demand of the labor force, compensation trends, customer performance metrics, and many live internal platform analytics. You will walk away from this talk with an advanced understanding of faceting, including pivot-faceting, geo/radius faceting, time-series faceting, function faceting, and multi-select faceting. You’ll also get a sneak peak at some new faceting capabilities just wrapping up development including distributed pivot facets and percentile/stats faceting, which will be open-sourced.
The presentation will be a technical tutorial, along with real-world use-cases and data visualizations. After this talk, you'll never see Solr as just a text search engine again.
Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks
Have you ever wondered how to implement your own operator pattern for you service X in Kubernetes? You can learn this in this session and see an example of open-source project that does spawn Apache Spark clusters on Kubernetes and OpenShift following the pattern. You will leave this talk with a better understanding of how spark-on-k8s native scheduling mechanism can be leveraged and how you can wrap your own service into operator pattern not only in Go lang but also in Java. The pod with spark operator and optionally the spark clusters expose the metrics for Prometheus so it makes it eas
La gestione dei log è da sempre un argomento complesso e nel tempo si sono cercate varie soluzioni più o meno complesse, spesso difficili da integrare nel proprio stack applicativo. Daremo un’ overview generale dei principali sistemi di aggregazione evoluta dei log in realtime (Fluentd, Greylog, eccetera) e illustreremo del motivo ci ha spinto a scegliere ELK per risolvere un’esigenza del nostro cliente; ovvero di consultare i log in modo piu comprensibile da persone non tecniche.
Lo stack ELK (Elasticsearch Logstash Kibana) permette agli sviluppatori di consultare i log in fase di debug / produzione senza avvalersi dello staff sistemistico. Dimostreremo come abbiamo eseguito il deployment dello stack ELK e lo abbiamo implementato per interpretare e strutturare
i log applicativi di Magento.
So, what is the ELK Stack? "ELK" is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana. Elasticsearch is a search and analytics engine. Logstash is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a "stash" like Elasticsearch. Kibana lets users visualize data with charts and graphs in Elasticsearch.
PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...Edureka!
( ** Deep Learning Training: https://www.edureka.co/ai-deep-learning-with-tensorflow ** )
This Edureka PyTorch Tutorial (Blog: https://goo.gl/4zxMfU) will help you in understanding various important basics of PyTorch. It also includes a use-case in which we will create an image classifier that will predict the accuracy of an image data-set using PyTorch.
Below are the topics covered in this tutorial:
1. What is Deep Learning?
2. What are Neural Networks?
3. Libraries available in Python
4. What is PyTorch?
5. Use-Case of PyTorch
6. Summary
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in FlinkFlink Forward
http://flink-forward.org/kb_sessions/beyond-the-watermark-on-demand-backfilling-in-flink/
Flink has consistency guarantees and efficient checkpointing model which make it a good fit for Uber’s money-related use cases, such as driver incentives. However, Flink’s time-progress model is built around a single watermark, which is incompatible with Uber’s business need for generating aggregates retroactively. The talk covers our solution for on-demand backfilling. It also outlines other abstractions and features we expect Flink to support as it matures.
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing.
Dense Retrieval with Apache Solr Neural Search.pdfSease
Neural Search is an industry derivation from the academic field of Neural information Retrieval. More and more frequently, we hear about how Artificial Intelligence (AI) permeates every aspect of our lives and this includes also software engineering and Information Retrieval.
In particular, the advent of Deep Learning introduced the use of deep neural networks to solve complex problems that could not be solved simply by an algorithm. Deep Learning can be used to produce a vector representation of both the query and the documents in a corpus of information. Search, in general, comprises of performing four primary steps:
- generate a representation of the query that describes the information need - generate a representation of the document that captures the information contained in it
- match the query and the document representations from the corpus of information
- assign a score to each matched document in order to establish a meaningful document ranking by relevance in the results.
With the Neural Search module, Apache Solr is introducing support for neural network based techniques that can improve these four aspects of search.
Eland: A Python client for data analysis and explorationElasticsearch
Python is a highly adopted language for data science and analysis. Eland is a Python client and toolkit for DataFrames, big data, machine learning, and ETL in Elasticsearch. Get an introduction to Eland with a hands-on demo where you’ll learn about the DataFrame implementation of Eland, as well as how to manage machine learning models.
Extending Flink SQL for stream processing use casesFlink Forward
Flink Forward San Francisco 2022.
Apache Flink is a powerful stream processing platform that enables users to build complex real time applications. Flink SQL provides a SQL interface that implements standard SQL. While the standard SQL provides a perfect interface for batch processing, in stream processing context, it can result is ambiguity and complex syntax. As an example, consider these three types of streams: Append-only stream, Retract stream and Upsert stream. Using standard SQL, we would represent all of these streams as Table along with the Table concept in batch processing. Such overloading of concepts can result in ambiguity in SQL statements in streaming context. In this talk, we will present extensions to the Flink SQL that simplify SQL statements in the context of stream processing. We will show how such extensions work in the context of a Flink application using different use cases. These extensions are only sugar syntax and users should be able to use Flink SQL as is if they desire.
by
Hojjat Jafarpour
YouTube Link: https://youtu.be/WvhQhj4n6b8
** Python Certification Training: https://www.edureka.co/python **
This Edureka PPT on 'What is Python?' will help you understand and learn python programming language with its features. It is one of the most widely adopted programming language in the industry currently. Below are the topics covered in this Python Programming tutorial
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
Building High-Throughput, Low-Latency Pipelines in Kafkaconfluent
William Hill is one of the UK’s largest, most well-established gaming companies with a global presence across 9 countries with over 16,000 employees. In recent years the gaming industry and in particular sports betting, has been revolutionised by technology. Customers now demand a wide range of events and markets to bet on both pre-game and in-play 24/7. This has driven out a business need to process more data, provide more updates and offer more markets and prices in real time.
At William Hill, we have invested in a completely new trading platform using Apache Kafka. We process vast quantities of data from a variety of feeds, this data is fed through a variety of odds compilation models, before being piped out to UI apps for use by our trading teams to provide events, markets and pricing data out to various end points across the whole of William Hill. We deal with thousands of sporting events, each with sometimes hundreds of betting markets, each market receiving hundreds of updates. This scales up to vast numbers of messages flowing through our system. We have to process, transform and route that data in real time. Using Apache Kafka, we have built a high throughput, low latency pipeline, based on Cloud hosted Microservices. When we started, we were on a steep learning curve with Kafka, Microservices and associated technologies. This led to fast learnings and fast failings.
In this session, we will tell the story of what we built, what went well, what didn’t go so well and what we learnt. This is a story of how a team of developers learnt (and are still learning) how to use Kafka. We hope that you will be able to take away lessons and learnings of how to build a data processing pipeline with Apache Kafka.
OSMC 2022 | Logstash, Beats, Elastic Agent, Open Telemetry — what’s the right...NETWAYS
Back in the old days with the ELK Stack, ingesting logs (and other data) was straight forward: Logstash or maybe Fluend. Today you have a lot more options: Beats have been around for a long time, but Elastic Agent is the hot new thing. And then there is also Open Telemetry that’s growing in use-cases. What’s the right choice? This talk gives a quick overview of the current options and their tradeoffs including some common scenarios and how one or more of the tools can solve your problems.
Learn how to get started with GraphQL, what is the great advantage of GraphQL compared to a REST API and where to find more resources. These slides were used in the first of three webinars in our series on GraphQL on 1 December 2017.
In this presentation, we are going to discuss how elasticsearch handles the various operations like insert, update, delete. We would also cover what is an inverted index and how segment merging works.
Building a real time big data analytics platform with solrTrey Grainger
Having “big data” is great, but turning that data into actionable intelligence is where the real value lies. This talk will demonstrate how you can use Solr to build a highly scalable data analytics engine to enable customers to engage in lightning fast, real-time knowledge discovery.
At CareerBuilder, we utilize these techniques to report the supply and demand of the labor force, compensation trends, customer performance metrics, and many live internal platform analytics. You will walk away from this talk with an advanced understanding of faceting, including pivot-faceting, geo/radius faceting, time-series faceting, function faceting, and multi-select faceting. You’ll also get a sneak peak at some new faceting capabilities just wrapping up development including distributed pivot facets and percentile/stats faceting, which will be open-sourced.
The presentation will be a technical tutorial, along with real-world use-cases and data visualizations. After this talk, you'll never see Solr as just a text search engine again.
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
GoPro’s camera, drone, mobile devices as well as web, desktop applications are generating billions of event logs. The analytics metrics and insights that inform product, engineering, and marketing team decisions need to be distributed quickly and efficiently. We need to visualize the metrics to find the trends or anomalies.
While trying to building up the features store for machine learning, we need to visualize the features, Google Facets is an excellent project for visualizing features. But can we visualize larger feature dataset?
These are issues we encounter at GoPro as part of the data platform evolution. In this talk, we will discuss few of the progress we made at GoPro. We will talk about how to use Slack + Plot.ly to delivery analytics metrics and visualization. And we will also discuss our work to visualize large feature set using Google Facets with Apache Spark.
Système de recommandations de produits sur un site marchand par Koby KARP, Data Scientist (Equancy) & Hervé MIGNOT, Partner at Equancy
La recommandation reste un outil clé pour la personnalisation des sites marchands et le sujet est loin d’être épuisé. La prise en compte de la particularité d’un marché peut nécessité d’adapter le traitement et les algorithmes utilisés. Après une revue des techniques de recommandations, nous présenterons la démarche spécifique que nous avons adopté. Le système a été développé sous Spark pour la préparation des données et le calcul des modèles de recommandations. Une API simple et son service ont été développé pour délivrer les recommandations aux applications clientes.
OSMC 2023 | Experiments with OpenSearch and AI by Jochen Kressin & Leanne La...NETWAYS
At the intersection of search and AI, melding Large Language Models (LLMs) with OpenSearch opens transformative avenues. In this talk, we explore how LLMs can simplify the interaction between users and OpenSearch, converting natural language into OpenSearch queries. We will also leverage OpenSearch’s Vector Storage, enriching traditional term-based searches with semantic understanding. Dive into a future where search engines transcend being mere tools, becoming intuitive partners in knowledge discovery.
Machine learning techniques are powerful, but building and deploying such models for production use require a lot of care and expertise.
A lot of books, articles, and best practices have been written and discussed on machine learning techniques and feature engineering, but putting those techniques into use on a production environment is usually forgotten and under- estimated , the aim of this talk is to shed some lights on current machine learning deployment practices, and go into details on how to deploy sustainable machine learning pipelines.
This is our contributions to the Data Science projects, as developed in our startup. These are part of partner trainings and in-house design and development and testing of the course material and concepts in Data Science and Engineering. It covers Data ingestion, data wrangling, feature engineering, data analysis, data storage, data extraction, querying data, formatting and visualizing data for various dashboards.Data is prepared for accurate ML model predictions and Generative AI apps
This is our project work at our startup for Data Science. This is part of our internal training and focused on data management for AI, ML and Generative AI apps
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016MLconf
Using Bayesian Optimization to Tune Machine Learning Models: In this talk we briefly introduce Bayesian Global Optimization as an efficient way to optimize machine learning model parameters, especially when evaluating different parameters is time-consuming or expensive. We will motivate the problem and give example applications.
We will also talk about our development of a robust benchmark suite for our algorithms including test selection, metric design, infrastructure architecture, visualization, and comparison to other standard and open source methods. We will discuss how this evaluation framework empowers our research engineers to confidently and quickly make changes to our core optimization engine.
We will end with an in-depth example of using these methods to tune the features and hyperparameters of a real world problem and give several real world applications.
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
A long time ago, there was Caffe and Theano, then came Torch and CNTK and Tensorflow, Keras and MXNet and Pytorch and Caffe2….a sea of Deep learning tools but none for Spark developers to dip into. Finally, there was BigDL, a deep learning library for Apache Spark. While BigDL is integrated into Spark and extends its capabilities to address the challenges of Big Data developers, will a library alone be enough to simplify and accelerate the deployment of ML/DL workloads on production clusters? From high level pipeline API support to feature transformers to pre-defined models and reference use cases, a rich repository of easy to use tools are now available with the ‘Analytics Zoo’. We’ll unpack the production challenges and opportunities with ML/DL on Spark and what the Zoo can do
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Sease
For more details:
https://sease.io/2020/04/the-importance-of-online-testing-in-learning-to-rank-part-1.html
https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html
Learning to rank (LTR from now on) is the application of machine learning techniques, typically supervised, in the formulation of ranking models for information retrieval systems.
With LTR becoming more and more popular (Apache Solr supports it from Jan 2017 and Elasticsearch has an Open Source plugin released in 2018), organizations struggle with the problem of how to evaluate the quality of the models they train.
This talk explores all the major points in both Offline and Online evaluation.
Setting up correct infrastructures and processes for a fair and effective evaluation of the trained models is vital for measuring the improvements/regressions of a LTR system.
The talk is intended for:
– Product Owners, Search Managers, Business Owners
– Software Engineers, Data Scientists, and Machine Learning Enthusiast
Expect to learn :
the importance of Offline testing from a business perspective
how Offline testing can be done with Open Source libraries
how to build a realistic test set from the original data set in input avoiding common mistakes in the process
the importance of Online testing from a business perspective
A/B testing and Interleaving approaches: details and Pros/ Cons
common mistakes and how they can false the obtained results
Join us as we explore real-world scenarios and dos and don’ts from the e-commerce industry!
Combining machine learning and search through learning to rankJettro Coenradie
In this presentation, we will go through all the steps to use machine learning to improve your search results. We'll discuss the search basics you need to know as well as some machine learning basics. After that, we use a sample application available at the URL https://rolling500.luminis.amsterdam to show improvements using a trained model and the learning to rank plugin in Elasticsearch.
Combining machine learning and search through learning to rankJettro Coenradie
With advanced tools available for search like Solr and Elasticsearch, companies are embedding search in almost all their products and websites. Search is becoming mainstream. Therefore we can focus on teaching the search engine tricks to return more relevant results. One new trick is called "learning to rank". During the presentation, you'll learn what Learning To Rank is, when to apply it and of course, you'll get an example to show how it works using Elasticsearch and a learning to rank plugin. After this presentation, you have learned to combine machine learning models and search.
You are a developer, create applications that generate logs. You would like to monitor those logs to check what the application is doing in production. Or you are an operator in need for information about the whole platform. You need logs from the load balancer, proxy, database and the application. If possible you would like to correlate these logs as well. Maybe you are an analyst and you would like to create some graphs of the data you obtained. If one of these roles is you, the chance is big you heard about ELK. This is short for Elasticsearch, Logstash and Kibana. The goal for these projects is to obtain data (logstash), store it in a central repository (elasticsearch) to make it searchable and available for analysis. Having all this data is nice, but making it visible is even better, that is where Kibana comes in. With Kibana you can create nice dashboard giving insight into your data. ELK is a proven technology stack to handle your logs. During this talk I will present you the complete stack. I’ll show you how to import data with logstash, explain what happens in elasticsearch and create a dashboard using Kibana. I will also discuss some choices you have to make while storing the data, go into a number of possible architectures for the ELK stack. At the end you have a good idea about what ELK can do for you.
Search: the right tool, but what is the job. At nosqlmatters amsterdam 2013Jettro Coenradie
This presentation is a non-technical overview of what kind of tool a search solution is. Elasticsearch is used to explain the concepts as well as provide a number of jobs that can be performed with a search solution.
This presentation is given at the nosqlmatters road show conference in Amsterdam in oktober 2013
Creating polyglot and scalable applications on the jvm using Vert.xJettro Coenradie
In this presentation I show the basic vert.x options for creating polyglot applications that scale on the JVM. Later on the live presentation will also be published. It will be in Dutch though.
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Orkestra
UIIN Conference, Madrid, 27-29 May 2024
James Wilson, Orkestra and Deusto Business School
Emily Wise, Lund University
Madeline Smith, The Glasgow School of Art
Acorn Recovery: Restore IT infra within minutesIP ServerOne
Introducing Acorn Recovery as a Service, a simple, fast, and secure managed disaster recovery (DRaaS) by IP ServerOne. A DR solution that helps restore your IT infra within minutes.
This presentation, created by Syed Faiz ul Hassan, explores the profound influence of media on public perception and behavior. It delves into the evolution of media from oral traditions to modern digital and social media platforms. Key topics include the role of media in information propagation, socialization, crisis awareness, globalization, and education. The presentation also examines media influence through agenda setting, propaganda, and manipulative techniques used by advertisers and marketers. Furthermore, it highlights the impact of surveillance enabled by media technologies on personal behavior and preferences. Through this comprehensive overview, the presentation aims to shed light on how media shapes collective consciousness and public opinion.
Have you ever wondered how search works while visiting an e-commerce site, internal website, or searching through other types of online resources? Look no further than this informative session on the ways that taxonomies help end-users navigate the internet! Hear from taxonomists and other information professionals who have first-hand experience creating and working with taxonomies that aid in navigation, search, and discovery across a range of disciplines.
0x01 - Newton's Third Law: Static vs. Dynamic AbusersOWASP Beja
f you offer a service on the web, odds are that someone will abuse it. Be it an API, a SaaS, a PaaS, or even a static website, someone somewhere will try to figure out a way to use it to their own needs. In this talk we'll compare measures that are effective against static attackers and how to battle a dynamic attacker who adapts to your counter-measures.
About the Speaker
===============
Diogo Sousa, Engineering Manager @ Canonical
An opinionated individual with an interest in cryptography and its intersection with secure software development.
This presentation by Morris Kleiner (University of Minnesota), was made during the discussion “Competition and Regulation in Professions and Occupations” held at the Working Party No. 2 on Competition and Regulation on 10 June 2024. More papers and presentations on the topic can be found out at oe.cd/crps.
This presentation was uploaded with the author’s consent.
17. Recap: Inverted Index
Terms doc_ids ttf
fifa 1 1
call 2 1
of 2 1
duty 2 1
god 3 1
war 3 1
pes 4 1
doodle 5 1
Doc Id Title
1 Fifa
2 Call of Duty
3 God of War
4 PES
5 Doodle God
2,3 2
3,5 2
18. {
"title": “Call of Duty®: Black Ops 4",
"image": "rs-137178-883f5fe955b2745cd539.jpg",
"description": "<p>Digital Standard Edition includes: - 1,100
Call of Duty® Points* - Digital Edition Bonus Items: --
Specialist Outfit for all Specialists -- Gesture -- Calling Card,
Emblem, Sticker and Tag inspired by the iconic Call of Duty®:
Black Ops 4 skull. Black Ops is back! Featuring gritty,
grounded, fluid Multiplayer combat, the biggest Zombies
offering ever with three full undead adventures at launch, and
Blackout, where the universe of Black Ops comes to life in
one massive battle royale experience.</p>",
"rating": 4.5,
"numberOfRatings": 279,
"vendor": "Activision",
"price": 59.99,
"releaseDate": “2018-12-10“,
"id": 350640
}
25. Learning To Rank
Learning to rank or machine-learned ranking is the application of machine
learning, typically supervised, semi-supervised or reinforcement learning, in the
construction of ranking models for information retrieval systems.
~ Wikipedia
http://bit.ly/ltr-wp
28. Model Evaluation Types (Errors)
• Binary relevance (MAP, Precision)
• Graded relevance, position based (DCG, NDCG)
• Only discounts based on relevance
• Graded relevance, cascade based (ERR)
• Discounts based on user interaction with the results
http://bit.ly/eval-metric
29. MAP using Binary relevance
23
9
88
33
45
YI Average Precision
1
0.67
0.6
fifa
X
relevant
not relevant
= (1 + 0.67 + 0.6) / 3
= 0.76
MAP
30. DCG using Graded relevance
23 rel=3
9 rel=2
88 rel=4
33 rel=1
45 rel=0
YI
Discounted
Cumulative
Gain
3
5 + 4/log2(3) = 7.57
8.07
fifa
X
Documents ranked on 0 - 4 relevance scale
3 + 2/log2(2) = 5
7.57 + 1/log2(4) = 8.07
32. Learning to rank - Model
X
Model
Y
Y’
Algorithm
Parameters
Cost function
33. LTR: Steps to take
1. Create Judgement List (Ground Truth)
2. Define features for the model
3. Log features during usage
4. Training and testing the model
5. Deploying and using the model
6. Feedback loop
40. Judgement List: Implicit Feedback
• Log user behaviour
• Compare actual clicks versus expected clicks
• A click is not a relevance judgement
41. Judgement List: Implicit Feedback
• Use as a signal to the ranking algorithm -> Feature
• Use as Label to train the model -> Ground truth
42. Using the LTR Plugin and the
python scripts
https://github.com/o19s/elasticsearch-learning-to-rank
43. Our Judgement List
# grade (0-4) queryid docId title
#
# Add your keyword strings below, the feature script will
# Use them to populate your query templates
#
# qid:1: fifa
# qid:2: football
# qid:3: call of duty
# qid:4: marvel
# qid:5: basketball
# qid:6: god
#
4 qid:1 # 1538781503000 FIFA 18
2 qid:1 # 1536187840000 EA SPORTS FIFA 16
3 qid:1 # 1538107776000 FIFA 19 Ultimate Edition
4 qid:1 # 1538694141000 FIFA 19
3 qid:1 # 1536293937000 EA SPORTS FIFA 17 Standard Edition
1 qid:1 # 1536022097000 FIFA 15
3 qid:2 # 1538509479000 PRO EVOLUTION SOCCER 2018
4 qid:2 # 1535257488000 Pro Evolution Soccer 2018 FC Barcelona Edition
1 qid:2 # 1536293937000 EA SPORTS FIFA 17 Standard Edition
1 qid:2 # 1538781636000 2MD: VR Football
44. 2. features for the model
• Raw Term Statistics
• Document Frequency
• Total Term Frequency
• Also max, min, sum (in case of multiple terms, fields)
• Elasticsearch queries
45. 2. features for the model
{
"query": {
"match": {
"title": "{{keywords}}"
}
}
}
{
"query": {
"match": {
"description”: "{{keywords}}"
}
}
}
51. 4. Train and test model
• Making use of Ranklib
• Can specify separate train, validation and test set
• Can normalise feature sets
52. Models using Ranklib
MART
Multiple Additive Regression Trees, a gradient boosting machine. Can be used for
regression as well as classification.
RankNet
Compare two feature vectors using stochastic gradient descent with the help of a cost
function.
RankBoost
Based on AdaBoost, combining many weak rankings into a single highly accurate ranking.
Is pairwise comparison
AdaRank
Combines a number of weak learners in a linear way. Builds on AdaBoost, but directed
more at ranking.
Coordinate Ascent
Optimises one parameter at a time, keeping the other constant. Done iteratively until
some convergence criteria is met.
LambdaRank
Optimisation of RankNet that only looks at the gradients represented by arrows
indicating how much they need to move up or down
LambdaMART Combines using the gradients of LambdaRank and the use of MART
ListNet
Uses a list wise loss function, a neural network and gradient descent. Similar to RankNet,
only difference is List versus Pair loss functions.
Random Forests
Number of trees to vote for the most popular class for a vector of features. One tree
would not be better than a random choice, but a forest is
Linear Regression
Most of the times to simplistic for the learning to rank problem with lots of features, but
good to have available to at least try
53. Evaluation metrics
MAP Mean Average Precision:The average of all P@k
DCG@k Discounted Cumulative Gain:Add relevance of all documents discounted
by the position of the document making the 1st document more important
NDCG@k Normalised DCG:A DCG with a value between 0 and 1, normalised by the
highest score.
P@k Percentage of relevant documents of this top K
RR@k Reciprocal Rank: 1/K where K is the first relevant document
ERR@k Expected Reciprocal Rank: discounts documents that are below a very
relevant document.
58. 6. Feedback Loop
• Register clicks by users (and other events)
• Use click data for predicting labels
59. Click models
• Random Click Model -> Every document has the same chance of being
clicked
• Click Through Rate Model -> Uses the fact that the first document is
clicked far more than the second document
• Cascade Model -> A click in the third item also tells us something about
the first two items. Only one click per session is assumed.
• Dynamic Bayesian Network Model -> Supports multiple clicks in a search
session and the difference in actual relevance of a document.
http://bit.ly/click-model
60. Dynamic Bayesian Network
Ei-1 Ei Ei+1
Ci
Ai Si
au su
http://bit.ly/dbn-clickmodel
Ei - Did the user examine the url
Ai - Was the user attracted by the url
Ci - Did the user click the url
Si - Was the user satisfied with the landing page
au - Probability of being attracted by the url
su - Probability of being satisfied by landing page