Target leakage is one of the most difficult problems in developing real-world machine learning models. Leakage occurs when the training data gets contaminated with information that will not be known at prediction time. Additionally, there can be multiple sources of leakage, from data collection and feature engineering to partitioning and model validation. As a result, even experienced data scientists can inadvertently introduce leaks and become overly optimistic about the performance of the models they deploy. In this talk, we will look through real-life examples of data leakage at different stages of the data science project lifecycle, and discuss various countermeasures and best practices for model validation.
Talk from QCon SF on 2018-11-05
For many years, the main goal of the Netflix personalized recommendation system has been to get the right titles in front each of our members at the right time. With a catalog spanning thousands of titles and a diverse member base spanning over a hundred million accounts, recommending the titles that are just right for each member is crucial. But the job of recommendation does not end there. Why should you care about any particular title we recommend? What can we say about a new and unfamiliar title that will pique your interest? How do we convince you that a title is worth watching? Answering these questions is critical in helping our members discover great content, especially for unfamiliar titles. One way to do this is to consider the artwork or imagery we use to visually portray each title. If the artwork representing a title captures something compelling to you, then it acts as a gateway into that title and gives you some visual โevidenceโ for why the title might be good for you. Selecting good artwork is important because it may be the first time a member becomes aware of a title (and sometimes the only time), so it must speak to them in a meaningful way. In this talk, we will present an approach for personalizing the artwork we show for each title on the Netflix homepage. We will look at how to frame this as a machine learning problem using contextual multi-armed bandits in a recommendation system setting. We will also describe the algorithmic and system challenges involved in getting this type of approach for artwork personalization to succeed at Netflix scale. Finally, we will discuss some of the future opportunities that we see to expand and improve upon this approach.
How To Interview a Data Scientist
Daniel Tunkelang
Presented at the O'Reilly Strata 2013 Conference
Video: https://www.youtube.com/watch?v=gUTuESHKbXI
Interviewing data scientists is hard. The tech press sporadically publishes โbestโ interview questions that are cringe-worthy.
At LinkedIn, we put a heavy emphasis on the ability to think through the problems we work on. For example, if someone claims expertise in machine learning, we ask them to apply it to one of our recommendation problems. And, when we test coding and algorithmic problem solving, we do it with real problems that weโve faced in the course of our day jobs. In general, we try as hard as possible to make the interview process representative of actual work.
In this session, Iโll offer general principles and concrete examples of how to interview data scientists. Iโll also touch on the challenges of sourcing and closing top candidates.
Interactive Recommender Systems with Netflix and SpotifyChris Johnson
ย
Interactive recommender systems enable the user to steer the received recommendations in the desired direction through explicit interaction with the system. In the larger ecosystem of recommender systems used on a website, it is positioned between a lean-back recommendation experience and an active search for a specific piece of content. Besides this aspect, we will discuss several parts that are especially important for interactive recommender systems, including the following: design of the user interface and its tight integration with the algorithm in the back-end; computational efficiency of the recommender algorithm; as well as choosing the right balance between exploiting the feedback from the user as to provide relevant recommendations, and enabling the user to explore the catalog and steer the recommendations in the desired direction.
In particular, we will explore the field of interactive video and music recommendations and their application at Netflix and Spotify. We outline some of the user-experiences built, and discuss the approaches followed to tackle the various aspects of interactive recommendations. We present our insights from user studies and A/B tests.
The tutorial targets researchers and practitioners in the field of recommender systems, and will give the participants a unique opportunity to learn about the various aspects of interactive recommender systems in the video and music domain. The tutorial assumes familiarity with the common methods of recommender systems.
Lessons Learned from Building Machine Learning Software at NetflixJustin Basilico
ย
Talk from Software Engineering for Machine Learning Workshop (SW4ML) at the Neural Information Processing Systems (NIPS) 2014 conference in Montreal, Canada on 2014-12-13.
Abstract:
Building a real system that incorporates machine learning as a part can be a difficult effort, both in terms of the algorithmic and engineering challenges involved. In this talk I will focus on the engineering side and discuss some of the practical issues weโve encountered in developing real machine learning systems at Netflix and some of the lessons weโve learned over time. I will describe our approach for building machine learning systems and how it comes from a desire to balance many different, and sometimes conflicting, requirements such as handling large volumes of data, choosing and adapting good algorithms, keeping recommendations fresh and accurate, remaining responsive to user actions, and also being flexible to accommodate research and experimentation. I will focus on what it takes to put machine learning into a real system that works in a feedback loop with our users and how that imposes different requirements and a different focus than doing machine learning only within a lab environment. I will address the particular software engineering challenges that weโve faced in running our algorithms at scale in the cloud. I will also mention some simple design patterns that weโve fond to be useful across a wide variety of machine-learned systems.
Talk from QCon SF on 2018-11-05
For many years, the main goal of the Netflix personalized recommendation system has been to get the right titles in front each of our members at the right time. With a catalog spanning thousands of titles and a diverse member base spanning over a hundred million accounts, recommending the titles that are just right for each member is crucial. But the job of recommendation does not end there. Why should you care about any particular title we recommend? What can we say about a new and unfamiliar title that will pique your interest? How do we convince you that a title is worth watching? Answering these questions is critical in helping our members discover great content, especially for unfamiliar titles. One way to do this is to consider the artwork or imagery we use to visually portray each title. If the artwork representing a title captures something compelling to you, then it acts as a gateway into that title and gives you some visual โevidenceโ for why the title might be good for you. Selecting good artwork is important because it may be the first time a member becomes aware of a title (and sometimes the only time), so it must speak to them in a meaningful way. In this talk, we will present an approach for personalizing the artwork we show for each title on the Netflix homepage. We will look at how to frame this as a machine learning problem using contextual multi-armed bandits in a recommendation system setting. We will also describe the algorithmic and system challenges involved in getting this type of approach for artwork personalization to succeed at Netflix scale. Finally, we will discuss some of the future opportunities that we see to expand and improve upon this approach.
How To Interview a Data Scientist
Daniel Tunkelang
Presented at the O'Reilly Strata 2013 Conference
Video: https://www.youtube.com/watch?v=gUTuESHKbXI
Interviewing data scientists is hard. The tech press sporadically publishes โbestโ interview questions that are cringe-worthy.
At LinkedIn, we put a heavy emphasis on the ability to think through the problems we work on. For example, if someone claims expertise in machine learning, we ask them to apply it to one of our recommendation problems. And, when we test coding and algorithmic problem solving, we do it with real problems that weโve faced in the course of our day jobs. In general, we try as hard as possible to make the interview process representative of actual work.
In this session, Iโll offer general principles and concrete examples of how to interview data scientists. Iโll also touch on the challenges of sourcing and closing top candidates.
Interactive Recommender Systems with Netflix and SpotifyChris Johnson
ย
Interactive recommender systems enable the user to steer the received recommendations in the desired direction through explicit interaction with the system. In the larger ecosystem of recommender systems used on a website, it is positioned between a lean-back recommendation experience and an active search for a specific piece of content. Besides this aspect, we will discuss several parts that are especially important for interactive recommender systems, including the following: design of the user interface and its tight integration with the algorithm in the back-end; computational efficiency of the recommender algorithm; as well as choosing the right balance between exploiting the feedback from the user as to provide relevant recommendations, and enabling the user to explore the catalog and steer the recommendations in the desired direction.
In particular, we will explore the field of interactive video and music recommendations and their application at Netflix and Spotify. We outline some of the user-experiences built, and discuss the approaches followed to tackle the various aspects of interactive recommendations. We present our insights from user studies and A/B tests.
The tutorial targets researchers and practitioners in the field of recommender systems, and will give the participants a unique opportunity to learn about the various aspects of interactive recommender systems in the video and music domain. The tutorial assumes familiarity with the common methods of recommender systems.
Lessons Learned from Building Machine Learning Software at NetflixJustin Basilico
ย
Talk from Software Engineering for Machine Learning Workshop (SW4ML) at the Neural Information Processing Systems (NIPS) 2014 conference in Montreal, Canada on 2014-12-13.
Abstract:
Building a real system that incorporates machine learning as a part can be a difficult effort, both in terms of the algorithmic and engineering challenges involved. In this talk I will focus on the engineering side and discuss some of the practical issues weโve encountered in developing real machine learning systems at Netflix and some of the lessons weโve learned over time. I will describe our approach for building machine learning systems and how it comes from a desire to balance many different, and sometimes conflicting, requirements such as handling large volumes of data, choosing and adapting good algorithms, keeping recommendations fresh and accurate, remaining responsive to user actions, and also being flexible to accommodate research and experimentation. I will focus on what it takes to put machine learning into a real system that works in a feedback loop with our users and how that imposes different requirements and a different focus than doing machine learning only within a lab environment. I will address the particular software engineering challenges that weโve faced in running our algorithms at scale in the cloud. I will also mention some simple design patterns that weโve fond to be useful across a wide variety of machine-learned systems.
2017/9/22(้) ้ๅฌ
ใตใคใใผใจใผใธใงใณใใฎใใผใฟๅๆๅบ็คใจใใผใฟๆดป็จใใใณใใใใฎๆ่กใซใคใใฆใฎๅๅผทไผใData Engineering and Data Analysis Workshop #2ใ
Artwork Personalization at Netflix Fernando Amat RecSys2018 Fernando Amat
ย
For many years, the main goal of the Netflix personalized recommendation system has been to get the right titles in front of our members at the right time. But the job of recommendation does not end there. The homepage should be able to convey to the member enough evidence of why a title may be good for her, especially for shows that the member has never heard of. One way to address this challenge is to personalize the way we portray the titles on our service. An important aspect of how to portray titles is through the artwork or imagery we display to visually represent each title. The artwork may highlight an actor that you recognize, capture an exciting moment like a car chase, or contain a dramatic scene that conveys the essence of a movie or show. It is important to select good artwork because it may be the first time a member becomes aware of a title (and sometimes the only time), so it must speak to them in a meaningful way. In this talk, we will present an approach for personalizing the artwork we use on the Netflix homepage. The system selects an image for each member and video to give better visual evidence for why the title might be appealing to that particular member.
These are the slides of my talk at the 2019 Netflix Workshop on Personalization, Recommendation and Search (PRS). This talk is based on previous talks on research we are doing at Spotify, but here I focus on the work we do on personalizing Spotify Home, with respect to success, intent & diversity. The link to the workshop is https://prs2019.splashthat.com/. This is research from various people at Spotify, and has been published at RecSys 2018, CIKM 2018 and WWW (The Web Conference) 2019.
RecSysOps: Best Practices for Operating a Large-Scale Recommender SystemEhsan38
ย
Ensuring the health of a modern large-scale recommendation system is a very challenging problem. To address this, we need to put in place proper logging, sophisticated exploration policies, develop ML-interpretability tools or even train new ML models to predict/detect issues of the main production model. In this talk, we shine a light on this less-discussed but important area and share some of the best practices, called RecSysOps, that weโve learned while operating our increasingly complex recommender systems at Netflix. RecSysOps is a set of best practices for identifying issues and gaps as well as diagnosing and resolving them in a large-scale machine-learned recommender system. RecSysOps helped us to 1) reduce production issues and 2) increase recommendation quality by identifying areas of improvement and 3) make it possible to bring new innovations faster to our members by enabling us to spend more of our time on new innovations and less on debugging and firefighting issues.
https://dl.acm.org/doi/10.1145/3460231.3474620
Target Leakage in Machine Learning (ODSC East 2020)Yuriy Guts
ย
Target leakage is one of the most difficult problems in developing real-world machine learning models. Leakage occurs when the training data gets contaminated with information that will not be known at prediction time. Additionally, there can be multiple sources of leakage, from data collection and feature engineering to partitioning and model validation. As a result, even experienced data scientists can inadvertently introduce leaks and become overly optimistic about the performance of the models they deploy. In this talk, we will look through real-life examples of data leakage at different stages of the data science project lifecycle, and discuss various countermeasures and best practices for model validation.
2017/9/22(้) ้ๅฌ
ใตใคใใผใจใผใธใงใณใใฎใใผใฟๅๆๅบ็คใจใใผใฟๆดป็จใใใณใใใใฎๆ่กใซใคใใฆใฎๅๅผทไผใData Engineering and Data Analysis Workshop #2ใ
Artwork Personalization at Netflix Fernando Amat RecSys2018 Fernando Amat
ย
For many years, the main goal of the Netflix personalized recommendation system has been to get the right titles in front of our members at the right time. But the job of recommendation does not end there. The homepage should be able to convey to the member enough evidence of why a title may be good for her, especially for shows that the member has never heard of. One way to address this challenge is to personalize the way we portray the titles on our service. An important aspect of how to portray titles is through the artwork or imagery we display to visually represent each title. The artwork may highlight an actor that you recognize, capture an exciting moment like a car chase, or contain a dramatic scene that conveys the essence of a movie or show. It is important to select good artwork because it may be the first time a member becomes aware of a title (and sometimes the only time), so it must speak to them in a meaningful way. In this talk, we will present an approach for personalizing the artwork we use on the Netflix homepage. The system selects an image for each member and video to give better visual evidence for why the title might be appealing to that particular member.
These are the slides of my talk at the 2019 Netflix Workshop on Personalization, Recommendation and Search (PRS). This talk is based on previous talks on research we are doing at Spotify, but here I focus on the work we do on personalizing Spotify Home, with respect to success, intent & diversity. The link to the workshop is https://prs2019.splashthat.com/. This is research from various people at Spotify, and has been published at RecSys 2018, CIKM 2018 and WWW (The Web Conference) 2019.
RecSysOps: Best Practices for Operating a Large-Scale Recommender SystemEhsan38
ย
Ensuring the health of a modern large-scale recommendation system is a very challenging problem. To address this, we need to put in place proper logging, sophisticated exploration policies, develop ML-interpretability tools or even train new ML models to predict/detect issues of the main production model. In this talk, we shine a light on this less-discussed but important area and share some of the best practices, called RecSysOps, that weโve learned while operating our increasingly complex recommender systems at Netflix. RecSysOps is a set of best practices for identifying issues and gaps as well as diagnosing and resolving them in a large-scale machine-learned recommender system. RecSysOps helped us to 1) reduce production issues and 2) increase recommendation quality by identifying areas of improvement and 3) make it possible to bring new innovations faster to our members by enabling us to spend more of our time on new innovations and less on debugging and firefighting issues.
https://dl.acm.org/doi/10.1145/3460231.3474620
Target Leakage in Machine Learning (ODSC East 2020)Yuriy Guts
ย
Target leakage is one of the most difficult problems in developing real-world machine learning models. Leakage occurs when the training data gets contaminated with information that will not be known at prediction time. Additionally, there can be multiple sources of leakage, from data collection and feature engineering to partitioning and model validation. As a result, even experienced data scientists can inadvertently introduce leaks and become overly optimistic about the performance of the models they deploy. In this talk, we will look through real-life examples of data leakage at different stages of the data science project lifecycle, and discuss various countermeasures and best practices for model validation.
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
ย
Number 2 in the Data Science for Dummies series - We'll predict Titanic survival with Databricks, python and MLSpark.
These are the slides only (excuse the Powerpoint animation issues) - check out the actual tech talk on YouTube: https://rodneyjoyce.home.blog/2019/05/03/data-science-for-dummies-machine-learning-with-databricks-python-sparkml-tech-talk-1-of-7/)
If you have not used Databricks before check out the first talk - Databricks for Dummies.
Here's the rest of the series: https://rodneyjoyce.home.blog/tag/data-science-for-dummies/
1) Data Science overview with Databricks
2) Titanic survival prediction with Azure Machine Learning Studio + Kaggle
3) Data Engineering with Titanic dataset + Databricks + Python
4) Titanic with Databricks + Spark ML
5) Titanic with Databricks + Azure Machine Learning Service
6) Titanic with Databricks + MLS + AutoML
7) Titanic with Databricks + MLFlow
8) Titanic with .NET Core + ML.NET
9) Deployment, DevOps/MLOps and Productionisation
Scaling AutoML-Driven Anomaly Detection With LuminaireDatabricks
ย
Organizations rely heavily on time series metrics to measure and model key aspects of operational and business performance. The ability to reliably detect issues with these metrics is imperative to identifying early indicators of major problems before they become pervasive. This is a difficult machine learning and systems problem because temporal patterns are complex, ever changing, and often very noisy, traditionally requiring significant manual configuration and model maintenance.
At Zillow, we have built an orchestration framework around Luminaire, our open-source python library for hands-off time-series Anomaly Detection. Luminaire provides a suite of models and built-in AutoML capabilities which we process with Spark for distributed training and scoring of thousands of metrics. In this talk, we will cover the architecture of this framework and performance of the Luminaire package across detection and prediction accuracy as well as runtime efficiency.
While the adoption of machine learning and deep learning techniques continue to grow, many organizations find it difficult to actually deploy these sophisticated models into production. It is common to see data scientists build powerful models, yet these models are not deployed because of the complexity of the technology used or lack of understanding related to the process of pushing these models into production.
As part of this talk, I will review several deployment design patterns for both real-time and batch use cases. Iโll show how these models can be deployed as scalable, distributed deployments within the cloud, scaled across hadoop clusters, as APIs, and deployed within streaming analytics pipelines. I will also touch on topics related to security, end-to-end governance, pitfalls, challenges, and useful tools across a variety of platforms. This presentation will involve demos and sample code for the the deployment design patterns.
DevOps and Machine Learning (Geekwire Cloud Tech Summit)Jasjeet Thind
ย
DevOps and Machine Learning: How do you test and deploy real-time machine learning services given the challenge that machine learning algorithms produce nondeterministic behaviors even for the same input.
Reviewing progress in the machine learning certification journey
๐ฆ๐ฝ๐ฒ๐ฐ๐ถ๐ฎ๐น ๐๐ฑ๐ฑ๐ถ๐๐ถ๐ผ๐ป - Short tech talk on How to Network by Qingyue(Annie) Wang
C๐ผ๐ป๐๐ฒ๐ป๐ ๐ฟ๐ฒ๐๐ถ๐ฒ๐ ๐ผ๐ป AI and ML on Google Cloud by Margaret Maynard-Reid
๐ ๐ณ๐ผ๐ฐ๐๐๐ฒ๐ฑ ๐ฐ๐ผ๐ป๐๐ฒ๐ป๐ ๐ฟ๐ฒ๐๐ถ๐ฒ๐ ๐ผ๐ป ๐ ๐ ๐ฝ๐ฟ๐ผ๐ฏ๐น๐ฒ๐บ ๐ณ๐ฟ๐ฎ๐บ๐ถ๐ป๐ด, ๐บ๐ผ๐ฑ๐ฒ๐น ๐ฒ๐๐ฎ๐น๐๐ฎ๐๐ถ๐ผ๐ป, ๐ฎ๐ป๐ฑ ๐ณ๐ฎ๐ถ๐ฟ๐ป๐ฒ๐๐ by Sowndarya Venkateswaran.
A discussion on sample questions to aid certification exam preparation.
An interactive Q&A session to clarify doubts and questions.
Previewing next steps and topics, including course completions and material reviews.
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...Databricks
ย
Current workshop diagnostics are based on manually-generated decision trees. This approach is increasingly reaching its limits due to a growing variant diversity, and the increasing complexity of vehicle systems. This session will describe BMWโs new Apache Spark-enabled approach: Use the available data from cars and workshops to train models that are able to predict the right part to switch, or the action to take.
Youโll get an overview and presentation of BMWโs complete pipeline including ETL, model training based on Spark 2.1, serializing results along with metadata and serving the gained insights as Web-App. Youโll also hear how Spark helped BMW leverage the information from millions of observations and thousands of features, and learn what pitfalls they experienced (e.g. setting up a working dev-toolchain, working with 50K features, parallelizing well), and how you can avoid them.
Model Monitoring at Scale with Apache Spark and VertaDatabricks
ย
For any organization whose core product or business depends on ML models (think Slack search, Twitter feed ranking, or Tesla Autopilot), ensuring that production ML models are performing with high efficacy is crucial. In fact, according to the McKinsey report on model risk, defective models have led to revenue losses of hundreds of millions of dollars in the financial sector alone. However, in spite of the significant harms of defective models, tools to detect and remedy model performance issues for production ML models are missing.
Based on our experience building ML debugging and robustness tools at MIT CSAIL and managing large-scale model inference services at Twitter, Nvidia, and now at Verta, we developed a generalized model monitoring framework that can monitor a wide variety of ML models, work unchanged in batch and real-time inference scenarios, and scale to millions of inference requests. In this talk, we focus on how this framework applies to monitoring ML inference workflows built on top of Apache Spark and Databricks. We describe how we can supplement the massively scalable data processing capabilities of these platforms with statistical processors to support the monitoring and debugging of ML models.
Learn how ML Monitoring is fundamentally different from application performance monitoring or data monitoring. Understand what model monitoring must achieve for batch and real-time model serving use cases. Then dig in with us as we focus on the batch prediction use case for model scoring and demonstrate how we can leverage the core Apache Spark engine to easily monitor model performance and identify errors in serving pipelines.
Customer Segmentation with R - Deep Dive into flexclustJim Porzak
ย
Jim Porzak's presentation at useR! 2015 in Aalborg, Denmark. Learn on how to segment customers based on stated interest surveys using the flexclust package in R. Covers basic customer segmentation concepts, introduction to flexclust, and solutions to three practical issues: the numbering problem, the stability problem, and the best choice for k.
Demystifying Data Science Webinar - February 14, 2018Analytics8
ย
In this webinar, we talked about data science and machine learning. It is not as hard as you may think to get started; and once you do, youโll see immediate business value.
[Tutorial] building machine learning models for predictive maintenance applic...PAPIs.io
ย
This talk introduces the landscape and challenges of predictive maintenance applications in the industry, illustrates how to formulate (data labeling and feature engineering) the problem with three machine learning models (regression, binary classification, multi-class classification) using a publicly available aircraft engine run-to-failure data set, and showcases how the models can be conveniently trained and compared with different algorithms in Azure ML.
Business Applications of Predictive Modeling at ScaleSongtao Guo
ย
Tutorial delivered in KDD 2016 San Francisco
Abstract
Predictive modeling is the art of building statistical models that forecast probabilities and trends of future events. It has broad applications in industry across different domains. Some popular examples include user intention predictions, lead scoring, churn analysis, etc. In this tutorial, we will focus on the best practice of predictive modeling in the big data era and its applications in industry, especially sales and marketing. We will start with an overview of how predictive modeling helps power and drive various key business use cases. We will introduce the essential concepts and state of the art in building end-to-end predictive modeling solutions, and discuss the challenges, key technologies, and lessons learned from our practice, followed by a case study. Moreover, we will discuss some practical solutions of building predictive modeling platform to scale the modeling efforts for data scientists and analysts, along with an overview of popular tools and platforms used across the industry.
Target Audience and Prerequisites
This tutorial is suitable for researchers, students, and practitioners of predictive modeling who are interested in the industry applications. Advanced techniques in data mining and statistical modeling are not required but some background in statistics and big data is expected.
A tremendous backlog of predictive modeling problems in the industry and short supply of trained data scientists have spiked interest in automation over the last few years. A new academic field, AutoML, has emerged. However, there is a significant gap between the topics that are academically interesting and automation capabilities that are necessary to solve real-world industrial problems end-to-end. An even greater challenge is enabling a non-expert to build a robust and trustworthy AI solution for their company. In this talk, weโll discuss what an industry-grade AutoML system consists of and the scientific and engineering challenges of building it.
Similar to Target Leakage in Machine Learning (20)
Paraphrase detection is an academically challenging NLP problem of detecting whether multiple phrases have the same meaning. In this talk, weโll go through the existing traditional and deep learning approaches for this task, and see how they apply in practice as a silver-winning solution to the popular Kaggle Quora Question Pairs competition.
[JEEConf 2015] Lessons from Building a Modern B2C System in ScalaYuriy Guts
ย
Whenever a functional language is mentioned, everyone talks about typical applications like DSL, parsing, data mining, or scientific computing. But what about mainstream consumer-facing applications? Can you build your next project with Scala? What if your system must run in the cloud, on desktop computers, and on specialized hardware, can you still do it? Weโll share our experience of implementing such a system for a German fitness startup that utilized custom-built devices, biometrics, cross-platform integration, and, of course, Scala. Weโll discuss the challenges we ran into and pitfalls you can avoid when your team decides to go functional.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
ย
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niลก, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Richard's entangled aventures in wonderlandRichard Gill
ย
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
ย
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The systemโs unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
ย
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
ย
As consumer awareness of health and wellness rises, the nutraceutical marketโwhich includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutritionโis growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
2. Machine Learning Engineer
Automated ML, time series forecasting, NLP
Lecturer
AI, Machine Learning, Summer/Winter ML Schools
Compete sometimes
Currently hold an Expert rank, top 2% worldwide
4. Data known at prediction time Data unknown at prediction time
timeprediction point
Leakage in a Nutshell
contamination
5. Data known at prediction time Data unknown at prediction time
timeprediction point
Leakage in a Nutshell
contamination
Training on contaminated data leads to overly optimistic
expectations about model performance in production
6. โBut I always validate on random K-fold CV. I should be fine, right?โ
10. Where is the leakage?
EmployeeID Title ExperienceYears MonthlySalaryGBP AnnualIncomeUSD
315981 Data Scientist 3 5,000.00 78,895.44
4691 Data Scientist 4 5,500.00 86,784.98
23598 Data Scientist 5 6,200.00 97,830.35
11. Target is a function of another column
EmployeeID Title ExperienceYears MonthlySalaryGBP AnnualIncomeUSD
315981 Data Scientist 3 5,000.00 78,895.44
4691 Data Scientist 4 5,500.00 86,784.98
23598 Data Scientist 5 6,200.00 97,830.35
The target can have different formatting or measurement units in different columns.
Forgetting to remove the copies will introduce target leakage.
Check out the example: example-01-data-collection.ipynb
12. Where is the leakage?
SubscriberID Group DailyVoiceUsage DailySMSUsage DailyDataUsage Gender
24092091 M18-25 15.31 25 135.10 0
4092034091 F40-60 35.81 3 5.01 1
329815 F25-40 13.09 32 128.52 1
94721835 M25-40 18.52 21 259.34 0
13. Feature is an aggregate of the target
SubscriberID Group DailyVoiceUsage DailySMSUsage DailyDataUsage Gender
24092091 M18-25 15.31 25 135.10 0
4092034091 F40-60 35.81 3 5.01 1
329815 F25-40 13.09 32 128.52 1
94721835 M25-40 18.52 21 259.34 0
E.g., the data can have derived columns created after the fact for reporting purposes
14. Where is the leakage?
Education Married AnnualIncome Purpose LatePaymentReminders IsBadLoan
1 Y 80k Car Purchase 0 0
3 N 120k Small Business 3 1
1 Y 85k House Purchase 5 1
2 N 72k Marriage 1 0
15. Mutable data due to lack of snapshot-ability
Education Married AnnualIncome Purpose LatePaymentReminders IsBadLoan
1 Y 80k Car Purchase 0 0
3 N 120k Small Business 3 1
1 Y 85k House Purchase 5 1
2 N 72k Marriage 1 0
Database records get overwritten as more facts become available.
But these later facts wonโt be available at prediction time.
17. My model is sensitive to feature scaling...
Dataset StandardScaler
Partition into
CV/Holdout
Train
Evaluate
18. My model is sensitive to feature scaling...
Dataset StandardScaler
Partition into
CV/Holdout
Train
Evaluate
OOPS. WEโRE LEAKING THE TEST FEATURE DISTRIBUTION INFO
INTO THE TRAINING SET
Check out the example: example-02-data-prep.ipynb
19. Removing leakage in feature engineering
Dataset Partition
Train
StandardScaler
(fit)
Evaluate
StandardScaler
(transform)
train
eval
apply mean and std
obtained on train
Obtain feature engineering/transformation parameters only on the training set
Apply them to transform the evaluation sets (CV, holdout, backtests, โฆ)
20. Encoding of different variable types
Text:
Learn DTM columns from the training set only, then transform the evaluation sets
(avoid leaking possible out-of-vocabulary words into the training pipeline)
Categoricals:
Create mappings on the training set only, then transform the evaluation sets
(avoid leaking cardinality/frequency info into the training pipeline)
28. Leakage in oversampling / augmentation
Random
Partitioning
Train
Evaluate
Original Data Oversampling or
Augmentation
29. Leakage in oversampling / augmentation
Random
Partitioning
Train
Evaluate
Original Data Oversampling or
Augmentation
Random
Partitioning
Train
Evaluate
Pairwise Data
(features for
A and B)
Augmentation
(AB โ BA)
30. Leakage in oversampling / augmentation
Random
Partitioning
Train
Evaluate
Original Data Oversampling or
Augmentation
Random
Partitioning
Train
Evaluate
Pairwise Data
(features for
A and B)
Augmentation
(AB โ BA)
OOPS. WE MAY GET COPIES SPLIT BETWEEN TRAINING AND EVALUATION
31. Leakage in oversampling / augmentation
Random
Partitioning
Train
Evaluate
Original Data Oversampling or
Augmentation
Random
Partitioning
Train
Evaluate
Pairwise Data
(features for
A and B)
Augmentation
(AB โ BA)
First partition, then augment the training data.
36. Reusing a CV split for multiple tasks
V
V
V
V
V
H
H
H
H
H
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Feature selection, hyperparameter tuning, model selection...
37. Reusing a CV split for multiple tasks
V
V
V
V
V
H
H
H
H
H
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Feature selection, hyperparameter tuning, model selection...
OOPS. CAN OVERTUNE TO THE VALIDATION SET
BETTER USE DIFFERENT SPLITS FOR DIFFERENT TASKS
38. Model stacking on in-sample predictions
Model 1
Train Predict
Model 2
Train Predict
2nd Level Model
Train
39. Model stacking on in-sample predictions
Model 1
Train Predict
Model 2
Train Predict
2nd Level Model
Train
OOPS. WILL LEAK THE TARGET IN THE META-FEATURES
40. Better way to stack
Model 1
Train Predict
V
Model 2
Train Predict
V
2nd Level Model
Train
Compute all meta-features only out-of-fold
42. Case Studies
โ Removing user IDs does not necessarily mean data anonymization
(Kaggle: Expedia Hotel Recommendations, 2016)
โ Anonymizing feature names does not mean anonymization either
(Kaggle: Santander Value Prediction competition, 2018)
โ Target can sometimes be recovered from the metadata or external datasets
(Kaggle: Dato โTruly Native?โ competition, 2015)
โ Overrepresented minority class in pairwise data enables reverse engineering
(Kaggle: Quora Question Pairs competition, 2017)
44. Leakage prevention checklist (not exhaustive!)
โ Split the holdout away immediately and do not preprocess it in any way before final model
evaluation.
โ Make sure you have a data dictionary and understand the meaning of every column, as well
as unusual values (e.g. negative sales) or outliers.
โ For every column in the final feature set, try answering the question:
โWill I have this feature at prediction time in my workflow? What values can it have?โ
โ Figure out preprocessing parameters on the training subset, freeze them elsewhere.
โ Treat feature selection, model tuning, model selection as separate โmachine learning
modelsโ that need to be validated separately.
โ Make sure your validation setup represents the problem you need to solve with the model.
โ Check feature importance and prediction explanations: do top features make sense?