Eliciting systematic bias in probabilistic assessments through statistical comparison of assessments with outcomes. My presentation to a workshop on risk and uncertainty at the EAGE conference in Copenhagen in June.
Evaluation of the lookback method for discount rate v 01.12.17 [v2]Lin Giralt
The document evaluates different methods for projecting inflation, interest and discount rates using historical data from 1953-2016. It finds that:
1) Using a symmetrical lookback method, where the projection horizon is equal to the historical data lookback period, projections of discount rates had an average error within 1% and R-squared values between 0.7-0.9 compared to realized rates.
2) Stretching the lookback period beyond the projection horizon improved accuracy, with over 75% of projections falling within +/- 50% of realized rates compared to 55% for the symmetrical method.
3) Smoothing methods, using the full historical average, performed nearly as well as stretching over longer 10-20 year
Are you looking to expand your research toolkit to include some quantitative methods, such as survey research or A/B testing? Have you been asked to collect some usability metrics, but aren’t sure how best to go about that? Or do you just want to be more aware of all of the UX research possibilities? If your answer to any of those questions is yes, then this session is for you.
You may know that without statistics, you won’t know if A is really better than B, if users are truly more satisfied with your new site than with your old one, or which changes to your site have actually impacted conversion rates. However, statistics can also help you figure out how to report satisfaction and other metrics you collect during usability tests. And they’re essential for making sense of the results of quantitative usability tests.
This session will focus on the statistical concepts that are most useful for UX researchers. It won’t make you a quant, but it will give you a good grounding in quantitative methods and reporting. (For example, you will learn what a margin of error is, how to report quantitative data collected during a usability test - and how not to - and how many people you really need to fill out a survey.)
Statistics for UX Professionals - Jessica CameronUser Vision
Are you looking to expand your research toolkit to include some quantitative methods, such as survey research or A/B testing? Have you been asked to collect some usability metrics, but aren’t sure how best to go about that? Or do you just want to be more aware of all of the UX research possibilities? If your answer to any of those questions is yes, then this session is for you.
You may know that without statistics, you won’t know if A is really better than B, if users are truly more satisfied with your new site than with your old one, or which changes to your site have actually impacted conversion rates. However, statistics can also help you figure out how to report satisfaction and other metrics you collect during usability tests. And they’re essential for making sense of the results of quantitative usability tests.
This session will focus on the statistical concepts that are most useful for UX researchers. It won’t make you a quant, but it will give you a good grounding in quantitative methods and reporting. (For example, you will learn what a margin of error is, how to report quantitative data collected during a usability test - and how not to - and how many people you really need to fill out a survey.)
This document provides an overview and agenda for a hands-on introduction to data science. It includes the following sections: Data Science Overview and Intro to R (90 minutes), Exploratory Data Analysis (60 minutes), and Logistic Regression Model (30 minutes). Key topics that will be covered include collecting and analyzing data to find insights to help decision making, predicting problems before they occur, using analytics to improve operations and innovations, and examples of predicting loan defaults. Machine learning concepts such as supervised and unsupervised learning and common machine learning models will also be introduced.
This document provides an overview and agenda for a hands-on introduction to data science. It includes the following sections: Data Science Overview and Intro to R (90 minutes), Exploratory Data Analysis (60 minutes), and Logistic Regression Model (30 minutes). The document then covers key concepts in data science including collecting and analyzing data to find insights to help decision making, using analytics to improve operations and innovations, and predicting problems before they occur. Machine learning and statistical techniques are also introduced such as supervised and unsupervised learning, parameters versus statistics, and calculating variance and standard deviation.
Evaluation of the lookback method for discount rate v 01.12.17 [v2]Lin Giralt
The document evaluates different methods for projecting inflation, interest and discount rates using historical data from 1953-2016. It finds that:
1) Using a symmetrical lookback method, where the projection horizon is equal to the historical data lookback period, projections of discount rates had an average error within 1% and R-squared values between 0.7-0.9 compared to realized rates.
2) Stretching the lookback period beyond the projection horizon improved accuracy, with over 75% of projections falling within +/- 50% of realized rates compared to 55% for the symmetrical method.
3) Smoothing methods, using the full historical average, performed nearly as well as stretching over longer 10-20 year
Are you looking to expand your research toolkit to include some quantitative methods, such as survey research or A/B testing? Have you been asked to collect some usability metrics, but aren’t sure how best to go about that? Or do you just want to be more aware of all of the UX research possibilities? If your answer to any of those questions is yes, then this session is for you.
You may know that without statistics, you won’t know if A is really better than B, if users are truly more satisfied with your new site than with your old one, or which changes to your site have actually impacted conversion rates. However, statistics can also help you figure out how to report satisfaction and other metrics you collect during usability tests. And they’re essential for making sense of the results of quantitative usability tests.
This session will focus on the statistical concepts that are most useful for UX researchers. It won’t make you a quant, but it will give you a good grounding in quantitative methods and reporting. (For example, you will learn what a margin of error is, how to report quantitative data collected during a usability test - and how not to - and how many people you really need to fill out a survey.)
Statistics for UX Professionals - Jessica CameronUser Vision
Are you looking to expand your research toolkit to include some quantitative methods, such as survey research or A/B testing? Have you been asked to collect some usability metrics, but aren’t sure how best to go about that? Or do you just want to be more aware of all of the UX research possibilities? If your answer to any of those questions is yes, then this session is for you.
You may know that without statistics, you won’t know if A is really better than B, if users are truly more satisfied with your new site than with your old one, or which changes to your site have actually impacted conversion rates. However, statistics can also help you figure out how to report satisfaction and other metrics you collect during usability tests. And they’re essential for making sense of the results of quantitative usability tests.
This session will focus on the statistical concepts that are most useful for UX researchers. It won’t make you a quant, but it will give you a good grounding in quantitative methods and reporting. (For example, you will learn what a margin of error is, how to report quantitative data collected during a usability test - and how not to - and how many people you really need to fill out a survey.)
This document provides an overview and agenda for a hands-on introduction to data science. It includes the following sections: Data Science Overview and Intro to R (90 minutes), Exploratory Data Analysis (60 minutes), and Logistic Regression Model (30 minutes). Key topics that will be covered include collecting and analyzing data to find insights to help decision making, predicting problems before they occur, using analytics to improve operations and innovations, and examples of predicting loan defaults. Machine learning concepts such as supervised and unsupervised learning and common machine learning models will also be introduced.
This document provides an overview and agenda for a hands-on introduction to data science. It includes the following sections: Data Science Overview and Intro to R (90 minutes), Exploratory Data Analysis (60 minutes), and Logistic Regression Model (30 minutes). The document then covers key concepts in data science including collecting and analyzing data to find insights to help decision making, using analytics to improve operations and innovations, and predicting problems before they occur. Machine learning and statistical techniques are also introduced such as supervised and unsupervised learning, parameters versus statistics, and calculating variance and standard deviation.
This document provides information about describing data using measures of center and spread such as the mean and standard deviation. It discusses Chebyshev's rule, which states that a certain percentage of data will fall within a given number of standard deviations from the mean. For a normal distribution, it presents the empirical rule - that approximately 68% of data lies within 1 standard deviation of the mean, 95% within 2 standard deviations, and 99.7% within 3 standard deviations. Several examples demonstrate calculating these percentages and interpreting data based on its mean and standard deviation. Practice problems at the end have the reader calculate ranges that certain percentages of data will fall within using Chebyshev's rule and the empirical rule.
This document discusses various measures of dispersion used to describe how data is spread out or varies. It defines standard deviation as a measure of how far observations are from the mean. It provides examples of calculating standard deviation for different data sets and explains how a lower standard deviation indicates data is less dispersed and the relationship is tighter. The document also covers coefficient of variation, Chebyshev's theorem, percentiles including quartiles and the median, interquartile range (IQR), and boxplots. It notes that IQR is resistant to outliers while standard deviation is more impacted by outliers.
inferential statistics, statistical inference, language technology, interval estimation, confidence interval, standard error, confidence level, z critical value, confidence interval for proportion, confidence interval for the mean, multiplier,
Types of Probability Distributions - Statistics IIRupak Roy
Get to know in detail the definitions of the types of probability distributions from binomial, poison, hypergeometric, negative binomial to continuous distribution like t-distribution and much more.
Let me know if anything is required. Ping me at google #bobrupakroy
This document discusses constraints and assumptions of regression models with time series data. It argues that violations of assumptions like heteroskedasticity, autocorrelation, and non-normal residuals may not invalidate a model if robust standard errors are used. It also downplays issues like unit roots, multicollinearity, and lack of variable stationarity if residuals are stationary. The document emphasizes that high in-sample fit does not guarantee out-of-sample accuracy, recommending techniques like holdouts and evaluating economic turning points.
In your routine laboratory works, do you have to issue a statement of conformity after testing a product sample, such as stating a Pass or fail, a compliance or non-compliance? What is your decision rule as required by this ISO standard? What is the risk to make a wrong decision in rejecting the test result based on the product specification limit when it is actually in conformance? Are you able to control such a risk in order to make an informed decision?
If you have all these questions in mind, I have got the answers for you.
To calculate the required sample size for a survey, you need to determine the population size, margin of error, confidence level, and standard deviation. The formula used is sample size = (z-score)2 * p(1-p) / (margin of error)2. For a survey estimating smartphone ownership in a city with a 99% confidence level, 0.03 margin of error, and 0.70 estimated standard deviation, the required sample size is 2,400.
- Confidence intervals provide an estimated range of values that is likely to include an unknown population parameter, such as a mean, with a specified degree of confidence.
- The margin of error depends on the sample size, standard deviation, and confidence level, with a larger sample size and smaller standard deviation yielding a smaller margin of error.
- When the sample size is small, a t-distribution rather than normal distribution is used to construct the confidence interval due to the unknown population standard deviation. The t-distribution is wider than the normal and accounts for additional uncertainty from an unknown standard deviation.
1) Statistics is the science of collecting, analyzing, and drawing conclusions from data. It is used to understand populations based on samples since directly measuring entire populations is often impossible.
2) There are two main types of data: qualitative data which relates to descriptive characteristics, and quantitative data which can be expressed numerically. Common statistical analyses include calculating the mean, standard deviation, and using t-tests, ANOVA, correlation, and chi-squared tests.
3) Statistical analyses allow researchers to determine uncertainties in measurements, compare groups, identify relationships between variables, and assess whether observed differences are likely due to chance or a factor being studied. Key concepts include null and alternative hypotheses, p-values, and effect size.
This document discusses error analysis in experimental measurements. It covers two types of errors - systematic errors which affect accuracy, and random errors which affect precision. Random errors follow a Gaussian distribution, and the mean and standard deviation are used to characterize these errors. Taking more measurements reduces random errors according to the central limit theorem. The document also discusses combining measurements and calculating a weighted mean to obtain the best estimate while accounting for differences in measurement precision.
This document provides an introduction to business intelligence and data analytics. It discusses key concepts such as data sources, data warehouses, data marts, data mining, and data analytics. It also covers topics like univariate analysis, measures of dispersion, heterogeneity measures, confidence intervals, cross validation, and ROC curves. The document aims to introduce fundamental techniques and metrics used in business intelligence and data mining.
Okay, let's solve this step-by-step:
- zα/2 for a 90% CI is 1.645
- p from the pilot study is 20/30 = 0.667
- 1 - p is 1 - 0.667 = 0.333
- Desired margin of error E is 0.1
Plugging into the formula:
n = (1.645)2 * 0.667 * 0.333 / (0.1)2
n = (1.645)2 * 0.222 / 0.01
n = 96.75 ~ 97
The required sample size is 97
This document discusses probability distributions in R. It defines probability distributions as ways to model real-life uncertain events and make inferences from sample data. It covers the binomial, Poisson, and normal distributions, and how to generate and analyze each using functions in R like dbinom(), rpois(), dnorm(), pnorm(), and qnorm(). These functions allow calculating probabilities, simulating distributions, and finding cutoff points for given probabilities.
M3_Statistics foundations for business analysts_Presentation.pdfACHALSHARMA52
This document provides an overview of key probability concepts including sample space, events, addition law, probability distributions, discrete vs continuous random variables, and common probability distributions such as binomial, Poisson, uniform, normal and exponential. Examples are provided to illustrate concepts such as calculating probabilities and determining parameters of different distributions. The document would help introduce someone to fundamental probability topics.
Chris Stuccio - Data science - Conversion Hotel 2015Webanalisten .nl
Slides of the keynote by Chris Stuccio (USA) at Conversion Hotel 2015, Texel, the Netherlands (#CH2015): "What’s this all about data science? Explain baysian statistics to me as a kid – what should I know?" http://conversionhotel.com
This document discusses understanding and quantifying uncertainty when evaluating projects. It describes how incorporating probabilistic risk analysis and decision analysis can help indicate where more information is needed to reduce uncertainty and risk. Three case studies are presented that use uncertainty analysis for geosteering into a thin reservoir, interpreting well logs in shaly sands, and analyzing a walkaway vertical seismic profile. Quantifying uncertainty allows assessing the value of obtaining additional data.
A confidence interval provides a range of values that is likely to include an unknown population parameter, with a specified confidence level. A 95% confidence interval states that if you were to repeat the sampling process numerous times, 95% of the calculated confidence intervals would contain the true population parameter. It does not mean there is a 95% chance that the population parameter falls within the given interval. Larger sample sizes are needed to achieve smaller margins of error or higher confidence levels when estimating population parameters from sample data.
This document discusses various topics related to report writing including findings, conclusions, recommendations, types of reports, report sections, and explores common myths about reports. It provides examples of different sections within reports including an executive summary, company overview, factors for analysis and methodology. The summaries focus on conveying the high-level purpose or content of the different sections while keeping the summary brief.
Quick and dirty first principles modellingGraeme Keith
Some examples of how symmetry principles: scaling (invariance under change of dimension), conservation of energy (lagrangian invariance in time) and actual symmetry can simplify real engineering problems.
This document provides information about describing data using measures of center and spread such as the mean and standard deviation. It discusses Chebyshev's rule, which states that a certain percentage of data will fall within a given number of standard deviations from the mean. For a normal distribution, it presents the empirical rule - that approximately 68% of data lies within 1 standard deviation of the mean, 95% within 2 standard deviations, and 99.7% within 3 standard deviations. Several examples demonstrate calculating these percentages and interpreting data based on its mean and standard deviation. Practice problems at the end have the reader calculate ranges that certain percentages of data will fall within using Chebyshev's rule and the empirical rule.
This document discusses various measures of dispersion used to describe how data is spread out or varies. It defines standard deviation as a measure of how far observations are from the mean. It provides examples of calculating standard deviation for different data sets and explains how a lower standard deviation indicates data is less dispersed and the relationship is tighter. The document also covers coefficient of variation, Chebyshev's theorem, percentiles including quartiles and the median, interquartile range (IQR), and boxplots. It notes that IQR is resistant to outliers while standard deviation is more impacted by outliers.
inferential statistics, statistical inference, language technology, interval estimation, confidence interval, standard error, confidence level, z critical value, confidence interval for proportion, confidence interval for the mean, multiplier,
Types of Probability Distributions - Statistics IIRupak Roy
Get to know in detail the definitions of the types of probability distributions from binomial, poison, hypergeometric, negative binomial to continuous distribution like t-distribution and much more.
Let me know if anything is required. Ping me at google #bobrupakroy
This document discusses constraints and assumptions of regression models with time series data. It argues that violations of assumptions like heteroskedasticity, autocorrelation, and non-normal residuals may not invalidate a model if robust standard errors are used. It also downplays issues like unit roots, multicollinearity, and lack of variable stationarity if residuals are stationary. The document emphasizes that high in-sample fit does not guarantee out-of-sample accuracy, recommending techniques like holdouts and evaluating economic turning points.
In your routine laboratory works, do you have to issue a statement of conformity after testing a product sample, such as stating a Pass or fail, a compliance or non-compliance? What is your decision rule as required by this ISO standard? What is the risk to make a wrong decision in rejecting the test result based on the product specification limit when it is actually in conformance? Are you able to control such a risk in order to make an informed decision?
If you have all these questions in mind, I have got the answers for you.
To calculate the required sample size for a survey, you need to determine the population size, margin of error, confidence level, and standard deviation. The formula used is sample size = (z-score)2 * p(1-p) / (margin of error)2. For a survey estimating smartphone ownership in a city with a 99% confidence level, 0.03 margin of error, and 0.70 estimated standard deviation, the required sample size is 2,400.
- Confidence intervals provide an estimated range of values that is likely to include an unknown population parameter, such as a mean, with a specified degree of confidence.
- The margin of error depends on the sample size, standard deviation, and confidence level, with a larger sample size and smaller standard deviation yielding a smaller margin of error.
- When the sample size is small, a t-distribution rather than normal distribution is used to construct the confidence interval due to the unknown population standard deviation. The t-distribution is wider than the normal and accounts for additional uncertainty from an unknown standard deviation.
1) Statistics is the science of collecting, analyzing, and drawing conclusions from data. It is used to understand populations based on samples since directly measuring entire populations is often impossible.
2) There are two main types of data: qualitative data which relates to descriptive characteristics, and quantitative data which can be expressed numerically. Common statistical analyses include calculating the mean, standard deviation, and using t-tests, ANOVA, correlation, and chi-squared tests.
3) Statistical analyses allow researchers to determine uncertainties in measurements, compare groups, identify relationships between variables, and assess whether observed differences are likely due to chance or a factor being studied. Key concepts include null and alternative hypotheses, p-values, and effect size.
This document discusses error analysis in experimental measurements. It covers two types of errors - systematic errors which affect accuracy, and random errors which affect precision. Random errors follow a Gaussian distribution, and the mean and standard deviation are used to characterize these errors. Taking more measurements reduces random errors according to the central limit theorem. The document also discusses combining measurements and calculating a weighted mean to obtain the best estimate while accounting for differences in measurement precision.
This document provides an introduction to business intelligence and data analytics. It discusses key concepts such as data sources, data warehouses, data marts, data mining, and data analytics. It also covers topics like univariate analysis, measures of dispersion, heterogeneity measures, confidence intervals, cross validation, and ROC curves. The document aims to introduce fundamental techniques and metrics used in business intelligence and data mining.
Okay, let's solve this step-by-step:
- zα/2 for a 90% CI is 1.645
- p from the pilot study is 20/30 = 0.667
- 1 - p is 1 - 0.667 = 0.333
- Desired margin of error E is 0.1
Plugging into the formula:
n = (1.645)2 * 0.667 * 0.333 / (0.1)2
n = (1.645)2 * 0.222 / 0.01
n = 96.75 ~ 97
The required sample size is 97
This document discusses probability distributions in R. It defines probability distributions as ways to model real-life uncertain events and make inferences from sample data. It covers the binomial, Poisson, and normal distributions, and how to generate and analyze each using functions in R like dbinom(), rpois(), dnorm(), pnorm(), and qnorm(). These functions allow calculating probabilities, simulating distributions, and finding cutoff points for given probabilities.
M3_Statistics foundations for business analysts_Presentation.pdfACHALSHARMA52
This document provides an overview of key probability concepts including sample space, events, addition law, probability distributions, discrete vs continuous random variables, and common probability distributions such as binomial, Poisson, uniform, normal and exponential. Examples are provided to illustrate concepts such as calculating probabilities and determining parameters of different distributions. The document would help introduce someone to fundamental probability topics.
Chris Stuccio - Data science - Conversion Hotel 2015Webanalisten .nl
Slides of the keynote by Chris Stuccio (USA) at Conversion Hotel 2015, Texel, the Netherlands (#CH2015): "What’s this all about data science? Explain baysian statistics to me as a kid – what should I know?" http://conversionhotel.com
This document discusses understanding and quantifying uncertainty when evaluating projects. It describes how incorporating probabilistic risk analysis and decision analysis can help indicate where more information is needed to reduce uncertainty and risk. Three case studies are presented that use uncertainty analysis for geosteering into a thin reservoir, interpreting well logs in shaly sands, and analyzing a walkaway vertical seismic profile. Quantifying uncertainty allows assessing the value of obtaining additional data.
A confidence interval provides a range of values that is likely to include an unknown population parameter, with a specified confidence level. A 95% confidence interval states that if you were to repeat the sampling process numerous times, 95% of the calculated confidence intervals would contain the true population parameter. It does not mean there is a 95% chance that the population parameter falls within the given interval. Larger sample sizes are needed to achieve smaller margins of error or higher confidence levels when estimating population parameters from sample data.
This document discusses various topics related to report writing including findings, conclusions, recommendations, types of reports, report sections, and explores common myths about reports. It provides examples of different sections within reports including an executive summary, company overview, factors for analysis and methodology. The summaries focus on conveying the high-level purpose or content of the different sections while keeping the summary brief.
Similar to Bias and overconfidence in oil and gas exploration (20)
Quick and dirty first principles modellingGraeme Keith
Some examples of how symmetry principles: scaling (invariance under change of dimension), conservation of energy (lagrangian invariance in time) and actual symmetry can simplify real engineering problems.
Quantitative strategic risk managementGraeme Keith
Principles for developing a quantitative approach to a holistic integration of strategy and risk management. Presentation given at the Danish Engineering Society's International Conference on Risk Management
How a simple, but systematic mathematical approach in general (and causal mapping in particular) can help navigate common pitfalls in decision making. An introduction to the mathematical modelling of decisions
The document discusses using Bayesian updating to analyze SAAM data. It describes using a Bayesian inversion approach to calculate the probability of success given observed data. The key points are:
- It formulates the approach using evidence ratios and probable probabilities rather than raw probabilities, which provides a simpler additive function of the prior probability.
- It establishes probabilities of different data observations given success or failure by analyzing frequencies in a database of past observations.
- The evidence implied by a single data indicator provides information on how significant and reliable that indicator is. Binning data works best with around 5 bins partitioned by samples rather than a continuous index. A hybrid model combines continuous modeling with binning.
The fundamental unity of strategy and riskGraeme Keith
This document summarizes a presentation on integrating strategy and risk management. It discusses how strategy involves choosing aspirations, where to play, how to win, core capabilities, and management systems. Risk management aims to identify, assess, and control risks to balance rewards and outcomes. The presentation argues that strategy and risk analysis models should link objectives, controls, and external factors using a rigorous causal structure informed by data. It provides an example of how a mid-sized oil and gas company developed probabilistic strategic options and risk management models to optimize investment decisions across its portfolio.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
A presentation that explain the Power BI Licensing
Bias and overconfidence in oil and gas exploration
1.
2. Optimism and confidence in the oil industry
Graeme Keith MA PhD FIMA
Gentleman of Leisure
(On gardening leave following Total’s acquisition of Maersk Oil)
4. Aggregate sequences to reduce the
range of reasonable outcomes
• Introduce some unreasonable outcomes
• Study sequence mean and variance
Validate sequences not prospects
• Give up inference on individual prospects
• Open up for inference on systematic biases
Infer probabilities by fitting biases to data
• Making eliciting systematic biases the object of study
• Mode matching, maximum likelihood, Bayesian inference
5. The ubiquitous iniquitous accumulating sequence plot
• Tradition is to cumulate sequences
• The purpose of accumulating is to
reduce range of reasonable
outcomes, so all sequence plots are
meaningless without some sense of
the range…
6. The ubiquitous accumulating sequence plot with 80% confidence interval
• Simulate distributions (dashed)
• Or use Central Limit Theorem
(solid green)
• Vertical scale set by number of
discoveries and not by deviations…
7. Accumulating sequence plot showing success rates (sequence mean)
• Moving to mean (success rate)
shows reduction of uncertainty
• False time evolution – some of
trend due to changes in sequence
length, not comparing like with like
• Every point on accumulation plot
has different sequence length
8. Sliding window plot showing success rates
• Pick a window length and slide it
along the sequence
• Here window length is 21 in a
sequence of 125 wells
• First point is the sequence
from well 1 to well 21
• Second is from well 2 to 22
• Last is from 105 to 125
• Use range to set window length
• Long enough to see errors, short
enough to see time evolution
9. All the same considerations apply to volumes, including and not including failures
10. The mean is not the only sequence property we can scrutinize
• Sample Variance
= Variance in prospect means
+ Mean of prospect variances
• Prospect variance = p(1-p)
• Variance measures uncertainty,
high when you don’t know anything
• Underprediction shows over-confidence
• Overprediction shows vagueness
0 0,5 1
11. Similar considerations apply to volume variance, here restricted to discoveries
• Sample Variance
= Variance in prospect means
+ Mean of prospect variances
• Prospect variance
• Variance measures uncertainty,
high when you don’t know anything
• Underprediction shows over-confidence
• Overprediction shows vagueness
• But distributions are highly skewed
12. Traditionally assess variance (confidence) through other visualizations, like the
Iniquitous probability plot
• Binning probabilities is another way of
looking at confidence
• Actually just sequence aggregation
• Meaningless without distributions
0-20 20-40 40-60 60-80 80-100
13. Probability plot with 80% confidence interval
• Binning probabilities is another way of
looking at confidence
• Actually just sequence aggregation
• Meaningless without distributions
• Distribution of prospects within bins is
not usually uniform, especially for the
highest and lowest
• Calculate distributions for the actual
prospects in the bin
• 0.85=33% chance that all of the bins fall
in 80% CI!
• Ranges function both of probability (high
and low probabilities have smaller
ranges) and number of prospects in bin
0-20 20-40 40-60 60-80 80-100
14. Similar approach to looking at confidence and variance in volume results
• Distribution of percentiles should be
uniform, percentile plot is one (rather
crude) way of examining this
• Meaningless without distributions
>P20P40-P20P60-P40P80-P60P100-P80
15. Percentile plot with 80% confidence interval
• Each prospect has a 20% chance of
falling in a given bin.
• Distribution in a given bin is Poisson with
N=number of wells and p=0.2
• 0.85=33% chance that all of the bins fall
in range!
>P20P40-P20P60-P40P80-P60P100-P80
16. Empirical distribution provides a slightly more elegant way of looking at volume
variance without looking at volume variance
• Empirical distribution looks at
proportion of discoveries that fall above
given percentiles
P0
P100
17. Investigate systematic biases by postulating a form of a systematic bias (same for
every prospect in a sequence) and then analysing outcomes relative to predictions
to elicit the bias parameters. Looking first at volume
Assessed
probability
”Faithful”
probability
Systematic
bias and
confidence
Outcomes
log 𝑉𝑖 ∼ 𝒩(𝜇𝑖, 𝜎𝑖)
𝜇𝑖
𝑡
= 𝜇𝑖 + 𝛿𝜇
𝜎𝑖
𝑡
= 𝜙𝜎𝑖
log 𝑉𝑖 − 𝜇𝑖
𝑡
𝜎𝑖
𝑡 ∼ 𝒩(−𝛿𝜇,
1
𝜎𝑖
)
f
dm
18. Optimism
Volumes fall
consistently short
Vagueness
Volumes gather
around P50
Pessimism
Volumes consistently
exceed expectation
Overconfidence
Volumes gather
around extremes
Biases distort empirical distribution curve in predictable way
19. Pangloss
Oil & Gas
No oil left behind
P0
P100
Small specialist exploration company.
Humble with respect to uncertainty, but
biased after their first, lucky, large discovery
20. Hubris
Industries
World class geoscience
P0
P100
Small exploration company made from ex-
chiefs from large companies. Solid
predictions on average but no humility at
all. If it’s good, it’s very very good and if it’s
bad it’s horrid.
21. Now looking at success and failures
Assessed
probability
”Faithful”
probability
Systematic
bias and
confidence
Outcomes
𝑝𝑖
𝑞𝑖 = 𝑝𝑖 + 𝜖(𝑝𝑖 − ҧ𝑥)
𝑟𝑖 = 𝑞𝑖 + 𝛿
Choose d and e to minimize Ψ = σ𝑖 𝑥𝑖 − 𝑟𝑖
2
Two parameter transform,
d captures optimism,
e captures confidence
𝜋𝑖 = 10 log
𝑝𝑖
1 − 𝑝𝑖
𝜃𝑖 = 𝜋𝑖 + 𝜖(𝜋𝑖 − 𝜋0)
𝜌𝑖 = 𝜃𝑖 + 𝛿
𝑟𝑖 =
10 Τ𝜌 𝑖 10
1 + 10 Τ𝜌 𝑖 10
𝛿 = ҧ𝑟 − ҧ𝑥
𝜖 =
𝑟2 − ҧ𝑟2
𝑥𝑟 − ҧ𝑥 ҧ𝑟
− 1
Solve numerically
22. Optimism
Probabilities shifted
up
Vagueness
Probabilities gather
around mean
Pessimism
Probabilities shifted
down
Overconfidence
Probabilities pushed
out to extremes
Can’t visualize outcome biases directly, but can use inferred parameters to plot
relationship between ”faithful” and biased probabilities
23. Lucky to survive a run of failures and non-
commercial discoveries, EE are consistently
conservative in their assessments
Eeyore Enterprises
Every silver lining has its cloud
24. Sybil Oil
Better vaguely right than
certainly wrong
Culture of deciding by committee and can’t
agree on anything. All probabilities close to
base rate and very large P10/P90 ratios.
Have falling in uncertainty range as KPI (but
not occasionally falling outside).
25. A taxonomy of systematic bias
• Presentation shows methods to elicit and quantify systematic assessment errors
• Methods reveal systematic trends otherwise hidden in traditional lookback visualizations
• Systematic bias models can be used to simulate lookback results to
• Develop intuition on what systematic failures can be seen in various visualizations
• And what can not reasonably be seen for noise with the sequence length available
Outcome Volume
Optimism Probabilities consistently too high,
mean overpredicted1
P50 volumes too high2
Pessimism Probabilities consistently too low,
mean underpredicted
P50 volumes too low
Over-confidence Probabilities too polarized,
variance underpredicted3
Ranges2,3 too narrow, variance
underpredicted
Under-confidence Probabilities too close to baseline,
variance overpredicted
Ranges too wide
1) Probabilities biased up towards 50% will also result in an increase in variance and probabilities biased downwards away from 50% will result in a lower variance
2) It is mathematically more sound to work with normal mean and variance, i.e. the mean and variance of the (approximately) normal distribution with which the logarithm of the volumes is distributed.
For this reason, we work with P50 (which corresponds to the logarithm of the normal mean) and P10/P90 ratios (proportional with the normal standard deviation)
3) Over-confidence is often mistaken for optimism or (less often) pessimism. Falling outside of a confidence interval can be because the confidence interval is too small.
One of the big advantages of the methods advocated here is the ability to tell the difference
26. Future work: Disutility of poor probabilistic prediction
drill
drop
success
failure
V
zero
Cost of well
develop
abandon
V
$
Cost of wellSystematic bias model may be used to predict the
expected value erosion from the biases.
In the decision model, decisions are based on
biased probabilities, i.e. probabilities given by the
bias model, but expectations are performed using
the original ”faithful” probabilities
Figure shows value erosion for simple decision
model (no appraisal, but no development if volume
is less than commercial threshold, NPV is linear
with volume above threshold).
Results show substantial value erosion
Note asymmetry: Fortune favours the brave! (well,
more than the correspondingly timid at any rate)