The document discusses using data analysis to answer relevant questions. It mentions finding answers to questions using data, analyzing the data, and presenting results. Specific analysis techniques mentioned include regression for comparison and dealing with heteroskedasticity by using robust errors rather than having to understand the concept. An example question given is whether men like beer more.
The document discusses the differences between prediction and causality. It notes that while correlation is good for prediction, correlation does not necessarily imply causation. Causal discovery requires methods beyond simply observing correlations as determining the fundamental causes of phenomena is often difficult. Deep learning methods that rely solely on correlations learned from data are just a form of curve fitting rather than a way to understand causal relationships.
This document discusses advanced techniques for A/B testing beyond basic approaches. It describes simulating A/B tests on historical sales data to estimate the sample size needed to detect effects of different sizes. Stratified sampling techniques that block on variables like geography are presented to ensure more balanced comparisons. Bayesian A/B testing is introduced as a method that calculates the posterior probability that versions are different given the observed data. Causal tree methods are proposed for exploring effects within segments while avoiding multiple testing issues.
This document discusses a case study about an election poll for a political candidate, Mr. Allen. It presents three potential poll outcomes and asks which would be most encouraging for Mr. Allen. It also contains information about income growth in the US from 1993-2012, showing that the top 1% saw much higher growth than the average or bottom 99%. Finally, it discusses a case from 1973 regarding alleged bias against women applicants to graduate programs at UC Berkeley.
Sally Clark was accused of murdering her two children in 1999 after they died of Sudden Infant Death Syndrome (SIDS). SIDS has a probability of occurring at 1 in 8500. The document asks what other probabilities a judge would want to know, if the person summarizing would convict Sally Clark, and if she was actually convicted.
This document discusses using data analysis to answer relevant questions. It provides examples of where to find open datasets and measuring what you want to know from data. Bayesian theory is explained using examples of Down syndrome screening and diagnostic test accuracy. The Sally Clark case is presented, where the probability of sudden infant death syndrome is given, and questions are posed about what other probabilities a judge would want to know, whether she should be convicted, and if she was actually convicted.
The document discusses how to find answers to relevant questions using data. It explains that the process involves forming a question, collecting relevant data, analyzing the data, and presenting the results. It also notes that data analysis is a valuable skill because data is becoming ubiquitous and cheap, while analysis remains scarce and complementary to data. The document provides examples of conditional probability problems and their solutions.
The document discusses a session on using data analysis to answer relevant questions. It provides examples of key concepts like conditional probability, conditional expected value, and linear regression. It also shares examples of using data and statistics, including a 1973 case analyzing admissions data from UC Berkeley by gender, and a 1987 example about predicting the outcome of a political election based on poll results.
The document discusses using data analysis to answer relevant questions. It mentions finding answers to questions using data, analyzing the data, and presenting results. Specific analysis techniques mentioned include regression for comparison and dealing with heteroskedasticity by using robust errors rather than having to understand the concept. An example question given is whether men like beer more.
The document discusses the differences between prediction and causality. It notes that while correlation is good for prediction, correlation does not necessarily imply causation. Causal discovery requires methods beyond simply observing correlations as determining the fundamental causes of phenomena is often difficult. Deep learning methods that rely solely on correlations learned from data are just a form of curve fitting rather than a way to understand causal relationships.
This document discusses advanced techniques for A/B testing beyond basic approaches. It describes simulating A/B tests on historical sales data to estimate the sample size needed to detect effects of different sizes. Stratified sampling techniques that block on variables like geography are presented to ensure more balanced comparisons. Bayesian A/B testing is introduced as a method that calculates the posterior probability that versions are different given the observed data. Causal tree methods are proposed for exploring effects within segments while avoiding multiple testing issues.
This document discusses a case study about an election poll for a political candidate, Mr. Allen. It presents three potential poll outcomes and asks which would be most encouraging for Mr. Allen. It also contains information about income growth in the US from 1993-2012, showing that the top 1% saw much higher growth than the average or bottom 99%. Finally, it discusses a case from 1973 regarding alleged bias against women applicants to graduate programs at UC Berkeley.
Sally Clark was accused of murdering her two children in 1999 after they died of Sudden Infant Death Syndrome (SIDS). SIDS has a probability of occurring at 1 in 8500. The document asks what other probabilities a judge would want to know, if the person summarizing would convict Sally Clark, and if she was actually convicted.
This document discusses using data analysis to answer relevant questions. It provides examples of where to find open datasets and measuring what you want to know from data. Bayesian theory is explained using examples of Down syndrome screening and diagnostic test accuracy. The Sally Clark case is presented, where the probability of sudden infant death syndrome is given, and questions are posed about what other probabilities a judge would want to know, whether she should be convicted, and if she was actually convicted.
The document discusses how to find answers to relevant questions using data. It explains that the process involves forming a question, collecting relevant data, analyzing the data, and presenting the results. It also notes that data analysis is a valuable skill because data is becoming ubiquitous and cheap, while analysis remains scarce and complementary to data. The document provides examples of conditional probability problems and their solutions.
The document discusses a session on using data analysis to answer relevant questions. It provides examples of key concepts like conditional probability, conditional expected value, and linear regression. It also shares examples of using data and statistics, including a 1973 case analyzing admissions data from UC Berkeley by gender, and a 1987 example about predicting the outcome of a political election based on poll results.
The document discusses using data analysis to answer relevant questions. It covers key steps in the process: defining the question, collecting relevant data, performing analysis on the data, and presenting results. Randomness and uncertainty are acknowledged as limitations. Statistical hypothesis testing is introduced as a framework for making inferences about populations based on sample data. An example question and poll results are provided to illustrate hypothesis testing.
This document appears to be notes from a data analysis presentation. It includes examples of conditional probability problems, explanations of Bayes' theorem and how new information can update probabilities, and an example calculating the probability of Down syndrome given a positive screening result. It also discusses a case where a woman's two children died of sudden infant death syndrome and questions whether she should have been convicted given the probability of that occurring.
This document discusses how to analyze data to answer relevant questions. It begins with an introduction to the data analysis process, including defining a question, collecting and analyzing data, and presenting results. It then provides two examples of data visualizations showing income distribution with the average, bottom 99%, and top 1% percentages. The visualizations demonstrate the concentration of income at the top.
The document discusses how to find answers to relevant questions using data. It outlines the process of moving from asking a question, to collecting and analyzing data, and finally presenting the results. It also quotes Prof. Hal Varian saying that data analysis skills will be highly valuable as data becomes more ubiquitous and cheap, since analysis is complementary to data. Finally, it mentions a 2014 salary survey of technology professionals in the US.
The document discusses three tips for coding projects: 1) Stay organized by structuring projects into folders for data, code, and output. 2) Be clear by adding comments to code to explain sections and steps. 3) Don't repeat code by using functions like lapply and bind_rows to read in multiple files instead of separate read.csv lines. The final tip is to search for help when needed.
This document provides tips for coding projects. It recommends staying organized by structuring projects into folders for data, code, and output. It also suggests being clear by using comments to explain code and create sections. Additionally, it advises not repeating code and instead using loops or conditionals to run code on different subsets of data. The final tip is to search for help when needed.
This document outlines programming tools for a 2015 winter course at CEU. It lists mini quizzes, project work, reasons to code such as keeping track of work, being productive, and being capable of new things. It also lists project files, data files from 2010-2013, and programming files for do_code, Gretl and Stata.
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
The document discusses using data analysis to answer relevant questions. It covers key steps in the process: defining the question, collecting relevant data, performing analysis on the data, and presenting results. Randomness and uncertainty are acknowledged as limitations. Statistical hypothesis testing is introduced as a framework for making inferences about populations based on sample data. An example question and poll results are provided to illustrate hypothesis testing.
This document appears to be notes from a data analysis presentation. It includes examples of conditional probability problems, explanations of Bayes' theorem and how new information can update probabilities, and an example calculating the probability of Down syndrome given a positive screening result. It also discusses a case where a woman's two children died of sudden infant death syndrome and questions whether she should have been convicted given the probability of that occurring.
This document discusses how to analyze data to answer relevant questions. It begins with an introduction to the data analysis process, including defining a question, collecting and analyzing data, and presenting results. It then provides two examples of data visualizations showing income distribution with the average, bottom 99%, and top 1% percentages. The visualizations demonstrate the concentration of income at the top.
The document discusses how to find answers to relevant questions using data. It outlines the process of moving from asking a question, to collecting and analyzing data, and finally presenting the results. It also quotes Prof. Hal Varian saying that data analysis skills will be highly valuable as data becomes more ubiquitous and cheap, since analysis is complementary to data. Finally, it mentions a 2014 salary survey of technology professionals in the US.
The document discusses three tips for coding projects: 1) Stay organized by structuring projects into folders for data, code, and output. 2) Be clear by adding comments to code to explain sections and steps. 3) Don't repeat code by using functions like lapply and bind_rows to read in multiple files instead of separate read.csv lines. The final tip is to search for help when needed.
This document provides tips for coding projects. It recommends staying organized by structuring projects into folders for data, code, and output. It also suggests being clear by using comments to explain code and create sections. Additionally, it advises not repeating code and instead using loops or conditionals to run code on different subsets of data. The final tip is to search for help when needed.
This document outlines programming tools for a 2015 winter course at CEU. It lists mini quizzes, project work, reasons to code such as keeping track of work, being productive, and being capable of new things. It also lists project files, data files from 2010-2013, and programming files for do_code, Gretl and Stata.
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
1. Uncovering political connections of firms using
machine learning methods
BURN meetup, 9th February 2016
» János Divényi @janosdivenyi « » Jenő Pál @paljenczy «
47. one interface to many algorithms
streamlines the process of machine learning
parallel computation with reproducibility
Improve decision rule
caret
classification and
regression training
48. one interface to many algorithms
streamlines the process of machine learning
parallel computation with reproducibility
Improve decision rule
caret
classification and
regression training
doParallel
57. Miklós Koren, Ádám Szeidl, Márta Bisztray, Anna Csonka,
Krisztián Fekete, Attila Gáspár, Dániel Molnár, Gábor Nyéki,
Krisztina Orbán, Rita Pető, Balázs Reizer, Mátyás Steiner,
Bálint Szilágyi, Ferenc Szűcs, András Vereckei, Zsófia
Kőműves, Olivér Kiss, Dániel Pass, Dávid Popper and others...
Thanks for the attention