This document discusses issues in statistics that data scientists can and cannot ignore when working with large datasets. It begins by outlining the talk and defining key terms in data science. It then explains that model assessment, such as estimating model performance on new data, becomes easier with more data as statistical adjustments are not needed. However, more data and variables are not always better, as noise, collinearity, and overfitting can still occur. Several examples are given where common machine learning algorithms can be fooled into achieving high accuracy on training data even when the target variable is random. The conclusion emphasizes that data science, statistics, and domain expertise each provide unique perspectives, and effective teams need to understand all views.
State of the Art in Machine Learning, by Thomas Dietterich, Distinguished Professor Emeritus in the School of EECS at Oregon State University and Chief Scientist of BigML.
*MLSEV 2020: Virtual Conference.
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)Ogechi Onuoha
An Introduction to the focus of my research. I presented this to the members of the Pipeline research group, University of Lagos Nigeria. I will be making subsequent presentations as well as paper reviews on the same topic.
Searching for Anomalies, by Thomas Dietterich, Distinguished Professor Emeritus in the School of EECS at Oregon State University and Chief Scientist of BigML.
*MLSEV 2020: Virtual Conference.
A talk given by Eugene Dubossarsky on predictive analytics at the Big Data Analytics meetup in Sydney this month. The talk is available at http://www.youtube.com/watch?v=aG16YSFgtLY
State of the Art in Machine Learning, by Thomas Dietterich, Distinguished Professor Emeritus in the School of EECS at Oregon State University and Chief Scientist of BigML.
*MLSEV 2020: Virtual Conference.
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)Ogechi Onuoha
An Introduction to the focus of my research. I presented this to the members of the Pipeline research group, University of Lagos Nigeria. I will be making subsequent presentations as well as paper reviews on the same topic.
Searching for Anomalies, by Thomas Dietterich, Distinguished Professor Emeritus in the School of EECS at Oregon State University and Chief Scientist of BigML.
*MLSEV 2020: Virtual Conference.
A talk given by Eugene Dubossarsky on predictive analytics at the Big Data Analytics meetup in Sydney this month. The talk is available at http://www.youtube.com/watch?v=aG16YSFgtLY
sience 2.0 : an illustration of good research practices in a real studywolf vanpaemel
a presentation explaining the what, how and why of some of the features of science 2.0 (replication, registration, high power, bayesian statistics, estimation, co-pilot multi-software approach, distinction between confirmatory and exploratory analyses, and open science) using steegen et al. (2014) as a running example.
MLSEV Virtual. Supervised vs UnsupervisedBigML, Inc
Supervised vs Unsupervised Learning Techniques, by Charles Parker, Vice President of Machine Learning algorithms at BigML.
*MLSEV 2020: Virtual Conference.
How to establish and evaluate clinical prediction models - StatsworkStats Statswork
A clinical prediction model can be used in various clinical contexts, including screening for asymptomatic illness, forecasting future events such as disease, and assisting doctors in their decision-making and health education. Despite the positive effects of clinical prediction models on practice, prediction modelling is a difficult process that necessitates meticulous statistical analysis and sound clinical judgments. Statswork offers statistical services as per the requirements of the customers. When you Order statistical Services at Statswork, we promise you the following always on Time, outstanding customer support, and High-quality Subject Matter Experts.
Read More With Us: https://bit.ly/3dxn32c
Why Statswork?
Plagiarism Free | Unlimited Support | Prompt Turnaround Times | Subject Matter Expertise | Experienced Bio-statisticians & Statisticians | Statistics across Methodologies | Wide Range of Tools & Technologies Supports | Tutoring Services | 24/7 Email Support | Recommended by Universities
Contact Us:
Website: www.statswork.com
Email: info@statswork.com
United Kingdom: 44-1143520021
India: 91-4448137070
WhatsApp: 91-8754446690
Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
Dichotomania and other challenges for the collaborating biostatisticianLaure Wynants
Conference presentation at ISCB 41 in the session
"Biostatistical inference in practice: moving beyond false
dichotomies"
A comment in Nature, signed by over 800 researchers, called for the scientific community to “retire statistical significance”. The responses included a call to halt the use of the term „statistically significant”, and changes in journal’s author guidelines. The leading discourse among statisticians is that inadequate statistical training of clinical researchers and publishing practices are to blame for the misuse of statistical testing. In this presentation, we search our collective conscience by reviewing ethical guidelines for statisticians in light of the p-value crisis, examine what this implies for us when conducting analyses in collaborative work and teaching, and whether the ATOM (accept uncertainty; be thoughtful, open and modest) principles can guide us.
H2O World - Top 10 Data Science Pitfalls - Mark LandrySri Ambati
H2O World 2015 - Mark Landry
Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Top 10 Data Science Practitioner PitfallsSri Ambati
Over-fitting, misread data, NAs, collinear column elimination and other common issues play havoc in the day of practicing data scientist. In this talk, Mark Landry, one of the world’s leading Kagglers, will review the top 10 common pitfalls and steps to avoid them.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
sience 2.0 : an illustration of good research practices in a real studywolf vanpaemel
a presentation explaining the what, how and why of some of the features of science 2.0 (replication, registration, high power, bayesian statistics, estimation, co-pilot multi-software approach, distinction between confirmatory and exploratory analyses, and open science) using steegen et al. (2014) as a running example.
MLSEV Virtual. Supervised vs UnsupervisedBigML, Inc
Supervised vs Unsupervised Learning Techniques, by Charles Parker, Vice President of Machine Learning algorithms at BigML.
*MLSEV 2020: Virtual Conference.
How to establish and evaluate clinical prediction models - StatsworkStats Statswork
A clinical prediction model can be used in various clinical contexts, including screening for asymptomatic illness, forecasting future events such as disease, and assisting doctors in their decision-making and health education. Despite the positive effects of clinical prediction models on practice, prediction modelling is a difficult process that necessitates meticulous statistical analysis and sound clinical judgments. Statswork offers statistical services as per the requirements of the customers. When you Order statistical Services at Statswork, we promise you the following always on Time, outstanding customer support, and High-quality Subject Matter Experts.
Read More With Us: https://bit.ly/3dxn32c
Why Statswork?
Plagiarism Free | Unlimited Support | Prompt Turnaround Times | Subject Matter Expertise | Experienced Bio-statisticians & Statisticians | Statistics across Methodologies | Wide Range of Tools & Technologies Supports | Tutoring Services | 24/7 Email Support | Recommended by Universities
Contact Us:
Website: www.statswork.com
Email: info@statswork.com
United Kingdom: 44-1143520021
India: 91-4448137070
WhatsApp: 91-8754446690
Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
Dichotomania and other challenges for the collaborating biostatisticianLaure Wynants
Conference presentation at ISCB 41 in the session
"Biostatistical inference in practice: moving beyond false
dichotomies"
A comment in Nature, signed by over 800 researchers, called for the scientific community to “retire statistical significance”. The responses included a call to halt the use of the term „statistically significant”, and changes in journal’s author guidelines. The leading discourse among statisticians is that inadequate statistical training of clinical researchers and publishing practices are to blame for the misuse of statistical testing. In this presentation, we search our collective conscience by reviewing ethical guidelines for statisticians in light of the p-value crisis, examine what this implies for us when conducting analyses in collaborative work and teaching, and whether the ATOM (accept uncertainty; be thoughtful, open and modest) principles can guide us.
H2O World - Top 10 Data Science Pitfalls - Mark LandrySri Ambati
H2O World 2015 - Mark Landry
Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Top 10 Data Science Practitioner PitfallsSri Ambati
Over-fitting, misread data, NAs, collinear column elimination and other common issues play havoc in the day of practicing data scientist. In this talk, Mark Landry, one of the world’s leading Kagglers, will review the top 10 common pitfalls and steps to avoid them.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
This is my presentation from LISA 2014 in Seattle on November 14, 2014.
Most IT Ops teams only keep an eye on a small fraction of the metrics they collect because analyzing this haystack of data and extracting signal from the noise is not easy and generates too many false positives.
In this talk I will show some of the types of anomalies commonly found in dynamic data center environments and discuss the top 5 things I learned while building algorithms to find them. You will see how various Gaussian based techniques work (and why they don’t!), and we will go into some non-parametric methods that you can use to great advantage.
Top 10 Data Science Practitioner PitfallsSri Ambati
Top 10 Data Science Practitioner Pitfalls Meetup with Erin LeDell and Mark Landry on 09.09.15
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Top 10 Data Science Practioner Pitfalls - Mark LandrySri Ambati
Over-fitting, misread data, NAs, collinear column elimination and other common issues play havoc in the day of practicing data scientist. In this talk, we review top 10 common pitfalls and steps to avoid them. #h2ony
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
The Simulacrum, a Synthetic Cancer DatasetCongChen35
This presentation describes the applications of synthetic data to cancer registries's efforts to support understanding of and research based on cancer while reducing privacy risks to cancer patients.
The Simulacrum imitates some of the data held securely by the Public Health England’s National Cancer Registration and Analysis Service.
The data in the Simulacrum is entirely artificial. It does not contain data about real patients, so users can never identify a real person. It is free to use and allows anyone who wants to use record-level cancer data to do so, safe in the knowledge that while the data feels like the real thing, there is no danger of breaching patient confidentiality.
What is Data Science and How to Succeed in itKhosrow Hassibi
The use of machine learning and data mining to create value from corporate or public data is nothing new. It is not the first time that these technologies are in the spotlight. Many remember the late ‘80s and the early ‘90s when machine learning techniques—in particular neural networks—had become very popular. Data mining was at a rise. There were talks everywhere about advanced analysis of data for decision making. Even the popular android character in “Star Trek: The Next Generation” had been named appropriately as “Data.” Data science has been the cornerstone of many data products and applications for more than two decades, e.g., in finance, Telco, and retail. Credit scores have been in use for decades to assess credit worthiness of people when applying for credit or loan. Sophisticated real-time fraud scores based on individual’s transaction spending patterns have been used since early ‘90s to protect credit card holders from a variety of fraud schemes. However, the popularity of web products from the likes of Google, Linked-in, Amazon, and Facebook has helped analytics become a household name. Every new technology comes with lots of hype and many new buzzwords. Often, fact and fiction get mixed-up making it impossible for outsiders to assess the technology’s true relevance. Due to the exponential growth of data, today there is an ever increasing need to process and analyze big data which has required a rethinking of every aspect of the data science life cycle, from data management, to data mining and analysis, to deployment. The purpose of this talk is first to describe what data science is and how it has evolved historically. Second, I share my own experiences as a data scientist across different industries and through time with the audience emphasizing the challenges and rewards.
Improving AI Development - Dave Litwiller - Jan 11 2022 - PublicDave Litwiller
A conversational tour through some things I’ve learned in helping scale-up stage client companies improve their AI development practices, especially where deep neural nets (DNNs) are in use.
Current challenges facing the implementation of NoSQL-type databases involve how to use advanced rule-based analytics on large tables and key value stores, where metadata is often sparse. Graph databases or triple stores are great for utilizing one’s metadata, but are often computationally inefficient compared to NoSQL stores. To combat this problem, Modus Operandi will showcase a Predicate Store inside of its MOVIA product that can run advanced, first-order level, logical rule sets and queries against large tables or column stores directly to provide a scalable, rapid and advanced data analytics for cloud applications. This provides graph complexity in terms of content with the performance and scalability of NoSQL data approaches. The system also allows for both statistical algorithms as well as logic-based rule sets to be run concurrently, meaning that a host of parallel analytics can be run at once, providing deep analysis over a multitude of important pattern types.
Presentation to the third LIS DREaM workshop, held at Edinburgh Napier university on Wednesday 25th April 2012.
More information about the event can be found at http://lisresearch.org/dream-project/dream-event-4-workshop-wednesday-25-april-2012/
Similar to Statistics in the age of data science, issues you can not ignore (20)
Machine Learning in 2016: Live Q&A with Carlos GuestrinTuri, Inc.
Live webinar session with Carlos Guestrin, Dato CEO and Amazon Professor of Machine Learning at University of Washington. Carlos reviewed 2015 highlights, previewed the Dato roadmap, and answered real-time questions from participants about use cases, algorithms, and resources.
Tutorial for Machine Learning 101 (an all-day tutorial at Strata + Hadoop World, New York City, 2015)
The course is designed to introduce machine learning via real applications like building a recommender image analysis using deep learning.
In this talk we cover deployment of machine learning models.
Overview of Machine Learning and Feature EngineeringTuri, Inc.
Machine Learning 101 Tutorial at Strata NYC, Sep 2015
Overview of machine learning models and features. Visualization of feature space and feature engineering methods.
Scalable tabular (SFrame, SArray) and graph (SGraph) data-structures built for out-of-core data analysis.
The SFrame package provides the complete implementation of:
SFrame
SArray
SGraph
The C++ SDK surface area (gl_sframe, gl_sarray, gl_sgraph)
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Statistics in the age of data science, issues you can not ignore
1. Statistics in the age of data science,
issues you can and can not ignore
John Mount
(data scientist, not a statistician)
Win-Vector LLC
http://www.win-vector.com/
Session: http://conf.dato.com/speakers/dr-john-mount/
These slides, all data and code: http://winvector.github.io/DS/
1
2. This talk
•Our most important data science tools are our theories and methods. Let
us look a bit at their fundamentals.
•Large data gives the apparent luxury of wishing away some common model
evaluation biases (instead of needing to apply the traditional statistical
corrections).
•Conversely, to work agilely data scientists must act as if a few naive
axioms of statistical inference were true (though they are not).
•I will point out some common statistical issues that do and do not remain
critical problems when you are data rich.
•I will concentrate on the simple case of supervised learning.
2
3. Outline
•(quick) What is data science?
•How can that work?
•An example critical task that gets easier when you have more data.
•What are some of the folk axioms of data science?
•How to design bad data.
3
4. What is Data Science: my position
•Data science is the continuation of data engineering and predictive analytics.
•More data allows domain naive models to perform well.
•See: “The Unreasonable Effectiveness of Data”, Alon Halevy, Peter Norvig,
and Fernando Pereira, IEEE Intelligent Systems, 2009, pp. 1541-1672.
•Emphasis on prediction over harder statistical problems such as coefficient
inference.
•Strong preference for easy procedures that become more statistically reliable as
you accumulate more data.
•Reliance on strong black-box tools.
4
6. Complicated domain theory doesn’t always
preclude easily observable statistical signals
6
“Maybe the molecule didn’t go to
graduate school.”
(Will Welch defending the success of his approximate
molecular screening algorithm, given he is a computer scientist
and not a chemist.)
Example approximate docking (in this case using SLIDE
approximation, not Welch et al.’s Hammerhead).
“Database Screening for HIV Protease Ligands: The
Influence of Binding-Site Conformation and
Representation on Ligand Selectivity", Volker Schnecke,
Leslie A. Kuhn, Proceedings of the Seventh International
Conference on Intelligent Systems for Molecular
Biology, Pages 242-251, AAAI Press, 1999.
7. A lot of deep statistics is about how to work
correctly with small data sets
•From Statistics 4th
edition, David
Freedman, Robert
Pisani, Roger
Purves, Norton,
2007.
7
8. What is a good example of a
critical task that becomes easier
when you have more data?
8
9. Model assessment
•Estimating the performance of your predictive model on future instances
•A critical task
•Gets easier when you have more data:
•Don’t need to rely on statistical adjustments
•Can reserve a single sample of data as a held-out test set (see “The Elements of
Statistical Learning” 2nd edition section 7.2)
•Computationally cheaper than:
•leave k-out cross validation
•k-fold cross validation
9
11. Statistical Adjustments
•Attempt to estimate the value of a statistic or the performance of a model
on new data using only training data.
•Examples:
•Sample size adjustment for variance: writing instead
of
•“adjusted R-squared”, in-sample p-values, “AIC”, “BIC”, …
•Pointless to adjust in training sample quantities when you have enough
data to try and estimate out of sample quantities directly (cross validation
methods, train/test methods, or bootstrap methods).
11
14. Train/Test split continued
•Statistically inefficient
•Blocking issue for small data sets
•Largely irrelevant for large data sets
•Considered “cross validation done wrong” by some statisticians
•Cross validation techniques in fact estimate the quality of the fitting
procedure, not the quality of the final returned model.
•Test or held-out procedures directly estimate the performance of the
actual model built.
•Preferred by some statisticians and most data scientists.
14
16. Data scientists rush in where
statisticians fear to tread
•Large data sets
•Wide data sets
•Heterogeneous variables
•Colinear variables
•Noisy dependent variables
•Noisy independent variables
16
17. We have to admit:
data scientists are a flourishing species
•Must be something to be
learned from that.
•What axioms (true or
false) would be needed to
explain their success?
17
18. Data science axiom wish list
•Just a few:
•Wish more data was always better.
•Wish more variables were always better.
•Wish you can retain some fraction of training
performance on future model application.
18
19. Is more data always better?
•In theory: yes (you could always simulate having less
data by throwing some away).
•In practice: almost always yes.
•Absolutely for every algorithm every time? no.
19
20. More data can be bad for a fixed procedure
(artificial example)
•Statistics / machine learning algorithms that depend
on re-sampling to supply diversity can degrade in the
presence of extra data.
•Case in point: random forest over shallow trees can
lose tree diversity (especially when there are
duplicate or near-duplicate variables).
20
21. Random forest example
•A data set where a random forest
over shallow trees shows lower
median accuracy on test data as we
increase training data set size.
•(synthetic data set designed to hurt
random forest, logistic model
passes 0.5 accuracy)
•All code/data:
http://winvector.github.io/DS/
21
22. Are more variables always better?
•In theory: yes.
•Consequence of the non-negativity of mutual information.
•Only true for training set performance, not performance on future instances.
•In practice: often.
•In fact: ridiculously easy to break:
•Noise variables
•Collinear variables
•Near constant variables
•Overfit
22
23. To benefit from more variables
•Need at least a few of the following:
•Enough additional data to falsify additional columns.
•Regularization terms / useful inductive biases.
•Variance reduction / bagging schemes.
•Dependent variable aware pre-processing (variable selection, partial
least squares, word2vec, and not principal components projection).
23
24. Can’t we keep at least some
of our training performance?
•Common situation:
•Near perfect fit on training data.
•Model performs like random guessing on new
instances.
•Extreme over fit.
•One often hopes some regularized, ensemble,
or transformed version of such a model would
have at least some use on new instances.
24
25. 25
Not the case
•For at least the following
common popular machine
learning algorithms we can
design a simple data set
where we get arbitrarily high
accuracy on training even
when the dependent variable
is generated completely
independently of all of the
independent variables.
•Decision Trees
•Logistic Regression
•Elastic Net Logistic Regression
•Gradient Boosting
•Naive Bayes
•Random Forest
•Support Vector Machine
(All code/data:
http://winvector.github.io/DS/ )
26. Lesson
•Can’t just tack on cross validation late in a machine
learning algorithm’s design (or just use it to pick a few
calibration parameters).
•Have to express model quality in terms of out of sample
data throughout.
•And must understand some fraction of the above
measurements will still be chimeric and falsely optimistic
(statistical significance re-enters the picture).
26
27. 27
How did we design the counter
examples?
•A lot of common machine learning algorithms fail in the
presence of:
•Noise variables
•Duplicate examples
•Serial correlation
•Incompatible scales
•Punchline: all these things are common in typical under-
curated real world data!
28. The analyst themself can be a source of
additional exotic “can never happen” biases
•Neighboring duplicate and near-duplicate rows (bad
join/sessionizing logic).
•Features with activation patterns depending on the size of the
training set (opportunistic feature generation/selection).
•Leakage of facts about evaluation set through repeated scoring
(see “wacky boosting” by Moritz Hardt, which gives a reliable
procedure to place high on Kaggle leaderboards without even
looking at the data).
28
29. Conclusions
•Data scientists, statisticians, and domain experts all see things differently.
•Data science emphasizes procedures that are conceptually easy and become more
correct when scaled to large data. Procedures can seem overly ambitious and as
pandering to domain/business needs.
•Statistics emphasizes procedures that are correct at all data scales, including difficult
small data problems. Procedures can seem overly doctrinal and as insensitive to
original domain/business needs.
•Domain experts/scientists value correctness and foundation, over implementability.
•An effective data science team must work agilely, understand statistics, and develop
domain empathy.
•We need a deeper science of structured back-testing.
29
30. Thank you
30
For more, please check out my book,
or contact me at win-vector.com
Also follow our group on our blog
http://www.win-vector.com/blog/ or on
Twitter @WinVectorLLC
Clearly we are ignoring some important domain science issues and statistical science issues, so how does data science work?
http://www.aaai.org/Papers/ISMB/1999/ISMB99-028.pdf
You may not get the whole story, but you may not miss the whole story.
Ch. 26 page 493. Statistical efficiency is a huge worry when you don’t have a lot of data.
“The Elements of Statistical Learning” 2nd edition section 7.2 page 222.
https://en.wikipedia.org/wiki/Cross-validation_(statistics)
Does not matter when n is large. Can actually be quite complicated and require a lot of background to apply correctly. Prefer tools like the PRESS statistics to adjusted R-squared. Can use training mean against out of sample instances and so on.
k-X cross validation methods are a procedural alternative. Shown: 3-fold cross validation. We try to simulate the performance of a model on new data by never applying a model to any data used to construct it. Which cross validation scheme you are using determines pattern of arrows. Common to all schemes: there are many throw away models. The larger the models the more like training on all of the available data they behave.
Test/Train split is an easier alternative that is less statistically efficient and depends on having good tools (that their selves cross-validate) during the training phase. Test set is held secret during model construction, tuning, and even early evaluation. Scoring in Train subset may in fact itself use both cross-validation and train/test subsplit methods. Actual model produced is scored on test set (though some data scientists re-train on the entire data set as a final “model polish” step).
Splitting your available data into train and test is a way to try and simulate the arrival of future data. Like any simulation- it may fail. Controlled experiments are prospective designs that are somewhat more expensive and somewhat more powerful than this.
Data science is a bit looser than traditional statistical practice and moves a bit faster; what does that look like?
Axioms that are true are true in the extreme.
Random forest is a dominant machine learning algorithm in practice. This is a problem where logistic regression gets 85% accuracy as n increases, and the concept is reachable by the random forest model.
Note: collinear variables while damaging to prediction are nowhere near as large a hazard to prediction as they are to coefficient inference. And classic “x alone” methods of dealing with them become problematic in so called “wide data” situations.
Principal components is a “independent variable only” or “x alone” transform, a good idea over curated homogeneous variable- not good over wild wide datasets. word2vec ( https://code.google.com/p/word2vec/ ) can be considered not “x alone” as it presumably retains concept clusters from the grouping of data its training source (typically GoogleNews or Freebase naming); to it has an “y” (just not your “y”).
I.e. we see an arbitrarily good model on training, even when to model is possible.
Also have sometimes seen a reversal: the model is significantly worse than random on the test set. Being worse than random is likely a minor distribution change from training to test. The observed statistical significance is likely due to some process causing dependence between rows in a limited window (like serial correlation or bad sessionizing) and not evidence of anything usable).
These methods should be familiar: they are most of the supervised learning algorithms from everybody’s “top 10 data mining” or “top 10 Kaggle algorithms” lists.
Gradient boosting typically uses cross val to number of trees, zero trees being no model.
Some of these problems even break test/train exchangeability, one of the major justifications of machine learning.
http://blog.mrtz.org/2015/03/09/competition.html
It is equally arrogant to completely ignore domain science as it is to believe you can always quickly become a domain expert.
http://www.win-vector.com/blog/2014/05/a-bit-of-the-agenda-of-practical-data-science-with-r/