1) The document discusses Vapnik's approach to statistical modeling and machine learning, focusing on the concepts of generalization, overfitting, and VC dimension.
2) It introduces the idea of Structural Risk Minimization (SRM), which aims to control a model's complexity through the VC dimension in order to maximize generalization. SRM selects the model that minimizes the total risk of empirical risk and confidence interval.
3) As an example, it describes how SRM can be implemented in an industrial data mining context to optimize variables, model structure, and hyperparameters for tasks like classification and regression.
This document provides an introduction to machine learning, covering key topics such as what machine learning is, common learning algorithms and applications. It discusses linear models, kernel methods, neural networks, decision trees and more. It also addresses challenges in machine learning like balancing fit and robustness, and evaluating model performance using techniques like ROC curves. The goal of machine learning is to build models that can learn from data to make predictions or decisions.
Johan Suykens: "Models from Data: a Unifying Picture" ieee_cis_cyprus
The document discusses models that are constructed from data using machine learning techniques. It provides examples of different model types, including neural networks, support vector machines, kernel methods, and spectral clustering. These models can be expressed in both primal and dual formulations, and the dual representations allow for out-of-sample extensions, model selection, and solving large-scale problems. The document outlines core models that underlie many machine learning algorithms and how adding regularization terms and constraints can yield different optimal model representations.
The document discusses machine learning theory and its applications. It introduces concepts like classification, regression, clustering, linear and non-linear models. It discusses model fitting and robustness tradeoffs. It then describes learning algorithms, performance assessment, and the process of minimizing risk functionals. A key concept discussed is Structural Risk Minimization, which provides a theoretical framework for avoiding overfitting using notions of guaranteed risk and model capacity.
The document provides an overview of machine learning, including definitions, types of machine learning (supervised, unsupervised, reinforcement learning), and evaluation metrics for machine learning models. It discusses classification metrics like accuracy, precision, recall, F1 score, and confusion matrices. For regression problems, it covers metrics like mean absolute error, mean squared error, R2 score. It also provides examples of calculating many of these common metrics in Python.
Supervised learning involves using a training dataset to learn a target function that can be used to predict class labels or attribute values. The document discusses supervised learning and classification, including types of supervised learning problems like classification and regression. It provides examples of classification algorithms like K-nearest neighbors, decision trees, naive Bayes, and support vector machines. It also gives examples of how to implement classification algorithms using scikit-learn and discusses evaluating classification model performance based on accuracy.
This document provides an overview of unsupervised machine learning and reinforcement learning. It discusses unsupervised learning, including clustering methods like k-means. It then explains reinforcement learning concepts such as the agent, environment, actions, states, rewards, and policy. Reinforcement learning is goal-oriented learning based on interaction. Q-learning and Markov decision processes are introduced as reinforcement learning models. Applications include using the Gym library in Python to model environments like cart pole.
This document provides an overview of basic statistical concepts and terms. It discusses variables, observational vs experimental research, dependent and independent variables, measurement scales, systematic and random errors, accuracy vs precision, populations, distributions like binomial and normal, central tendency, dispersion, and other key statistical concepts. Examples are provided to illustrate statistical terminology.
This document provides an introduction to machine learning, covering key topics such as what machine learning is, common learning algorithms and applications. It discusses linear models, kernel methods, neural networks, decision trees and more. It also addresses challenges in machine learning like balancing fit and robustness, and evaluating model performance using techniques like ROC curves. The goal of machine learning is to build models that can learn from data to make predictions or decisions.
Johan Suykens: "Models from Data: a Unifying Picture" ieee_cis_cyprus
The document discusses models that are constructed from data using machine learning techniques. It provides examples of different model types, including neural networks, support vector machines, kernel methods, and spectral clustering. These models can be expressed in both primal and dual formulations, and the dual representations allow for out-of-sample extensions, model selection, and solving large-scale problems. The document outlines core models that underlie many machine learning algorithms and how adding regularization terms and constraints can yield different optimal model representations.
The document discusses machine learning theory and its applications. It introduces concepts like classification, regression, clustering, linear and non-linear models. It discusses model fitting and robustness tradeoffs. It then describes learning algorithms, performance assessment, and the process of minimizing risk functionals. A key concept discussed is Structural Risk Minimization, which provides a theoretical framework for avoiding overfitting using notions of guaranteed risk and model capacity.
The document provides an overview of machine learning, including definitions, types of machine learning (supervised, unsupervised, reinforcement learning), and evaluation metrics for machine learning models. It discusses classification metrics like accuracy, precision, recall, F1 score, and confusion matrices. For regression problems, it covers metrics like mean absolute error, mean squared error, R2 score. It also provides examples of calculating many of these common metrics in Python.
Supervised learning involves using a training dataset to learn a target function that can be used to predict class labels or attribute values. The document discusses supervised learning and classification, including types of supervised learning problems like classification and regression. It provides examples of classification algorithms like K-nearest neighbors, decision trees, naive Bayes, and support vector machines. It also gives examples of how to implement classification algorithms using scikit-learn and discusses evaluating classification model performance based on accuracy.
This document provides an overview of unsupervised machine learning and reinforcement learning. It discusses unsupervised learning, including clustering methods like k-means. It then explains reinforcement learning concepts such as the agent, environment, actions, states, rewards, and policy. Reinforcement learning is goal-oriented learning based on interaction. Q-learning and Markov decision processes are introduced as reinforcement learning models. Applications include using the Gym library in Python to model environments like cart pole.
This document provides an overview of basic statistical concepts and terms. It discusses variables, observational vs experimental research, dependent and independent variables, measurement scales, systematic and random errors, accuracy vs precision, populations, distributions like binomial and normal, central tendency, dispersion, and other key statistical concepts. Examples are provided to illustrate statistical terminology.
This document discusses classical conditioning and operant conditioning. Classical conditioning involves learning associations between stimuli through repeated pairing, such as Pavlov's dogs learning to associate the sound of a bell with food. Operant conditioning refers to changing behavior through reinforcement or punishment of responses. The document provides examples of classical conditioning experiments conducted by Ivan Pavlov on digestion in dogs and John Watson on fear responses in a baby known as Little Albert. It also discusses applications of conditioning principles in shaping behaviors.
Streamlining Technology to Reduce Complexity and Improve ProductivityKevin Fream
The 3 biggest challenges for mid-sized companies are: clutter, discovery, and curiosity. Those that thrive will become experts in streamlining technology while remaining vigilant in following new trends.
07 history of cv vision paradigms - system - algorithms - applications - eva...zukun
This document outlines the history of computer vision from the 1960s to present day through a graph-based structure. It covers early paradigms in areas like pattern recognition and image processing in the 1970s. Artificial intelligence approaches emerged in the 1970s including systems like SHRDLU. Vision research in the 1980s focused on areas like image understanding, expert systems, and reasoning. Robotics became an important application area from the 1990s onward, with a focus on learning and interaction. Current computer vision systems, libraries and toolboxes build upon these historical paradigms, systems and algorithms.
This document discusses machine learning and artificial intelligence. It provides an overview of the machine learning process, including obtaining raw data, preprocessing the data, applying algorithms to extract features and train models, and generating outputs. It then describes different types of machine learning, including supervised learning, unsupervised learning, reinforcement learning, and semi-supervised learning. Specific algorithms like artificial neural networks, support vector machines, genetic algorithms are also briefly explained. Real-world applications of machine learning like character recognition and medical diagnosis are listed.
Graphical Models for chains, trees and gridspotaters
This document discusses graphical models for chains, trees, and grids. It begins with an overview of chain and tree models and algorithms for maximum a posteriori (MAP) inference in chain and tree models. It then discusses dynamic programming for efficient MAP inference in chain models. Examples of chain models for applications like sign language recognition are provided. The document is presented as slides for a lecture on graphical models and computer vision.
THM1: Formalizing a problem as a prediction problem is often the most important contribution of a data scientist.
THM2: A predictor is an estimator, i.e. an algorithm which takes data and returns a prediction. Reality is stochastic, so data and predictions are stochastic.
THM3: Learning is challenging since data must be used both to create prediction models and to assess them. Bias and variance must be balanced to achieve good generalization.
The peer-reviewed International Journal of Engineering Inventions (IJEI) is started with a mission to encourage contribution to research in Science and Technology. Encourage and motivate researchers in challenging areas of Sciences and Technology.
One Size Doesn't Fit All: The New Database Revolutionmark madsen
Slides from a webcast for the database revolution research report (report will be available at http://www.databaserevolution.com)
Choosing the right database has never been more challenging, or potentially rewarding. The options available now span a wide spectrum of architectures, each of which caters to a particular workload. The range of pricing is also vast, with a variety of free and low-cost solutions now challenging the long-standing titans of the industry. How can you determine the optimal solution for your particular workload and budget? Register for this Webcast to find out!
Robin Bloor, Ph.D. Chief Analyst of the Bloor Group, and Mark Madsen of Third Nature, Inc. will present the findings of their three-month research project focused on the evolution of database technology. They will offer practical advice for the best way to approach the evaluation, procurement and use of today’s database management systems. Bloor and Madsen will clarify market terminology and provide a buyer-focused, usage-oriented model of available technologies.
Webcast video and audio will be available on the report download site as well.
Power of Code: What you don’t know about what you knowcdathuraliya
Introductory and inspiring session for students on computer science
Date: 19th March 2013
Event: IT workshop for high school students
Venue: President's College, Maharagama
Organized by: Society of Computer Science, University of Sri Jayewardenepura
Applying Reinforcement Learning for Network Routingbutest
This document discusses the application of reinforcement learning in network routing. It provides an overview of reinforcement learning, including its key elements like the agent, environment, policy, reward function, and value function. It also discusses important reinforcement learning problems like Markov decision processes and elementary methods including dynamic programming, Monte Carlo methods, and temporal-difference learning. Finally, it presents Q-routing and dual reinforcement Q-routing as examples of applying reinforcement learning concepts to optimize network routing.
Pattern Recognition and Machine Learning : Graphical Modelsbutest
- Bayesian networks are directed acyclic graphs that represent conditional independence relationships between variables. They allow compact representation of high-dimensional joint distributions.
- Graphical models like Bayesian networks and Markov random fields use graphs to represent conditional independence relationships between random variables. Inference can be performed exactly using algorithms like sum-product on trees or approximately using loopy belief propagation on general graphs.
- Sum-product and max-sum algorithms allow efficient exact inference in trees by passing messages along edges until beliefs at all nodes converge. Loopy belief propagation extends this approach to general graphs but convergence is not guaranteed.
The document discusses graphical models for analyzing large amounts of data from the internet. It outlines several applications of graphical models including clustering documents, detecting topics in text, word segmentation, modeling user interests over time, and detecting ideology from text. The document also discusses challenges like scale of data, need for advanced modeling beyond clustering/topics, and scalable inference algorithms. Basic statistical tools for graphical models like probability, independence, Bayes' rule, and exponential families are also covered.
Nearest neighbor models are conceptually just about the simplest kind of model possible. The problem is that they generally aren’t feasible to apply. Or at least, they weren’t feasible until the advent of Big Data techniques. These slides will describe some of the techniques used in the knn project to reduce thousand-year computations to a few hours. The knn project uses the Mahout math library and Hadoop to speed up these enormous computations to the point that they can be usefully applied to real problems. These same techniques can also be used to do real-time model scoring.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
Big Data Paradigm - Analysis, Application and ChallengesUyoyo Edosio
This document discusses big data, including its exponential growth driven by the internet and cheaper computing. It defines big data based on its volume, velocity, and variety (the 3 Vs). Hadoop and MapReduce are presented as tools for analyzing large and diverse datasets. The challenges of big data analysis are also discussed, such as noise, privacy issues, and infrastructure costs.
The diversity and complexity of contents available on the web have dramatically increased in recent years. Multimedia content such as images, videos, maps, voice recordings has been published more often than before. Document genres have also been diversified, for instance, news, blogs, FAQs, wiki. These diversified information sources are often dealt with in a separated way. For example, in web search, users have to switch between search verticals to access different sources. Recently, there has been a growing interest in finding effective ways to aggregate these information sources so that to hide the complexity of the information spaces to users searching for relevant information. For example, so-called aggregated search investigated by the major search engine companies will provide search results from several sources in a single result page. Aggregation itself is not a new paradigm; for instance, aggregate operators are common in database technology.
This talk presents the challenges faced by the like of web search engines and digital libraries in providing the means to aggregate information from several and complex information spaces in a way that helps users in their information seeking tasks. It also discusses how other disciplines including databases, artificial intelligence, and cognitive science can be brought into building effective and efficient aggregated search systems.
This document describes a novel statistical damage detection approach using unsupervised support vector machines (SVM). It aims to identify damage in structural components through vibration-based methods. The proposed approach builds a statistical model through unsupervised learning, avoiding the need for measurements from damaged structures. It is computationally efficient even with large numbers of features and does not suffer from local minima problems like artificial neural networks. Numerical simulations show the approach can accurately detect both the occurrence and location of damage.
This document describes a novel statistical damage detection approach using unsupervised support vector machines (SVM). It begins with an introduction to vibration-based damage detection methods and their limitations. It then provides an overview of SVM and describes how one-class SVM and support vector regression can be applied for unsupervised learning in damage detection, without requiring data from damaged structures. Feature selection is discussed to improve computational efficiency when dealing with large datasets. Numerical simulations are analyzed to examine the accuracy and scalability of the proposed approach.
This document discusses classical conditioning and operant conditioning. Classical conditioning involves learning associations between stimuli through repeated pairing, such as Pavlov's dogs learning to associate the sound of a bell with food. Operant conditioning refers to changing behavior through reinforcement or punishment of responses. The document provides examples of classical conditioning experiments conducted by Ivan Pavlov on digestion in dogs and John Watson on fear responses in a baby known as Little Albert. It also discusses applications of conditioning principles in shaping behaviors.
Streamlining Technology to Reduce Complexity and Improve ProductivityKevin Fream
The 3 biggest challenges for mid-sized companies are: clutter, discovery, and curiosity. Those that thrive will become experts in streamlining technology while remaining vigilant in following new trends.
07 history of cv vision paradigms - system - algorithms - applications - eva...zukun
This document outlines the history of computer vision from the 1960s to present day through a graph-based structure. It covers early paradigms in areas like pattern recognition and image processing in the 1970s. Artificial intelligence approaches emerged in the 1970s including systems like SHRDLU. Vision research in the 1980s focused on areas like image understanding, expert systems, and reasoning. Robotics became an important application area from the 1990s onward, with a focus on learning and interaction. Current computer vision systems, libraries and toolboxes build upon these historical paradigms, systems and algorithms.
This document discusses machine learning and artificial intelligence. It provides an overview of the machine learning process, including obtaining raw data, preprocessing the data, applying algorithms to extract features and train models, and generating outputs. It then describes different types of machine learning, including supervised learning, unsupervised learning, reinforcement learning, and semi-supervised learning. Specific algorithms like artificial neural networks, support vector machines, genetic algorithms are also briefly explained. Real-world applications of machine learning like character recognition and medical diagnosis are listed.
Graphical Models for chains, trees and gridspotaters
This document discusses graphical models for chains, trees, and grids. It begins with an overview of chain and tree models and algorithms for maximum a posteriori (MAP) inference in chain and tree models. It then discusses dynamic programming for efficient MAP inference in chain models. Examples of chain models for applications like sign language recognition are provided. The document is presented as slides for a lecture on graphical models and computer vision.
THM1: Formalizing a problem as a prediction problem is often the most important contribution of a data scientist.
THM2: A predictor is an estimator, i.e. an algorithm which takes data and returns a prediction. Reality is stochastic, so data and predictions are stochastic.
THM3: Learning is challenging since data must be used both to create prediction models and to assess them. Bias and variance must be balanced to achieve good generalization.
The peer-reviewed International Journal of Engineering Inventions (IJEI) is started with a mission to encourage contribution to research in Science and Technology. Encourage and motivate researchers in challenging areas of Sciences and Technology.
One Size Doesn't Fit All: The New Database Revolutionmark madsen
Slides from a webcast for the database revolution research report (report will be available at http://www.databaserevolution.com)
Choosing the right database has never been more challenging, or potentially rewarding. The options available now span a wide spectrum of architectures, each of which caters to a particular workload. The range of pricing is also vast, with a variety of free and low-cost solutions now challenging the long-standing titans of the industry. How can you determine the optimal solution for your particular workload and budget? Register for this Webcast to find out!
Robin Bloor, Ph.D. Chief Analyst of the Bloor Group, and Mark Madsen of Third Nature, Inc. will present the findings of their three-month research project focused on the evolution of database technology. They will offer practical advice for the best way to approach the evaluation, procurement and use of today’s database management systems. Bloor and Madsen will clarify market terminology and provide a buyer-focused, usage-oriented model of available technologies.
Webcast video and audio will be available on the report download site as well.
Power of Code: What you don’t know about what you knowcdathuraliya
Introductory and inspiring session for students on computer science
Date: 19th March 2013
Event: IT workshop for high school students
Venue: President's College, Maharagama
Organized by: Society of Computer Science, University of Sri Jayewardenepura
Applying Reinforcement Learning for Network Routingbutest
This document discusses the application of reinforcement learning in network routing. It provides an overview of reinforcement learning, including its key elements like the agent, environment, policy, reward function, and value function. It also discusses important reinforcement learning problems like Markov decision processes and elementary methods including dynamic programming, Monte Carlo methods, and temporal-difference learning. Finally, it presents Q-routing and dual reinforcement Q-routing as examples of applying reinforcement learning concepts to optimize network routing.
Pattern Recognition and Machine Learning : Graphical Modelsbutest
- Bayesian networks are directed acyclic graphs that represent conditional independence relationships between variables. They allow compact representation of high-dimensional joint distributions.
- Graphical models like Bayesian networks and Markov random fields use graphs to represent conditional independence relationships between random variables. Inference can be performed exactly using algorithms like sum-product on trees or approximately using loopy belief propagation on general graphs.
- Sum-product and max-sum algorithms allow efficient exact inference in trees by passing messages along edges until beliefs at all nodes converge. Loopy belief propagation extends this approach to general graphs but convergence is not guaranteed.
The document discusses graphical models for analyzing large amounts of data from the internet. It outlines several applications of graphical models including clustering documents, detecting topics in text, word segmentation, modeling user interests over time, and detecting ideology from text. The document also discusses challenges like scale of data, need for advanced modeling beyond clustering/topics, and scalable inference algorithms. Basic statistical tools for graphical models like probability, independence, Bayes' rule, and exponential families are also covered.
Nearest neighbor models are conceptually just about the simplest kind of model possible. The problem is that they generally aren’t feasible to apply. Or at least, they weren’t feasible until the advent of Big Data techniques. These slides will describe some of the techniques used in the knn project to reduce thousand-year computations to a few hours. The knn project uses the Mahout math library and Hadoop to speed up these enormous computations to the point that they can be usefully applied to real problems. These same techniques can also be used to do real-time model scoring.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
Big Data Paradigm - Analysis, Application and ChallengesUyoyo Edosio
This document discusses big data, including its exponential growth driven by the internet and cheaper computing. It defines big data based on its volume, velocity, and variety (the 3 Vs). Hadoop and MapReduce are presented as tools for analyzing large and diverse datasets. The challenges of big data analysis are also discussed, such as noise, privacy issues, and infrastructure costs.
The diversity and complexity of contents available on the web have dramatically increased in recent years. Multimedia content such as images, videos, maps, voice recordings has been published more often than before. Document genres have also been diversified, for instance, news, blogs, FAQs, wiki. These diversified information sources are often dealt with in a separated way. For example, in web search, users have to switch between search verticals to access different sources. Recently, there has been a growing interest in finding effective ways to aggregate these information sources so that to hide the complexity of the information spaces to users searching for relevant information. For example, so-called aggregated search investigated by the major search engine companies will provide search results from several sources in a single result page. Aggregation itself is not a new paradigm; for instance, aggregate operators are common in database technology.
This talk presents the challenges faced by the like of web search engines and digital libraries in providing the means to aggregate information from several and complex information spaces in a way that helps users in their information seeking tasks. It also discusses how other disciplines including databases, artificial intelligence, and cognitive science can be brought into building effective and efficient aggregated search systems.
This document describes a novel statistical damage detection approach using unsupervised support vector machines (SVM). It aims to identify damage in structural components through vibration-based methods. The proposed approach builds a statistical model through unsupervised learning, avoiding the need for measurements from damaged structures. It is computationally efficient even with large numbers of features and does not suffer from local minima problems like artificial neural networks. Numerical simulations show the approach can accurately detect both the occurrence and location of damage.
This document describes a novel statistical damage detection approach using unsupervised support vector machines (SVM). It begins with an introduction to vibration-based damage detection methods and their limitations. It then provides an overview of SVM and describes how one-class SVM and support vector regression can be applied for unsupervised learning in damage detection, without requiring data from damaged structures. Feature selection is discussed to improve computational efficiency when dealing with large datasets. Numerical simulations are analyzed to examine the accuracy and scalability of the proposed approach.
06-01 Machine Learning and Linear Regression.pptxSaharA84
This document discusses machine learning and linear regression. It provides examples of supervised learning problems like predicting housing prices and classifying cancer as malignant or benign. Unsupervised learning is used to discover patterns in unlabeled data, like grouping customers for market segmentation. Linear regression finds the linear function that best fits some training data to make predictions on new data. It can be extended to nonlinear functions by adding polynomial features. More complex models may overfit the training data and not generalize well to new examples.
This document provides an overview of machine learning concepts related to overfitting and model selection. It discusses overfitting in k-nearest neighbors and regression models. It introduces bias-variance decomposition and structural risk minimization. Methods for controlling overfitting like cross-validation, regularization, feature selection and model selection are covered. The concepts of consistency, model convergence speed, and strategies for controlling generalization capacity are explained.
This document provides an introduction to business intelligence and data analytics. It discusses key concepts such as data sources, data warehouses, data marts, data mining, and data analytics. It also covers topics like univariate analysis, measures of dispersion, heterogeneity measures, confidence intervals, cross validation, and ROC curves. The document aims to introduce fundamental techniques and metrics used in business intelligence and data mining.
This document discusses machine learning techniques including linear support vector machines (SVMs), data splitting, model fitting and prediction, and histograms. It summarizes an SVM tutorial for predicting samples and evaluating models using classification reports and confusion matrices. It also covers kernel density estimation, PCA, and comparing different classifiers.
Surface features with nonparametric machine learningSylvain Ferrandiz
For data savvy users (analysts, scientists, ops, engineers) who are willing to discover some nonparametric machine learning algos that might help while competing via Kaggle or, more down-to-earth-ly, while having not that much time to spend on some predictive analytics projects. Talk given at Paris Kaggle meetup.
May 2015 talk to SW Data Meetup by Professor Hendrik Blockeel from KU Leuven & Leiden University.
With increasing amounts of ever more complex forms of digital data becoming available, the methods for analyzing these data have also become more diverse and sophisticated. With this comes an increased risk of incorrect use of these methods, and a greater burden on the user to be knowledgeable about their assumptions. In addition, the user needs to know about a wide variety of methods to be able to apply the most suitable one to a particular problem. This combination of broad and deep knowledge is not sustainable.
The idea behind declarative data analysis is that the burden of choosing the right statistical methodology for answering a research question should no longer lie with the user, but with the system. The user should be able to simply describe the problem, formulate a question, and let the system take it from there. To achieve this, we need to find answers to questions such as: what languages are suitable for formulating these questions, and what execution mechanisms can we develop for them? In this talk, I will discuss recent and ongoing research in this direction. The talk will touch upon query languages for data mining and for statistical inference, declarative modeling for data mining, meta-learning, and constraint-based data mining. What connects these research threads is that they all strive to put intelligence about data analysis into the system, instead of assuming it resides in the user.
Hendrik Blockeel is a professor of computer science at KU Leuven, Belgium, and part-time associate professor at Leiden University, The Netherlands. His research interests lie mostly in machine learning and data mining. He has made a variety of research contributions in these fields, including work on decision tree learning, inductive logic programming, predictive clustering, probabilistic-logical models, inductive databases, constraint-based data mining, and declarative data analysis. He is an action editor for Machine Learning and serves on the editorial board of several other journals. He has chaired or organized multiple conferences, workshops, and summer schools, including ILP, ECMLPKDD, IDA and ACAI, and he has been vice-chair, area chair, or senior PC member for ECAI, IJCAI, ICML, KDD, ICDM. He was a member of the board of the European Coordinating Committee for Artificial Intelligence from 2004 to 2010, and currently serves as publications chair for the ECMLPKDD steering committee.
Machine learning techniques can be applied in formal verification in several ways:
1) To enhance current formal verification tools by automating tasks like debugging, specification mining, and theorem proving.
2) To enable the development of new formal verification tools by applying machine learning to problems like SAT solving, model checking, and property checking.
3) Specific applications include using machine learning for debugging and root cause identification, learning specifications from runtime traces, aiding theorem proving by selecting heuristics, and tuning SAT solver parameters and selection.
This document discusses multiple regression analysis techniques. It begins by stating the goals of developing a statistical model to predict dependent variables from independent variables and using multiple regression when more than one independent variable is useful for prediction. It then provides an introduction to simple and multiple regression. The rest of the document discusses key aspects of multiple regression analysis, including linear models, the method of least squares, standard error of estimate, coefficient of multiple determination, hypothesis testing, and selection of predictor variables.
These slides were used in an introductory lecture to Computational Finance presented in a third-year class on Machine Learning and Artificial Intelligence. The slides present three examples of machine learning applied to computational / quantitative finance. These include
1) Model calibration (stochastic process) using the stochastic Hill Climbing algorithms.
2) Predicting Credit Default rates using a Neural Network
3) Portfolio Optimization using the Particle Swarm Optimization Algorithm.
All of the Python code is available for download on GitHub. Link is available at the end of the slide-show.
This document provides an overview of machine learning techniques using R. It discusses regression, classification, linear models, decision trees, neural networks, genetic algorithms, support vector machines, and ensembling methods. Evaluation metrics and algorithms like lm(), rpart(), nnet(), ksvm(), and ga() are presented for different machine learning tasks. The document also compares inductive learning, analytical learning, and explanation-based learning approaches.
The document discusses recommender systems and sequential recommendation problems. It covers several key points:
1) Matrix factorization and collaborative filtering techniques are commonly used to build recommender systems, but have limitations like cold start problems and how to incorporate additional constraints.
2) Sequential recommendation problems can be framed as multi-armed bandit problems, where past recommendations influence future recommendations.
3) Various bandit algorithms like UCB, Thompson sampling, and LinUCB can be applied, but extending guarantees to models like matrix factorization is challenging. Offline evaluation on real-world datasets is important.
In Machine Learning in Credit Risk Modeling, we provide an explanation of the main Machine Learning models used in James so that Efficiency does not come at the expense of Explainability.
(Contact Yvan De Munck for more info or to receive other and future updates on the subject @yvandemunck or yvan@james.finance)
Econometrics of High-Dimensional Sparse ModelsNBER
The document discusses high-dimensional sparse econometric models where the number of predictors (p) is much larger than the sample size (n). It outlines an approach for estimating regression functions using penalization methods like the LASSO. Specifically, it discusses:
1. Using the LASSO estimator to minimize squared errors while penalizing the l1-norm of coefficients, inducing sparsity.
2. Choosing the optimal penalty level as a function of the error variance and sample size. Variants like the square-root LASSO provide a tuning-free approach.
3. Examples showing how sparse approximations can better capture patterns in population data than traditional low-dimensional approximations.
This document provides an overview of multiple regression analysis and hypothesis testing using the classical linear model. It discusses the assumptions of the classical linear model and how they allow for hypothesis testing of regression coefficients. Specifically, it describes how to test hypotheses about individual coefficients using t-tests and hypotheses about linear combinations of coefficients or exclusion of multiple regressors using F-tests. Examples are provided to illustrate testing various null hypotheses about coefficients.
This document discusses classifier performance evaluation. It covers the following key points in 3 sentences:
The document outlines different methods for evaluating classifier performance, including hold out, k-fold cross validation, and bootstrap aggregating. It emphasizes that evaluation should be treated as statistical hypothesis testing using metrics like accuracy, precision, and recall calculated from a confusion matrix. Proper evaluation also requires partitioning data into separate training and test sets to avoid overfitting and get an accurate estimate of a classifier's generalization performance.
This document discusses tuning hyperparameters using cross validation. It begins by motivating the need for model selection to choose hyperparameters that provide a good balance between model complexity and accuracy. It then discusses assessing model quality using measures like error rate from a test set. Cross validation techniques like k-fold and leave-one-out are presented as methods for estimating accuracy without using all the data for training. The document concludes by discussing strategies for implementing model selection like using grids to search hyperparameters and evaluating results.
Lazy learning methods store training data and wait until test data is received to perform classification, taking less time to train but more time to predict. Eager learning methods construct a classification model during training. Lazy methods like k-nearest neighbors use a richer hypothesis space while eager methods commit to a single hypothesis. The k-nearest neighbor algorithm classifies new examples based on the labels of its k closest training examples. Case-based reasoning uses a symbolic case database for classification while genetic algorithms evolve rule populations through crossover and mutation to classify data.
This document discusses regularization and model selection techniques for machine learning models. It describes cross-validation methods like hold-out validation and k-fold cross validation that evaluate models on held-out data to select models that generalize well. Feature selection is discussed as an important application of model selection. Bayesian statistics and placing prior distributions on parameters is introduced as a regularization technique that favors models with smaller parameter values.
Similar to Les outils de modélisation des Big Data (20)
B -technical_specification_for_the_preparatory_phase__part_ii_Kezhan SHI
This document contains part II of the technical specifications for the Solvency II preparatory phase. It provides details on determining the risk-free interest rate term structure, including the volatility adjustment and transitional measures. It also specifies how to derive the matching adjustment. A number of simplifications were made for pragmatic reasons during the preparatory phase. The specifications are based on Solvency II directives and draft delegated acts. Several measures require prior supervisory approval, and their use during the preparatory phase does not imply such approval has been granted.
A -technical_specification_for_the_preparatory_phase__part_i_Kezhan SHI
This document contains part I of the technical specifications for the preparatory phase of Solvency II. It provides guidance on valuation of assets and liabilities, calculation of technical provisions, and the standard formula for the Solvency Capital Requirement (SCR). A number of simplifications have been included to facilitate the preparatory phase. The specifications are based on Directive 138/2009/EC, Delegated Acts, and Level 3 Guidelines available at the time. It is to be used together with part II of the technical specifications.
20140806 traduction hypotheses_sous-jacentes_formule_standardKezhan SHI
The document outlines the key assumptions underlying the standard formula for calculating the Solvency Capital Requirement (SCR). It covers all risk modules in the standard formula, including the assumptions regarding risks covered by each module and correlations between modules. It provides details on the assumptions for risks like market, underwriting, health, operational and counterparty risk. Organizations are expected to evaluate how their own risk profile deviates from these assumptions as part of the Own Risk and Solvency Assessment (ORSA) process.
C -annexes_to_technical_specification_for_the_preparatory_phase__part_i_Kezhan SHI
This document provides annexes to the technical specification for the preparatory phase. It includes annexes that define terms related to calculating technical provisions, provide examples of techniques for calculating the best estimate of technical provisions such as simulation, analytical, and deterministic techniques, and provide guidance on issues like health insurance, the boundary of insurance contracts, discount factors used in risk margins, simplifications for incurred but not reported claims, allowance of reinsurance, risk mitigation techniques, and ring-fenced funds. It also includes annexes that define lines of business, regions, natural catastrophe risk factors, and the treatment of participations in solvency calculations.
This document provides technical specifications for the QIS5 valuation exercise, including guidance on valuation of assets, liabilities, and technical provisions. Key points include:
- Assets and liabilities should be valued using a market-consistent approach based on the amount for which they could be exchanged between knowledgeable willing parties.
- Valuation should generally follow IFRS to the extent it reflects economic valuation principles. Additional guidance is provided where IFRS does not yield economic values.
- The hierarchy is to use mark-to-market where possible and mark-to-model otherwise, benchmarked to observable market prices. Materiality should also be considered.
Tableau de comparaison bilan S1 et bilan S2Kezhan SHI
Ce tableau n'a qu'une valeur indicative pour les organismes d'assurance, cette analyse est fondée sur la base du plan comptable assurance qui figure dans le code des assurances.
Des adaptations pourront être rendues nécessaires pour tenir compte des spécificités liées aux différents types d’organismes qui pourraient notamment conduire à des reclassements de
comptes entre les fonds propres et le bilan.
Le tableau de raccordement présenté ci-dessous est destiné à s’appliquer aux états solo des entités et non aux bilans des groupes.
Ce tableau sera éventuellement mis à jour en fonction des conclusions rendues par EIOPA dans le cadre de son travail d’harmonisation, au cas où celles-ci rentrent en contradiction avec
le document tel qu’il est actuellement présenté.
En tout état de cause cette table de correspondance n'a pas vocation à se substituer au travail d'affectation des comptes aux rubriques du bilan prudentiel Solvabilité II qui relève de la
responsabilité des organismes eux-mêmes et qui doit être conduit par les ceux-ci.
L'indication "ex" devant un numéro de compte du PCA signifie le solde de ce compte se trouve réparti entre plusieurs postes du bilan prudentiel.
Optimal discretization of hedging strategies rosenbaumKezhan SHI
This document discusses optimal discretization of hedging strategies. It begins by introducing classical hedging using delta hedging, but notes that continuous adjustment is impossible in practice due to microstructure effects and transaction costs. It then examines how to optimally discretize hedging strategies by rebalancing the portfolio at certain time intervals. The document outlines an asymptotic approach to derive asymptotically optimal strategies. It proves a lower bound on discretization error and shows that a strategy of adaptive rebalancing based on changes in the hedging position achieves the lower bound and is thus asymptotically optimal. Finally, it discusses limitations of ignoring microstructure effects and introduces a model to incorporate these effects.
Machine learning pour les données massives algorithmes randomis´es, en ligne ...Kezhan SHI
This document discusses machine learning algorithms for big data applications. It begins by providing context on the rise of big data from various domains and the challenges it poses. It then gives an overview of machine learning and some key algorithms like neural networks, support vector machines, and boosting. The document discusses challenges of scaling machine learning to big data, including dealing with high volume, variety, and velocity of data. It covers areas of research focused on developing algorithms that can handle data distribution, online/real-time learning, and more. Examples of applying machine learning to tasks like classification, regression, and clustering are also provided.
Détection de profils, application en santé et en économétrie geisslerKezhan SHI
(1) Quinten provides data-oriented strategic advisory and has over 100 missions for more than 25 clients in life sciences, healthcare, CRM, insurance, and investment. (2) It uses machine learning techniques like profile detection to analyze large datasets and discover meaningful patterns, with a focus on interpretability and simplicity. (3) Some techniques discussed include clustering, associative rule discovery, and detecting recurrent biases in financial markets to generate investment profiles.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
1. Les outils de modélisation des Big Data
SEPIA
3 dec 13
Pr Michel Béra
Chaire de Modélisation statistique du Risque
CNAM/SITI/IMATH
1
2. • Plan de l’exposé
– L’inégalité de Vapnik et les fondements d’une nouvelle
théorie de la robustesse (1971 et 1995)
– Éclairages sur les méthodes classiques (NN, Decision Trees,
analyse factorielle)
– La notion de géométrie des données et d’espace étendu –
le Kernel Trick – quali et quanti : un combat dépassé
– Big Data et monde vapnikien, utopies et réalités – notions
de complexité informatique
– Modélisation moderne : un enchaînement d’approches, du
Machine Learning aveugle aux finesses de l’Evidence based
Policy
2
3. Theoretical Statistics
« Data are as they are »
Applied Statistics
« modeling data then testing »
Theory of ill-posed problems
Empirical Methods
of conjuration (PCA,NN,Bayes)
1974 VC Dimension
2001: Start of the
internet era,
Millions of records
& thousands of variables
1980 SRM (Vapnik)
1995 Support Vector Machines (Vapnik)
1960: Mainframe.
Huge Datasets start appearing.
1930
Kolmogorov-SmirnovFisher
1950Cramer
High dimensionnal problems
malediction
STOP !
Watch out !
GO !
Statistical history
3
4. 1. Le monde de Vapnik
- Conférence aux Bell Labs (New Jersey) de 1995
4
5. Consistency : definition
1) A learning process (model) is said to be consistent if
model error, measured on new data sampled from the
same underlying probability laws of our original
sample, converges, when original sample size
increases, towards model error, measured on original
sample.
2) A model that is consistent is also said to generalize well,
or to be robust
5
6. %error
number of training examples
Test error
Training error
Consistent training?
%error
number of training examples
Test error
Training error
6
7. Generalization: definition
• Generalization capacity for a model describes how
(ex: error function) a model will perform on data that
he has never seen before (in his training set)
• Good generalization for a model means that model
errors on new unknown data will be of the same
« size » than known error on his training set. The
model is also called « robust ».
7
10. Vapnik approach to modeling (1)
• Vapnik approach is based on the family of functions S =
{f(X,w), w ε W}, in which a model is chosen as a specific
function, described by a specific w
• For Vapnik, the model function must answer properly for a
given row X the question described by target Y, ie predict Y,
the quality of the answer being measured by a cost function Q
• Different families of functions may provide the same
« quality » of answer
10
11. Vapnik approach to modeling (2)
• All the trick is then to find a good family of functions S, that
not only answers in a « good way » the question described by
target Y, but that can also be easy to understand, ie also
provide a good description, allowing to explain easily what is
underlying the data behaviour of the problem question
• VC dimension will be a key to understand and control model
robustness
11
12. VC dimension - definition (1)
• Let us consider a sample (x1, .. , xL) from Rn
• There are 2L different ways to separate the sample in two sub-
samples
• A set S of functions f(X,w) shatters the sample if all 2L
separations can be defined by different f(X,w) from family S
12
13. VC dimension - definition (2)
A function family S has VC dimension h (h is an integer) if:
1) Every sample of h vectors from Rn can be shattered by a
function from S
2) There is at least one sample of h+1 vectors that cannot be
shattered by any function from S
13
14. Example: VC dimension
VC dimension:
- Measures the complexity of a
solution (function).
- Is not directly related to the
number of variables
VC dimension:
- Measures the complexity of a
solution (function).
- Is not directly related to the
number of variables
14
15. Other examples
• VC dimension for hyperplanes of Rn is n+1
• VC dimension of set of functions:
f(x,w) = sign (sin (w.x) ),
c <= x <= 1, c>0,
where w is a free parameter, is infinite.
– Conclusion : VC dimension is not always equal to the
number n of parameters (X1,..,Xn) of a given family S of
functions from Rn to {-1,+1}.
15
16. Key Example: linear models -> y = <w|x> + b
• VC dimension of family S of linear models:
with:
depends on C and can take any value between 0 and n.
This is the basis for Machine Learning approaches such as SVM
(Support Vector Machines) or Ridge Regression.
16
17. VC dimension : interpretation
• VC dimension of S: an integer, that measures the
shattering (or separating) power (“complexity”) of
function family S:
• We shall now show that VC dimension (a major
theorem from Vapnik) gives a powerful indication for
model consistency, hence “robustness”.
17
18. What is a Risk Functional?
• A function of the parameters of the
learning machine, assessing how much it is
expected to fail on a given task.
Parameter space (w)
R[f(x,w)]
w*
18
19. Examples of Risk Functionals
• Classification:
– Error rate
– AUC
• Regression:
– Mean square error
19
20. Lift Curve
O
MKI =
O M
Fraction of customers selected
Fractionofgoodcustomersselected
Ideal Lift
100%
100%Customers
ordered
according
to f(x);
selection
of the top
ranking
customers.
Gini index
0 ≤≤≤≤ KI ≤≤≤≤ 1
20
22. Learning Theory Problem (1)
• A model computes a function:
• Problem : minimize in w Risk Expectation
– w : a parameter that specifies the chosen model
– z = (X, y) are possible values for attributes (variables)
– Q measures (quantifies) model error cost
– P(z) is the underlying probability law (unknown) for data z
22
23. • We get L data from learning sample (z1, .. , zL), and we suppose them
iid sampled from law P(z).
• To minimize R(w), we start by minimizing Empirical Risk over this
sample :
• Example of classical cost functions :
– classification (eg. Q can be a cost function based on cost for
misclassified points)
– regression (eg. Q can be a cost function of least squares type)
Learning Theory Problem (2)
23
24. Learning Theory Problem (3)
• Central problem for Statistical Learning Theory:
What is the relation
between Risk Expectation R(W)
and Empirical Risk E(W)?
• How to define and measure a generalization capacity
(“robustness”) for a model ?
24
25. Four Pillars for SLT (1 and 2)
• Consistency (guarantees generalization)
– Under what conditions will a model be consistent ?
• Model convergence speed (a measure for
generalization capacity)
– How does generalization capacity improve when
sample size L grows?
25
26. Four Pillars for SLT (3 and 4)
• Generalization capacity control
– How to control in an efficient way model generalization
starting with the only given information we have: our
sample data?
• A strategy for good learning algorithms
– Is there a strategy that guarantees, measures and controls
our learning model generalization capacity ?
26
27. Vapnik main theorem
• Q : Under which conditions will a learning process (model) be
consistent?
• R : A model will be consistent if and only if the function f that
defines the model comes from a family of functions S with
finite VC dimension h
• A finite VC dimension h not only guarantees a generalization
capacity (consistency), but to pick f in a family S with finite VC
dimension h is the only way to build a model that generalizes.
27
28. Model convergence speed (generalization
capacity)
• Q : What is the nature of model risk difference between
learning data (sample: empirical risk) and test data (expected
risk), for a sample of finite size L?
• R : This difference is no greater than a limit that only depends
on the ratio between VC dimension h of model functions
family S, and sample size L, ie h/L
This statement is a new theorem that belongs to Kolmogorov-
Smirnov way for results, ie theorems that do not depend on
data’s underlying probability law.
28
29. Empirical risk minimization in LS case
• With probability 1-q, the following inequality is true:
where w0 is the parameter w value that minimizes
Empirical Risk:
29
31. “SRM” methodology: how to control model
generalization capacity
Expected Risk = Empirical Risk + Confidence Interval
• To minimize Empirical Risk alone will not always give a
good generalization capacity: one will want to minimize
the sum of Empirical Risk and Confidence Interval
• What is important is not Vapnik limit numerical value ,
most often too large to be of any practical use, it is the
fact that this limit is a non decreasing function of model
family function “richness”, ie shattering power
31
32. SRM strategy (1)
• With probability 1-q,
• When h/L is too large, second term of equation
becomes large
• SRM basic idea for strategy is to minimize
simultaneously both terms standing on the right of
this majoring equation for R(w)
• To do this, one has to make h a controlled parameter
32
33. SRM strategy (2)
• Let us consider a sequence S1 < S2 < .. < Sn of model
family functions, with respective growing VC
dimensions
h1 < h2 < .. < hn
• For each family Si of our sequence, the inequality
is valid
33
34. SRM strategy (3)
SRM : find i such that expected risk R(w) becomes
minimum, for a specific h*=hi, relating to a specific
family Si of our sequence; build model using f from Si
Empirical
Risk
Risk
Model Complexity
Total Risk
Confidence interval
In h/L
Best Model
h*
34
35. How to chose h*: cross-validation
• Learning sample of size L is divided in two: basic learning set
of size L1, and validation set of size L2
• For a given meta-parameter that controls the model family S
richness, hence its h, a model is built on basic learning set,
and its actual risk is measured on validation set
• Meta-parameter is determined so that model actual risk is
minimum on validation set: this leads to the best family, ie h*
• Final model is computed from this optimal family: best trade-
off between fit and robustness is achieved by construction
35
36. Some Learning Machines
• Linear models
• Polynomial models
• Kernel methods
• Neural networks
• Decision trees
36
37. Learning Process
• Learning machines include:
– Linear discriminant (including Naïve Bayes)
– Kernel methods
– Neural networks
– Decision trees, Random Forests
• Learning is tuning:
– Parameters (weights w or αααα, threshold b)
– Hyperparameters (basis functions, kernels, number of
units, number of features/attributes)
37
38. Industrial Data Mining: implementation example
x1
xn
x3
x2
Output
System
y1
yp
y2
Input
k
x k
y
DataPreparation
Learning
Algorithm
Class of Models
DataEncoding
LossCriterion
k
x
k
y
Descriptors
Automatic
via SRM
Ridge
regression
KI
(Gini index)
Polynomials
( )
κ, σκ, σκ, σκ, σ
γγγγ
w
38
39. Data Encoding/Compression
• Encodes nominal and ordinal variables
numerically.
• Encodes continuous variables non-linearly.
• Compresses variables in robust categories.
• Handles missing values and outliers.
• This process includes adjustable hyper-
parameters.
39
40. Multiple Structures
S1⊂ S2 ⊂ … SN
• Weight decay/Ridge regression:
Sk = { w | ||w||2< ωk }, ω1<ω2<…<ωk
γ1 > γ2 > γ3 >… > γk (γ is the ridge)
• Feature selection:
Sk = { w | ||w||0< σk },
σ1<σ2<…<σk (σ is the number of features)
• Data compression:
κ1<κ2<…<κk (κ may be the number of clusters)
40
41. Hyper-parameter selection
• w = parameter vector.
γ, σ, κ = hyper-parameters.
• Cross-validation with K-folds:
• For various values of γ, σ, κ:
– Adjust w on (K-1)/K training
examples.
– Test on K remaining examples.
– Rotate examples and average test
results (CV error).
– Select γ, σ, κ to minimize CV error.
– Re-compute w on all training
examples using opt. γ, σ, κ.
X y
Prospective
study /
“real”
validation
Trainingdata:MakeKfoldsTestdata
41
42. SRM put to work : campaign optimization
O
MKI =
O
M
Fraction of customers selected
Fractionofgoodcustomersselected
Ideal Lift
100%
100%Customers
ordered
according
to f(x);
selection
of the top
ranking
customers.
G
CV lift
O
GKR −=1
42
43. Summary
• Weight decay is a powerful mean of
overfitting avoidance.
• It is also known as “ridge regression”.
• It is grounded in the SRM theory.
• Multiple structures are used by most
current DM engines : ridge, feature
selection, data compression.
43
44. Quelques exemples concrets
• Census : expliquer ce qui fait que l’on gagne
plus ou moins de $50000/an
• Données biostatistiques : feature reduction
44
45. Ockham’s Razor
• Principle proposed by William of
Ockham in the fourteenth
century: “Pluralitas non est
ponenda sine neccesitate”.
• Of two theories providing
similarly good predictions, prefer
the simplest one.
• Shave off unnecessary
parameters of your models.
45
46. Vision : l’Atelier de modélisation prédictive
• Le data mining/machine learning intervient en
amont pour sélectionner dans un grand ensemble de
variables, sur une problématique, les « bonnes »
variables susceptibles d’inférence utile. Cette étape
peut être « automatisée »
• On met ensuite en place la stratification, la
randomisation et les RCT appropriés, à partir de ces
variables « particulièrement intéressantes »
• On finit par les tests sur les résultats (étape qui peut
être aussi automatisée)
• => un accélérateur de production de résultats pour
une Evidence Based Policy toujours plus efficace
46