This document discusses the use of machine learning in official statistics. It begins by contrasting the traditional "data modeling" approach used in statistics with the newer "algorithmic modeling" approach used in machine learning. It then discusses how machine learning can be applied both in traditional statistical production processes based on primary survey data, as well as in new multi-source production processes that incorporate alternative data sources like big data. Specifically, machine learning can be used for tasks like imputation, outlier detection, and estimation. The document concludes that machine learning represents a paradigm shift for official statistics and is particularly well-suited for new data sources, as it prioritizes prediction over interpretability and generalizability.
antimo musone - Parleremo di Machine Learining, che cos’è, a cosa serve, la quali sono i campi di applicazione. Analizzeremo e vedremo in azione le diverse soluzione di machine learning esistenti sul Cloud ( Watson di IBM e Azure ML di Microsoft ) che consentiranno alle aziende, ai centri di ricerca e agli sviluppatori di incorporare nelle loro Applicazioni funzionalità di apprendimento automatico e di analisi predittiva su enorme quantità al fine di offrire servizi sempre più innovativi e intelligenti.Daremo saggio delle piattaforme svelando i pro e i contro a secondo delle esigenze che vogliamo soddisfare
This presentation is based on ``Statistical Modeling: The two cultures'' from Leo Breiman. It compares the data modeling culture (statistics) and the algorithmic modeling culture (machine learning).
Improved correlation analysis and visualization of industrial alarm dataISA Interchange
The problem of multivariate alarm analysis and rationalization is complex and important in the area of smart alarm management due to the interrelationships between variables. The technique of capturing and visualizing the correlation information, especially from historical alarm data directly, is beneficial for further analysis. In this paper, the Gaussian kernel method is applied to generate pseudo continuous time series from the original binary alarm data. This can reduce the influence of missed, false, and chattering alarms. By taking into account time lags between alarm variables, a correlation color map of the transformed or pseudo data is used to show clusters of correlated variables with the alarm tags reordered to better group the correlated alarms. Thereafter correlation and redundancy information can be easily found and used to improve the alarm settings; and statistical methods such as singular value decomposition techniques can be applied within each cluster to help design multivariate alarm strategies. Industrial case studies are given to illustrate the practicality and efficacy of the proposed method. This improved method is shown to be better than the alarm similarity color map when applied in the analysis of industrial alarm data.
Data Warehouses are structures with large amount of data collected from heterogeneous sources to be
used in a decision support system. Data Warehouses analysis identifies hidden patterns initially unexpected
which analysis requires great memory and computation cost. Data reduction methods were proposed to
make this analysis easier. In this paper, we present a hybrid approach based on Genetic Algorithms (GA)
as Evolutionary Algorithms and the Multiple Correspondence Analysis (MCA) as Analysis Factor Methods
to conduct this reduction. Our approach identifies reduced subset of dimensions p’ from the initial subset p
where p'<p where it is proposed to find the profile fact that is the closest to reference. Gas identify the
possible subsets and the Khi² formula of the ACM evaluates the quality of each subset. The study is based
on a distance measurement between the reference and n facts profile extracted from the warehouse.
antimo musone - Parleremo di Machine Learining, che cos’è, a cosa serve, la quali sono i campi di applicazione. Analizzeremo e vedremo in azione le diverse soluzione di machine learning esistenti sul Cloud ( Watson di IBM e Azure ML di Microsoft ) che consentiranno alle aziende, ai centri di ricerca e agli sviluppatori di incorporare nelle loro Applicazioni funzionalità di apprendimento automatico e di analisi predittiva su enorme quantità al fine di offrire servizi sempre più innovativi e intelligenti.Daremo saggio delle piattaforme svelando i pro e i contro a secondo delle esigenze che vogliamo soddisfare
This presentation is based on ``Statistical Modeling: The two cultures'' from Leo Breiman. It compares the data modeling culture (statistics) and the algorithmic modeling culture (machine learning).
Improved correlation analysis and visualization of industrial alarm dataISA Interchange
The problem of multivariate alarm analysis and rationalization is complex and important in the area of smart alarm management due to the interrelationships between variables. The technique of capturing and visualizing the correlation information, especially from historical alarm data directly, is beneficial for further analysis. In this paper, the Gaussian kernel method is applied to generate pseudo continuous time series from the original binary alarm data. This can reduce the influence of missed, false, and chattering alarms. By taking into account time lags between alarm variables, a correlation color map of the transformed or pseudo data is used to show clusters of correlated variables with the alarm tags reordered to better group the correlated alarms. Thereafter correlation and redundancy information can be easily found and used to improve the alarm settings; and statistical methods such as singular value decomposition techniques can be applied within each cluster to help design multivariate alarm strategies. Industrial case studies are given to illustrate the practicality and efficacy of the proposed method. This improved method is shown to be better than the alarm similarity color map when applied in the analysis of industrial alarm data.
Data Warehouses are structures with large amount of data collected from heterogeneous sources to be
used in a decision support system. Data Warehouses analysis identifies hidden patterns initially unexpected
which analysis requires great memory and computation cost. Data reduction methods were proposed to
make this analysis easier. In this paper, we present a hybrid approach based on Genetic Algorithms (GA)
as Evolutionary Algorithms and the Multiple Correspondence Analysis (MCA) as Analysis Factor Methods
to conduct this reduction. Our approach identifies reduced subset of dimensions p’ from the initial subset p
where p'<p where it is proposed to find the profile fact that is the closest to reference. Gas identify the
possible subsets and the Khi² formula of the ACM evaluates the quality of each subset. The study is based
on a distance measurement between the reference and n facts profile extracted from the warehouse.
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Intel® Software
This session explains how solutions desired by such IT/Internet/Silicon Valley etc companies can look like, how they may differ from the more “classical” consumers of machine learning and analytics, and the arising challenges that current and future HPC development may have to cope with.
Collecting and analyzing data in real time doesn't have to be as stressful or hard as it sounds, especially if you want to collect real time data using surveys. There is a short way and a long way to collect real time survey data. The short way of collecting and analyzing survey data is to use software that has the capability of collecting and analyzing survey data when embedded into powerpoints or webinars. The long way is to use hard copies of surveys to collect data, and Excel to analyze. This document will show you step by step how to collect and analyze survey data the long way.
Application of Exponential Gamma Distribution in Modeling Queuing Dataijtsrd
There are many events in daily life where a queue is formed. Queuing theory is the study of waiting lines and it is very crucial in analyzing the procedure of queuing in daily life of human being. Queuing theory applies not only in day to day life but also in sequence of computer programming, networks, medical field, banking sectors etc. Researchers have applied many statistical distributions in analyzing a queuing data. In this study, we apply a new distribution named Exponential Gamma distribution in fitting a data on waiting time of bank customers before service is been rendered. We compared the adequacy and performance of the results with other existing statistical distributions. The result shows that the Exponential Gamma distribution is adequate and also performed better than the existing distributions. Ayeni Taiwo Michael | Ogunwale Olukunle Daniel | Adewusi Oluwasesan Adeoye ""Application of Exponential-Gamma Distribution in Modeling Queuing Data"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30097.pdf
Paper Url : https://www.ijtsrd.com/mathemetics/statistics/30097/application-of-exponential-gamma-distribution-in-modeling-queuing-data/ayeni-taiwo-michael
Counterfactual Learning for RecommendationOlivier Jeunen
Slides for our presentation at the REVEAL workshop for RecSys '19 in Copenhagen and a Data Science Leuven Meetup, titled "Counterfactual Learning for Recommendation".
Turnover Prediction of Shares Using Data Mining Techniques : A Case Study csandit
Predicting the Total turnover of a company in the ever fluctuating Stock market has always proved to be a precarious situation and most certainly a difficult task at hand. Data mining is a
well-known sphere of Computer Science that aims at extracting meaningful information from large databases. However, despite the existence of many algorithms for the purpose of
predicting future trends, their efficiency is questionable as their predictions suffer from a high
error rate. The objective of this paper is to investigate various existing classification algorithms
to predict the turnover of different companies based on the Stock price. The authorized datasetfor predicting the turnover was taken from www.bsc.com and included the stock market valuesof various companies over the past 10 years. The algorithms were investigated using the ‘R’
tool. The feature selection algorithm, Boruta, was run on this dataset to extract the important
and influential features for classification. With these extracted features, the Total Turnover of
the company was predicted using various algorithms like Random Forest, Decision Tree, SVM and Multinomial Regression. This prediction mechanism was implemented to predict the turnover of a company on an everyday basis and hence could help navigate through dubious
stock markets trades. An accuracy rate of 95% was achieved by the above prediction process.
Moreover, the importance of the stock market attributes was established as well.
Machine Learning for Understanding Biomedical PublicationsGrigorios Tsoumakas
My keynote talk on "Machine Learning for Understanding Biomedical Publications" at the 12th Conference of the Hellenic Society for Computational Biology and Bioinformatics, HSCBB17
Single view vs. multiple views scatterplotsIJECEIAES
Among all the available visualization tools, the scatterplot has been deeply analyzed through the years and many researchers investigated how to improve this tool to face new challenges. The scatterplot visualization diagram is considered one of the most functional among the variety of data visual representations, due to its relative simplicity compared to other multivariable visualization techniques. Even so, one of the most significant and unsolved challenge in data visualization consists in effectively displaying datasets with many attributes or dimensions, such as multidimensional or multivariate ones. The focus of this research is to compare the single view and the multiple views visualization paradigms for displaying multivariable dataset using scatterplots. A multivariable scatterplot has been developed as a web application to provide the single view tool, whereas for the multiple views visualization, the ScatterDice web app has been slightly modified and adopted as a traditional, yet interactive, scatterplot matrix. Finally, a taxonomy of tasks for visualization tools has been chosen to define the use case and the tests to compare the two paradigms.
Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...Olivier Jeunen
Slides for my Doctoral Symposium presentation at RecSys '19 in Copenhagen, titled "Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems".
Data mining is an integrated field, depicted technologies in combination to the areas having database, learning by machine, statistical study, and recognition in patterns of same type, information regeneration, A.I networks, knowledge-based portfolios, artificial intelligence, neural network, and data determination. In real terms, mining of data is the investigation of provisional data sets for finding hidden connections and to gather the information in peculiar form which are justifiable and understandable to the owner of gather or mined data. An unsupervised formula which differentiate data components into collections by which the components in similar group are more allied to one other and items in rest of cluster seems to be non-allied, by the criteria of measurement of equality or predictability is called process of clustering. Cluster analysis is a relegating task that is utilized to identify same group of object and it is additionally one of the most widely used method for many practical application in data mining. It is a method of grouping objects, where objects can be physical, such as a student or may be a summary such as customer comportment, handwriting. It has been proposed many clustering algorithms that it falls into the different clustering methods. The intention of this paper is to provide a relegation of some prominent clustering algorithms.
This slide discuss predictive data analytics models and their applications in broader content. It gives simple examples of regression and classification.
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...ertekg
İndirmek için Bağlantı > https://ertekprojects.com/gurdal-ertek-publications/blog/re-mining-association-mining-results-through-visualization-data-envelopment-analysis-and-decision-trees/
Re-mining is a general framework which suggests the execution of additional data mining steps based on the results of an original data mining process. This study investigates the multi-faceted re-mining of association mining results, develops and presents a practical methodology, and shows the applicability of the developed methodology through real world data. The methodology suggests re-mining using data visualization, data envelopment analysis, and decision trees. Six hypotheses, regarding how re-mining can be carried out on association mining results, are answered in the case study through empirical analysis.
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Intel® Software
This session explains how solutions desired by such IT/Internet/Silicon Valley etc companies can look like, how they may differ from the more “classical” consumers of machine learning and analytics, and the arising challenges that current and future HPC development may have to cope with.
Collecting and analyzing data in real time doesn't have to be as stressful or hard as it sounds, especially if you want to collect real time data using surveys. There is a short way and a long way to collect real time survey data. The short way of collecting and analyzing survey data is to use software that has the capability of collecting and analyzing survey data when embedded into powerpoints or webinars. The long way is to use hard copies of surveys to collect data, and Excel to analyze. This document will show you step by step how to collect and analyze survey data the long way.
Application of Exponential Gamma Distribution in Modeling Queuing Dataijtsrd
There are many events in daily life where a queue is formed. Queuing theory is the study of waiting lines and it is very crucial in analyzing the procedure of queuing in daily life of human being. Queuing theory applies not only in day to day life but also in sequence of computer programming, networks, medical field, banking sectors etc. Researchers have applied many statistical distributions in analyzing a queuing data. In this study, we apply a new distribution named Exponential Gamma distribution in fitting a data on waiting time of bank customers before service is been rendered. We compared the adequacy and performance of the results with other existing statistical distributions. The result shows that the Exponential Gamma distribution is adequate and also performed better than the existing distributions. Ayeni Taiwo Michael | Ogunwale Olukunle Daniel | Adewusi Oluwasesan Adeoye ""Application of Exponential-Gamma Distribution in Modeling Queuing Data"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30097.pdf
Paper Url : https://www.ijtsrd.com/mathemetics/statistics/30097/application-of-exponential-gamma-distribution-in-modeling-queuing-data/ayeni-taiwo-michael
Counterfactual Learning for RecommendationOlivier Jeunen
Slides for our presentation at the REVEAL workshop for RecSys '19 in Copenhagen and a Data Science Leuven Meetup, titled "Counterfactual Learning for Recommendation".
Turnover Prediction of Shares Using Data Mining Techniques : A Case Study csandit
Predicting the Total turnover of a company in the ever fluctuating Stock market has always proved to be a precarious situation and most certainly a difficult task at hand. Data mining is a
well-known sphere of Computer Science that aims at extracting meaningful information from large databases. However, despite the existence of many algorithms for the purpose of
predicting future trends, their efficiency is questionable as their predictions suffer from a high
error rate. The objective of this paper is to investigate various existing classification algorithms
to predict the turnover of different companies based on the Stock price. The authorized datasetfor predicting the turnover was taken from www.bsc.com and included the stock market valuesof various companies over the past 10 years. The algorithms were investigated using the ‘R’
tool. The feature selection algorithm, Boruta, was run on this dataset to extract the important
and influential features for classification. With these extracted features, the Total Turnover of
the company was predicted using various algorithms like Random Forest, Decision Tree, SVM and Multinomial Regression. This prediction mechanism was implemented to predict the turnover of a company on an everyday basis and hence could help navigate through dubious
stock markets trades. An accuracy rate of 95% was achieved by the above prediction process.
Moreover, the importance of the stock market attributes was established as well.
Machine Learning for Understanding Biomedical PublicationsGrigorios Tsoumakas
My keynote talk on "Machine Learning for Understanding Biomedical Publications" at the 12th Conference of the Hellenic Society for Computational Biology and Bioinformatics, HSCBB17
Single view vs. multiple views scatterplotsIJECEIAES
Among all the available visualization tools, the scatterplot has been deeply analyzed through the years and many researchers investigated how to improve this tool to face new challenges. The scatterplot visualization diagram is considered one of the most functional among the variety of data visual representations, due to its relative simplicity compared to other multivariable visualization techniques. Even so, one of the most significant and unsolved challenge in data visualization consists in effectively displaying datasets with many attributes or dimensions, such as multidimensional or multivariate ones. The focus of this research is to compare the single view and the multiple views visualization paradigms for displaying multivariable dataset using scatterplots. A multivariable scatterplot has been developed as a web application to provide the single view tool, whereas for the multiple views visualization, the ScatterDice web app has been slightly modified and adopted as a traditional, yet interactive, scatterplot matrix. Finally, a taxonomy of tasks for visualization tools has been chosen to define the use case and the tests to compare the two paradigms.
Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...Olivier Jeunen
Slides for my Doctoral Symposium presentation at RecSys '19 in Copenhagen, titled "Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems".
Data mining is an integrated field, depicted technologies in combination to the areas having database, learning by machine, statistical study, and recognition in patterns of same type, information regeneration, A.I networks, knowledge-based portfolios, artificial intelligence, neural network, and data determination. In real terms, mining of data is the investigation of provisional data sets for finding hidden connections and to gather the information in peculiar form which are justifiable and understandable to the owner of gather or mined data. An unsupervised formula which differentiate data components into collections by which the components in similar group are more allied to one other and items in rest of cluster seems to be non-allied, by the criteria of measurement of equality or predictability is called process of clustering. Cluster analysis is a relegating task that is utilized to identify same group of object and it is additionally one of the most widely used method for many practical application in data mining. It is a method of grouping objects, where objects can be physical, such as a student or may be a summary such as customer comportment, handwriting. It has been proposed many clustering algorithms that it falls into the different clustering methods. The intention of this paper is to provide a relegation of some prominent clustering algorithms.
This slide discuss predictive data analytics models and their applications in broader content. It gives simple examples of regression and classification.
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...ertekg
İndirmek için Bağlantı > https://ertekprojects.com/gurdal-ertek-publications/blog/re-mining-association-mining-results-through-visualization-data-envelopment-analysis-and-decision-trees/
Re-mining is a general framework which suggests the execution of additional data mining steps based on the results of an original data mining process. This study investigates the multi-faceted re-mining of association mining results, develops and presents a practical methodology, and shows the applicability of the developed methodology through real world data. The methodology suggests re-mining using data visualization, data envelopment analysis, and decision trees. Six hypotheses, regarding how re-mining can be carried out on association mining results, are answered in the case study through empirical analysis.
Introduction to Data and Computation: Essential capabilities for everyone in ...Kim Flintoff
An overview seminar about the themes of the Curtin Institute for Computation, and some thoughts on the future role of these capabilities in Learning and Teaching.
Overview of Machine learning concepts – Over fitting and train/test splits, Types of Machine learning – Supervised, Unsupervised, Reinforced learning, Introduction to Bayes Theorem, Linear Regression- model assumptions, regularization (lasso, ridge, elastic net), Classification and Regression algorithms- Naïve Bayes, K-Nearest Neighbors, logistic regression, support vector machines (SVM), decision trees, and random forest, Classification Errors..
This slide gives brief overview of supervised, unsupervised and reinforcement learning. Algorithms discussed are Naive Bayes, K nearest neighbour, SVM,decision tree, Markov model.
Difference between regression and classification. difference between supervised and reinforcement, iterative functioning of Markov model and machine learning applications.
Main points of this slide presentation:
1.What is statistics?
2.Application
3.Application of Statistics in Computer Science and Engineering
4.Machine learning’s Relation to statistics
5.Application of Statistics in Data mining
6.Data mining relation with Statistics
7.Outline of Applications
8.Some Outline of Application’s details are given below
Thank you
Top 20 Data Science Interview Questions and Answers in 2023.pptxAnanthReddy38
Here are the top 20 data science interview questions along with their answers:
What is data science?
Data science is an interdisciplinary field that involves extracting insights and knowledge from data using various scientific methods, algorithms, and tools.
What are the different steps involved in the data science process?
The data science process typically involves the following steps:
a. Problem formulation
b. Data collection
c. Data cleaning and preprocessing
d. Exploratory data analysis
e. Feature engineering
f. Model selection and training
g. Model evaluation and validation
h. Deployment and monitoring
What is the difference between supervised and unsupervised learning?
Supervised learning involves training a model on labeled data, where the target variable is known, to make predictions or classify new instances. Unsupervised learning, on the other hand, deals with unlabeled data and aims to discover patterns, relationships, or structures within the data.
What is overfitting, and how can it be prevented?
Overfitting occurs when a model learns the training data too well, resulting in poor generalization to new, unseen data. To prevent overfitting, techniques like cross-validation, regularization, and early stopping can be employed.
What is feature engineering?
Feature engineering involves creating new features from the existing data that can improve the performance of machine learning models. It includes techniques like feature extraction, transformation, scaling, and selection.
The importance of model fairness and interpretability in AI systemsFrancesca Lazzeri, PhD
Machine learning model fairness and interpretability are critical for data scientists, researchers and developers to explain their models and understand the value and accuracy of their findings. Interpretability is also important to debug machine learning models and make informed decisions about how to improve them.
In this session, Francesca will go over a few methods and tools that enable you to "unpack” machine learning models, gain insights into how and why they produce specific results, assess your AI systems fairness and mitigate any observed fairness issues.
Using open-source fairness and interpretability packages, attendees will learn how to:
- Explain model prediction by generating feature importance values for the entire model and/or individual data points.
- Achieve model interpretability on real-world datasets at scale, during training and inference.
- Use an interactive visualization dashboard to discover patterns in data and explanations at training time.
- Leverage additional interactive visualizations to assess which groups of users might be negatively impacted by a model and compare multiple models in terms of their fairness and performance.
Data Science for Business Managers - An intro to ROI for predictive analyticsAkin Osman Kazakci
This module addresses critical business aspects related to launching a predictive analytics project. How to establish the relationship with business KPIs is discussed. A notion of data hunt, for planning & acquiring external data for better predictions is introduced. Model quality and it's role for ROI of data and prediction tasks are explained. The module is concluded with a glimpse on how collaborative data challenges can improve predictive model quality in no time.
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...ijcseit
Scrutiny for presage is the era of advance statistics where accuracy matter the most. Commensurate
between algorithms with statistical implementation provides better consequence in terms of accurate
prediction by using data sets. Prolific usage of algorithms lead towards the simplification of mathematical
models, which provide less manual calculations. Presage is the essence of data science and machine
learning requisitions that impart control over situations. Implementation of any dogmas require proper
feature extraction which helps in the proper model building that assist in precision. This paper is
predominantly based on different statistical analysis which includes correlation significance and proper
categorical data distribution using feature engineering technique that unravel accuracy of different models
of machine learning algorithms.
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...IJCSES Journal
Scrutiny for presage is the era of advance statistics where accuracy matter the most. Commensurate between algorithms with statistical implementation provides better consequence in terms of accurate prediction by using data sets. Prolific usage of algorithms lead towards the simplification of mathematical models, which provide less manual calculations. Presage is the essence of data science and machine learning requisitions that impart control over situations. Implementation of any dogmas require proper feature extraction which helps in the proper model building that assist in precision. This paper is predominantly based on different statistical analysis which includes correlation significance and proper categorical data distribution using feature engineering technique that unravel accuracy of different models of machine learning algorithms.
S. Corradini, L. Martinez, 30 Novembre - 1 Dicembre 2021 -
Webinar: L'inclusione lavorativa: il panorama nazionale e l'esperienza dell'Istat
Titolo: La condizione occupazionale delle persone con disabilità
L. Lavecchia, 30 Novembre - 1 Dicembre 2021 -
Webinar: Il quadro informativo per il Green Deal: sviluppi e domanda informativa per le questioni energetiche
Titolo: La misura della povertà energetica in Italia
V. Buratta, 30 Novembre - 1 Dicembre 2021 -
Webinar: La strategia dei dati: l’iniziativa europea e la risposta nazionale
Titolo: Il ruolo dell'Istat nella Strategia Nazionale ed Europea dei Dati
E. Fornero, 30 Novembre - 1 Dicembre 2021 -
Webinar: Gender statistics by default: il cambiamento di paradigma nelle statistiche e oltre
Titolo: Illusioni, luoghi comuni e verità nella lotta alle disparità di genere
A. Perrazzelli, 30 Novembre - 1 Dicembre 2021 -
Webinar: Gender statistics by default: il cambiamento di paradigma nelle statistiche e oltre
Titolo: Qualità di genere per sostenere la crescita
A. Tinto, 30 Novembre - 1 Dicembre 2021 -
Webinar: Gli effetti della pandemia sulla soddisfazione per la vita e il benessere: analisi e prospettive
Titolo: L'impatto della pandemia sulla componente soggettiva del Benessere Equo e Sostenibile
L. Becchetti, 30 Novembre - 1 Dicembre 2021 -
Webinar: Gli effetti della pandemia sulla soddisfazione per la vita e il benessere: analisi e prospettive
Titolo: La pandemia attraverso gli indicatori soggettivi a livello internazionale: un paradosso?
G. Onder, 30 Novembre - 1 Dicembre 2021 -
Webinar: La lezione della crisi per le statistiche demografiche e sociali
Titolo: Il sistema di sorveglianza dei decessi dell'ISS e le nuove prospettive
C. Romano, 30 Novembre - 1 Dicembre 2021 -
Webinar: La lezione della crisi per le statistiche demografiche e sociali
Titolo: Nuovi strumenti e indagini per un'informazione pertinente in fase di emergenza
S. Prati, M. Battaglini, G. Corsetti, 30 Novembre - 1 Dicembre 2021 -
Webinar: La lezione della crisi per le statistiche demografiche e sociali
Titolo: La sfida per la demografia: tempestività e qualità dell'informazione
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
G. Barcaroli, The use of machine learning in official statistics
1. The use of machine
learning in official
statistics
Giulio Barcaroli
Istituto Nazionale di Statistica
0
2. ① Data science and Machine Learning (ML)
② Two cultures in statistical modeling: data vs algorithmic modeling, to explain or to predict?
③ Official statistics: what kind of modeling?
④ Why ML in official statistics?
⑤ ML in the traditional statistical information production process
⑥ ML in a multi-source production process
⑦ Some conclusions
1
Outline
3. 2
2
Data science and
Machine Learning
“Data science is a concept to unify statistics,
data analysis, machine learning and their related
methods in order to understand and analyze actual
phenomena with data.”
“Machine learning is a subset of artificial
intelligence in the field of computer science that
often uses statistical techniques to give
computers the ability to learn (i.e., progressively
improve performance on a specific task) with data,
without being explicitly programmed.”
(Wikipedia)
4. «Statistical Modeling: The Two Cultures» (Breiman, 2001)
“There are two cultures in the use of statistical modeling to reach
conclusions from data.
o One assumes that the data are generated by a given stochastic data
model.
o The other uses algorithmic models and treats the data mechanism as
unknown.
The statistical community has been committed to the almost exclusive use of
data models.
Algorithmic modeling, both in theory and practice, has developed rapidly in
fields outside statistics. It can be used both on large complex data sets and
as a more accurate and informative alternative to data modeling on smaller
data sets. If our goal as a field is to use data to solve problems, then we
need to move away from exclusive dependence on data models and adopt a
more diverse set of tools.”
3
3
From data modeling to algorithmic modeling
5. “Assuming a stochastic data model for the inside of the black box.
A common data model is that data are generated by independent draws from
response variables = f(predictor variables, random noise, parameters)
Model validation: Yes–no using goodness-of-fit tests and residual examination”
4
4
From data modeling to algorithmic modeling
6. “The analysis in this culture considers the inside of the box complex and unknown. Their approach is to
find a function f(x) — an algorithm that operates on x to predict the responses y. Their black box looks
like this:
Model validation. Measured by predictive accuracy”
5
5
From data modeling to algorithmic modeling
7. 6
6
Machine Learning
“A core objective of a learner is to generalize from its experience.
Generalization in this context is the ability of a learning machine
to perform accurately on new, unseen examples/tasks after
having experienced a learning data set.
Classification machine learning models can be validated by
accuracy estimation techniques like the holdout method,
which splits the data in a training and test set (conventionally 2/3
training set and 1/3 test set designation) and evaluates the
performance of the training model on the test set.
In comparison, the N-fold-cross-validation method randomly
splits the data in k subsets where the k-1 instances of the data
are used to train the model while the k-th instance is used to test
the predictive ability of the training model. In addition to the
holdout and cross-validation methods, bootstrap, which samples
n instances with replacement from the dataset, can be used to
assess model accuracy.”
(Wikipedia)
8. 7
7
Primary and secondary data: towards a multisource
environment
“NSOs are currently facing unprecedented pressure to evaluate how they operate. Years of declining response rates
to primary data collection efforts and the proliferation of readily accessible data, which has made it easier for private
companies to produce statistics, is putting into question the role of NSOs. In response, many NSOs are looking to
tap into these alternative data sources to supplement, or even replace, data collected by traditional means.”
(UNECE Machine Learning Team)
So, the shift is from primary data (survey data) where the only source is represented by data are collected for
statistical purposes, to secondary data (administrative or Big Data sources).
The nature of secondary data, in particular the volume and variety of these data, makes the algorithmic
approach more convenient than the data modeling approach.
But even in the classical production process based on primary data (described by the Generic Statistic Business
Process Model) Machine Learning can be competitive in many phases.
9. 8
8
Modeling in Official Statistics: primary data process
Modeling is widely used in Official Statistics.
In the standard statistical information production
process, model based or model assisted
techniques are adopted in
- Sampling design (stratification)
- Data integration (record linkage and statistical matching)
- Data editing and imputation
- Outlier detection and handling
- Total non response handling
- Estimation
10. 9
9
Use of models in primary data production process
Examples of implicit definition of models:
• stratification in sample design
• donor search for imputation
• population totals for calibration
Examples of explicit definition of models:
• models for imputation
• models for outlier detection
• models for calibration
11. 10
Id Y X1 X2 X3
c 12 1 1 1
b 6 1 1 2
d 8 1 2 3
f ? 2 1 1
h 21 2 1 2
a 3 2 2 4
i ? 3 1 1
e 7 3 3 1
g 13 4 1 3
10
Example: imputation (donor search)
Given a dataset with a number of missing values on a
variable Y, impute them with the hot-deck donor method.
1. Order the dataset by other variables X1, X2, …,Xp
2. Scan the dataset starting from the first record.
Whenever there is a missing value in the Y variable,
impute the value of the previous record.
Implicit model:
Y = f(X1,X2, … Xp)
No parameters, no indications about the quality of the
imputation.
12. 11
Id Y X1 X2 X3
f ? 2 1 1
i ? 3 1 1
11
Example: model based imputation
(traditional approach)
Id Y X1 X2 X3
c 12 1 1 1
b 6 1 1 2
d 8 1 2 3
h 21 2 1 2
a 3 2 2 4
e 7 3 3 1
g 13 4 1 3
Complete
data
Incomplete
data
Fit models
using complete
data
Apply to
impute
Evaluate
fitting
13. 12
Id Y X1 X2 X3
f ? 2 1 1
i ? 3 1 1
12
Id Y X1 X2 X3
c 12 1 1 1
b 6 1 1 2
d 8 1 2 3
h 21 2 1 2
a 3 2 2 4
e 7 3 3 1
g 13 4 1 3
Id Y X1 X2 X3
c 12 1 1 1
b 6 1 1 2
d 8 1 2 3
h 21 2 1 2
Id Y X1 X2 X3
a 3 2 2 4
e 7 3 3 1
g 13 4 1 3
Complete
data:
“ground
truth”
Training
set
Validate
set
Incomplete
data
Fit models
using training
set
Evaluate
performance
on validate set
Choose the
best and apply
to impute
Example: model based imputation
(Machine Learning approach)
14. 13
13
Use of ML in a multi-source production process
e-commerce
e-recruitment
e-tendering
…
32,000 enterprises
Sample
selection
Data
collection
on 19,000
enterprises
Machine
learning
Population
frame
(ASIA)
Reference population:
184,000 enterprises
Predictors
Survey
data
Websites
and social
networks
Big Data:
Internet as
Data Source
Web scraping +
text processing
Document
Terms
Matrix
11,700 websites
14,000 URLs
15. 14
14
Use of ML in a multi-source production process
e-commerce
e-recruitment
e-tendering
…
Reference population:
184,000 enterprises
Predictors
Websites
and social
networks
Big Data:
Internet as
Data Source
Web scraping +
text processing
Document
Terms
Matrix
85,000 websites
Population
frame
(ASIA)
Predictions
Survey
data
Estimation
Estimates
16. 15
15
Use of ML in a multi-source production process
1. Web scraping
a. URLs retrieval
b. Websites scraping
2. Text processing
3. Machine learning
a. Models fitting
b. Models performance evaluation
4. Estimation
a. Design based estimators
b. Model based estimators
c. Combined estimators
5. Quality compared evaluation
a. Analytic and resampling methods
b. Simulation studies
1. Logistic Regression
2. Classification Trees
3. Ensembles (Bagging, Boosting,
Random Forests)
4. Naïve Bayes
5. Neural Networks
6. Support Vector Machines
7. …
17. 16
16
Use of ML in a multi-source production process
1. Web scraping
a. URLs retrieval
b. Websites scraping
2. Text processing
3. Machine learning
a. Models fitting
b. Models performance evaluation
4. Estimation
a. Design based estimators
b. Model based estimators
c. Combined estimators
5. Quality compared evaluation
a. Analytic and resampling methods
b. Simulation studies
Learner Accuracy Recall Precision F1-measure
Naïve Bayes 0.84 0.56 0.56 0.56
Logistic 0.84 0.57 0.57 0.57
Decision
Tree
0.87 0.64 0.64 0.64
Neural Net 0.88 0.65 0.66 0.66
Bagging 0.88 0.66 0.67 0.67
SVM 0.90 0.62 0.76 0.68
Boosting 0.90 0.71 0.71 0.71
Random
Forest
0.90 0.73 0.73 0.73
18. 17
17
Estimator Formula Weighting Description
Design based /
model assisted
𝑌 = ∑ 𝑟 𝑦 𝑘 𝑤 𝑘
𝑘=1
𝑟
𝑤 𝑘 = 𝑁 𝑈
𝑤 𝑘 weights are obtained by a
calibration procedure making
use of known totals in the
population
Model based
𝑌 = ∑ 𝑈2 𝑦 𝑘 𝑤 𝑘
′
𝑘=1
𝑈2
𝑤 𝑘
′
= 𝑁 𝑈1
Count of the predicted values
𝑦 𝑘 for all units for which it
was possible reach their
websites (population 𝑈2
),
calibrated in order to make
them representative of all
the population having
websites (𝑈1
).
Combined
𝑌 = ∑ 𝑈2 𝑦 𝑘 +
∑ 𝑟1( 𝑦 𝑘 − 𝑦 𝑘)𝑤 𝑘
′′
+
∑ 𝑟2 𝑦 𝑘 𝑤 𝑘
′′′
∑ 𝑘=1
𝑟1
𝑤 𝑘
′′
= 𝑁 𝑈2
and
∑ 𝑘=1
𝑟2
𝑤 𝑘
′′′
= 𝑁 𝑈1−𝑈2
Estimates produced by using
both survey data and
predicted values.
Use of ML in a multi-source production process
1. Web scraping
a. URLs retrieval
b. Websites scraping
2. Text processing
3. Machine learning
a. Models fitting
b. Models performance evaluation
4. Estimation
a. Design based estimators
b. Model based estimators
c. Combined estimators
5. Quality compared evaluation
a. Analytic and resampling methods
b. Simulation studies
19. 18
18
Use of ML in a multi-source production process
1. Web scraping
a. URLs retrieval
b. Websites scraping
2. Text processing
3. Machine learning
a. Models fitting
b. Models performance evaluation
4. Estimation
a. Design based estimators
b. Model based estimators
c. Combined estimators
5. Quality compared evaluation
a. Analytic and resampling methods
b. Simulation studies
20. 19
19
Use of ML in a multi-source production process
1. Web scraping
a. URLs retrieval
b. Websites scraping
2. Text processing
3. Machine learning
a. Models fitting
b. Models performance evaluation
4. Estimation
a. Design based estimators
b. Model based estimators
c. Combined estimators
5. Quality compared evaluation
a. Analytic and resampling methods
b. Simulation studies
21. 20
20
Some conclusions
1. The adoption of Machine Learning is a real paradigm shift for Official Statistics
2. Algorithmic modeling is particularly suitable for new data sources
3. The fundamental principle at the basis of ML is to privilege the prediction capability
of a model, regardless of its interpretability
4. Generalizability is the main requirement
5. The evaluation of the accuracy of predictions is the key to choose the model
6. ML approach can/should be adopted in the traditional production process based
on primary (survey) data
7. ML approach is often the only suitable in a multi-source production process, where
new sources (Big Data) require algorithmic solutions able to handle their volume and
variety
22. 21
21
The use of Machine Learning and Official Statistics
Thank you!
barcarol@istat.it