Nick Schmidt of BLDS, LLC to the Maryland AI meetup, June 4, 2019 (https://www.meetup.com/Maryland-AI). Nick discusses ideas of fairness and how they apply to machine learning. He explores recent academic work on identifying and mitigating bias, and how his work in lending and employment can be applied to other industries. Nick explains how to measure whether an algorithm is fair and also demonstrate the techniques that model builders can use to ameliorate bias when it is found.
This document discusses decision trees, which are supervised learning algorithms used for both classification and regression. It describes key decision tree concepts like decision nodes, leaves, splitting, and pruning. It also outlines different decision tree algorithms (ID3, C4.5, CART), attribute selection measures like Gini index and information gain, and the basic steps for implementing a decision tree in a programming language.
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Edureka!
** Python Data Science Training : https://www.edureka.co/python **
This Edureka Video on Logistic Regression in Python will give you basic understanding of Logistic Regression Machine Learning Algorithm with examples. In this video, you will also get to see demo on Logistic Regression using Python. Below are the topics covered in this tutorial:
1. What is Regression?
2. What is Logistic Regression?
3. Why use Logistic Regression?
4. Linear vs Logistic Regression
5. Logistic Regression Use Cases
6. Logistic Regression Example Demo in Python
Subscribe to our channel to get video updates. Hit the subscribe button above.
Machine Learning Tutorial Playlist: https://goo.gl/UxjTxm
Uncertainty Quantification with Unsupervised Deep learning and Multi Agent Sy...Bang Xiang Yong
Presented at MET4FOF Workshop, JULY 2020
I talk about our recent work of combining Bayesian Deep learning with Explainable Artificial Intelligence (XAI) methods. In particular, we look at Bayesian Autoencoders.
Linear Regression and Logistic Regression in MLKumud Arora
Linear regression and logistic regression are statistical modeling techniques. Linear regression predicts continuous dependent variables using independent variables, while logistic regression predicts binary dependent variables. Both aim to model relationships between variables by estimating coefficients. Logistic regression models the log odds of the dependent variable rather than the variable directly. Key evaluation metrics for regression include accuracy, precision, recall, and F1 score, which are calculated using a confusion matrix.
Instance-based learning algorithms like k-nearest neighbors (KNN) and locally weighted regression are conceptually straightforward approaches to function approximation problems. These algorithms store all training data and classify new query instances based on similarity to near neighbors in the training set. There are three main approaches: lazy learning with KNN, radial basis functions using weighted methods, and case-based reasoning. Locally weighted regression generalizes KNN by constructing an explicit local approximation to the target function for each query. Radial basis functions are another related approach using Gaussian kernel functions centered on training points.
Genetic algorithm for hyperparameter tuningDr. Jyoti Obia
This document discusses using genetic algorithms to tune hyperparameters in predictive models. It begins by providing an overview of genetic algorithms, describing them as a heuristic approach that mimics natural selection to generate multiple solutions. It then defines key terms related to genetic algorithms and chromosomes. The document outlines the genetic algorithm methodology and provides pseudocode. It applies this approach to tune hyperparameters C and gamma in an SVM model and finds it achieves higher accuracy than grid search in less computation time. In an appendix, it references related work and describes a spam email dataset used to classify emails as spam or not spam.
Nick Schmidt of BLDS, LLC to the Maryland AI meetup, June 4, 2019 (https://www.meetup.com/Maryland-AI). Nick discusses ideas of fairness and how they apply to machine learning. He explores recent academic work on identifying and mitigating bias, and how his work in lending and employment can be applied to other industries. Nick explains how to measure whether an algorithm is fair and also demonstrate the techniques that model builders can use to ameliorate bias when it is found.
This document discusses decision trees, which are supervised learning algorithms used for both classification and regression. It describes key decision tree concepts like decision nodes, leaves, splitting, and pruning. It also outlines different decision tree algorithms (ID3, C4.5, CART), attribute selection measures like Gini index and information gain, and the basic steps for implementing a decision tree in a programming language.
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Edureka!
** Python Data Science Training : https://www.edureka.co/python **
This Edureka Video on Logistic Regression in Python will give you basic understanding of Logistic Regression Machine Learning Algorithm with examples. In this video, you will also get to see demo on Logistic Regression using Python. Below are the topics covered in this tutorial:
1. What is Regression?
2. What is Logistic Regression?
3. Why use Logistic Regression?
4. Linear vs Logistic Regression
5. Logistic Regression Use Cases
6. Logistic Regression Example Demo in Python
Subscribe to our channel to get video updates. Hit the subscribe button above.
Machine Learning Tutorial Playlist: https://goo.gl/UxjTxm
Uncertainty Quantification with Unsupervised Deep learning and Multi Agent Sy...Bang Xiang Yong
Presented at MET4FOF Workshop, JULY 2020
I talk about our recent work of combining Bayesian Deep learning with Explainable Artificial Intelligence (XAI) methods. In particular, we look at Bayesian Autoencoders.
Linear Regression and Logistic Regression in MLKumud Arora
Linear regression and logistic regression are statistical modeling techniques. Linear regression predicts continuous dependent variables using independent variables, while logistic regression predicts binary dependent variables. Both aim to model relationships between variables by estimating coefficients. Logistic regression models the log odds of the dependent variable rather than the variable directly. Key evaluation metrics for regression include accuracy, precision, recall, and F1 score, which are calculated using a confusion matrix.
Instance-based learning algorithms like k-nearest neighbors (KNN) and locally weighted regression are conceptually straightforward approaches to function approximation problems. These algorithms store all training data and classify new query instances based on similarity to near neighbors in the training set. There are three main approaches: lazy learning with KNN, radial basis functions using weighted methods, and case-based reasoning. Locally weighted regression generalizes KNN by constructing an explicit local approximation to the target function for each query. Radial basis functions are another related approach using Gaussian kernel functions centered on training points.
Genetic algorithm for hyperparameter tuningDr. Jyoti Obia
This document discusses using genetic algorithms to tune hyperparameters in predictive models. It begins by providing an overview of genetic algorithms, describing them as a heuristic approach that mimics natural selection to generate multiple solutions. It then defines key terms related to genetic algorithms and chromosomes. The document outlines the genetic algorithm methodology and provides pseudocode. It applies this approach to tune hyperparameters C and gamma in an SVM model and finds it achieves higher accuracy than grid search in less computation time. In an appendix, it references related work and describes a spam email dataset used to classify emails as spam or not spam.
With R, Python, Apache Spark and a plethora of other open source tools, anyone with a computer can run machine learning algorithms in a jiffy! However, without an understanding of which algorithms to choose and when to apply a particular technique, most machine learning efforts turn into trial and error experiments with conclusions like "The algorithms don't work" or "Perhaps we should get more data".
In this lecture, we will focus on the key tenets of machine learning algorithms and how to choose an algorithm for a particular purpose. Rather than just showing how to run experiments in R ,Python or Apache Spark, we will provide an intuitive introduction to machine learning with just enough mathematics and basic statistics.
We will address:
• How do you differentiate Clustering, Classification and Prediction algorithms?
• What are the key steps in running a machine learning algorithm?
• How do you choose an algorithm for a specific goal?
• Where does exploratory data analysis and feature engineering fit into the picture?
• Once you run an algorithm, how do you evaluate the performance of an algorithm?
Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingSalah Amean
the chapter contains :
Data Preprocessing: An Overview,
Data Quality,
Major Tasks in Data Preprocessing,
Data Cleaning,
Data Integration,
Data Reduction,
Data Transformation and Data Discretization,
Summary.
Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for both classification and regression analysis. It works by finding a hyperplane in an N-dimensional space that distinctly classifies the data points. SVM selects the hyperplane that has the largest distance to the nearest training data points of any class, since larger the margin lower the generalization error of the classifier. SVM can efficiently perform nonlinear classification by implicitly mapping their inputs into high-dimensional feature spaces.
Enterprise-manufacturing systems integration requires a methodology to maintain overall life cycle of production system
PERA can be a possible solution
Tools and standardized data interchange formats are required to support the use of the methodology
ISA-95 & B2MML are suitable implementations
Fuzzy c means clustering protocol for wireless sensor networksmourya chandra
This document discusses clustering techniques for wireless sensor networks. It describes hierarchical routing protocols that involve clustering sensor nodes into cluster heads and non-cluster heads. It then explains fuzzy c-means clustering, which allows data points to belong to multiple clusters to different degrees, unlike hard clustering methods. Finally, it proposes using fuzzy c-means clustering as an energy-efficient routing protocol for wireless sensor networks due to its ability to handle uncertain or incomplete data.
This document discusses XGBoost, an optimized distributed gradient boosting library. It begins by explaining what problems XGBoost can solve like binary classification, regression, and ranking. It then discusses the key concepts in XGBoost including boosted trees, GBDT, tree ensembles, and additive training. XGBoost builds an ensemble of trees using gradient boosting and additive training to minimize loss. It provides efficient algorithms for split finding to construct trees level-by-level to maximize the loss drop at each step.
This document introduces feature selection in data mining and knowledge discovery. It discusses various feature selection methods like filters, wrappers and embedded methods. It presents variable selection in regression using penalized least squares with l0, l1 and l2 penalties. Maximizing penalized likelihood can also lead to sparse solutions and variable selection. The document outlines desired properties for penalty functions and discusses applications in taxonomy, financial engineering and challenges in high dimensional feature selection.
Exploratory Data Analysis (EDA) was promoted by John Tukey in 1977 to encourage visually examining data without hypotheses. EDA uses graphical and non-graphical techniques like histograms, scatter plots, box plots to summarize variable characteristics. EDA allows understanding data distributions and relationships without models through inspection and information graphics. Common EDA goals are describing typical values, variability, distributions, and relationships between variables.
This document discusses DBSCAN (Density-Based Spatial Clustering of Applications with Noise), a density-based clustering algorithm. DBSCAN groups together closely packed points considered core points, and separates clusters based on density rather than assigning points to predefined clusters. It requires two parameters, epsilon which defines a neighborhood distance, and MinPts specifying the minimum number of points required to form a dense region. Points are classified as core, border or noise based on their epsilon-neighborhood. DBSCAN forms clusters by linking core points that are density-reachable from each other, and identifies outliers as noise points not belonging to any cluster.
The document describes the k-means clustering algorithm. It provides illustrations of how k-means works by randomly selecting initial cluster centroids, calculating the distance between data points and centroids, and iteratively assigning data points to the closest centroid. The goal is to group similar data points together into k clusters.
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
IRJET- Road Accident Prediction using Machine Learning AlgorithmIRJET Journal
This document summarizes a research paper that predicts road accidents using machine learning algorithms. It discusses how large datasets have enabled data mining techniques to discover useful information. The paper aims to determine the most suitable machine learning classification technique for road accident prediction. It uses logistic regression, an algorithm that predicts a binary outcome (yes/no). The researchers clean the data, divide it into training and testing sets, and use logistic regression in Jupyter notebooks with the Python programming language. It provides percentage predictions of accident likelihood to users through a website interface. The results show logistic regression can accurately predict accidents for numerical data but has limitations for non-numerical text data.
This document discusses feature selection concepts and methods. It defines features as attributes that determine which class an instance belongs to. Feature selection aims to select a relevant subset of features by removing irrelevant, redundant and unnecessary data. This improves learning accuracy, model performance and interpretability. The document categorizes feature selection algorithms as filter, wrapper or embedded methods based on how they evaluate feature subsets. It also discusses concepts like feature relevance, search strategies, successor generation and evaluation measures used in feature selection algorithms.
This document summarizes a machine learning workshop on feature selection. It discusses typical feature selection methods like single feature evaluation using metrics like mutual information and Gini indexing. It also covers subset selection techniques like sequential forward selection and sequential backward selection. Examples are provided showing how feature selection improves performance for logistic regression on large datasets with more features than samples. The document outlines the workshop agenda and provides details on when and why feature selection is important for machine learning models.
The document provides an introduction to linear algebra concepts for machine learning. It defines vectors as ordered tuples of numbers that express magnitude and direction. Vector spaces are sets that contain all linear combinations of vectors. Linear independence and basis of vector spaces are discussed. Norms measure the magnitude of a vector, with examples given of the 1-norm and 2-norm. Inner products measure the correlation between vectors. Matrices can represent linear operators between vector spaces. Key linear algebra concepts such as trace, determinant, and matrix decompositions are outlined for machine learning applications.
L'analisi dei dati è molto importante nella Lean Six Sigma, nel framework DMAIC la fase M è cruciale. Così la verifica di relazioni tra variabili è decisiva nella valutazione dei fenomeni e nel miglioramento dei processi. Senza essere esperti in statistica, con Excel è possibile ottenere informazioni utili e in modo semplificato per l'analisi e il problem solving..
With R, Python, Apache Spark and a plethora of other open source tools, anyone with a computer can run machine learning algorithms in a jiffy! However, without an understanding of which algorithms to choose and when to apply a particular technique, most machine learning efforts turn into trial and error experiments with conclusions like "The algorithms don't work" or "Perhaps we should get more data".
In this lecture, we will focus on the key tenets of machine learning algorithms and how to choose an algorithm for a particular purpose. Rather than just showing how to run experiments in R ,Python or Apache Spark, we will provide an intuitive introduction to machine learning with just enough mathematics and basic statistics.
We will address:
• How do you differentiate Clustering, Classification and Prediction algorithms?
• What are the key steps in running a machine learning algorithm?
• How do you choose an algorithm for a specific goal?
• Where does exploratory data analysis and feature engineering fit into the picture?
• Once you run an algorithm, how do you evaluate the performance of an algorithm?
Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingSalah Amean
the chapter contains :
Data Preprocessing: An Overview,
Data Quality,
Major Tasks in Data Preprocessing,
Data Cleaning,
Data Integration,
Data Reduction,
Data Transformation and Data Discretization,
Summary.
Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for both classification and regression analysis. It works by finding a hyperplane in an N-dimensional space that distinctly classifies the data points. SVM selects the hyperplane that has the largest distance to the nearest training data points of any class, since larger the margin lower the generalization error of the classifier. SVM can efficiently perform nonlinear classification by implicitly mapping their inputs into high-dimensional feature spaces.
Enterprise-manufacturing systems integration requires a methodology to maintain overall life cycle of production system
PERA can be a possible solution
Tools and standardized data interchange formats are required to support the use of the methodology
ISA-95 & B2MML are suitable implementations
Fuzzy c means clustering protocol for wireless sensor networksmourya chandra
This document discusses clustering techniques for wireless sensor networks. It describes hierarchical routing protocols that involve clustering sensor nodes into cluster heads and non-cluster heads. It then explains fuzzy c-means clustering, which allows data points to belong to multiple clusters to different degrees, unlike hard clustering methods. Finally, it proposes using fuzzy c-means clustering as an energy-efficient routing protocol for wireless sensor networks due to its ability to handle uncertain or incomplete data.
This document discusses XGBoost, an optimized distributed gradient boosting library. It begins by explaining what problems XGBoost can solve like binary classification, regression, and ranking. It then discusses the key concepts in XGBoost including boosted trees, GBDT, tree ensembles, and additive training. XGBoost builds an ensemble of trees using gradient boosting and additive training to minimize loss. It provides efficient algorithms for split finding to construct trees level-by-level to maximize the loss drop at each step.
This document introduces feature selection in data mining and knowledge discovery. It discusses various feature selection methods like filters, wrappers and embedded methods. It presents variable selection in regression using penalized least squares with l0, l1 and l2 penalties. Maximizing penalized likelihood can also lead to sparse solutions and variable selection. The document outlines desired properties for penalty functions and discusses applications in taxonomy, financial engineering and challenges in high dimensional feature selection.
Exploratory Data Analysis (EDA) was promoted by John Tukey in 1977 to encourage visually examining data without hypotheses. EDA uses graphical and non-graphical techniques like histograms, scatter plots, box plots to summarize variable characteristics. EDA allows understanding data distributions and relationships without models through inspection and information graphics. Common EDA goals are describing typical values, variability, distributions, and relationships between variables.
This document discusses DBSCAN (Density-Based Spatial Clustering of Applications with Noise), a density-based clustering algorithm. DBSCAN groups together closely packed points considered core points, and separates clusters based on density rather than assigning points to predefined clusters. It requires two parameters, epsilon which defines a neighborhood distance, and MinPts specifying the minimum number of points required to form a dense region. Points are classified as core, border or noise based on their epsilon-neighborhood. DBSCAN forms clusters by linking core points that are density-reachable from each other, and identifies outliers as noise points not belonging to any cluster.
The document describes the k-means clustering algorithm. It provides illustrations of how k-means works by randomly selecting initial cluster centroids, calculating the distance between data points and centroids, and iteratively assigning data points to the closest centroid. The goal is to group similar data points together into k clusters.
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
IRJET- Road Accident Prediction using Machine Learning AlgorithmIRJET Journal
This document summarizes a research paper that predicts road accidents using machine learning algorithms. It discusses how large datasets have enabled data mining techniques to discover useful information. The paper aims to determine the most suitable machine learning classification technique for road accident prediction. It uses logistic regression, an algorithm that predicts a binary outcome (yes/no). The researchers clean the data, divide it into training and testing sets, and use logistic regression in Jupyter notebooks with the Python programming language. It provides percentage predictions of accident likelihood to users through a website interface. The results show logistic regression can accurately predict accidents for numerical data but has limitations for non-numerical text data.
This document discusses feature selection concepts and methods. It defines features as attributes that determine which class an instance belongs to. Feature selection aims to select a relevant subset of features by removing irrelevant, redundant and unnecessary data. This improves learning accuracy, model performance and interpretability. The document categorizes feature selection algorithms as filter, wrapper or embedded methods based on how they evaluate feature subsets. It also discusses concepts like feature relevance, search strategies, successor generation and evaluation measures used in feature selection algorithms.
This document summarizes a machine learning workshop on feature selection. It discusses typical feature selection methods like single feature evaluation using metrics like mutual information and Gini indexing. It also covers subset selection techniques like sequential forward selection and sequential backward selection. Examples are provided showing how feature selection improves performance for logistic regression on large datasets with more features than samples. The document outlines the workshop agenda and provides details on when and why feature selection is important for machine learning models.
The document provides an introduction to linear algebra concepts for machine learning. It defines vectors as ordered tuples of numbers that express magnitude and direction. Vector spaces are sets that contain all linear combinations of vectors. Linear independence and basis of vector spaces are discussed. Norms measure the magnitude of a vector, with examples given of the 1-norm and 2-norm. Inner products measure the correlation between vectors. Matrices can represent linear operators between vector spaces. Key linear algebra concepts such as trace, determinant, and matrix decompositions are outlined for machine learning applications.
L'analisi dei dati è molto importante nella Lean Six Sigma, nel framework DMAIC la fase M è cruciale. Così la verifica di relazioni tra variabili è decisiva nella valutazione dei fenomeni e nel miglioramento dei processi. Senza essere esperti in statistica, con Excel è possibile ottenere informazioni utili e in modo semplificato per l'analisi e il problem solving..
Appunti su alcuni algoritmi di ordinamento (sort) classici. È un argomento utile per introdurre aspetti di complessità computazionale e di miglioramento dell'efficienza degli algoritmi basandosi su strategie algoritmiche meno elementari e su strutture dati più sofisticate.
[http://www.mat.uniroma3.it/users/liverani/doc/sort.pdf]
Una delle tecniche utilizzate nei giochi a due avversari in cui uno è una macchina è quella del min-max. Esaminiamola e costruiamone una implementazione object-oriented in C++
I fenomeni bivariati sono quei fenomeni che possono essere caratterizzati studiando congiuntamente due variabili.
Se le variabili sono entrambe quantitative si può procedere a un’analisi di interdipendenza, altrimenti si ricorre all’utilizzo di misure di associazione (per caratteri qualitativi).
Similar to ACP - Analisi delle componenti principali (15)
2. Matrice Dati (n x p)
Colonne = Variabili quantitative
X1 X2 X3 … Xj … Xp
1 x11 x12 x1j x1p
2 x21
3
…
i xi1
…
n xn1 xnj xnp
Righe= Unità PROFILO INDIVIDUALE
3. Obiettivo:
Ridurre il numero di variabili (da p a q<p) in presenza di un insieme di
variabili fortemente correlate (= informazioni ridondanti)
FATTORIZZAZIONE
Comprende una serie di metodi per distinguere le variabili che
spiegano la maggior parte dell’informazione :
Varianza alta = Punti dispersi = Maggiore spiegabilità
ANALISI CORRISPONDENZE (variabili qualitative)
ANALISI DELLE COMPONENTI PRINCIPALI (variabili quantitative)
4. Se non ci sono correlazioni significative tra le variabili
metodi fattorizzazione non attuabili
Per correlazioni significative si intende che almeno la metà dei coefficienti
di correlazione siano maggiori di |0.3|.
Altrimenti, ciascuna variabile rappresenterebbe una dimensione a se stante,
una CP.
E se le variabili fossero invece tutte altamente correlate?
Ci sarebbe una sola CP che spiegherebbe quasi il 100% della variabilità
totale della nuvola dei punti originaria e quindi la ricerca di dimensioni
sottostanti le variabili originarie non avrebbe senso.
PREMESSA FONDAMENTALE: LA CORRELAZIONE
6. In termini matematici…
Descrivere la variabilità globale di un insieme di variabili mediante un
sottoinsieme di nuove variabili, dette componenti principali, tra loro
incorrelate (=indipendenti) ottenute come combinazioni lineari delle
variabili originarie ed ordinate in modo tale che la prima componente
sintetizzi la quota massima di variabilità:
in cui il coefficiente rappresenta il peso (loading) che ogni
variabile ha nel determinare la componente stessa e permette
l’interpretazione della componente stessa.
pipiiii
pp
XaXaXaXaY
XaXaXaXaY
...
...
332211
13132121111
ija
iX
7. Scelta del numero di componenti:
•Scree Plot: considero le componenti il cui autovalore è più alto del
punto di esso o “gomito" (Harman, 1976);
•Soglia di varianza cumulata: trattengo solo le componenti principali
che consentono di ottenere una variabilità cumulata pari a circa il 75-
80%. Se già la prima componente la spiega, mi fermo alla prima;
•Regola di Kaiser (Kaiser, 1960): trattengo solo le componenti principali
e gli autovettori corrispondenti ad autovalori maggiori o uguale a 1.
8. ACP: Ausili all’interpretazione
Contributo assoluto: indica il contribuito dato dalla variabile nella
costruzione dell’ asse fattoriale (coordinata al quadrato, rapportata
all’inerzia associata dell’ asse, autovalore).
La qualità della rappresentazione è in funzione dei contributi assoluti e
relativi dei vari punti.
9. ACP: Ausili all’interpretazione
Contributo relativo: indica quanto la variabile è ben rappresentata
sull’asse ricordando che la proiezione non sempre riesce a riprodurre
la distanza iniziale tra due punti.
Si calcola il quadrato del coseno dell’angolo formato dai vettori
corrispondenti al punto nello spazio originario ed alla sua proiezione.
Quanto più tale valore si avvicina ad 1 tanto più piccolo sarà l’angolo
formato dai due vettori e tanto migliore quindi la rappresentazione.
10. Dataset: crimini.txt (disponibile online)
Campione: i 50 stati componenti gli Stati Uniti d’America
Analisi delle componenti principali: esempio
11. MURDER: numero di arresti per omicidio (su 100.000 ab.)
ASSAULT: numero di arresti per aggressioni (su 100.000 ab.)
URBANPOP: percentuale di popolazione urbana
RAPE: numero di arresti per stupro (su 100.000 ab.)
Variabili analizzate
13. Grafici Autovalori
Dim1 Dim2 Dim3 Dim4
0.00.51.01.52.02.5
BAR PLOT AUTOVALORI SCREE PLOT AUTOVALORI
1 2 3 4
12
Numero Componente
Autovalori
1 2 3 4
708090100
Numero Componente
Percentualevarianzacumulata
VARIANZA CUMULATA SCELTA DEL NUMERO DI COMPONENTI:
Scree Plot: considero le componenti il cui
autovalore e piu alto del punto di esso ogomito
Soglia di varianza cumulata: trattengo solo le
componenti principali che consentono di ottenere
una variabilita cumulata pari a circa il 75-80%;
Regola di Kaiser: trattengo solo le componenti
principali il cui autovalore e maggiore o uguale a 1.
1
2
3
14. Matrice autovalori
- Traccia della Matrice = Inerzia Totale = Somma Autovalori.
Se le variabili originarie sono Standardizzate : Inerzia Totale =
Numero variabili originarie = 4;
- Eigenvalue = Autovalore (j)= Inerzia (varianza) spiegata dalla
j-esima componente principale;
- % of variance = parte di inerzia totale spiegata dalla j-esima
componente principale;
- cumulative % of variance = parte di inerzia totale spiegata
dalla j-esima componente principale e dalle componenti ad
essa precedenti.
15. Output Variabili
N.B. Nel linguaggio di R Dim.j = Comp j
N.B. Se le variabili originarie sono standardizzate:
COORDINATA = CORRELAZIONE.
Per conoscere l'importanza di ciascuna variabile rispetto ad un fattore,
è sufficiente guardare le sue coordinate (correlazioni):
più elevate sono le coordinate, più il punto e vicino sia alla
circonferenza che all'asse, più incide nella costruzione dell'asse stesso.
COORDINATE variabili/componenti: CORRELAZIONE variabili/componenti:
16. Output Variabili
COSENO AL QUADRATO CONTRIBUTI
N.B. E sempre vero che: COSENO2=CORRELAZIONE2
Mentre solo se le variabili originarie sono standardizzate: COSENO2=COORDINATA2
Il coseno al quadrato, o contributo relativo, risponde alla domanda:
o Quanto ciascuna componente spiega una variabile?
Il contributo, o contributo assoluto, risponde alla domanda:
o Quanto ciascuna variabile spiega (in termini di inerzia) la componente?
(Contributo assoluto medio = 100/4 = 25%)
17. Cerchio correlazioni: 1-2 COMPONENTE
-1.0 -0.5 0.0 0.5 1.0
-1.0-0.50.00.51.0
Variables factor map (PCA)
Dim 1 (62.01%)
Dim2(24.74%)
Murder
Assault
UrbanPop
Rape
N.B. Sono riportate solo le variabili con cos2 > 0:5 nel piano.
18. Grafico Individui
-4 -2 0 2 4
-3-2-1012
Individuals factor map (PCA)
Dim 1 (62.01%)
Dim2(24.74%)
Alabama Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
IndianaIowa
Kansas
Kentucky Louisiana
Maine Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
OregonPennsylvania
Rhode Island
South Carolina
South Dakota Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
20. Biplot: Variabili ed Individui
-0.2 -0.1 0.0 0.1 0.2 0.3
-0.2-0.10.00.10.20.3
Comp.1
Comp.2
AlabamaAlaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana Iowa
Kansas
Kentucky
Louisiana
MaineMaryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
OregonPennsylvania
Rhode Island
South Carolina
South DakotaTennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
-5 0 5
-505
Murder
Assault
UrbanPop
Rape