Principal Component Analysis (PCA) and LDA PPT SlidesAbhishekKumar4995
Machine learning (ML) technique use for Dimension reduction, feature extraction and analyzing huge amount of data are Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are easily and interactively explained with scatter plot graph , 2D and 3D projection of Principal components(PCs) for better understanding.
Principal Component Analysis (PCA) and LDA PPT SlidesAbhishekKumar4995
Machine learning (ML) technique use for Dimension reduction, feature extraction and analyzing huge amount of data are Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are easily and interactively explained with scatter plot graph , 2D and 3D projection of Principal components(PCs) for better understanding.
Principal Component Analysis, or PCA, is a factual method that permits you to sum up the data contained in enormous information tables by methods for a littler arrangement of "synopsis files" that can be all the more handily envisioned and broke down.
Process of converting data set having vast dimensions into data set with lesser dimensions ensuring that it conveys similar information concisely.
Concept
R code
Welcome to the Supervised Machine Learning and Data Sciences.
Algorithms for building models. Support Vector Machines.
Classification algorithm explanation and code in Python ( SVM ) .
DBScan stands for Density-Based Spatial Clustering of Applications with Noise.
DBScan Concepts
DBScan Parameters
DBScan Connectivity and Reachability
DBScan Algorithm , Flowchart and Example
Advantages and Disadvantages of DBScan
DBScan Complexity
Outliers related question and its solution.
Dimensionality Reduction and feature extraction.pptxSivam Chinna
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension.
Principal Component Analysis, or PCA, is a factual method that permits you to sum up the data contained in enormous information tables by methods for a littler arrangement of "synopsis files" that can be all the more handily envisioned and broke down.
Process of converting data set having vast dimensions into data set with lesser dimensions ensuring that it conveys similar information concisely.
Concept
R code
Welcome to the Supervised Machine Learning and Data Sciences.
Algorithms for building models. Support Vector Machines.
Classification algorithm explanation and code in Python ( SVM ) .
DBScan stands for Density-Based Spatial Clustering of Applications with Noise.
DBScan Concepts
DBScan Parameters
DBScan Connectivity and Reachability
DBScan Algorithm , Flowchart and Example
Advantages and Disadvantages of DBScan
DBScan Complexity
Outliers related question and its solution.
Dimensionality Reduction and feature extraction.pptxSivam Chinna
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension.
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
Data clustering is a common technique for statistical data analysis; it is defined as a class of
statistical techniques for classifying a set of observations into completely different groups. Cluster analysis
seeks to minimize group variance and maximize between group variance. In this study we formulate a
mathematical programming model that chooses the most important variables in cluster analysis. A nonlinear
binary model is suggested to select the most important variables in clustering a set of data. The idea of the
suggested model depends on clustering data by minimizing the distance between observations within groups.
Indicator variables are used to select the most important variables in the cluster analysis.
Supervised learning is a machine learning paradigm where the algorithm is trained on a labeled dataset, learning patterns and relationships between input features and corresponding output labels to make accurate predictions on new, unseen data. It involves a teacher-supervisor relationship, where the algorithm strives to minimize the error between its predictions and the actual outcomes during training.
how,when and why to perform Feature scaling?
Different type of feature scaling Technique.
when to perform feature scaling?
why to perform feature scaling?
MinMax feature scaling techniques.
Unit vector scaling.
Parallel search or multithreaded search is a way to increase search speed by using additional processors.
Parallel search algorithms are classified by their scalability (that means the behavior of the algorithm as the number of processors becomes large) and their speed up.
Enterprise architecture (EA) is a well-defined practice for conducting enterprise analysis, design, planning, and implementation, using a comprehensive approach at all times, for the successful development and execution of strategy.
Application of Knowledge Engineering in Legal Data : A case study of the info...Suresh Pokharel
To study the information flow in the Supreme Court of Nepal and to suggest an appropriate Knowledge Engineering approach for enhancement of the process.
See next presentation for details...
The process of solving new problems based on the solutions of similar past problems.
CBR is not only a powerful method for computer reasoning but also a pervasive behavior in everyday human problem solving.
Related to prototype theory, which is most deeply explored in cognitive science.
Signature is commonly accepted as a means of verifying the legality of documents such as certificates, checks, drafts, letters, approvals, etc.
The pre-trained CNN model, GoogleNet is used for the experiment and the TensorFlow platform is used.
GoogleNet model consists of two parts; a classification layer and a feature extraction layer.
The parameters on the classification layer are removed and trained with the transfer values from the feature extraction layer of the model.
Problems in countering the forgery and falsification of such documents in diverse financial, legal, academic, and other commercial settings.
Comparison of ANN Algorithm for Water Quality Prediction of River GangaSuresh Pokharel
Paper Presentation:
To develop two different ANN models and compare their performances to evaluate the current situation and predict the behavior of water quality with respect to changes in pollutant loads and hydrological conditions.
A distinct unit of code, or a program, that can be separated and executed in a remote runtime environment
Task computing provides distribution by harnessing the computer power of several computing nodes
Task computing is categorized into: High-performance computing (HPC), High-throughput computing (HTC) and Many-task computing (MTC).
Gene Identification in Alzheimer Disease Using Bioinformatics AnalysisSuresh Pokharel
Alzheimer is a degenerative brain disease, common cause of dementia, a general term for memory loss.
Excess of manganese (Mn) in brain exerts toxic effects on neurons and contributes to AD development.
Paper Review:
Ling, JunJun, et al. "Identifying key genes, pathways and screening therapeutic agents for manganese-induced Alzheimer disease using bioinformatics analysis." Medicine 97.22 (2018).
More commonly used variance of Markov Networks
The undirected graphical model that is discriminative, used for predicting sequences.
Construct conditional model p(Y|X)
Commonly used in pattern recognition, machine learning and used for structured prediction.
It uses contextual information from previous labels, thus increasing the amount of information the model has to make a good prediction.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
2. Curse of Dimensionality
This exponential growth in data causes high sparsity in the data set and unnecessarily increases storage space and
processing time for the particular modelling algorithm. Think of image recognition problem of high resolution images 1280 ×
720 = 921,600 pixels i.e. 921600 dimensions. OMG. And that’s why it’s called Curse of Dimensionality.
Growth is
exponential and
everytime we increase
the number of
dimensions, more
data will have to be
added to fill the empty
spaces.
3. Curse of Dimensionality
● Refers to the problem caused by the exponential increase in volume
associated with adding extra dimensions to a mathematical space.
4. For some algorithms, the number of features controls over-fitting.
E.g. logistic regression
5. Principal Component Analysis
● A mathematical procedure with simple matrix operations
from linear algebra and statistics to transform a number
of correlated variables into smaller number of uncorrelated
variables called principal components.
● Emphasize variation and bring out strong patterns in a
dataset.
● To make data easy to explore and visualize.
● Still contains most of the information in the large set.
6. Principal Component Analysis (contd...)
Figure 1A: The data are represented in the X-Y coordinate system
Figure 1B: The PC1 axis is the first principal direction along which the samples show the largest variation
7. Goals of PCA
● To identify hidden pattern in a data set
● To reduce the dimensionality of the data by removing the
noise and redundancy in the data
● To identify correlated variables
8. Principal Component Analysis (contd...)
● PCA method is particularly useful
when the variables within the
data set are highly correlated.
● Correlation indicates that there is
redundancy in the data.
● PCA can be used to reduce the
original variables into a smaller
number of new variables ( =
principal components) explaining
most of the variance in the
original variables.
9. Steps to implement PCA in 2D dataset
Step 1: Normalize the data
Step 2: Calculate the covariance matrix
Step 3: Calculate the eigenvalues and eigenvectors
Step 4: Choosing principal components
Step 5: Forming a feature vector
Step 6: Forming Principal Components
10. Step 1: Normalization
This is done by subtracting the respective means from the numbers in the
respective column. So if we have two dimensions X and Y, all X become 𝔁- and all Y
become 𝒚-.
For all X; 𝔁- = X- μx
For all Y; 𝒚- = Y- μy
This produces a dataset whose mean is zero.
12. Calculation of correlation(contd...)
If x and y be two variables with
length n,
The variance of x , variance of y
and variance of x & y is given by
following equations.
mx
: mean of x variables
my
: mean of y variables
13. Calculation of correlation (contd...)
Correlation is the index to measure how strongly two variable are related
to each other. The value of the same ranges for -1 to +1. (i.e. -1 < r < 1)
If r<0, variables are negatively correlated (e.g. x increases when y
increases)
If r>0, variables are positively correlated (e.g. x increases when y
decreases)
If r=0, variables has no correlation
14. Step 3: Calculation of Eigenvalue and
Eigenvector
Calculate Eigenvalue and Eigenvector of the covariance matrix using power
method:
|λ - AI| = 0
Where, I is an identity matrix of same dimension as A
λ is eigenvalue.
For each value of λ, corresponding eigenvector ‘v’ is obtained by solving:
(λ - AI)v = 0
15. Step 4: Choosing Component
● Eigenvalues from largest to smallest so that it gives us the
components in order or significance.
● If we have a dataset with n variables, then we have the
corresponding n eigenvalues and eigenvectors.
● Eigenvector (v) corresponding to largest eigenvalue (λ) is
called first principal component.
● To reduce the dimensions, we choose the first p
eigenvalues and ignore the rest.
16. Step 5: Forming a feature vector
Form a feature vector consists of selected eigenvectors
Feature Vector = (eig1, eig2)
17. Step 5: Forming Principal Components
NewData = FeatureVectorT
x ScaledDataT
NewData is the Matrix consisting of the principal components,
FeatureVector is the matrix we formed using the eigenvectors we chose to keep,
and
ScaledData is the scaled version of original dataset
18. Sample Implementation with sklearn
Output3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Principal Component Analysis
from numpy import array
from sklearn.decomposition import PCA
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# create the PCA instance
pca = PCA(2)
# fit on data
pca.fit(A)
# access values and vectors ie.covariance
matrix
print(pca.components_)
print(pca.explained_variance_)
# transform data
B = pca.transform(A)
print(B)
[[1 2][3 4][5 6]]
[[ 0.70710678 0.70710678]
[ 0.70710678 -0.70710678]]
[ 8.00000000e+00 2.25080839e-33]
[[ -2.82842712e+00 2.22044605e-16]
[ 0.00000000e+00 0.00000000e+00]
[ 2.82842712e+00
-2.22044605e-16]]
19. Applications of PCA
● Commonly used dimensionality reduction technique in domains like
facial recognition, computer vision and image compression.
● Used in finding patterns in data of high dimension in the field of
finance, data mining, bioinformatics, psychology.
20. Linear Discriminant Analysis (LDA)
● The objective of LDA is to perform dimensionality reduction
● However, we want to preserve as much of the class discriminatory information
as possible
● LDA helps you find the boundaries around clusters of classes
● It projects your data points on a line so that your clusters are as separated as
possible, with each cluster having a relative (close) distance to a centroid.
● Supervised algorithm as it takes the class label into consideration
● Assume we have a set of 𝐷-dimensional samples 𝑥 (1) , 𝑥 (2) , … 𝑥 (𝑁), 𝑁1 of which
belong to class 𝜔1, and 𝑁2 to class 𝜔2
○ We seek to obtain a scalar 𝑦 by projecting the samples 𝑥 onto a line 𝑦 = 𝑤 𝑇
𝑥
○ Of all the possible lines we would like to select the one that maximizes the separability of the
scalars
21. Linear Discriminant Analysis (Cont…)
● It determines a new dimension which is nothing but an axis which should
satisfy two criteria:
○ Maximize the distance between the centroid of each class.
○ Minimize the variation (which LDA calls scatter), within each category.
22.
23.
24. In the graph, you can visualize the whole dataset properly differentiate well among the classes. This is the major
difference between PCA and LDA.
25. Different approaches to LDA
● Class-dependent transformation:
○ Involves maximizing the ratio of between class variance to within class variance
○ The main objective is to maximize this ratio so that adequate class separability is obtained
○ The class-specific type approach involves using two optimizing criteria for transforming the data
sets independently.
● Class-independent transformation:
○ This approach involves maximizing the ratio of overall variance to within class variance
○ This approach uses only one optimizing criterion to transform the data sets and hence all data
points irrespective of their class identity are transformed using this transform
○ In this type of LDA, each class is considered as a separate class against all other classes
26. Mathematical Operations
For ease of understanding, this concept is applied to a two-class problem. Each
data set has 100 2-D data points.
1. Formulate the data sets and the test sets, which are to be classified in the
original space. For ease of understanding let us represent the data sets as a
matrix consisting of features in the form given below:
27. Mathematical Operations(Cont...)
2. Compute the mean of each data set and mean of entire data set. Let and be
the mean of set 1 and set 2 respectively and be mean of entire data, which is
obtained by merging set 1 and set 2, is given by Equation 1.
where and are the apriori probabilities of the classes. In the case of this simple two
class problem, the probability factor is assumed to be 0.5.
28. Mathematical Operations(Cont...)
3. In LDA, within-class and between-class scatter are used to formulate criteria for
class separability. Within-class scatter is the expected covariance of each of
the classes
30. Mathematical Operations(Cont...)
The transformations are found as the eigen vector matrix of the different criteria
defined in Equations 7 and 8
● An eigen vector of a transformation represents a 1-D invariant subspace of the
vector space in which the transformation is applied.
● To obtain a non-redundant set of features all eigen vectors corresponding to
non-zero eigen values only are considered and the ones corresponding to
zero eigen values are neglected.
31. Mathematical Operations(Cont...)
5. Obtained the transformation matrices, we transform the data sets using the
single LDA transform or the class specific transforms which ever the case may
be.
Similarly the test vectors are transformed and are classified using the euclidean
distance of the test vectors from each class mean.
32. Mathematical Operations(Cont...)
● Once the transformations are completed using the LDA transforms, Euclidean
distance or RMS distance is used to classify data points.
● Euclidean distance is computed to get the mean of the transformed data set
● The smallest Euclidean distance among the n distances classifies the test
vector as belonging to class n.
33. PCA vs. LDA
● LDA and PCA are linear transformation techniques: LDA is a supervised whereas
PCA is unsupervised – PCA ignores class labels.
● PCA performs better in case where number of samples per class is less. Whereas
LDA works better with large dataset having multiple classes; class separability is an
important factor while reducing dimensionality.
● PCA does more of feature classification and LDA does data classification.
● In PCA, the shape and location of the original data sets changes when transformed to
a different space whereas LDA doesn’t change the location but only tries to provide
more class separability and draw a decision region between the given classes
34. References
● LDA and PCA for dimensionality reduction
[https://sebastianraschka.com/faq/docs/lda-vs-pca.html]
● A. M. Martinez and A. C. Kak, "PCA versus LDA," in IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228-233, Feb.
2001.
● Principal Component Analysis
[http://setosa.io/ev/principal-component-analysis/]
35. References (contd...)
● Introduction to Principal Components and Factor Analysis
[ftp://statgen.ncsu.edu/pub/thorne/molevoclass/AtchleyOct19.pdf]
● Introduction to PCA
[https://www.dezyre.com/data-science-in-python-tutorial/principal-component-a
nalysis-tutorial]
● Linear Discriminant Analysis - A Brief Analysis
[http://www.sthda.com/english/wiki/print.php?id=206]