SlideShare a Scribd company logo
1 of 25
Download to read offline
TECHNIQUES FOR BIG DATA
FEATURE EXTRACTION USING
DISTANCE COVARIANCE
BASED PCA
Big Data
 Big Data' is a blanket term for any collection of data
sets so large and complex that it becomes difficult to
process using on-hand database management tools or
traditional data processing applications.
 Big data requires exceptional technologies to efficiently
process large quantities of data within tolerable
elapsed times. A 2011 McKinsey report suggests
suitable technologies include crowdsourcing, data fusion
and integration, genetic algorithms, machine learning,
natural language processing, signal processing,
simulation, time series analysis and visualization.
How Big is Big Data?
 Very large, distributed aggregations of loosely structured data – often
incomplete and inaccessible.
 Petabytes/exabytes of data Millions/billions of people Billions/trillions of
records.
 Loosely-structured and often distributed data.
 Flat schemas with few complex interrelationships
 Often involving time-stamped events
 Often made up of incomplete data
 Often including connections between data elements that must be
probabilistically inferred.
 Applications that involved Big-data can be: Transactional (e.g., Facebook,
PhotoBox), or, Analytic (e.g., ClickFox, Merced Applications).
 (Reference Wikibon.org)
Big Data
Big Data Can be of three types:
1. Large number of attributes (>16)
2. Large number of samples
3. Large number both of attributes and samples
I have tried to work on the first case.
What is Dimensionality Reduction?
 Dimensionality reduction or dimension reduction
is the process of reducing the number of random
variables under consideration (or attributes or
features or descriptors), and can be divided into
feature selection and feature extraction.
Feature Selection
 Filters: Pearson’s Correlation
 Wrappers: Run a classifier again and again, each
time with a new set of features selected using
backward selection or forward selection.
Feature Extraction
 Feature extraction transforms the data in the high-
dimensional space to a space of fewer dimensions.
The data transformation may be linear, as in
principal component analysis (PCA), but many
nonlinear dimensionality reduction techniques also
exist. For multidimensional data, tensor
representation can be used in dimensionality
reduction through multilinear subspace learning.
Feature Extraction
 The main linear technique for dimensionality
reduction, principal component analysis, performs a
linear mapping of the data to a lower-dimensional
space in such a way that the variance of the data in
the low-dimensional representation is maximized
What is Principal Component Analysis?
 Principal component analysis (PCA) is a statistical procedure that
uses an orthogonal transformation to convert a set of observations
of possibly correlated variables into a set of values of linearly
uncorrelated variables called principal components. The number of
principal components is less than or equal to the number of original
variables. This transformation is defined in such a way that the first
principal component has the largest possible variance (that is,
accounts for as much of the variability in the data as possible), and
each succeeding component in turn has the highest variance possible
under the constraint that it is orthogonal to (i.e., uncorrelated with)
the preceding components. Principal components are guaranteed to
be independent if the data set is jointly normally distributed. PCA is
sensitive to the relative scaling of the original variables.
That is fine, but show me the MATH!
 Online tutorial
(http://www.cs.otago.ac.nz/cosc453/student_tutori
als/principal_components.pdf)
PCA and BIG DATA
 BIG DATA containing thousands will require a lot of
computation time for an average computer.
 PCA becomes an important tool while drawing
inference from such large data sets.
What is Distance Correlation?
 Distance correlation is a measure of statistical
dependence between two random variables or two
random vectors of arbitrary, not necessarily equal
dimension. An important property is that this measure of
dependence is zero if and only if the random variables
are statistically independent. This measure is derived
from a number of other quantities that are used in its
specification, specifically: distance variance, distance
standard deviation and distance covariance. These
take the same roles as the ordinary moments with
corresponding names in the specification of the Pearson
product-moment correlation coefficient.
Distance Covariance Solved Example
Sample Data
Column 1 Column 2
1 1
2 0
-1 2
0 3
Mean 0.5 1.5
Distances
 For Column 1 (aij = pow((ai^2 – aj^2), 0.5))
0 1.73 0 1
1.73 0 1.73 2
0 1.73 0 1
1 2 1 0
Using Euclidean formula to calculate distances
Mean 0.62 1.365 0.62 1
0.62
1.365
0.62
1
Grand Mean : 0.932
Similarly
 Distances for column 2 (bij)
0 1 1.73 2.8
1 0 2 3
1.73 2 0 2.23
2.8 3 2.23 0
Mean
Mean 1.38 1.5 1.49
2.66
1.38
1.5
1.49
2.66
Grand Mean : 1.595
Centering both the columns
 Aij = aij – ~ai – ~aj + ~a;
 where
 ~ai = row mean of ai
 ~aj = column mean of aj
 ~a = grand mean of a
Aij
-0.308 0.677 -0.308 0.312
0.677 -1.668 0.677 0.567
-0.308 0.677 -0.308 0.312
0.312 0.567 0.312 -1.608
Similarly
 We can calculate Bij
 Distance Covariance = (Aij*Bij)/n^2
Distance Covariance Principal
Component Analysis
 After we have obtained distance covariance, we
can find the highest eigen vectors of the covariance
matrix and then use those eigen vectors to extract
new features
 These eigen vectors can be multiplied by the real
dataset to generate the reduced dataset.
PCA vs D-PCA
 The classical measure of dependence, the Pearson
correlation coefficient, is mainly sensitive to a linear
relationship between two variables. Distance correlation
was introduced in 2005 by Gabor J Szekely in several
lectures to address this deficiency of Pearson’s
correlation, namely that it can easily be zero for
dependent variables. Correlation = 0
(uncorrelatedness) does not imply independence while
distance correlation = 0 does imply independence. The
first results on distance correlation were published in
2007 and 2009.
Confusion Matrix
Modifications of D-PCA
 1. pow((ai^2 – aj^2),0.5)/ai+aj
 2. pow((ai^2 – aj^2),0.5)/ai
 These modification can be used to scale the data
which can then eliminate Normalization Step.
Results
Drawbacks
 Cannot handle time series data
 Cannot handle noisy data
 Assumes data distribution to be normal
 Sensitive to scaling of the data
Future work
 Rank correlation
 Distance based source separation

More Related Content

What's hot

KNN and ARL Based Imputation to Estimate Missing Values
KNN and ARL Based Imputation to Estimate Missing ValuesKNN and ARL Based Imputation to Estimate Missing Values
KNN and ARL Based Imputation to Estimate Missing Valuesijeei-iaes
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clusteringguest0edcaf
 
An intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et alAn intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et alRazzaqe
 
Cannonical Correlation
Cannonical CorrelationCannonical Correlation
Cannonical Correlationdomsr
 
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataMPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataIRJET Journal
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster AnalysisDerek Kane
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-Ihktripathy
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET Journal
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisJaclyn Kokx
 
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...ijiert bestjournal
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematicshktripathy
 
Cluster spss week7
Cluster spss week7Cluster spss week7
Cluster spss week7Birat Sharma
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET Journal
 

What's hot (18)

KNN and ARL Based Imputation to Estimate Missing Values
KNN and ARL Based Imputation to Estimate Missing ValuesKNN and ARL Based Imputation to Estimate Missing Values
KNN and ARL Based Imputation to Estimate Missing Values
 
Data reduction
Data reductionData reduction
Data reduction
 
Canonical correlation
Canonical correlationCanonical correlation
Canonical correlation
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
An intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et alAn intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et al
 
Cannonical Correlation
Cannonical CorrelationCannonical Correlation
Cannonical Correlation
 
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataMPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster Analysis
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
 
Datamining
DataminingDatamining
Datamining
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms Comparison
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 
AI: Belief Networks
AI: Belief NetworksAI: Belief Networks
AI: Belief Networks
 
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
 
03 Data Mining Techniques
03 Data Mining Techniques03 Data Mining Techniques
03 Data Mining Techniques
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematics
 
Cluster spss week7
Cluster spss week7Cluster spss week7
Cluster spss week7
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
 

Viewers also liked

Theory and the review of related literature
Theory and the review of related literatureTheory and the review of related literature
Theory and the review of related literatureJesullyna Manuel
 
Image feature extraction
Image feature extractionImage feature extraction
Image feature extractionRushin Shah
 
Matlab Feature Extraction Using Segmentation And Edge Detection
Matlab Feature Extraction Using Segmentation And Edge DetectionMatlab Feature Extraction Using Segmentation And Edge Detection
Matlab Feature Extraction Using Segmentation And Edge DetectionDataminingTools Inc
 
Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsBuhwan Jeong
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
Deep Learning through Examples
Deep Learning through ExamplesDeep Learning through Examples
Deep Learning through ExamplesSri Ambati
 

Viewers also liked (6)

Theory and the review of related literature
Theory and the review of related literatureTheory and the review of related literature
Theory and the review of related literature
 
Image feature extraction
Image feature extractionImage feature extraction
Image feature extraction
 
Matlab Feature Extraction Using Segmentation And Edge Detection
Matlab Feature Extraction Using Segmentation And Edge DetectionMatlab Feature Extraction Using Segmentation And Edge Detection
Matlab Feature Extraction Using Segmentation And Edge Detection
 
Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applications
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Deep Learning through Examples
Deep Learning through ExamplesDeep Learning through Examples
Deep Learning through Examples
 

Similar to Slides distancecovariance

Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfBeyaNasr1
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxrajalakshmi5921
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxrajalakshmi5921
 
Survey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction TechniquesSurvey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction TechniquesIRJET Journal
 
A study on rough set theory based
A study on rough set theory basedA study on rough set theory based
A study on rough set theory basedijaia
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET Journal
 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAEditor Jacotech
 
High dimensionality reduction on graphical data
High dimensionality reduction on graphical dataHigh dimensionality reduction on graphical data
High dimensionality reduction on graphical dataeSAT Journals
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...IRJET Journal
 
IRJET- Comparative Study of PCA, KPCA, KFA and LDA Algorithms for Face Re...
IRJET-  	  Comparative Study of PCA, KPCA, KFA and LDA Algorithms for Face Re...IRJET-  	  Comparative Study of PCA, KPCA, KFA and LDA Algorithms for Face Re...
IRJET- Comparative Study of PCA, KPCA, KFA and LDA Algorithms for Face Re...IRJET Journal
 
Building Azure Machine Learning Models
Building Azure Machine Learning ModelsBuilding Azure Machine Learning Models
Building Azure Machine Learning ModelsEng Teong Cheah
 
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Mumbai Academisc
 
Survey on Supervised Method for Face Image Retrieval Based on Euclidean Dist...
Survey on Supervised Method for Face Image Retrieval  Based on Euclidean Dist...Survey on Supervised Method for Face Image Retrieval  Based on Euclidean Dist...
Survey on Supervised Method for Face Image Retrieval Based on Euclidean Dist...Editor IJCATR
 
Working with the data for Machine Learning
Working with the data for Machine LearningWorking with the data for Machine Learning
Working with the data for Machine LearningMehwish690898
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Jayanti Pande
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
 
SHAHBAZ_TECHNICAL_SEMINAR.docx
SHAHBAZ_TECHNICAL_SEMINAR.docxSHAHBAZ_TECHNICAL_SEMINAR.docx
SHAHBAZ_TECHNICAL_SEMINAR.docxShahbazKhan77289
 
Feature selection using PCA.pptx
Feature selection using PCA.pptxFeature selection using PCA.pptx
Feature selection using PCA.pptxbeherasushree212
 

Similar to Slides distancecovariance (20)

M5.pptx
M5.pptxM5.pptx
M5.pptx
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptx
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
 
Survey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction TechniquesSurvey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction Techniques
 
A study on rough set theory based
A study on rough set theory basedA study on rough set theory based
A study on rough set theory based
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
 
1376846406 14447221
1376846406  144472211376846406  14447221
1376846406 14447221
 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCA
 
High dimensionality reduction on graphical data
High dimensionality reduction on graphical dataHigh dimensionality reduction on graphical data
High dimensionality reduction on graphical data
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...
 
IRJET- Comparative Study of PCA, KPCA, KFA and LDA Algorithms for Face Re...
IRJET-  	  Comparative Study of PCA, KPCA, KFA and LDA Algorithms for Face Re...IRJET-  	  Comparative Study of PCA, KPCA, KFA and LDA Algorithms for Face Re...
IRJET- Comparative Study of PCA, KPCA, KFA and LDA Algorithms for Face Re...
 
Building Azure Machine Learning Models
Building Azure Machine Learning ModelsBuilding Azure Machine Learning Models
Building Azure Machine Learning Models
 
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
 
Survey on Supervised Method for Face Image Retrieval Based on Euclidean Dist...
Survey on Supervised Method for Face Image Retrieval  Based on Euclidean Dist...Survey on Supervised Method for Face Image Retrieval  Based on Euclidean Dist...
Survey on Supervised Method for Face Image Retrieval Based on Euclidean Dist...
 
Working with the data for Machine Learning
Working with the data for Machine LearningWorking with the data for Machine Learning
Working with the data for Machine Learning
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
SHAHBAZ_TECHNICAL_SEMINAR.docx
SHAHBAZ_TECHNICAL_SEMINAR.docxSHAHBAZ_TECHNICAL_SEMINAR.docx
SHAHBAZ_TECHNICAL_SEMINAR.docx
 
Feature selection using PCA.pptx
Feature selection using PCA.pptxFeature selection using PCA.pptx
Feature selection using PCA.pptx
 

Recently uploaded

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 

Recently uploaded (20)

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 

Slides distancecovariance

  • 1. TECHNIQUES FOR BIG DATA FEATURE EXTRACTION USING DISTANCE COVARIANCE BASED PCA
  • 2. Big Data  Big Data' is a blanket term for any collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.  Big data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. A 2011 McKinsey report suggests suitable technologies include crowdsourcing, data fusion and integration, genetic algorithms, machine learning, natural language processing, signal processing, simulation, time series analysis and visualization.
  • 3. How Big is Big Data?  Very large, distributed aggregations of loosely structured data – often incomplete and inaccessible.  Petabytes/exabytes of data Millions/billions of people Billions/trillions of records.  Loosely-structured and often distributed data.  Flat schemas with few complex interrelationships  Often involving time-stamped events  Often made up of incomplete data  Often including connections between data elements that must be probabilistically inferred.  Applications that involved Big-data can be: Transactional (e.g., Facebook, PhotoBox), or, Analytic (e.g., ClickFox, Merced Applications).  (Reference Wikibon.org)
  • 4. Big Data Big Data Can be of three types: 1. Large number of attributes (>16) 2. Large number of samples 3. Large number both of attributes and samples I have tried to work on the first case.
  • 5. What is Dimensionality Reduction?  Dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration (or attributes or features or descriptors), and can be divided into feature selection and feature extraction.
  • 6. Feature Selection  Filters: Pearson’s Correlation  Wrappers: Run a classifier again and again, each time with a new set of features selected using backward selection or forward selection.
  • 7. Feature Extraction  Feature extraction transforms the data in the high- dimensional space to a space of fewer dimensions. The data transformation may be linear, as in principal component analysis (PCA), but many nonlinear dimensionality reduction techniques also exist. For multidimensional data, tensor representation can be used in dimensionality reduction through multilinear subspace learning.
  • 8. Feature Extraction  The main linear technique for dimensionality reduction, principal component analysis, performs a linear mapping of the data to a lower-dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized
  • 9. What is Principal Component Analysis?  Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components. Principal components are guaranteed to be independent if the data set is jointly normally distributed. PCA is sensitive to the relative scaling of the original variables.
  • 10. That is fine, but show me the MATH!  Online tutorial (http://www.cs.otago.ac.nz/cosc453/student_tutori als/principal_components.pdf)
  • 11. PCA and BIG DATA  BIG DATA containing thousands will require a lot of computation time for an average computer.  PCA becomes an important tool while drawing inference from such large data sets.
  • 12. What is Distance Correlation?  Distance correlation is a measure of statistical dependence between two random variables or two random vectors of arbitrary, not necessarily equal dimension. An important property is that this measure of dependence is zero if and only if the random variables are statistically independent. This measure is derived from a number of other quantities that are used in its specification, specifically: distance variance, distance standard deviation and distance covariance. These take the same roles as the ordinary moments with corresponding names in the specification of the Pearson product-moment correlation coefficient.
  • 13. Distance Covariance Solved Example Sample Data Column 1 Column 2 1 1 2 0 -1 2 0 3 Mean 0.5 1.5
  • 14. Distances  For Column 1 (aij = pow((ai^2 – aj^2), 0.5)) 0 1.73 0 1 1.73 0 1.73 2 0 1.73 0 1 1 2 1 0 Using Euclidean formula to calculate distances Mean 0.62 1.365 0.62 1 0.62 1.365 0.62 1 Grand Mean : 0.932
  • 15. Similarly  Distances for column 2 (bij) 0 1 1.73 2.8 1 0 2 3 1.73 2 0 2.23 2.8 3 2.23 0 Mean Mean 1.38 1.5 1.49 2.66 1.38 1.5 1.49 2.66 Grand Mean : 1.595
  • 16. Centering both the columns  Aij = aij – ~ai – ~aj + ~a;  where  ~ai = row mean of ai  ~aj = column mean of aj  ~a = grand mean of a
  • 17. Aij -0.308 0.677 -0.308 0.312 0.677 -1.668 0.677 0.567 -0.308 0.677 -0.308 0.312 0.312 0.567 0.312 -1.608
  • 18. Similarly  We can calculate Bij  Distance Covariance = (Aij*Bij)/n^2
  • 19. Distance Covariance Principal Component Analysis  After we have obtained distance covariance, we can find the highest eigen vectors of the covariance matrix and then use those eigen vectors to extract new features  These eigen vectors can be multiplied by the real dataset to generate the reduced dataset.
  • 20. PCA vs D-PCA  The classical measure of dependence, the Pearson correlation coefficient, is mainly sensitive to a linear relationship between two variables. Distance correlation was introduced in 2005 by Gabor J Szekely in several lectures to address this deficiency of Pearson’s correlation, namely that it can easily be zero for dependent variables. Correlation = 0 (uncorrelatedness) does not imply independence while distance correlation = 0 does imply independence. The first results on distance correlation were published in 2007 and 2009.
  • 22. Modifications of D-PCA  1. pow((ai^2 – aj^2),0.5)/ai+aj  2. pow((ai^2 – aj^2),0.5)/ai  These modification can be used to scale the data which can then eliminate Normalization Step.
  • 24. Drawbacks  Cannot handle time series data  Cannot handle noisy data  Assumes data distribution to be normal  Sensitive to scaling of the data
  • 25. Future work  Rank correlation  Distance based source separation