Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Dimensionality
Reduction using

Principal Components
Analysis


Rumman Chowdhury, Senior Data Scientist
@ruchowdh
rummanch...
Me:
Political Science PhD, Data Scientist, Teacher, Do-
Gooder. Check me out on twitter: @ruchowdh, or on
my website: rumm...
What is PCA?
Why do we need dimensionality reduction?
Intuition behind Principal Components Analysis
Coding example
What is Principal Components
Analysis?
What is PCA?
- A shift in perspective
- A reduction in the number of
dimensions
Why do we need dimensionality
reduction?
Curse of Dimensionality
One dimension:
Small space
Being close quite
probableCigarettes
per day
Curse of Dimensionality
Two
dimensions
Height
Cigarettes per day
Curse of Dimensionality
Height
Two dimensions:
More space but still not so
much Being close not
improbable
Cigarettes per day
Curse of Dimensional...
Height
Three
dimensions
Cigarettes per day
Exercise
Curse of Dimensionality
Height
Three dimensions:
Much larger space
Being close less
probable
Cigarettes per dayExercise
Curse of Dimensionality
Height
Four
dimensions
Age
Cigarettes per day
Exercise
Curse of Dimensionality
Age
Height
Four dimensions:
Omg so much space
Being close quite
improbable
Cigarettes per
dayExercise
Curse of Dimensional...
Thousand dimensions:
Helloooo… hellooo.. helloo…
Can anybody hear meee..
mee.. mee.. mee..
So
alone….
Curse of Dimensional...
Thousand dimensions:
I specified you with such high
resolution, with so much
detail, that you don’t look
like anybody else ...
Height
Classification, clustering and other analysis methods
become exponentially difficult with increasing
dimensions.
Ci...
Height
Classification, clustering and other analysis methods
become exponentially difficult with increasing
dimensions.
To...
Height
Lots of features, lots of data is best. But what if
you don’t have the luxury of ginormous amounts of
data?
Not all...
Feature Extraction
Do I have to choose the
dimensions among existing
features?
Height
Cigarettes per day
Feature Extraction
Do I have to choose the
dimensions among existing
features?
Height
Cigarettes per day
Why do we need dimensionality reduction?
- To better perform analyses
- …without sacrificing the information we
get from o...
What is the intuition behind PCA?
Variable 1
Variable 2
Height
Cigarettes per day
PC 1PC 2
Ducks and Bunnies
PC 1
PC 2
Height
Cigarettes per day
0.398 (Height) + 0.602 (Cigarettes)
Height
Cigarettes
0.398 (Height) + 0.602 (Cigarettes)
Advantage: You retain more information
Disadvantage: You lose interpretability
2D
Healthy_or_not = logit( β1(Height) + β2(...
3D → 2D Feature Extraction (PCA)
Height
Cigarettes
Exercise
3D → 2D Feature Extraction (PCA)
Optimum plane
Height
Cigarettes
Exercise
Cigarettes
Height
3D → 2D Feature Extraction (PCA)
Optimum plane
Exercise
A1*(Height)+B1*(cigarettes)+C1*(Exercise)
A2 *(H...
Singular Value Decomposition
The eigenvectors and eigenvalues of a covariance (or
correlation) matrix represent the "core"...
Correlation or Covariance Matrix?
Use the correlation matrix to calculate the principal components
if variables are measur...
Kaiser Method
Retain any components with eigenvector values
greater than 1
Scree Test
Bar plot that shows the variance exp...
What is the intuition behind PCA?
- We are attempting to resolve the curse of
dimensionality
- by shifting our perspective...
Principal Components Analysis - PyBay 2016
Principal Components Analysis - PyBay 2016
Principal Components Analysis - PyBay 2016
Principal Components Analysis - PyBay 2016
Upcoming SlideShare
Loading in …5
×

Principal Components Analysis - PyBay 2016

2,487 views

Published on

Rumman Chowdhury, Senior Data Scientist at Metis, discusses the intuition behind PCA

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Principal Components Analysis - PyBay 2016

  1. 1. Dimensionality Reduction using
 Principal Components Analysis 
 Rumman Chowdhury, Senior Data Scientist @ruchowdh rummanchowdhury.com thisismetis.com
  2. 2. Me: Political Science PhD, Data Scientist, Teacher, Do- Gooder. Check me out on twitter: @ruchowdh, or on my website: rummanchowdhury.com (psst, I post cool jobs there) What’s Metis? Metis accelerates the careers of data scientists by providing full-time immersive bootcamps, evening part-time professional development courses, online training, and corporate programs. Who is Rumman? What’s a Metis?
  3. 3. What is PCA? Why do we need dimensionality reduction? Intuition behind Principal Components Analysis Coding example
  4. 4. What is Principal Components Analysis?
  5. 5. What is PCA? - A shift in perspective - A reduction in the number of dimensions
  6. 6. Why do we need dimensionality reduction?
  7. 7. Curse of Dimensionality
  8. 8. One dimension: Small space Being close quite probableCigarettes per day Curse of Dimensionality
  9. 9. Two dimensions Height Cigarettes per day Curse of Dimensionality
  10. 10. Height Two dimensions: More space but still not so much Being close not improbable Cigarettes per day Curse of Dimensionality
  11. 11. Height Three dimensions Cigarettes per day Exercise Curse of Dimensionality
  12. 12. Height Three dimensions: Much larger space Being close less probable Cigarettes per dayExercise Curse of Dimensionality
  13. 13. Height Four dimensions Age Cigarettes per day Exercise Curse of Dimensionality
  14. 14. Age Height Four dimensions: Omg so much space Being close quite improbable Cigarettes per dayExercise Curse of Dimensionality
  15. 15. Thousand dimensions: Helloooo… hellooo.. helloo… Can anybody hear meee.. mee.. mee.. mee.. So alone…. Curse of Dimensionality
  16. 16. Thousand dimensions: I specified you with such high resolution, with so much detail, that you don’t look like anybody else anymore. You’re unique. Curse of Dimensionality
  17. 17. Height Classification, clustering and other analysis methods become exponentially difficult with increasing dimensions. Cigarettes per day Curse of Dimensionality
  18. 18. Height Classification, clustering and other analysis methods become exponentially difficult with increasing dimensions. To understand how to divide that huge space, we need a whole lot more data (usually much more than we do or can have). Cigarettes per day Curse of Dimensionality
  19. 19. Height Lots of features, lots of data is best. But what if you don’t have the luxury of ginormous amounts of data? Not all features provide the same amount of information. We can reduce the dimensions (compress the data) without necessarily losing too much information. Cigarettes per day Dimensionality Reduction
  20. 20. Feature Extraction Do I have to choose the dimensions among existing features? Height Cigarettes per day
  21. 21. Feature Extraction Do I have to choose the dimensions among existing features? Height Cigarettes per day
  22. 22. Why do we need dimensionality reduction? - To better perform analyses - …without sacrificing the information we get from our features - To better visualize our data
  23. 23. What is the intuition behind PCA?
  24. 24. Variable 1 Variable 2
  25. 25. Height Cigarettes per day PC 1PC 2
  26. 26. Ducks and Bunnies PC 1 PC 2
  27. 27. Height Cigarettes per day 0.398 (Height) + 0.602 (Cigarettes)
  28. 28. Height Cigarettes 0.398 (Height) + 0.602 (Cigarettes)
  29. 29. Advantage: You retain more information Disadvantage: You lose interpretability 2D Healthy_or_not = logit( β1(Height) + β2(Cigarettes per day) ) Feature selection 1D Healthy_or_not = logit( β1(Height) ) Feature extraction 1D Healthy_or_not = logit( β1(0.4*Height + 0.6*Cigarettes per day) )
  30. 30. 3D → 2D Feature Extraction (PCA) Height Cigarettes Exercise
  31. 31. 3D → 2D Feature Extraction (PCA) Optimum plane Height Cigarettes Exercise
  32. 32. Cigarettes Height 3D → 2D Feature Extraction (PCA) Optimum plane Exercise A1*(Height)+B1*(cigarettes)+C1*(Exercise) A2 *(Height) + B2 *(Cigarettes) + C2 *(Exercise)
  33. 33. Singular Value Decomposition The eigenvectors and eigenvalues of a covariance (or correlation) matrix represent the "core" of a PCA: The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data along the new feature axes. PCA Math
  34. 34. Correlation or Covariance Matrix? Use the correlation matrix to calculate the principal components if variables are measured by different scales and you want to standardize them or if the variances differ widely between variables. You can use the covariance or correlation matrix in all other situations. Matrix Selection
  35. 35. Kaiser Method Retain any components with eigenvector values greater than 1 Scree Test Bar plot that shows the variance explained by each component. Ideally you will see a clear drop-off (elbow). Percent Variance Explained Calculate the sum of variance explained by each component, stop when you reach a point. How do I know how many dimensions to reduce by?
  36. 36. What is the intuition behind PCA? - We are attempting to resolve the curse of dimensionality - by shifting our perspective - and keeping the eigenvectors that explain the highest amount of variance. - We select those components based on our end goal, or by particular methods (Kaiser, Scree, % Variance).

×