Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BigML Release: PCA

252 views

Published on

BigML brings Principal Component Analysis (PCA) to the platform, a key unsupervised Machine Learning technique used to transform a given dataset in order to yield uncorrelated features and reduce dimensionality. BigML PCA unique implementation is distinct from other approaches to PCA in that it can handle numeric and non-numeric data types, including text, categorical, items fields, as well as combinations of different data types. PCA can be used in any industry vertical as a preprocessing technique to improve supervised learning performance, with the caveat that some measure of interpretability may be sacrificed. It is commonly applied in fields with high dimensional data including bioinformatics, quantitative finance, and signal processing.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

BigML Release: PCA

  1. 1. Introducing Principal Component Analysis PCA Release
  2. 2. BigML, Inc BigML PCA Release Webinar Fall 2018 Release GREGORY ANTELL, Ph.D. - Machine Learning Architect and Product Manager Please enter questions into chat box – We will answer some via chat and others at the end of the session https://bigml.com/releases/fall-2018 ATAKAN CETINSOY - VP of Predictive Applications Resources Moderator Speaker Contact support@bigml.com Twitter @bigmlcom Questions 2
  3. 3. BigML, Inc BigML PCA Release Webinar Agenda: Principal Component Analysis 3 1 Utility in Machine Learning Workflows 2 High Dimensional Data in Machine Learning 3 PCA Intuition and Methodology 4 Use Cases with the BigML Dashboard 5 BigML Implementation
  4. 4. BigML, Inc BigML PCA Release Webinar Agenda: Principal Component Analysis 4 1 Utility in Machine Learning Workflows 2 High Dimensional Data in Machine Learning 3 PCA Intuition and Methodology 4 Use Cases with the BigML Dashboard 5 BigML Implementation
  5. 5. BigML, Inc BigML PCA Release Webinar Problem Formulation Data Acquisition Feature Engineering Modeling and Evaluations Predictions Measure Results Data Transformations Task 5 Steps of a ML Application
  6. 6. BigML, Inc BigML PCA Release Webinar Steps of a ML Application Problem Formulation Data Acquisition Feature Engineering Modeling and Evaluations Predictions Measure Results Data Transformations Task 6 • More often than changing models, improvement comes from more data or better features • Garbage In, Garbage Out principle • Model training and hyper-parameter tuning can be automated, feature engineering (mostly) cannot
  7. 7. BigML, Inc BigML PCA Release Webinar Steps of a ML Application Problem Formulation Data Acquisition Feature Engineering Modeling and Evaluations Predictions Measure Results Data Transformations Today’s release further expands what is possible in Task 7
  8. 8. BigML, Inc BigML PCA Release Webinar Agenda: Principal Component Analysis 8 1 Utility in Machine Learning Workflows 2 High Dimensional Data in Machine Learning 3 PCA Intuition and Methodology 4 Use Cases with the BigML Dashboard 5 BigML Implementation
  9. 9. BigML, Inc BigML PCA Release Webinar High-dimensional Data 9 F1 F2 F3 F4 F5 … FN I1 I2 I3 I4 I5 … IN Features (p) Instances (n) Machine Learning typically performs better when n >>> p
  10. 10. BigML, Inc BigML PCA Release Webinar Dangers of high-dimensional Data • Implicitly increases model complexity, prone to overfitting • Requires more observations in order to generalize well • Contains correlated or useless variables • Data is difficult to visualize • Takes a longer time to train models or make predictions 10 Principal Component Analysis addresses all of these issues
  11. 11. BigML, Inc BigML PCA Release Webinar Model Complexity and Training Data 11 • Models with lower complexity will converge to higher test error rates Number of training examples TestError Model 1 Model 2
  12. 12. BigML, Inc BigML PCA Release Webinar Model Complexity and Training Data 12 • Models with lower complexity will converge to higher test error rates • A threshold exists where enough training data is available to favor the more complex model • With a fixed amount of data, less complex models are often favoredNumber of training examples TestError Less Complex Model Wins Model 1 Model 2 More Complex Model Wins
  13. 13. BigML, Inc BigML PCA Release Webinar Combating High-dimensional Data 13 MODEL Pruning, Node threshold ENSEMBLE Bagging, Randomization LOGISTIC REGRESSION L1 and L2 penalties DEEPNET Dropout
  14. 14. BigML, Inc BigML PCA Release Webinar Dimensionality Reduction 14 Feature Selection • Preserves the original variables and selects a subset • Often uses recursive methods or statistical thresholds • Examples: RFE, Chi-Squared Test, Boruta Feature Extraction • Transforms original variables into variables better suited for modeling • Examples: word vectors, clustering • PCA falls into this category Reducing the dimensions will decrease model complexity
  15. 15. BigML, Inc BigML PCA Release Webinar Agenda: Principal Component Analysis 15 1 Utility in Machine Learning Workflows 2 High Dimensional Data in Machine Learning 3 PCA Intuition and Methodology 4 Use Cases with the BigML Dashboard 5 BigML Implementation
  16. 16. BigML, Inc BigML PCA Release Webinar Why Consider Using PCA? 1. You want to reduce the number of variables in your model, but it is not clear which should be eliminated 2. You want to generate variables that are not correlated 3. You are okay with sacrificing some amount of interpretability for potential downstream performance gains 16
  17. 17. BigML, Inc BigML PCA Release Webinar PCA in Machine Learning Workflows 17 SOURCE DATASET TRAIN TEST
  18. 18. BigML, Inc BigML PCA Release Webinar 18 PCA PCA in Machine Learning Workflows SOURCE DATASET TRAIN TEST
  19. 19. BigML, Inc BigML PCA Release Webinar 19 BATCH PROJECTION PCA in Machine Learning Workflows BATCH PROJECTION SOURCE DATASET TRAIN TEST PCA
  20. 20. BigML, Inc BigML PCA Release Webinar 20 NEW TRAIN FEATURES NEW TEST FEATURES PCA in Machine Learning Workflows BATCH PROJECTION BATCH PROJECTION SOURCE DATASET TRAIN TEST PCA
  21. 21. BigML, Inc BigML PCA Release Webinar 21 PCA in Machine Learning Workflows NEW TRAIN FEATURES NEW TEST FEATURES BATCH PROJECTION BATCH PROJECTION SOURCE DATASET TRAIN TEST PCA What’s special about these new features?
  22. 22. BigML, Inc BigML PCA Release Webinar 22 Original Data Matrix F1 F2 F3 F4 F5 … FN I1 I2 I3 I4 I5 … IN Transformed Data Matrix PC1 PC2 PC3 PC4 PC5 … PCN I1 I2 I3 I4 I5 … IN The new variables are the “principal components” What Does PCA Yield?
  23. 23. BigML, Inc BigML PCA Release Webinar 23 Properties of Principal Components Each PC is a linear combination of original variables PC1 = w1F1 + w2F2 + w3F3 + … + wNFN PC2 = w1F1 + w2F2 + w3F3 + … + wNFN PCN = w1F1 + w2F2 + w3F3 + … + wNFN …
  24. 24. BigML, Inc BigML PCA Release Webinar 24 Geometric Interpretation of PCA
  25. 25. BigML, Inc BigML PCA Release Webinar 25 Intuition Behind Principal Components
  26. 26. BigML, Inc BigML PCA Release Webinar 26 Intuition Behind Principal Components
  27. 27. BigML, Inc BigML PCA Release Webinar 27 Properties of Principal Components Original Data Transformed Data Principal Components are not correlated
  28. 28. BigML, Inc BigML PCA Release Webinar 28 Properties of Principal Components Principal Components are sorted by the percentage of variance explained in the original data
  29. 29. BigML, Inc BigML PCA Release Webinar 29 How to Reduce Dimensions Approach #1 Directly select how many PCs to keep
  30. 30. BigML, Inc BigML PCA Release Webinar 30 How to Reduce Dimensions Approach #2 Select a threshold for the cumulative Percent Variance Explained
  31. 31. BigML, Inc BigML PCA Release Webinar Agenda: Principal Component Analysis 31 1 Utility in Machine Learning Workflows 2 High Dimensional Data in Machine Learning 3 PCA Intuition and Methodology 4 Use Cases with the BigML Dashboard 5 BigML Implementation
  32. 32. BigML, Inc BigML PCA Release Webinar Agenda: Principal Component Analysis 32 1 Utility in Machine Learning Workflows 2 High Dimensional Data in Machine Learning 3 PCA Intuition and Methodology 4 Use Cases with BigML Dashboard 5 BigML Implementation
  33. 33. BigML, Inc BigML PCA Release Webinar 33 BigML-Specific Implementation • Standard PCA only applies to numerical data • BigML uses three different data transformation methods in order to handle different data types • Numeric data: Principal Component Analysis (PCA) • Categorical data: Multiple Correspondence Analysis (MCA) • Mixed data: Factorial Analysis of Mixed Data (FAMD) • BigML will automatically handle numeric, text, items, and categorical data without needing user input
  34. 34. BigML, Inc BigML PCA Release Webinar https://bigml.com/releases/fall-2018 34 More Info
  35. 35. Questions? @bigmlcom support@bigml.com

×