Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Center for Evolutionary Medicine and Informatics 
Sparse Screening for Exact Data Reduction 
Jieping Ye 
Arizona State Uni...
Center for Evolutionary Medicine and Informatics 
2 
wide data 
tall data 
sample 
reduction 
feature 
reduction
Center for Evolutionary Medicine and Informatics 
The model learnt from the reduced data is identical to the model learnt ...
Center for Evolutionary Medicine and Informatics 
4
Center for Evolutionary Medicine and Informatics 
Lasso/Basis Pursuit (Tibshirani, 1996, Chen, Donoho, and Saunders, 1999)...
Center for Evolutionary Medicine and Informatics 
Imaging Genetics (Thompson et al. 2013) 
6
Center for Evolutionary Medicine and Informatics 
Sparse Reduced-Rank Regression 
7 
Vounou et al. (2010, 2012)
Center for Evolutionary Medicine and Informatics 
Structured Sparse Models 
8 
Group Lasso 
Tree Lasso 
Fused Lasso 
Graph...
Center for Evolutionary Medicine and Informatics 
9 
Sparsity has become an important modeling tool in genomics, genetics,...
Center for Evolutionary Medicine and Informatics 
Optimization Algorithms 
•Coordinate descent 
•Subgradient descent 
•Aug...
Center for Evolutionary Medicine and Informatics 
Lasso 
Fused Lasso 
Group Lasso 
Sparse Group Lasso 
Tree Structured Gro...
Center for Evolutionary Medicine and Informatics 
More Efficiency? 
12 
Very high dimensional data 
Non-smooth sparsity-in...
Center for Evolutionary Medicine and Informatics 
How to make any existing Lasso solver much more efficient? 
13
Center for Evolutionary Medicine and Informatics 
14 
1M 
1K 
Data Reduction/Compression 
original data reduced data
Center for Evolutionary Medicine and Informatics 
Data Reduction 
•Heuristic-based data reduction 
–Sure screening, random...
Center for Evolutionary Medicine and Informatics 
16 
with screening 
same solution 
1M 
1M 
1K 
without screening 
Sparse...
Center for Evolutionary Medicine and Informatics 
Large-Scale Sparse Screening
Center for Evolutionary Medicine and Informatics 
Screening Rule: Motivation 
Ghaoui, Viallon, and Rabbani.
Center for Evolutionary Medicine and Informatics 
Large-Scale Sparse Screening (Cont’d)
Center for Evolutionary Medicine and Informatics 
More on the Dual Formulation 
•Solving the dual formulation is difficult...
Center for Evolutionary Medicine and Informatics 
Screening Rule 
21
Center for Evolutionary Medicine and Informatics 
Sketch of Sparse Screening 
22
Center for Evolutionary Medicine and Informatics 
How to Estimate the Region Θ? 
J. Wang et al. NIPS’13; J. Liu et al. ICM...
Center for Evolutionary Medicine and Informatics 
Enhanced DPP 
24 
Use projections of rays: 
Define: 
Enhanced DPP:
Center for Evolutionary Medicine and Informatics 
Firmly Non-expansive Projection 
25 
Non-expansiveness: 
Firmly non-expa...
Center for Evolutionary Medicine and Informatics 
26 
Results on MNIST along a sequence of 100 parameter values along the ...
Center for Evolutionary Medicine and Informatics 
27 
Evaluation on MNIST 
solver 
SAFE 
DPP 
EDPP 
SDPP 
time (s) 
2245.2...
Center for Evolutionary Medicine and Informatics 
Evaluation on ADNI 
•Problem: GWAS to MRI ROI prediction (ADNI) 
–The si...
Center for Evolutionary Medicine and Informatics 
Sparse Screening Extensions 
•Group Lasso 
–J Wang, J Liu, J Ye. Efficie...
Center for Evolutionary Medicine and Informatics 
Wide versus Tall Data 
30 
wide data 
tall data
Center for Evolutionary Medicine and Informatics 
Support Vector Machines 
•SVM is a maximum margin classifier. 
31 
denot...
Center for Evolutionary Medicine and Informatics 
Support Vectors 
•SVM is determined by the so-called support vectors. 
3...
Center for Evolutionary Medicine and Informatics 
The Idea of Sample Screening 
33 
Original Problem 
Screening 
Smaller P...
Center for Evolutionary Medicine and Informatics 
Guidelines for Sample Screening 
34 
J. Wang, P. Wonka, and J. Ye. ICML’...
Center for Evolutionary Medicine and Informatics 
Relaxed Guidelines 
35
Center for Evolutionary Medicine and Informatics 
Sketch of SVM Screening 
36
Center for Evolutionary Medicine and Informatics 
Synthetic Studies 
37 
•We use the rejection rates to measure the perfor...
Center for Evolutionary Medicine and Informatics 
Performance of DVI for SVM on Real Data Sets 
38 
Comparison of SSNSV (O...
Center for Evolutionary Medicine and Informatics 
Experiments on Real Data Sets 
39 
Comparison of SSNSV (Ogawa et al., IC...
Center for Evolutionary Medicine and Informatics 
Resource 
40 
•Tutorial webpages of our screening rules, which include s...
Center for Evolutionary Medicine and Informatics 
Summary 
•Developed exact data reduction approaches 
–Exact data reducti...
Upcoming SlideShare
Loading in …5
×

Exact Data Reduction for Big Data by Jieping Ye

2,356 views

Published on

Recent technological innovations have enabled data collection of unprecedented size and complexity. Examples include web text data, social media data, gene expression images, neuroimages, and genome-wide association study (GWAS) data. Such data have incredible potential to address complex scientific and societal questions, however analysis of these data poses major challenges for the scientists. As an emerging and powerful tool for analyzing massive collections of data, data reduction in terms of the number of variables and/or the number of samples has attracted tremendous attentions in the past few years, and has achieved great success in a broad range of applications. The intuition of data reduction is based on the observation that many real-world data with complex structures and billions of variables and/or samples can usually be well explained by a few most relevant explanatory features and/or samples. Most existing methods for data reduction are based on sampling or random projection, and the final model based on the reduced data is an approximation of the true (original) model. In this talk, I will present fundamentally different approaches for data reduction in that there is no approximation in the model, that is, the final model constructed from the reduced data is identical to the original model constructed from the complete data. Finally, I will use several real world examples to demonstrate the potential of exact data reduction for analyzing big data.

Published in: Technology
  • Be the first to comment

Exact Data Reduction for Big Data by Jieping Ye

  1. 1. Center for Evolutionary Medicine and Informatics Sparse Screening for Exact Data Reduction Jieping Ye Arizona State University 1 Joint work with Jie Wang and Jun Liu
  2. 2. Center for Evolutionary Medicine and Informatics 2 wide data tall data sample reduction feature reduction
  3. 3. Center for Evolutionary Medicine and Informatics The model learnt from the reduced data is identical to the model learnt from the full data. We focus on two models in this talk: Lasso for wide data (feature reduction) SVM for tall data (sample reduction) 3 Sparse Screening: A New Framework for Exact Data Reduction
  4. 4. Center for Evolutionary Medicine and Informatics 4
  5. 5. Center for Evolutionary Medicine and Informatics Lasso/Basis Pursuit (Tibshirani, 1996, Chen, Donoho, and Saunders, 1999) … = × + y A z n×1 n×p n×1 p×1 x 5 Simultaneous feature selection and regression
  6. 6. Center for Evolutionary Medicine and Informatics Imaging Genetics (Thompson et al. 2013) 6
  7. 7. Center for Evolutionary Medicine and Informatics Sparse Reduced-Rank Regression 7 Vounou et al. (2010, 2012)
  8. 8. Center for Evolutionary Medicine and Informatics Structured Sparse Models 8 Group Lasso Tree Lasso Fused Lasso Graph Lasso
  9. 9. Center for Evolutionary Medicine and Informatics 9 Sparsity has become an important modeling tool in genomics, genetics, signal and audio processing, image processing, neuroscience (theory of sparse coding), machine learning, statistics …
  10. 10. Center for Evolutionary Medicine and Informatics Optimization Algorithms •Coordinate descent •Subgradient descent •Augmented Lagrangian Method •Gradient descent •Accelerated gradient descent •… 10 min loss(x) + λ×penalty(x)
  11. 11. Center for Evolutionary Medicine and Informatics Lasso Fused Lasso Group Lasso Sparse Group Lasso Tree Structured Group Lasso Overlapping Group Lasso Sparse Inverse Covariance Estimation Trace Norm Minimization http://www.public.asu.edu/~jye02/Software/SLEP/ 11
  12. 12. Center for Evolutionary Medicine and Informatics More Efficiency? 12 Very high dimensional data Non-smooth sparsity-induced norms Multiple runs in model selection A large number of runs in permutation test
  13. 13. Center for Evolutionary Medicine and Informatics How to make any existing Lasso solver much more efficient? 13
  14. 14. Center for Evolutionary Medicine and Informatics 14 1M 1K Data Reduction/Compression original data reduced data
  15. 15. Center for Evolutionary Medicine and Informatics Data Reduction •Heuristic-based data reduction –Sure screening, random projection/selection –Resulting model is an approximation of the true model •Propose data reduction methods –Exact data reduction via sparse screening •The model based on reduced data is identical to the one constructed from complete data 15
  16. 16. Center for Evolutionary Medicine and Informatics 16 with screening same solution 1M 1M 1K without screening Sparse Screening
  17. 17. Center for Evolutionary Medicine and Informatics Large-Scale Sparse Screening
  18. 18. Center for Evolutionary Medicine and Informatics Screening Rule: Motivation Ghaoui, Viallon, and Rabbani.
  19. 19. Center for Evolutionary Medicine and Informatics Large-Scale Sparse Screening (Cont’d)
  20. 20. Center for Evolutionary Medicine and Informatics More on the Dual Formulation •Solving the dual formulation is difficult •Providing a good (not exact) estimate of the optimal dual solution is easier • A good estimate of the optimal dual solution is sufficient for effective feature screening 20
  21. 21. Center for Evolutionary Medicine and Informatics Screening Rule 21
  22. 22. Center for Evolutionary Medicine and Informatics Sketch of Sparse Screening 22
  23. 23. Center for Evolutionary Medicine and Informatics How to Estimate the Region Θ? J. Wang et al. NIPS’13; J. Liu et al. ICML’14 Non-expansiveness:
  24. 24. Center for Evolutionary Medicine and Informatics Enhanced DPP 24 Use projections of rays: Define: Enhanced DPP:
  25. 25. Center for Evolutionary Medicine and Informatics Firmly Non-expansive Projection 25 Non-expansiveness: Firmly non-expansiveness:
  26. 26. Center for Evolutionary Medicine and Informatics 26 Results on MNIST along a sequence of 100 parameter values along the λ/λmax scale from 0.05 to 1. The data matrix is of size 784x50,000
  27. 27. Center for Evolutionary Medicine and Informatics 27 Evaluation on MNIST solver SAFE DPP EDPP SDPP time (s) 2245.26 685.12 233.85 45.56 9.34 0 100 200 300 SAFE DPP EDPP SDPP Speedup
  28. 28. Center for Evolutionary Medicine and Informatics Evaluation on ADNI •Problem: GWAS to MRI ROI prediction (ADNI) –The size of the data matrix is 747 by 504095 Method ROI3 ROI8 ROI30 ROI69 ROI76 ROI83 Lasso Solver 37975.31 37097.25 38258.72 36926.81 38116.29 37251.03 SR 84.06 84.44 84.70 83.09 82.76 85.39 SR+Lasso 217.08 215.90 223.39 214.36 212.04 211.57 EDDP 43.56 45.75 45.70 45.01 44.31 44.16 EDDP+Lasso 183.64 190.43 182.87 170.71 177.41 178.98 Running time (in seconds) of the Lasso solver, strong rule (Tibshriani et al, 2012), and EDPP. The parameter sequence contains 100 values along the log λ/λmax scale from 100 log 0.95 to log 0.95.
  29. 29. Center for Evolutionary Medicine and Informatics Sparse Screening Extensions •Group Lasso –J Wang, J Liu, J Ye. Efficient Mixed-Norm Regularization: Algorithms and Safe Screening Methods. arXiv preprint arXiv:1307.4156. •Sparse Logistic Regression –J Wang, J Zhou, P Wonka, J Ye. A Safe Screening Rule for Sparse Logistic Regression. arXiv preprint arXiv:1307.4145. •Sparse Inverse Covariance Estimation –S Huang, J Li, L Sun, J Liu, T Wu, K Chen, A Fleisher, E Reiman, J Ye. Learning brain connectivity of Alzheimer’s disease by exploratory graphical models. NeuroImage 50, 935-949. –Witten, Friedman and Simon (2011), Mazumder and Hastie (2012) •Multiple Graphical Lasso –S Yang, Z Pan, X Shen, P Wonka, J Ye. Fused Multiple Graphical Lasso. arXiv preprint arXiv:1209.2139. 29
  30. 30. Center for Evolutionary Medicine and Informatics Wide versus Tall Data 30 wide data tall data
  31. 31. Center for Evolutionary Medicine and Informatics Support Vector Machines •SVM is a maximum margin classifier. 31 denotes +1 denotes -1 Margin
  32. 32. Center for Evolutionary Medicine and Informatics Support Vectors •SVM is determined by the so-called support vectors. 32 Support Vectors are those data points that the margin pushes up against denotes +1 denotes -1 The non-support vectors are irrelevant to the classifier. Can we make use of this observation?
  33. 33. Center for Evolutionary Medicine and Informatics The Idea of Sample Screening 33 Original Problem Screening Smaller Problem to Solve
  34. 34. Center for Evolutionary Medicine and Informatics Guidelines for Sample Screening 34 J. Wang, P. Wonka, and J. Ye. ICML’14.
  35. 35. Center for Evolutionary Medicine and Informatics Relaxed Guidelines 35
  36. 36. Center for Evolutionary Medicine and Informatics Sketch of SVM Screening 36
  37. 37. Center for Evolutionary Medicine and Informatics Synthetic Studies 37 •We use the rejection rates to measure the performance of the screening rules, the ratio of the number of data instances whose membership can be identified by the rule to the total number of data instances.
  38. 38. Center for Evolutionary Medicine and Informatics Performance of DVI for SVM on Real Data Sets 38 Comparison of SSNSV (Ogawa et al., ICML’13), ESSNSV and DVIs for SVM on three real data sets. IJCNN, , Speedup Solver Total 4669.14 Solver + SSNSV SSNSV 2.08 2.31 Init. 92.45 Total 2018.55 Solver + ESSNSV ESSNSV 2.09 3.01 Init. 91.33 Total 1552.72 Solver + DVI DVI 0.99 5.64 Init. 42.67 Total 828.02 Wine, , Speedup Solver Total 76.52 Solver + SSNSV SSNSV 0.02 3.50 Init. 1.56 Total 21.85 Solver + ESSNSV ESSNSV 0.03 4.47 Init. 1.60 Total 17.17 Solver + DVI DVI 0.01 6.59 Init. 0.67 Total 11.62 Covertype, , Speedup Solver Total 1675.46 Solver + SSNSV SSNSV 2.73 7.60 Init. 35.52 Total 220.58 Solver + ESSNSV ESSNSV 2.89 10.72 Init. 36.13 Total 156.23 Solver + DVI DVI 1.27 79.18 Init. 12.57 Total 21.26
  39. 39. Center for Evolutionary Medicine and Informatics Experiments on Real Data Sets 39 Comparison of SSNSV (Ogawa et al., ICML’13), ESSNSV and DVIs for LAD on three real data sets. Telescope, , Speedup Solver Total 122.34 Solver + DVI DVI 0.28 9.86 Init. 0.12 Total 12.14 Computer, , Speedup Solver Total 5.85 Solver + DVI DVI 0.08 19.21 Init. 0.05 Total 0.28 Telescope, , Speedup Solver Total 21.43 Solver + DVI DVI 0.06 114.91 Init. 0.1 Total 0.19
  40. 40. Center for Evolutionary Medicine and Informatics Resource 40 •Tutorial webpages of our screening rules, which include sample codes, implementation instructions, illustration materials, etc. http://www.public.asu.edu/~jwang237/screening.html Seven lines implementation of EDPP rule The list is growing quickly
  41. 41. Center for Evolutionary Medicine and Informatics Summary •Developed exact data reduction approaches –Exact data reduction via feature screening –Exact data reduction via sample screening •The model based on reduced data is identical to the one constructed from complete data •Results show screening leads to a significant speedup. •Extend exact data reduction to other sparse learning formulations –Sparsity on features, samples, networks etc 41

×