SlideShare a Scribd company logo
Center for Evolutionary Medicine and Informatics 
Sparse Screening for Exact Data Reduction 
Jieping Ye 
Arizona State University 
1 
Joint work with Jie Wang and Jun Liu
Center for Evolutionary Medicine and Informatics 
2 
wide data 
tall data 
sample 
reduction 
feature 
reduction
Center for Evolutionary Medicine and Informatics 
The model learnt from the reduced data is identical to the model learnt from the full data. 
We focus on two models in this talk: 
Lasso for wide data (feature reduction) 
SVM for tall data (sample reduction) 
3 
Sparse Screening: A New Framework for Exact Data Reduction
Center for Evolutionary Medicine and Informatics 
4
Center for Evolutionary Medicine and Informatics 
Lasso/Basis Pursuit (Tibshirani, 1996, Chen, Donoho, and Saunders, 1999) 
… 
= 
× 
+ 
y 
A 
z 
n×1 
n×p 
n×1 
p×1 
x 
5 
Simultaneous feature selection and regression
Center for Evolutionary Medicine and Informatics 
Imaging Genetics (Thompson et al. 2013) 
6
Center for Evolutionary Medicine and Informatics 
Sparse Reduced-Rank Regression 
7 
Vounou et al. (2010, 2012)
Center for Evolutionary Medicine and Informatics 
Structured Sparse Models 
8 
Group Lasso 
Tree Lasso 
Fused Lasso 
Graph Lasso
Center for Evolutionary Medicine and Informatics 
9 
Sparsity has become an important modeling tool in genomics, genetics, signal and audio processing, image processing, neuroscience (theory of sparse coding), machine learning, statistics …
Center for Evolutionary Medicine and Informatics 
Optimization Algorithms 
•Coordinate descent 
•Subgradient descent 
•Augmented Lagrangian Method 
•Gradient descent 
•Accelerated gradient descent 
•… 
10 
min loss(x) + λ×penalty(x)
Center for Evolutionary Medicine and Informatics 
Lasso 
Fused Lasso 
Group Lasso 
Sparse Group Lasso 
Tree Structured Group Lasso 
Overlapping Group Lasso 
Sparse Inverse Covariance Estimation 
Trace Norm Minimization 
http://www.public.asu.edu/~jye02/Software/SLEP/ 
11
Center for Evolutionary Medicine and Informatics 
More Efficiency? 
12 
Very high dimensional data 
Non-smooth sparsity-induced norms 
Multiple runs in model selection 
A large number of runs in permutation test
Center for Evolutionary Medicine and Informatics 
How to make any existing Lasso solver much more efficient? 
13
Center for Evolutionary Medicine and Informatics 
14 
1M 
1K 
Data Reduction/Compression 
original data reduced data
Center for Evolutionary Medicine and Informatics 
Data Reduction 
•Heuristic-based data reduction 
–Sure screening, random projection/selection 
–Resulting model is an approximation of the true model 
•Propose data reduction methods 
–Exact data reduction via sparse screening 
•The model based on reduced data is identical to the one constructed from complete data 
15
Center for Evolutionary Medicine and Informatics 
16 
with screening 
same solution 
1M 
1M 
1K 
without screening 
Sparse Screening
Center for Evolutionary Medicine and Informatics 
Large-Scale Sparse Screening
Center for Evolutionary Medicine and Informatics 
Screening Rule: Motivation 
Ghaoui, Viallon, and Rabbani.
Center for Evolutionary Medicine and Informatics 
Large-Scale Sparse Screening (Cont’d)
Center for Evolutionary Medicine and Informatics 
More on the Dual Formulation 
•Solving the dual formulation is difficult 
•Providing a good (not exact) estimate of the optimal dual solution is easier 
• A good estimate of the optimal dual solution is sufficient for effective feature screening 
20
Center for Evolutionary Medicine and Informatics 
Screening Rule 
21
Center for Evolutionary Medicine and Informatics 
Sketch of Sparse Screening 
22
Center for Evolutionary Medicine and Informatics 
How to Estimate the Region Θ? 
J. Wang et al. NIPS’13; J. Liu et al. ICML’14 
Non-expansiveness:
Center for Evolutionary Medicine and Informatics 
Enhanced DPP 
24 
Use projections of rays: 
Define: 
Enhanced DPP:
Center for Evolutionary Medicine and Informatics 
Firmly Non-expansive Projection 
25 
Non-expansiveness: 
Firmly non-expansiveness:
Center for Evolutionary Medicine and Informatics 
26 
Results on MNIST along a sequence of 100 parameter values along the λ/λmax scale from 
0.05 to 1. The data matrix is of size 784x50,000
Center for Evolutionary Medicine and Informatics 
27 
Evaluation on MNIST 
solver 
SAFE 
DPP 
EDPP 
SDPP 
time (s) 
2245.26 
685.12 
233.85 
45.56 
9.34 
0 
100 
200 
300 
SAFE 
DPP 
EDPP 
SDPP 
Speedup
Center for Evolutionary Medicine and Informatics 
Evaluation on ADNI 
•Problem: GWAS to MRI ROI prediction (ADNI) 
–The size of the data matrix is 747 by 504095 
Method 
ROI3 
ROI8 
ROI30 
ROI69 
ROI76 
ROI83 
Lasso Solver 
37975.31 
37097.25 
38258.72 
36926.81 
38116.29 
37251.03 
SR 
84.06 
84.44 
84.70 
83.09 
82.76 
85.39 
SR+Lasso 
217.08 
215.90 
223.39 
214.36 
212.04 
211.57 
EDDP 
43.56 
45.75 
45.70 
45.01 
44.31 
44.16 
EDDP+Lasso 
183.64 
190.43 
182.87 
170.71 
177.41 
178.98 
Running time (in seconds) of the Lasso solver, strong rule (Tibshriani et al, 2012), and EDPP. The parameter sequence contains 100 values along the log λ/λmax scale from 100 log 0.95 to log 0.95.
Center for Evolutionary Medicine and Informatics 
Sparse Screening Extensions 
•Group Lasso 
–J Wang, J Liu, J Ye. Efficient Mixed-Norm Regularization: Algorithms and Safe Screening Methods. arXiv preprint arXiv:1307.4156. 
•Sparse Logistic Regression 
–J Wang, J Zhou, P Wonka, J Ye. A Safe Screening Rule for Sparse Logistic Regression. arXiv preprint arXiv:1307.4145. 
•Sparse Inverse Covariance Estimation 
–S Huang, J Li, L Sun, J Liu, T Wu, K Chen, A Fleisher, E Reiman, J Ye. Learning brain connectivity of Alzheimer’s disease by exploratory graphical models. NeuroImage 50, 935-949. 
–Witten, Friedman and Simon (2011), Mazumder and Hastie (2012) 
•Multiple Graphical Lasso 
–S Yang, Z Pan, X Shen, P Wonka, J Ye. Fused Multiple Graphical Lasso. arXiv preprint arXiv:1209.2139. 
29
Center for Evolutionary Medicine and Informatics 
Wide versus Tall Data 
30 
wide data 
tall data
Center for Evolutionary Medicine and Informatics 
Support Vector Machines 
•SVM is a maximum margin classifier. 
31 
denotes +1 
denotes -1 
Margin
Center for Evolutionary Medicine and Informatics 
Support Vectors 
•SVM is determined by the so-called support vectors. 
32 
Support Vectors are those data points that the margin pushes up against 
denotes +1 
denotes -1 
The non-support vectors are irrelevant to the classifier. 
Can we make use of this observation?
Center for Evolutionary Medicine and Informatics 
The Idea of Sample Screening 
33 
Original Problem 
Screening 
Smaller Problem to Solve
Center for Evolutionary Medicine and Informatics 
Guidelines for Sample Screening 
34 
J. Wang, P. Wonka, and J. Ye. ICML’14.
Center for Evolutionary Medicine and Informatics 
Relaxed Guidelines 
35
Center for Evolutionary Medicine and Informatics 
Sketch of SVM Screening 
36
Center for Evolutionary Medicine and Informatics 
Synthetic Studies 
37 
•We use the rejection rates to measure the performance of the screening rules, the ratio of the number of data instances whose membership can be identified by the rule to the total number of data instances.
Center for Evolutionary Medicine and Informatics 
Performance of DVI for SVM on Real Data Sets 
38 
Comparison of SSNSV (Ogawa et al., ICML’13), ESSNSV and DVIs for SVM on three real data sets. 
IJCNN, , 
Speedup 
Solver 
Total 
4669.14 
Solver + SSNSV 
SSNSV 
2.08 
2.31 
Init. 
92.45 
Total 
2018.55 
Solver + ESSNSV 
ESSNSV 
2.09 
3.01 
Init. 
91.33 
Total 
1552.72 
Solver + DVI 
DVI 
0.99 
5.64 
Init. 
42.67 
Total 
828.02 
Wine, , 
Speedup 
Solver 
Total 
76.52 
Solver + SSNSV 
SSNSV 
0.02 
3.50 
Init. 
1.56 
Total 
21.85 
Solver + ESSNSV 
ESSNSV 
0.03 
4.47 
Init. 
1.60 
Total 
17.17 
Solver + DVI 
DVI 
0.01 
6.59 
Init. 
0.67 
Total 
11.62 
Covertype, , 
Speedup 
Solver 
Total 
1675.46 
Solver + SSNSV 
SSNSV 
2.73 
7.60 
Init. 
35.52 
Total 
220.58 
Solver + ESSNSV 
ESSNSV 
2.89 
10.72 
Init. 
36.13 
Total 
156.23 
Solver + DVI 
DVI 
1.27 
79.18 
Init. 
12.57 
Total 
21.26
Center for Evolutionary Medicine and Informatics 
Experiments on Real Data Sets 
39 
Comparison of SSNSV (Ogawa et al., ICML’13), ESSNSV and DVIs for LAD on three real data sets. 
Telescope, , 
Speedup 
Solver 
Total 
122.34 
Solver + DVI 
DVI 
0.28 
9.86 
Init. 
0.12 
Total 
12.14 
Computer, , 
Speedup 
Solver 
Total 
5.85 
Solver + DVI 
DVI 
0.08 
19.21 
Init. 
0.05 
Total 
0.28 
Telescope, , 
Speedup 
Solver 
Total 
21.43 
Solver + DVI 
DVI 
0.06 
114.91 
Init. 
0.1 
Total 
0.19
Center for Evolutionary Medicine and Informatics 
Resource 
40 
•Tutorial webpages of our screening rules, which include sample codes, implementation instructions, illustration materials, etc. 
http://www.public.asu.edu/~jwang237/screening.html 
Seven lines implementation of EDPP rule 
The list is growing quickly
Center for Evolutionary Medicine and Informatics 
Summary 
•Developed exact data reduction approaches 
–Exact data reduction via feature screening 
–Exact data reduction via sample screening 
•The model based on reduced data is identical to the one constructed from complete data 
•Results show screening leads to a significant speedup. 
•Extend exact data reduction to other sparse learning formulations 
–Sparsity on features, samples, networks etc 
41

More Related Content

Similar to Exact Data Reduction for Big Data by Jieping Ye

Healthcare
HealthcareHealthcare
Healthcare
Gaurav Dubey
 
2013 machine learning_choih
2013 machine learning_choih2013 machine learning_choih
2013 machine learning_choih
Hongyoon Choi
 
Contrast Pattern Aided Regression and Classification
Contrast Pattern Aided Regression and ClassificationContrast Pattern Aided Regression and Classification
Contrast Pattern Aided Regression and Classification
Artificial Intelligence Institute at UofSC
 
MISSING DATA CLASSIFICATION OF CHRONIC KIDNEY DISEASE
MISSING DATA CLASSIFICATION OF CHRONIC KIDNEY DISEASEMISSING DATA CLASSIFICATION OF CHRONIC KIDNEY DISEASE
MISSING DATA CLASSIFICATION OF CHRONIC KIDNEY DISEASE
IJDKP
 
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
Artificial Intelligence Institute at UofSC
 
How deep learning reshapes medicine
How deep learning reshapes medicineHow deep learning reshapes medicine
How deep learning reshapes medicine
Hongyoon Choi
 
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...cambridgeWD
 
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...cambridgeWD
 
Deep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpointsDeep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpoints
Valery Tkachenko
 
Big Data Analytics for Healthcare
Big Data Analytics for HealthcareBig Data Analytics for Healthcare
Big Data Analytics for Healthcare
Chandan Reddy
 
PMED: APPM Workshop: Overview of Methods for Subgroup Identification in Clini...
PMED: APPM Workshop: Overview of Methods for Subgroup Identification in Clini...PMED: APPM Workshop: Overview of Methods for Subgroup Identification in Clini...
PMED: APPM Workshop: Overview of Methods for Subgroup Identification in Clini...
The Statistical and Applied Mathematical Sciences Institute
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MINING
Ashish Salve
 
Collaborative Database and Computational Models for Tuberculosis Drug Discovery
Collaborative Database and Computational Models for Tuberculosis Drug DiscoveryCollaborative Database and Computational Models for Tuberculosis Drug Discovery
Collaborative Database and Computational Models for Tuberculosis Drug Discovery
Sean Ekins
 
Extending A Trial’s Design Case Studies Of Dealing With Study Design Issues
Extending A Trial’s Design Case Studies Of Dealing With Study Design IssuesExtending A Trial’s Design Case Studies Of Dealing With Study Design Issues
Extending A Trial’s Design Case Studies Of Dealing With Study Design Issues
nQuery
 
Lab 1 intro
Lab 1 introLab 1 intro
Lab 1 intro
Erik D. Davenport
 
Health Data Science Seminar Series
Health Data Science Seminar SeriesHealth Data Science Seminar Series
Health Data Science Seminar Series
Office for National Statistics
 
GGWS_M3_L5_Estimation_of_heritability_from_GWAS_summary_statistics.pptx
GGWS_M3_L5_Estimation_of_heritability_from_GWAS_summary_statistics.pptxGGWS_M3_L5_Estimation_of_heritability_from_GWAS_summary_statistics.pptx
GGWS_M3_L5_Estimation_of_heritability_from_GWAS_summary_statistics.pptx
BHAGWAT NAWADE
 
ICU case presentation-english 2016
ICU case presentation-english 2016ICU case presentation-english 2016
ICU case presentation-english 2016Vasilios Papaioannou
 

Similar to Exact Data Reduction for Big Data by Jieping Ye (20)

Healthcare
HealthcareHealthcare
Healthcare
 
2013 machine learning_choih
2013 machine learning_choih2013 machine learning_choih
2013 machine learning_choih
 
Contrast Pattern Aided Regression and Classification
Contrast Pattern Aided Regression and ClassificationContrast Pattern Aided Regression and Classification
Contrast Pattern Aided Regression and Classification
 
MISSING DATA CLASSIFICATION OF CHRONIC KIDNEY DISEASE
MISSING DATA CLASSIFICATION OF CHRONIC KIDNEY DISEASEMISSING DATA CLASSIFICATION OF CHRONIC KIDNEY DISEASE
MISSING DATA CLASSIFICATION OF CHRONIC KIDNEY DISEASE
 
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
 
How deep learning reshapes medicine
How deep learning reshapes medicineHow deep learning reshapes medicine
How deep learning reshapes medicine
 
Statistics78 (2)
Statistics78 (2)Statistics78 (2)
Statistics78 (2)
 
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
 
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
 
Deep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpointsDeep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpoints
 
Big Data Analytics for Healthcare
Big Data Analytics for HealthcareBig Data Analytics for Healthcare
Big Data Analytics for Healthcare
 
PMED: APPM Workshop: Overview of Methods for Subgroup Identification in Clini...
PMED: APPM Workshop: Overview of Methods for Subgroup Identification in Clini...PMED: APPM Workshop: Overview of Methods for Subgroup Identification in Clini...
PMED: APPM Workshop: Overview of Methods for Subgroup Identification in Clini...
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MINING
 
5 5 10
5 5 105 5 10
5 5 10
 
Collaborative Database and Computational Models for Tuberculosis Drug Discovery
Collaborative Database and Computational Models for Tuberculosis Drug DiscoveryCollaborative Database and Computational Models for Tuberculosis Drug Discovery
Collaborative Database and Computational Models for Tuberculosis Drug Discovery
 
Extending A Trial’s Design Case Studies Of Dealing With Study Design Issues
Extending A Trial’s Design Case Studies Of Dealing With Study Design IssuesExtending A Trial’s Design Case Studies Of Dealing With Study Design Issues
Extending A Trial’s Design Case Studies Of Dealing With Study Design Issues
 
Lab 1 intro
Lab 1 introLab 1 intro
Lab 1 intro
 
Health Data Science Seminar Series
Health Data Science Seminar SeriesHealth Data Science Seminar Series
Health Data Science Seminar Series
 
GGWS_M3_L5_Estimation_of_heritability_from_GWAS_summary_statistics.pptx
GGWS_M3_L5_Estimation_of_heritability_from_GWAS_summary_statistics.pptxGGWS_M3_L5_Estimation_of_heritability_from_GWAS_summary_statistics.pptx
GGWS_M3_L5_Estimation_of_heritability_from_GWAS_summary_statistics.pptx
 
ICU case presentation-english 2016
ICU case presentation-english 2016ICU case presentation-english 2016
ICU case presentation-english 2016
 

More from BigMine

Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
BigMine
 
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...
BigMine
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
 
Big Data and Small Devices by Katharina Morik
Big Data and Small Devices by Katharina MorikBig Data and Small Devices by Katharina Morik
Big Data and Small Devices by Katharina Morik
BigMine
 
Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...
BigMine
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
BigMine
 
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 Big & Personal: the data and the models behind Netflix recommendations by Xa... Big & Personal: the data and the models behind Netflix recommendations by Xa...
Big & Personal: the data and the models behind Netflix recommendations by Xa...
BigMine
 
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosLarge Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
BigMine
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
BigMine
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
BigMine
 

More from BigMine (10)

Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
 
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
 
Big Data and Small Devices by Katharina Morik
Big Data and Small Devices by Katharina MorikBig Data and Small Devices by Katharina Morik
Big Data and Small Devices by Katharina Morik
 
Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
 
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 Big & Personal: the data and the models behind Netflix recommendations by Xa... Big & Personal: the data and the models behind Netflix recommendations by Xa...
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosLarge Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 

Recently uploaded

FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 

Exact Data Reduction for Big Data by Jieping Ye

  • 1. Center for Evolutionary Medicine and Informatics Sparse Screening for Exact Data Reduction Jieping Ye Arizona State University 1 Joint work with Jie Wang and Jun Liu
  • 2. Center for Evolutionary Medicine and Informatics 2 wide data tall data sample reduction feature reduction
  • 3. Center for Evolutionary Medicine and Informatics The model learnt from the reduced data is identical to the model learnt from the full data. We focus on two models in this talk: Lasso for wide data (feature reduction) SVM for tall data (sample reduction) 3 Sparse Screening: A New Framework for Exact Data Reduction
  • 4. Center for Evolutionary Medicine and Informatics 4
  • 5. Center for Evolutionary Medicine and Informatics Lasso/Basis Pursuit (Tibshirani, 1996, Chen, Donoho, and Saunders, 1999) … = × + y A z n×1 n×p n×1 p×1 x 5 Simultaneous feature selection and regression
  • 6. Center for Evolutionary Medicine and Informatics Imaging Genetics (Thompson et al. 2013) 6
  • 7. Center for Evolutionary Medicine and Informatics Sparse Reduced-Rank Regression 7 Vounou et al. (2010, 2012)
  • 8. Center for Evolutionary Medicine and Informatics Structured Sparse Models 8 Group Lasso Tree Lasso Fused Lasso Graph Lasso
  • 9. Center for Evolutionary Medicine and Informatics 9 Sparsity has become an important modeling tool in genomics, genetics, signal and audio processing, image processing, neuroscience (theory of sparse coding), machine learning, statistics …
  • 10. Center for Evolutionary Medicine and Informatics Optimization Algorithms •Coordinate descent •Subgradient descent •Augmented Lagrangian Method •Gradient descent •Accelerated gradient descent •… 10 min loss(x) + λ×penalty(x)
  • 11. Center for Evolutionary Medicine and Informatics Lasso Fused Lasso Group Lasso Sparse Group Lasso Tree Structured Group Lasso Overlapping Group Lasso Sparse Inverse Covariance Estimation Trace Norm Minimization http://www.public.asu.edu/~jye02/Software/SLEP/ 11
  • 12. Center for Evolutionary Medicine and Informatics More Efficiency? 12 Very high dimensional data Non-smooth sparsity-induced norms Multiple runs in model selection A large number of runs in permutation test
  • 13. Center for Evolutionary Medicine and Informatics How to make any existing Lasso solver much more efficient? 13
  • 14. Center for Evolutionary Medicine and Informatics 14 1M 1K Data Reduction/Compression original data reduced data
  • 15. Center for Evolutionary Medicine and Informatics Data Reduction •Heuristic-based data reduction –Sure screening, random projection/selection –Resulting model is an approximation of the true model •Propose data reduction methods –Exact data reduction via sparse screening •The model based on reduced data is identical to the one constructed from complete data 15
  • 16. Center for Evolutionary Medicine and Informatics 16 with screening same solution 1M 1M 1K without screening Sparse Screening
  • 17. Center for Evolutionary Medicine and Informatics Large-Scale Sparse Screening
  • 18. Center for Evolutionary Medicine and Informatics Screening Rule: Motivation Ghaoui, Viallon, and Rabbani.
  • 19. Center for Evolutionary Medicine and Informatics Large-Scale Sparse Screening (Cont’d)
  • 20. Center for Evolutionary Medicine and Informatics More on the Dual Formulation •Solving the dual formulation is difficult •Providing a good (not exact) estimate of the optimal dual solution is easier • A good estimate of the optimal dual solution is sufficient for effective feature screening 20
  • 21. Center for Evolutionary Medicine and Informatics Screening Rule 21
  • 22. Center for Evolutionary Medicine and Informatics Sketch of Sparse Screening 22
  • 23. Center for Evolutionary Medicine and Informatics How to Estimate the Region Θ? J. Wang et al. NIPS’13; J. Liu et al. ICML’14 Non-expansiveness:
  • 24. Center for Evolutionary Medicine and Informatics Enhanced DPP 24 Use projections of rays: Define: Enhanced DPP:
  • 25. Center for Evolutionary Medicine and Informatics Firmly Non-expansive Projection 25 Non-expansiveness: Firmly non-expansiveness:
  • 26. Center for Evolutionary Medicine and Informatics 26 Results on MNIST along a sequence of 100 parameter values along the λ/λmax scale from 0.05 to 1. The data matrix is of size 784x50,000
  • 27. Center for Evolutionary Medicine and Informatics 27 Evaluation on MNIST solver SAFE DPP EDPP SDPP time (s) 2245.26 685.12 233.85 45.56 9.34 0 100 200 300 SAFE DPP EDPP SDPP Speedup
  • 28. Center for Evolutionary Medicine and Informatics Evaluation on ADNI •Problem: GWAS to MRI ROI prediction (ADNI) –The size of the data matrix is 747 by 504095 Method ROI3 ROI8 ROI30 ROI69 ROI76 ROI83 Lasso Solver 37975.31 37097.25 38258.72 36926.81 38116.29 37251.03 SR 84.06 84.44 84.70 83.09 82.76 85.39 SR+Lasso 217.08 215.90 223.39 214.36 212.04 211.57 EDDP 43.56 45.75 45.70 45.01 44.31 44.16 EDDP+Lasso 183.64 190.43 182.87 170.71 177.41 178.98 Running time (in seconds) of the Lasso solver, strong rule (Tibshriani et al, 2012), and EDPP. The parameter sequence contains 100 values along the log λ/λmax scale from 100 log 0.95 to log 0.95.
  • 29. Center for Evolutionary Medicine and Informatics Sparse Screening Extensions •Group Lasso –J Wang, J Liu, J Ye. Efficient Mixed-Norm Regularization: Algorithms and Safe Screening Methods. arXiv preprint arXiv:1307.4156. •Sparse Logistic Regression –J Wang, J Zhou, P Wonka, J Ye. A Safe Screening Rule for Sparse Logistic Regression. arXiv preprint arXiv:1307.4145. •Sparse Inverse Covariance Estimation –S Huang, J Li, L Sun, J Liu, T Wu, K Chen, A Fleisher, E Reiman, J Ye. Learning brain connectivity of Alzheimer’s disease by exploratory graphical models. NeuroImage 50, 935-949. –Witten, Friedman and Simon (2011), Mazumder and Hastie (2012) •Multiple Graphical Lasso –S Yang, Z Pan, X Shen, P Wonka, J Ye. Fused Multiple Graphical Lasso. arXiv preprint arXiv:1209.2139. 29
  • 30. Center for Evolutionary Medicine and Informatics Wide versus Tall Data 30 wide data tall data
  • 31. Center for Evolutionary Medicine and Informatics Support Vector Machines •SVM is a maximum margin classifier. 31 denotes +1 denotes -1 Margin
  • 32. Center for Evolutionary Medicine and Informatics Support Vectors •SVM is determined by the so-called support vectors. 32 Support Vectors are those data points that the margin pushes up against denotes +1 denotes -1 The non-support vectors are irrelevant to the classifier. Can we make use of this observation?
  • 33. Center for Evolutionary Medicine and Informatics The Idea of Sample Screening 33 Original Problem Screening Smaller Problem to Solve
  • 34. Center for Evolutionary Medicine and Informatics Guidelines for Sample Screening 34 J. Wang, P. Wonka, and J. Ye. ICML’14.
  • 35. Center for Evolutionary Medicine and Informatics Relaxed Guidelines 35
  • 36. Center for Evolutionary Medicine and Informatics Sketch of SVM Screening 36
  • 37. Center for Evolutionary Medicine and Informatics Synthetic Studies 37 •We use the rejection rates to measure the performance of the screening rules, the ratio of the number of data instances whose membership can be identified by the rule to the total number of data instances.
  • 38. Center for Evolutionary Medicine and Informatics Performance of DVI for SVM on Real Data Sets 38 Comparison of SSNSV (Ogawa et al., ICML’13), ESSNSV and DVIs for SVM on three real data sets. IJCNN, , Speedup Solver Total 4669.14 Solver + SSNSV SSNSV 2.08 2.31 Init. 92.45 Total 2018.55 Solver + ESSNSV ESSNSV 2.09 3.01 Init. 91.33 Total 1552.72 Solver + DVI DVI 0.99 5.64 Init. 42.67 Total 828.02 Wine, , Speedup Solver Total 76.52 Solver + SSNSV SSNSV 0.02 3.50 Init. 1.56 Total 21.85 Solver + ESSNSV ESSNSV 0.03 4.47 Init. 1.60 Total 17.17 Solver + DVI DVI 0.01 6.59 Init. 0.67 Total 11.62 Covertype, , Speedup Solver Total 1675.46 Solver + SSNSV SSNSV 2.73 7.60 Init. 35.52 Total 220.58 Solver + ESSNSV ESSNSV 2.89 10.72 Init. 36.13 Total 156.23 Solver + DVI DVI 1.27 79.18 Init. 12.57 Total 21.26
  • 39. Center for Evolutionary Medicine and Informatics Experiments on Real Data Sets 39 Comparison of SSNSV (Ogawa et al., ICML’13), ESSNSV and DVIs for LAD on three real data sets. Telescope, , Speedup Solver Total 122.34 Solver + DVI DVI 0.28 9.86 Init. 0.12 Total 12.14 Computer, , Speedup Solver Total 5.85 Solver + DVI DVI 0.08 19.21 Init. 0.05 Total 0.28 Telescope, , Speedup Solver Total 21.43 Solver + DVI DVI 0.06 114.91 Init. 0.1 Total 0.19
  • 40. Center for Evolutionary Medicine and Informatics Resource 40 •Tutorial webpages of our screening rules, which include sample codes, implementation instructions, illustration materials, etc. http://www.public.asu.edu/~jwang237/screening.html Seven lines implementation of EDPP rule The list is growing quickly
  • 41. Center for Evolutionary Medicine and Informatics Summary •Developed exact data reduction approaches –Exact data reduction via feature screening –Exact data reduction via sample screening •The model based on reduced data is identical to the one constructed from complete data •Results show screening leads to a significant speedup. •Extend exact data reduction to other sparse learning formulations –Sparsity on features, samples, networks etc 41