SlideShare a Scribd company logo
1 of 1
Download to read offline
Optimal Subsampling Strategy for Logistic
Regression
Qianshun Cheng and Tian Tian
University of Illinois at Chicago
Background
Introduction
Massive data are presented more and more frequently in
modern scientific research.
How to extract useful information from massive data has
been a hot spot problem.
– Truncate and merge
– Subsampling based algorithms
Advantage and disadvantage of
subsampling-based algorithms
Advantages
– Efficiently downsize data
– Easy computation and implementation
Disadvantages
– Sampling errors
– Efficiency of extracting informations
Motivation for our strategy
Is there a way to better preserve the majority information
contained in the full data?
Logistic regression model
Unknown parameter β = (β0, · · · , βm)T
;
Binary response Yi at feature vector Xi is modeled as
follows,
Prob(Yi = 1|Xi) = P(Xi, β) =
exp(XT
i β)
1 + exp(XT
i β)
, i = 1, ..., n. (1)
(locally) D-optimal designs
D-optimal designs: How to assign feature value Xi’s such
that the determinant of the information matrix with respect to
β can be maximized?
Theorem (Yang, Zhang and Huang, 2011):
Under logistic model (1), a D-optimal design with respect to
β is
ξ∗
= {(C∗
l1
, 1/2m
), (C∗
l2
, 1/2m
), l = 1, · · · , 2m−1
}
where C∗
lj
= (1, al,1, · · · , al,m−1, (−1)j−1
c∗
), j = 1, 2.
– c∗
minimizes function f(c) = c−2
(Ψ(c))−m−1
, where Ψ(c) = [P (x)]2
P(x)(1−P(x));
– al,k is the boundary point of the design space at the k-th dimension,
k = 1, · · · , m − 1.
Subsampling Algorithm
Algorithm
(I). Given data set {(Yi, XT
i ), i = 1, · · · , n}, choose a
subsample of size ro by random sampling;
(II). Fit the data and obtain an initial estimate
ˆβ = ( ˆβ0, · · · , ˆβm);
(III). Obtain B = {i | min{|ci − c∗
|, |ci + c∗
|} ≤ δ} by
calculating ci = XT
i
ˆβ;
(IV). From {(Yi, XT
i ), i ∈ B}, pick r1
2(m−1) Xi’s and Xj’s where
Xi1’s and Xj1’s are the first r1
2(m−1) largest and smallest
values, respectively, among the first-dimension
components.
(V). Remove the chosen points from set B, and then
continue to the next dimension. Collect data after the
maximums and minimums at each of the m − 1
dimensions have been searched for and located.
(VI). The newly collected r1 data points serve as the starting
subsample for the next iteration, where the above steps
are repeated.
Simulation settings for small sample size
scenarios
Total sample size n = 10000.
Starting subsample size r0 = 200.
Parameter dimension m = 7.
True parameter value β = (0.5, · · · , 0.5).
Variance-covariance structure Σ is compound symmetry
with diagonal entries being 1 and off-diagonal 0.5.
– NzNormal
– MzNormal
– Mixed Normal
– T3
Simulation settings for large sample size
scenarios
Total sample size n = 500000.
Starting subsample size r0 = 1000.
Other settings same as above.
– Mixed Normal
– T3
Simulation results
Simulation results (small sample size)
-2.75
-2.50
-2.25
-2.00
600 700 800 900 1000
r1
MSE
Algorithm
New Algorithm
mVc
Random Sampling
(a) MzNormal
-2.4
-2.0
-1.6
600 700 800 900 1000
r1
MSE
Algorithm
New Algorithm
mVc
Random Sampling
(b) NzNormal
-2.3
-2.1
-1.9
-1.7
600 700 800 900 1000
r1
MSE
Algorithm
New Algorithm
mVc
Random Sampling
(c) MixNormal
0.0
0.5
1.0
600 700 800 900 1000
r1
MSE
Algorithm
New Algorithm
mVc
Random Sampling
(d) T3
Simulation results (large sample size)
-4.0
-3.5
-3.0
-2.5
1000 2000 3000 4000 5000
r1
MSE
Algorithm
New Algorithm
mVc
Random Sampling
(a) MixNormal
-3
-2
-1
0
1000 2000 3000 4000 5000
r1
MSE
Algorithm
New Algorithm
mVc
Random Sampling
(b) T3
Ongoing Work
Incorporate LEV algorithm into sampling.
Incorporate higher order terms or interaction terms into
model building.
Incorporate model selection/averaging problem into current
structure.
Email: qcheng5@uic.edu, ttian3@uic.edu CCASA Student Showcase 2016 MSCS, UIC

More Related Content

What's hot

20181204i mlse discussions
20181204i mlse discussions20181204i mlse discussions
20181204i mlse discussionsHiroshi Maruyama
 
About functional SIR
About functional SIRAbout functional SIR
About functional SIRtuxette
 
A Note on Confidence Bands for Linear Regression Means-07-24-2015
A Note on Confidence Bands for Linear Regression Means-07-24-2015A Note on Confidence Bands for Linear Regression Means-07-24-2015
A Note on Confidence Bands for Linear Regression Means-07-24-2015Junfeng Liu
 
Cari 2020: A minimalistic model of spatial structuration of humid savanna veg...
Cari 2020: A minimalistic model of spatial structuration of humid savanna veg...Cari 2020: A minimalistic model of spatial structuration of humid savanna veg...
Cari 2020: A minimalistic model of spatial structuration of humid savanna veg...Mokhtar SELLAMI
 
Self taught clustering
Self taught clusteringSelf taught clustering
Self taught clusteringSOYEON KIM
 
Improvement of id3 algorithm based on simplified information entropy and coor...
Improvement of id3 algorithm based on simplified information entropy and coor...Improvement of id3 algorithm based on simplified information entropy and coor...
Improvement of id3 algorithm based on simplified information entropy and coor...MdAhasanulAlam
 
Raw 2009 -THE ROLE OF LATEST FIXATIONS ON ONGOING VISUAL SEARCH A MODEL TO E...
Raw 2009 -THE ROLE OF LATEST FIXATIONS ON ONGOING VISUAL SEARCH  A MODEL TO E...Raw 2009 -THE ROLE OF LATEST FIXATIONS ON ONGOING VISUAL SEARCH  A MODEL TO E...
Raw 2009 -THE ROLE OF LATEST FIXATIONS ON ONGOING VISUAL SEARCH A MODEL TO E...Giacomo Veneri
 
I Don't Want to Be a Dummy! Encoding Predictors for Trees
I Don't Want to Be a Dummy! Encoding Predictors for TreesI Don't Want to Be a Dummy! Encoding Predictors for Trees
I Don't Want to Be a Dummy! Encoding Predictors for TreesWork-Bench
 
Digital Image Processing (Lab 08)
Digital Image Processing (Lab 08)Digital Image Processing (Lab 08)
Digital Image Processing (Lab 08)Moe Moe Myint
 
Session II - Estimation methods and accuracy Li-Chun Zhang Discussion: Sess...
Session II - Estimation methods and accuracy   Li-Chun Zhang Discussion: Sess...Session II - Estimation methods and accuracy   Li-Chun Zhang Discussion: Sess...
Session II - Estimation methods and accuracy Li-Chun Zhang Discussion: Sess...Istituto nazionale di statistica
 
Lecture 6-cs345-2014
Lecture 6-cs345-2014Lecture 6-cs345-2014
Lecture 6-cs345-2014Rajiv Omar
 

What's hot (19)

20181204i mlse discussions
20181204i mlse discussions20181204i mlse discussions
20181204i mlse discussions
 
CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...
CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...
CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...
 
About functional SIR
About functional SIRAbout functional SIR
About functional SIR
 
A Note on Confidence Bands for Linear Regression Means-07-24-2015
A Note on Confidence Bands for Linear Regression Means-07-24-2015A Note on Confidence Bands for Linear Regression Means-07-24-2015
A Note on Confidence Bands for Linear Regression Means-07-24-2015
 
Image compression
Image compressionImage compression
Image compression
 
Lent Matlab H Ss
Lent Matlab H SsLent Matlab H Ss
Lent Matlab H Ss
 
Cari 2020: A minimalistic model of spatial structuration of humid savanna veg...
Cari 2020: A minimalistic model of spatial structuration of humid savanna veg...Cari 2020: A minimalistic model of spatial structuration of humid savanna veg...
Cari 2020: A minimalistic model of spatial structuration of humid savanna veg...
 
Self taught clustering
Self taught clusteringSelf taught clustering
Self taught clustering
 
CLIM: Transition Workshop - Statistical Emulation with Dimension Reduction fo...
CLIM: Transition Workshop - Statistical Emulation with Dimension Reduction fo...CLIM: Transition Workshop - Statistical Emulation with Dimension Reduction fo...
CLIM: Transition Workshop - Statistical Emulation with Dimension Reduction fo...
 
Improvement of id3 algorithm based on simplified information entropy and coor...
Improvement of id3 algorithm based on simplified information entropy and coor...Improvement of id3 algorithm based on simplified information entropy and coor...
Improvement of id3 algorithm based on simplified information entropy and coor...
 
Mit18 03 s10_ex1
Mit18 03 s10_ex1Mit18 03 s10_ex1
Mit18 03 s10_ex1
 
Integration
IntegrationIntegration
Integration
 
Heap tree
Heap treeHeap tree
Heap tree
 
Raw 2009 -THE ROLE OF LATEST FIXATIONS ON ONGOING VISUAL SEARCH A MODEL TO E...
Raw 2009 -THE ROLE OF LATEST FIXATIONS ON ONGOING VISUAL SEARCH  A MODEL TO E...Raw 2009 -THE ROLE OF LATEST FIXATIONS ON ONGOING VISUAL SEARCH  A MODEL TO E...
Raw 2009 -THE ROLE OF LATEST FIXATIONS ON ONGOING VISUAL SEARCH A MODEL TO E...
 
I Don't Want to Be a Dummy! Encoding Predictors for Trees
I Don't Want to Be a Dummy! Encoding Predictors for TreesI Don't Want to Be a Dummy! Encoding Predictors for Trees
I Don't Want to Be a Dummy! Encoding Predictors for Trees
 
AggNet: Deep Learning from Crowds
AggNet: Deep Learning from CrowdsAggNet: Deep Learning from Crowds
AggNet: Deep Learning from Crowds
 
Digital Image Processing (Lab 08)
Digital Image Processing (Lab 08)Digital Image Processing (Lab 08)
Digital Image Processing (Lab 08)
 
Session II - Estimation methods and accuracy Li-Chun Zhang Discussion: Sess...
Session II - Estimation methods and accuracy   Li-Chun Zhang Discussion: Sess...Session II - Estimation methods and accuracy   Li-Chun Zhang Discussion: Sess...
Session II - Estimation methods and accuracy Li-Chun Zhang Discussion: Sess...
 
Lecture 6-cs345-2014
Lecture 6-cs345-2014Lecture 6-cs345-2014
Lecture 6-cs345-2014
 

Viewers also liked

Efficient reconfigurable architecture of baseband demodulator in sdr
Efficient reconfigurable architecture of baseband demodulator in sdrEfficient reconfigurable architecture of baseband demodulator in sdr
Efficient reconfigurable architecture of baseband demodulator in sdreSAT Journals
 
SDR Reference Secure architecture
SDR Reference Secure architectureSDR Reference Secure architecture
SDR Reference Secure architectureglobalsdr
 
Universal software defined radio development platform
Universal software defined radio development platformUniversal software defined radio development platform
Universal software defined radio development platformBertalan EGED
 
Spectra DTP4700 Linux Based Development for Software Defined Radio (SDR) Soft...
Spectra DTP4700 Linux Based Development for Software Defined Radio (SDR) Soft...Spectra DTP4700 Linux Based Development for Software Defined Radio (SDR) Soft...
Spectra DTP4700 Linux Based Development for Software Defined Radio (SDR) Soft...ADLINK Technology IoT
 
Subsampling Multi-standard receiver design, Part-1
Subsampling Multi-standard receiver design, Part-1Subsampling Multi-standard receiver design, Part-1
Subsampling Multi-standard receiver design, Part-1Ahmed Sakr
 
Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Combining Lazy Learning, Racing and Subsampling for Effective Feature SelectionCombining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Combining Lazy Learning, Racing and Subsampling for Effective Feature SelectionGianluca Bontempi
 
Abhinav End Sem Presentation Software Defined Radio
Abhinav End Sem Presentation Software Defined RadioAbhinav End Sem Presentation Software Defined Radio
Abhinav End Sem Presentation Software Defined Radioguestad4734
 
Dyspan Sdr Cr Tutorial 10 25 Rev02
Dyspan Sdr Cr Tutorial 10 25 Rev02Dyspan Sdr Cr Tutorial 10 25 Rev02
Dyspan Sdr Cr Tutorial 10 25 Rev02melvincabatuan
 
Hardware Accelerated Software Defined Radio
Hardware Accelerated Software Defined Radio Hardware Accelerated Software Defined Radio
Hardware Accelerated Software Defined Radio Tarik Kazaz
 
How to Use Social Media to Influence the World
How to Use Social Media to Influence the WorldHow to Use Social Media to Influence the World
How to Use Social Media to Influence the WorldSean Si
 

Viewers also liked (13)

Efficient reconfigurable architecture of baseband demodulator in sdr
Efficient reconfigurable architecture of baseband demodulator in sdrEfficient reconfigurable architecture of baseband demodulator in sdr
Efficient reconfigurable architecture of baseband demodulator in sdr
 
MaxEye SDR System Level Testing
MaxEye SDR System Level TestingMaxEye SDR System Level Testing
MaxEye SDR System Level Testing
 
SDR Reference Secure architecture
SDR Reference Secure architectureSDR Reference Secure architecture
SDR Reference Secure architecture
 
final presentation
final presentationfinal presentation
final presentation
 
Universal software defined radio development platform
Universal software defined radio development platformUniversal software defined radio development platform
Universal software defined radio development platform
 
Spectra DTP4700 Linux Based Development for Software Defined Radio (SDR) Soft...
Spectra DTP4700 Linux Based Development for Software Defined Radio (SDR) Soft...Spectra DTP4700 Linux Based Development for Software Defined Radio (SDR) Soft...
Spectra DTP4700 Linux Based Development for Software Defined Radio (SDR) Soft...
 
Subsampling Multi-standard receiver design, Part-1
Subsampling Multi-standard receiver design, Part-1Subsampling Multi-standard receiver design, Part-1
Subsampling Multi-standard receiver design, Part-1
 
Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Combining Lazy Learning, Racing and Subsampling for Effective Feature SelectionCombining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection
 
YUV, Y CB CR and Subsampling
YUV, Y CB CR and SubsamplingYUV, Y CB CR and Subsampling
YUV, Y CB CR and Subsampling
 
Abhinav End Sem Presentation Software Defined Radio
Abhinav End Sem Presentation Software Defined RadioAbhinav End Sem Presentation Software Defined Radio
Abhinav End Sem Presentation Software Defined Radio
 
Dyspan Sdr Cr Tutorial 10 25 Rev02
Dyspan Sdr Cr Tutorial 10 25 Rev02Dyspan Sdr Cr Tutorial 10 25 Rev02
Dyspan Sdr Cr Tutorial 10 25 Rev02
 
Hardware Accelerated Software Defined Radio
Hardware Accelerated Software Defined Radio Hardware Accelerated Software Defined Radio
Hardware Accelerated Software Defined Radio
 
How to Use Social Media to Influence the World
How to Use Social Media to Influence the WorldHow to Use Social Media to Influence the World
How to Use Social Media to Influence the World
 

Similar to optimal subsampling

Computing near-optimal policies from trajectories by solving a sequence of st...
Computing near-optimal policies from trajectories by solving a sequence of st...Computing near-optimal policies from trajectories by solving a sequence of st...
Computing near-optimal policies from trajectories by solving a sequence of st...Université de Liège (ULg)
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized ModelPredicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Modelweekendsunny
 
3. 10079 20812-1-pb
3. 10079 20812-1-pb3. 10079 20812-1-pb
3. 10079 20812-1-pbIAESIJEECS
 
Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2Arthur Charpentier
 
Study on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit ScoringStudy on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit Scoringharmonylab
 
Dimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxDimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxRohanBorgalli
 
Non-parametric analysis of models and data
Non-parametric analysis of models and dataNon-parametric analysis of models and data
Non-parametric analysis of models and datahaharrington
 
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Pooyan Jamshidi
 

Similar to optimal subsampling (20)

Computing near-optimal policies from trajectories by solving a sequence of st...
Computing near-optimal policies from trajectories by solving a sequence of st...Computing near-optimal policies from trajectories by solving a sequence of st...
Computing near-optimal policies from trajectories by solving a sequence of st...
 
Presentation on machine learning
Presentation on machine learningPresentation on machine learning
Presentation on machine learning
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized ModelPredicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model
 
3. 10079 20812-1-pb
3. 10079 20812-1-pb3. 10079 20812-1-pb
3. 10079 20812-1-pb
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
 
Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2
 
Optimization tutorial
Optimization tutorialOptimization tutorial
Optimization tutorial
 
Study on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit ScoringStudy on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit Scoring
 
Dimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxDimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptx
 
Non-parametric analysis of models and data
Non-parametric analysis of models and dataNon-parametric analysis of models and data
Non-parametric analysis of models and data
 
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
 

optimal subsampling

  • 1. Optimal Subsampling Strategy for Logistic Regression Qianshun Cheng and Tian Tian University of Illinois at Chicago Background Introduction Massive data are presented more and more frequently in modern scientific research. How to extract useful information from massive data has been a hot spot problem. – Truncate and merge – Subsampling based algorithms Advantage and disadvantage of subsampling-based algorithms Advantages – Efficiently downsize data – Easy computation and implementation Disadvantages – Sampling errors – Efficiency of extracting informations Motivation for our strategy Is there a way to better preserve the majority information contained in the full data? Logistic regression model Unknown parameter β = (β0, · · · , βm)T ; Binary response Yi at feature vector Xi is modeled as follows, Prob(Yi = 1|Xi) = P(Xi, β) = exp(XT i β) 1 + exp(XT i β) , i = 1, ..., n. (1) (locally) D-optimal designs D-optimal designs: How to assign feature value Xi’s such that the determinant of the information matrix with respect to β can be maximized? Theorem (Yang, Zhang and Huang, 2011): Under logistic model (1), a D-optimal design with respect to β is ξ∗ = {(C∗ l1 , 1/2m ), (C∗ l2 , 1/2m ), l = 1, · · · , 2m−1 } where C∗ lj = (1, al,1, · · · , al,m−1, (−1)j−1 c∗ ), j = 1, 2. – c∗ minimizes function f(c) = c−2 (Ψ(c))−m−1 , where Ψ(c) = [P (x)]2 P(x)(1−P(x)); – al,k is the boundary point of the design space at the k-th dimension, k = 1, · · · , m − 1. Subsampling Algorithm Algorithm (I). Given data set {(Yi, XT i ), i = 1, · · · , n}, choose a subsample of size ro by random sampling; (II). Fit the data and obtain an initial estimate ˆβ = ( ˆβ0, · · · , ˆβm); (III). Obtain B = {i | min{|ci − c∗ |, |ci + c∗ |} ≤ δ} by calculating ci = XT i ˆβ; (IV). From {(Yi, XT i ), i ∈ B}, pick r1 2(m−1) Xi’s and Xj’s where Xi1’s and Xj1’s are the first r1 2(m−1) largest and smallest values, respectively, among the first-dimension components. (V). Remove the chosen points from set B, and then continue to the next dimension. Collect data after the maximums and minimums at each of the m − 1 dimensions have been searched for and located. (VI). The newly collected r1 data points serve as the starting subsample for the next iteration, where the above steps are repeated. Simulation settings for small sample size scenarios Total sample size n = 10000. Starting subsample size r0 = 200. Parameter dimension m = 7. True parameter value β = (0.5, · · · , 0.5). Variance-covariance structure Σ is compound symmetry with diagonal entries being 1 and off-diagonal 0.5. – NzNormal – MzNormal – Mixed Normal – T3 Simulation settings for large sample size scenarios Total sample size n = 500000. Starting subsample size r0 = 1000. Other settings same as above. – Mixed Normal – T3 Simulation results Simulation results (small sample size) -2.75 -2.50 -2.25 -2.00 600 700 800 900 1000 r1 MSE Algorithm New Algorithm mVc Random Sampling (a) MzNormal -2.4 -2.0 -1.6 600 700 800 900 1000 r1 MSE Algorithm New Algorithm mVc Random Sampling (b) NzNormal -2.3 -2.1 -1.9 -1.7 600 700 800 900 1000 r1 MSE Algorithm New Algorithm mVc Random Sampling (c) MixNormal 0.0 0.5 1.0 600 700 800 900 1000 r1 MSE Algorithm New Algorithm mVc Random Sampling (d) T3 Simulation results (large sample size) -4.0 -3.5 -3.0 -2.5 1000 2000 3000 4000 5000 r1 MSE Algorithm New Algorithm mVc Random Sampling (a) MixNormal -3 -2 -1 0 1000 2000 3000 4000 5000 r1 MSE Algorithm New Algorithm mVc Random Sampling (b) T3 Ongoing Work Incorporate LEV algorithm into sampling. Incorporate higher order terms or interaction terms into model building. Incorporate model selection/averaging problem into current structure. Email: qcheng5@uic.edu, ttian3@uic.edu CCASA Student Showcase 2016 MSCS, UIC