optimal subsampling

•

0 likes•118 views

Tian Tian

What's hot

20181204i mlse discussionsHiroshi Maruyama

CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...The Statistical and Applied Mathematical Sciences Institute

About functional SIRtuxette

A Note on Confidence Bands for Linear Regression Means-07-24-2015Junfeng Liu

Image compressionbelorkar_abha

Lent Matlab H SsIntro Engineering

Cari 2020: A minimalistic model of spatial structuration of humid savanna veg...Mokhtar SELLAMI

Self taught clusteringSOYEON KIM

CLIM: Transition Workshop - Statistical Emulation with Dimension Reduction fo...The Statistical and Applied Mathematical Sciences Institute

Improvement of id3 algorithm based on simplified information entropy and coor...MdAhasanulAlam

Mit18 03 s10_ex1alighobadi20

IntegrationChhitiz Shrestha

Heap treeJananiJ19

Raw 2009 -THE ROLE OF LATEST FIXATIONS ON ONGOING VISUAL SEARCH A MODEL TO E...Giacomo Veneri

I Don't Want to Be a Dummy! Encoding Predictors for TreesWork-Bench

AggNet: Deep Learning from CrowdsShadi Nabil Albarqouni

Digital Image Processing (Lab 08)Moe Moe Myint

Session II - Estimation methods and accuracy Li-Chun Zhang Discussion: Sess...Istituto nazionale di statistica

Lecture 6-cs345-2014Rajiv Omar

What's hot (19)

20181204i mlse discussions

CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...

About functional SIR

A Note on Confidence Bands for Linear Regression Means-07-24-2015

Image compression

Lent Matlab H Ss

Cari 2020: A minimalistic model of spatial structuration of humid savanna veg...

Self taught clustering

CLIM: Transition Workshop - Statistical Emulation with Dimension Reduction fo...

Improvement of id3 algorithm based on simplified information entropy and coor...

Mit18 03 s10_ex1

Integration

Heap tree

Raw 2009 -THE ROLE OF LATEST FIXATIONS ON ONGOING VISUAL SEARCH A MODEL TO E...

I Don't Want to Be a Dummy! Encoding Predictors for Trees

AggNet: Deep Learning from Crowds

Digital Image Processing (Lab 08)

Session II - Estimation methods and accuracy Li-Chun Zhang Discussion: Sess...

Lecture 6-cs345-2014

Viewers also liked

Efficient reconfigurable architecture of baseband demodulator in sdreSAT Journals

MaxEye SDR System Level TestingMaxEye Technologies Private Limited

SDR Reference Secure architectureglobalsdr

final presentationshalinigowda12

Universal software defined radio development platformBertalan EGED

Spectra DTP4700 Linux Based Development for Software Defined Radio (SDR) Soft...ADLINK Technology IoT

Subsampling Multi-standard receiver design, Part-1Ahmed Sakr

Combining Lazy Learning, Racing and Subsampling for Effective Feature SelectionGianluca Bontempi

YUV, Y CB CR and SubsamplingUniversity of Sunderland

Abhinav End Sem Presentation Software Defined Radioguestad4734

Dyspan Sdr Cr Tutorial 10 25 Rev02melvincabatuan

Hardware Accelerated Software Defined Radio Tarik Kazaz

How to Use Social Media to Influence the WorldSean Si

Viewers also liked (13)

Efficient reconfigurable architecture of baseband demodulator in sdr

MaxEye SDR System Level Testing

SDR Reference Secure architecture

final presentation

Universal software defined radio development platform

Spectra DTP4700 Linux Based Development for Software Defined Radio (SDR) Soft...

Subsampling Multi-standard receiver design, Part-1

Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection

YUV, Y CB CR and Subsampling

Abhinav End Sem Presentation Software Defined Radio

Dyspan Sdr Cr Tutorial 10 25 Rev02

Hardware Accelerated Software Defined Radio

How to Use Social Media to Influence the World

Similar to optimal subsampling

Computing near-optimal policies from trajectories by solving a sequence of st...Université de Liège (ULg)

Presentation on machine learningJawaharlal Nehru Centre for Advanced Scientific Research

A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal

Introduction to Big Data ScienceAlbert Bifet

Machine Learning and Statistical Analysisbutest

Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Modelweekendsunny

3. 10079 20812-1-pbIAESIJEECS

MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...The Statistical and Applied Mathematical Sciences Institute

Slides econometrics-2017-graduate-2Arthur Charpentier

Optimization tutorialNorthwestern University

Study on Application of Ensemble learning on Credit Scoringharmonylab

Dimension Reduction Introduction & PCA.pptxRohanBorgalli

Non-parametric analysis of models and datahaharrington

Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Pooyan Jamshidi

Similar to optimal subsampling (20)

Computing near-optimal policies from trajectories by solving a sequence of st...

Presentation on machine learning

A Mathematical Programming Approach for Selection of Variables in Cluster Ana...

Introduction to Big Data Science

Machine Learning and Statistical Analysis

Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model

3. 10079 20812-1-pb

MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...

Slides econometrics-2017-graduate-2

Optimization tutorial

Study on Application of Ensemble learning on Credit Scoring

Dimension Reduction Introduction & PCA.pptx

Non-parametric analysis of models and data

Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...

optimal subsampling

1. Optimal Subsampling Strategy for Logistic Regression Qianshun Cheng and Tian Tian University of Illinois at Chicago Background Introduction Massive data are presented more and more frequently in modern scientific research. How to extract useful information from massive data has been a hot spot problem. – Truncate and merge – Subsampling based algorithms Advantage and disadvantage of subsampling-based algorithms Advantages – Efficiently downsize data – Easy computation and implementation Disadvantages – Sampling errors – Efficiency of extracting informations Motivation for our strategy Is there a way to better preserve the majority information contained in the full data? Logistic regression model Unknown parameter β = (β0, · · · , βm)T ; Binary response Yi at feature vector Xi is modeled as follows, Prob(Yi = 1|Xi) = P(Xi, β) = exp(XT i β) 1 + exp(XT i β) , i = 1, ..., n. (1) (locally) D-optimal designs D-optimal designs: How to assign feature value Xi’s such that the determinant of the information matrix with respect to β can be maximized? Theorem (Yang, Zhang and Huang, 2011): Under logistic model (1), a D-optimal design with respect to β is ξ∗ = {(C∗ l1 , 1/2m ), (C∗ l2 , 1/2m ), l = 1, · · · , 2m−1 } where C∗ lj = (1, al,1, · · · , al,m−1, (−1)j−1 c∗ ), j = 1, 2. – c∗ minimizes function f(c) = c−2 (Ψ(c))−m−1 , where Ψ(c) = [P (x)]2 P(x)(1−P(x)); – al,k is the boundary point of the design space at the k-th dimension, k = 1, · · · , m − 1. Subsampling Algorithm Algorithm (I). Given data set {(Yi, XT i ), i = 1, · · · , n}, choose a subsample of size ro by random sampling; (II). Fit the data and obtain an initial estimate ˆβ = ( ˆβ0, · · · , ˆβm); (III). Obtain B = {i | min{|ci − c∗ |, |ci + c∗ |} ≤ δ} by calculating ci = XT i ˆβ; (IV). From {(Yi, XT i ), i ∈ B}, pick r1 2(m−1) Xi’s and Xj’s where Xi1’s and Xj1’s are the first r1 2(m−1) largest and smallest values, respectively, among the first-dimension components. (V). Remove the chosen points from set B, and then continue to the next dimension. Collect data after the maximums and minimums at each of the m − 1 dimensions have been searched for and located. (VI). The newly collected r1 data points serve as the starting subsample for the next iteration, where the above steps are repeated. Simulation settings for small sample size scenarios Total sample size n = 10000. Starting subsample size r0 = 200. Parameter dimension m = 7. True parameter value β = (0.5, · · · , 0.5). Variance-covariance structure Σ is compound symmetry with diagonal entries being 1 and off-diagonal 0.5. – NzNormal – MzNormal – Mixed Normal – T3 Simulation settings for large sample size scenarios Total sample size n = 500000. Starting subsample size r0 = 1000. Other settings same as above. – Mixed Normal – T3 Simulation results Simulation results (small sample size) -2.75 -2.50 -2.25 -2.00 600 700 800 900 1000 r1 MSE Algorithm New Algorithm mVc Random Sampling (a) MzNormal -2.4 -2.0 -1.6 600 700 800 900 1000 r1 MSE Algorithm New Algorithm mVc Random Sampling (b) NzNormal -2.3 -2.1 -1.9 -1.7 600 700 800 900 1000 r1 MSE Algorithm New Algorithm mVc Random Sampling (c) MixNormal 0.0 0.5 1.0 600 700 800 900 1000 r1 MSE Algorithm New Algorithm mVc Random Sampling (d) T3 Simulation results (large sample size) -4.0 -3.5 -3.0 -2.5 1000 2000 3000 4000 5000 r1 MSE Algorithm New Algorithm mVc Random Sampling (a) MixNormal -3 -2 -1 0 1000 2000 3000 4000 5000 r1 MSE Algorithm New Algorithm mVc Random Sampling (b) T3 Ongoing Work Incorporate LEV algorithm into sampling. Incorporate higher order terms or interaction terms into model building. Incorporate model selection/averaging problem into current structure. Email: qcheng5@uic.edu, ttian3@uic.edu CCASA Student Showcase 2016 MSCS, UIC

optimal subsampling

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (13)

Similar to optimal subsampling

Similar to optimal subsampling (20)

optimal subsampling