QMC: Transition Workshop - Overview of Research in Working Group 4: Representative Points for Small Data and Big Data Problems - V. Roshan Joseph & Simon Mak, May 7, 2018

•Download as PPTX, PDF•

0 likes•108 views

The Statistical and Applied Mathematical Sciences Institute

This talk gives an overview of the research work done in working group IV. It will briefly introduce some of the problems in Deterministic Bayesian computation, big data reduction, estimation of normalizing constants in Bayesian sampling, and stochastic programming. This talk will briefly present two ongoing works in WG4. In the first part of the talk, we propose a new active data reduction method for large-scale Gaussian process (GP) modeling. GP modeling with big data is well-known to be computationally demanding, since it requires O(N^3) work and O(N^2) memory (N>>1 is the number of data points). The goal here is to first reduce this big data to a smaller dataset of n<<N points, then use the smaller data for efficient GP modeling. Our reduction method is guided by a useful bound on GP prediction error, which provides an exploration-exploitation trade-off for sequentially selecting the reduced data. We demonstrate the effectiveness of this approach in simulations and a climate model application. In the second part of this talk, we present a new adaptive modeling framework for estimating normalizing constants of an (unnormalized) posterior density, which we assume is expensive to evaluate. This method first uses posterior samples to fit an approximating surface, then employs this fit to obtain a closed-form estimate for the normalizing constant. A key novelty is the use of semi-definite (convex) programming, which allows for efficient and adaptive estimation with a small number of posterior samples. We explore the effectiveness of our approach in simulations and a real-world application.

Education

Overview of Research in WG IV:
Representative Points for Small-Data and Big-Data Problems
V. Roshan Joseph and Simon Mak
1
Supported by NSF DMS 1712642

Big-Data Problems
• Reduce big data to reduce future
computational cost
2
Big Data Representative points

Small-Data Problems
• Obtain expensive small data with minimum
cost
3
Expensive data
generating mechanism
Representative points

Big Data: An Example
4
Kernel Ridge Regression
(Inputs)
90 song features
(Output)
Song release date
Loudness Pitch Timbre
2007
E.g.,
(Data)
N=515345 songs
Computation: 𝑂(𝑁3)
Storage: 𝑂(𝑁2)
https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD
• Mak and Joseph (2017)

How to Reduce Big Data?
• Stratified sampling (Dalenius 1950; Cox
1957)
• Principal points (Flury 1990)
• Quantizers (Lloyd 1957; Max 1960)
• MSE-rep points (Fang and Wang 1994)
– K-means Clustering
– Can’t produce a reduced point set that retains
the original distribution -> Not a “true”
representative point set!
5

Support Points
6
• Can be efficiently optimized using difference-of-convex
program.
• Mak and Joseph (2018)

Small Data: An Example
10
Physical Experiment
FEM Experiment
• Friction drilling (Miller and Shih 2007)

Bayesian Model
where
• One evaluation of the unnormalized posterior takes
about 15.4 seconds in a 3.2 GHz laptop=> 10,000
MCMC samples would take 43 hours!
12

Minimum Energy Design
• Joseph, Dasgupta, Tuo, and Wu (2015)
• Posterior density
• Normalizing constant C is not needed!
13
𝑓 𝑥 =
1
𝐶
ℎ(𝑥)

15
• #evaluations=654 (Joseph, Wang, Gu, Lv, and Tuo 2017)

MED+MCMC
• Approximate the log-unnormalized posterior using
Gaussian Process and use MCMC
18

Support Points
min
2
𝑛𝐶
𝑖=1
𝑛
𝑥𝑖 − 𝑥 ℎ 𝑥 𝑑𝑥 −
1
𝑛2
𝑖=1
𝑛
𝑗=1
𝑛
𝑥𝑖 − 𝑥𝑗
• Normalizing constant C doesn’t factor out!
19

Research Questions
• Can we do fast Gaussian Process
approximation with big data?
• Can we adaptively estimate the
normalizing constant?
– Simon’s talk!
20

Thanks
21
• Lulu Kang
• Lester Mackey
• Fred Hickernell
• Mac Hyman
• Scott Schmilder
• Joe Marion
• Raaz Dwivedi
• Kan Zhang
• Cheng Cheng
• Matthias Sachs

References
Support points
• Mak, S. and Joseph, V. R. (2018). “Support Points,” Annals of Statistics, to appear,
https://arxiv.org/abs/1609.01811.
• Mak, S. and Joseph, V. R. (2017) “Projected Support Points: A New Method for High-
Dimensional Data Reduction”. Under review, https://arxiv.org/abs/1708.06897.
Minimum energy designs
• Joseph, V. R., Dasgupta, T., Tuo, R., and Wu, C. F. J. (2015). “Sequential
Exploration of Complex Surfaces Using Minimum Energy Designs”. Technometrics,
57, 64-74.
• Joseph, V. R., Wang, D., Gu, L., Lv, S., and Tuo, R. (2017) “Deterministic Sampling
of Expensive Posteriors Using Minimum Energy Designs”.
https://arxiv.org/abs/1712.08929
22

Similar to QMC: Transition Workshop - Overview of Research in Working Group 4: Representative Points for Small Data and Big Data Problems - V. Roshan Joseph & Simon Mak, May 7, 2018

AI3391 Artificial Intelligence Session 21 CSP.pptxAsst.prof M.Gokilavani

CS6715-Module1John A. Serri

Benchmarking graph databases on the problem of community detectionSotiris Beis

Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen

Lecture2-DT.pptxINyomanSwitrayana

Handling Uncertainty in Geo-Spatial Data.Andreas Zuefle

Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in RRevolution Analytics

Foofah: Data Transformation by Example (SIGMOD 2017)"Zhongjun "Mark"" Jin

SQLBits Module 2 RStats Introduction to R and StatisticsJen Stirrup

Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...Grammarly

Memory Networks for Question Answering on Tabular Data Viktoria Kolomiets

What Metrics Matter? CS, NcState

data miningnehaanand123

Part1Amit Sharma

IMIA Chiang Spatial Computing - 2016International Map Industry Association

High-Performance Analysis of Streaming Graphs Jason Riedy

Symbolic Background Knowledge for Machine LearningSteffen Staab

Visualising and analysing networksFrancisco Restivo

DBMSKathirvel Ayyaswamy

Chengqi zhang graph processing and mining in the era of big datajins0618

Similar to QMC: Transition Workshop - Overview of Research in Working Group 4: Representative Points for Small Data and Big Data Problems - V. Roshan Joseph & Simon Mak, May 7, 2018 (20)

AI3391 Artificial Intelligence Session 21 CSP.pptx

CS6715-Module1

Benchmarking graph databases on the problem of community detection

Machine Learning Foundations for Professional Managers

Lecture2-DT.pptx

Handling Uncertainty in Geo-Spatial Data.

Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in R

Foofah: Data Transformation by Example (SIGMOD 2017)

SQLBits Module 2 RStats Introduction to R and Statistics

Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...

Memory Networks for Question Answering on Tabular Data

What Metrics Matter?

data mining

Part1

IMIA Chiang Spatial Computing - 2016

High-Performance Analysis of Streaming Graphs

Symbolic Background Knowledge for Machine Learning

Visualising and analysing networks

DBMS

Chengqi zhang graph processing and mining in the era of big data

More from The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...The Statistical and Applied Mathematical Sciences Institute

2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - A Bracketing Relationship between Differe...The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Difference-in-differences: more than meet...The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...The Statistical and Applied Mathematical Sciences Institute

2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...The Statistical and Applied Mathematical Sciences Institute

2019 Fall Series: Professional Development, Writing Academic Papers…What Work...The Statistical and Applied Mathematical Sciences Institute

2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...The Statistical and Applied Mathematical Sciences Institute

2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...

2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...

Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...

Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...

Causal Inference Opening Workshop - A Bracketing Relationship between Differe...

Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...

Causal Inference Opening Workshop - Difference-in-differences: more than meet...

Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...

Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...

Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...

Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...

Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...

Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...

Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...

Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...

Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...

2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...

2019 Fall Series: Professional Development, Writing Academic Papers…What Work...

2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...

2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...

Recently uploaded

[GDSC YCCE] Build with AI Online PresentationGDSCYCCE

How to Split Bills in the Odoo 17 POS ModuleCeline George

Forest and Wildlife Resources Class 10 Free Study Material PDFVivekanand Anglo Vedic Academy

The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason

Accounting and finance exit exam 2016 E.C.pdfYibeltalNibretu

Operations Management - Book1.p - Dr. Abdulfatah A. SalemArab Academy for Science, Technology and Maritime Transport

The Benefits and Challenges of Open Educational Resourcesaileywriter

Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfQucHHunhnh

MARUTI SUZUKI- A Successful Joint Venture in India.pptxbennyroshan06

Basic_QTL_Marker-assisted_Selection_Sourabh.pptSourabh Kumar

UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...Sayali Powar

Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfPo-Chuan Chen

Application of Matrices in real life. Presentation on application of matricesRased Khan

Basic phrases for greeting and assisting costumersPedroFerreira53928

Industrial Training Report- AKTU Industrial Training ReportAvinash Rai

Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptxricssacare

Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringDenish Jangid

Sectors of the Indian Economy - Class 10 Study Notes pdfVivekanand Anglo Vedic Academy

Salient features of Environment protection Act 1986.pptxakshayaramakrishnan21

GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...Nguyen Thanh Tu Collection

Recently uploaded (20)

[GDSC YCCE] Build with AI Online Presentation

How to Split Bills in the Odoo 17 POS Module

Forest and Wildlife Resources Class 10 Free Study Material PDF

The Art Pastor's Guide to Sabbath | Steve Thomason

Accounting and finance exit exam 2016 E.C.pdf

Operations Management - Book1.p - Dr. Abdulfatah A. Salem

The Benefits and Challenges of Open Educational Resources

Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf

MARUTI SUZUKI- A Successful Joint Venture in India.pptx

Basic_QTL_Marker-assisted_Selection_Sourabh.ppt

UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...

Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf

Application of Matrices in real life. Presentation on application of matrices

Basic phrases for greeting and assisting costumers

Industrial Training Report- AKTU Industrial Training Report

Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx

Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering

Sectors of the Indian Economy - Class 10 Study Notes pdf

Salient features of Environment protection Act 1986.pptx

GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...

QMC: Transition Workshop - Overview of Research in Working Group 4: Representative Points for Small Data and Big Data Problems - V. Roshan Joseph & Simon Mak, May 7, 2018

1. Overview of Research in WG IV: Representative Points for Small-Data and Big-Data Problems V. Roshan Joseph and Simon Mak 1 Supported by NSF DMS 1712642

2. Big-Data Problems • Reduce big data to reduce future computational cost 2 Big Data Representative points

3. Small-Data Problems • Obtain expensive small data with minimum cost 3 Expensive data generating mechanism Representative points

4. Big Data: An Example 4 Kernel Ridge Regression (Inputs) 90 song features (Output) Song release date Loudness Pitch Timbre 2007 E.g., (Data) N=515345 songs Computation: 𝑂(𝑁3) Storage: 𝑂(𝑁2) https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD • Mak and Joseph (2017)

5. How to Reduce Big Data? • Stratified sampling (Dalenius 1950; Cox 1957) • Principal points (Flury 1990) • Quantizers (Lloyd 1957; Max 1960) • MSE-rep points (Fang and Wang 1994) – K-means Clustering – Can’t produce a reduced point set that retains the original distribution -> Not a “true” representative point set! 5

6. Support Points 6 • Can be efficiently optimized using difference-of-convex program. • Mak and Joseph (2018)

7. Support Points-continued 7

8. Reducing Big Data 8

9. Example: N2(0,1) 9

10. Small Data: An Example 10 Physical Experiment FEM Experiment • Friction drilling (Miller and Shih 2007)

11. Model Calibration 11

12. Bayesian Model where • One evaluation of the unnormalized posterior takes about 15.4 seconds in a 3.2 GHz laptop=> 10,000 MCMC samples would take 43 hours! 12

13. Minimum Energy Design • Joseph, Dasgupta, Tuo, and Wu (2015) • Posterior density • Normalizing constant C is not needed! 13 𝑓 𝑥 = 1 𝐶 ℎ(𝑥)

14. 14

15. 15 • #evaluations=654 (Joseph, Wang, Gu, Lv, and Tuo 2017)

16. 16

17. Marginal Distribution 17

18. MED+MCMC • Approximate the log-unnormalized posterior using Gaussian Process and use MCMC 18

19. Support Points min 2 𝑛𝐶 𝑖=1 𝑛 𝑥𝑖 − 𝑥 ℎ 𝑥 𝑑𝑥 − 1 𝑛2 𝑖=1 𝑛 𝑗=1 𝑛 𝑥𝑖 − 𝑥𝑗 • Normalizing constant C doesn’t factor out! 19

20. Research Questions • Can we do fast Gaussian Process approximation with big data? • Can we adaptively estimate the normalizing constant? – Simon’s talk! 20

21. Thanks 21 • Lulu Kang • Lester Mackey • Fred Hickernell • Mac Hyman • Scott Schmilder • Joe Marion • Raaz Dwivedi • Kan Zhang • Cheng Cheng • Matthias Sachs

22. References Support points • Mak, S. and Joseph, V. R. (2018). “Support Points,” Annals of Statistics, to appear, https://arxiv.org/abs/1609.01811. • Mak, S. and Joseph, V. R. (2017) “Projected Support Points: A New Method for High- Dimensional Data Reduction”. Under review, https://arxiv.org/abs/1708.06897. Minimum energy designs • Joseph, V. R., Dasgupta, T., Tuo, R., and Wu, C. F. J. (2015). “Sequential Exploration of Complex Surfaces Using Minimum Energy Designs”. Technometrics, 57, 64-74. • Joseph, V. R., Wang, D., Gu, L., Lv, S., and Tuo, R. (2017) “Deterministic Sampling of Expensive Posteriors Using Minimum Energy Designs”. https://arxiv.org/abs/1712.08929 22

QMC: Transition Workshop - Overview of Research in Working Group 4: Representative Points for Small Data and Big Data Problems - V. Roshan Joseph & Simon Mak, May 7, 2018

Recommended

Recommended

More Related Content

Similar to QMC: Transition Workshop - Overview of Research in Working Group 4: Representative Points for Small Data and Big Data Problems - V. Roshan Joseph & Simon Mak, May 7, 2018

Similar to QMC: Transition Workshop - Overview of Research in Working Group 4: Representative Points for Small Data and Big Data Problems - V. Roshan Joseph & Simon Mak, May 7, 2018 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

QMC: Transition Workshop - Overview of Research in Working Group 4: Representative Points for Small Data and Big Data Problems - V. Roshan Joseph & Simon Mak, May 7, 2018