Deep Learning Theory Seminar (Chap 3, part 2)

•

0 likes•90 views

Sangwoo Mo

Chapter 3 (part 2) of https://mjt.cs.illinois.edu/dlt/index.pdf

Technology

Deep Learning Theory Lecture Note
Chapter 3 (part 2)
2022.04.06.
KAIST ALIN-LAB
Sangwoo Mo
1

• Maurey sampling technique
• Let 𝑋 = 𝔼𝑉 for random variable 𝑉 supported on a set 𝑆
• Finite-sample approx. of 𝑋 is &
𝑋 =
!
"
∑#$!
"
𝑉# for 𝑉# iid sampled from 𝑝(𝑉)
• Here, &
𝑋 ≈ 𝑋 as 𝑘 → ∞ (precisely, 𝑋 − &
𝑋 = 𝑂(1/𝑘))
• It is very intuitive… let’s prove it!
(3.3) How to sample finite networks?
2
𝑉 is on a Hilbert space (i.e., has a inner product)

• Maurey sampling technique
• Formal statement
(3.3) How to sample finite networks?
3
= 𝑂(1/𝑘),
goes to zero as 𝑘 → ∞

• Maurey sampling technique
• Formal statement
(3.3) How to sample finite networks?
4
= 𝑂(1/𝑘),
goes to zero as 𝑘 → ∞
(1) We bound the 𝔼 form
(2) If 𝔼 over 𝑉!, … , 𝑉" is some value 𝐾,
there exists some 𝑈!, … , 𝑈" with value ≤ 𝐾
(we need only one realization of 𝑈!, … , 𝑈" that satisfies
!
"
∑# 𝑈# ≈ 𝑋)
This technique is called “probabilistic method”!

• Maurey sampling technique
• Proof of Lemma 3.1
(3.3) How to sample finite networks?
5
𝔼𝑉#𝔼𝑉
$ − 𝔼𝑉#𝑋 − 𝔼𝑉
$𝑋 + 𝑋 %
= 𝑋 %
− 𝑋 %
− 𝑋 %
+ 𝑋 %
= 0
𝑉 %
− 2𝔼𝑉𝑋 + 𝑋 %
= 𝑉 %
− 2 𝑋 %
+ 𝑋 %
= 𝑉 %
− 𝑋 %

• Sampling finite-width network
• Lemma 3.1 assumes that 𝑝(𝑉) is probability – (1) nonzero, (2) sum is 1
• However, our “weight distribution” of infinite-width NN is not a probability!
• Our infinite-width NN
• The weight distribution of (𝑤, 𝑏) is sin ⋯ – (1) can be negative, (2) sum is not 1
• Q. How to extend Maurey sampling for general weight distribution?
(3.3) How to sample finite networks?
6

• Sampling finite-width network
• For simplicity, let the infinite-width NN be
where 𝜇 is a signed measure over weight vectors 𝑤 ∈ ℝ%
• 𝑔 is some abstract node (e.g., 𝑔 𝑥; 𝑤 = 𝜎(𝑎⊺𝑥 + 𝑏) for 𝑤 = {𝑎, 𝑏})
• To convert the general signed measure 𝜇 to a probability measure,
1) Introduce a sign parameter 𝑠 ∈ {±1} and consider nonnegative measure 𝜇±
• For 𝜇 = 𝜇& − 𝜇', both 𝜇& and 𝜇' are nonnegative (Jordan decomposition)
• Multiply 𝑠 = +1 for 𝜇& and 𝑠 = −1 for 𝜇' regions (Pr s = +1 = 𝜇& / 𝜇 )
2) Normalize nonnegative measure 𝜇/‖𝜇‖ to make sum 1
• Multiply the normalizing constant ‖𝜇‖ to the output 𝑔(𝑥; 𝑤, 𝑠)
• After the conversion, one can extend Maurey sampling for general signed measure
(3.3) How to sample finite networks?
7
‖ Infinite NN – Finite NN ‖

• Sampling finite-width network
• Applying Maurey sampling, the approx. error of finite-width NN is
• (3.1) Univariate case
• (3.2) Baron’s construction
(3.3) How to sample finite networks?
8
Approx. error ≤
Approx. error ≤

What's hot

Ensemble Data Assimilation on a Non-Conservative Adaptive MeshColinGuider

[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...Taiji Suzuki

K means and dbscanYan Xu

DDPG algortihm for angry birdsWangyu Han

Quick Look At ClusteringDataminingTools Inc

Bfs dfs mstAvichalVishnoi

IJCAI13 Paper review: Large-scale spectral clustering on graphsAkisato Kimura

Tensor Spectral ClusteringAustin Benson

Optimal real-time landing using DNN홍배 김

An overview of gradient descent optimization algorithms Hakky St

Fast Multipole AlgorithmAmir Masoud Abdol

K means clusteringThomas K T

K means clustering | K Means ++sabbirantor

K MEANS CLUSTERINGsingh7599

Linear regression with gradient descentSuraj Parmar

Spectral clustering TutorialZitao Liu

Firefly exact MCMC for Big DataGianvito Siciliano

Optimization/Gradient Descentkandelin

Birch1ThamizharasiM3

9 Hierarchical ClusteringVishal Dutt

What's hot (20)

Ensemble Data Assimilation on a Non-Conservative Adaptive Mesh

[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...

K means and dbscan

DDPG algortihm for angry birds

Quick Look At Clustering

Bfs dfs mst

IJCAI13 Paper review: Large-scale spectral clustering on graphs

Tensor Spectral Clustering

Optimal real-time landing using DNN

An overview of gradient descent optimization algorithms

Fast Multipole Algorithm

K means clustering

K means clustering | K Means ++

K MEANS CLUSTERING

Linear regression with gradient descent

Spectral clustering Tutorial

Firefly exact MCMC for Big Data

Optimization/Gradient Descent

Birch1

9 Hierarchical Clustering

Similar to Deep Learning Theory Seminar (Chap 3, part 2)

Elementary statistical inference1SEMINARGROOT

Efficient anomaly detection via matrix sketchingHsing-chuan Hsieh

Distributional RL via Moment Matchingtaeseon ryu

Average Sensitivity of Graph AlgorithmsYuichi Yoshida

Elements of Statistical Learning 読み会第2章Tsuyoshi Sakama

Av 738- Adaptive Filtering - Background MaterialDr. Bilal Siddiqui, C.Eng., MIMechE, FRAeS

A compact zero knowledge proof to restrict message space in homomorphic encry...MITSUNARI Shigeo

ARTIFICIAL-NEURAL-NETWORKMACHINELEARNINGmohanapriyastp

Paper study: Attention, learn to solve routing problems!ChenYiHuang5

Bayesian Neural NetworksNatan Katz

NIPS KANSAI Reading Group #5: State Aware Imitation LearningEiji Uchibe

Score based Generative Modeling through Stochastic Differential EquationsSungchul Kim

Markov Chain BasicSanghyuk Chun

Paper study: Learning to solve circuit satChenYiHuang5

Lecture1tarikh007

Probabilistic Models of Time Series and SequencesZitao Liu

Domain adaptation: A Theoretical ViewChia-Ching Lin

Fa18_P1.pptxMd Abul Hayat

is anyone_interest_in_auto-encoding_variational-bayesNAVER Engineering

Schelkunoff Polynomial Method for Antenna SynthesisSwapnil Bangera

Similar to Deep Learning Theory Seminar (Chap 3, part 2) (20)

Elementary statistical inference1

Efficient anomaly detection via matrix sketching

Distributional RL via Moment Matching

Average Sensitivity of Graph Algorithms

Elements of Statistical Learning 読み会第2章

Av 738- Adaptive Filtering - Background Material

A compact zero knowledge proof to restrict message space in homomorphic encry...

ARTIFICIAL-NEURAL-NETWORKMACHINELEARNING

Paper study: Attention, learn to solve routing problems!

Bayesian Neural Networks

NIPS KANSAI Reading Group #5: State Aware Imitation Learning

Score based Generative Modeling through Stochastic Differential Equations

Markov Chain Basic

Paper study: Learning to solve circuit sat

Lecture1

Probabilistic Models of Time Series and Sequences

Domain adaptation: A Theoretical View

Fa18_P1.pptx

is anyone_interest_in_auto-encoding_variational-bayes

Schelkunoff Polynomial Method for Antenna Synthesis

Recently uploaded

Artificial intelligence in the post-deep learning eraDeakin University

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Key Features Of Token Development (1).pptxLBM Solutions

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Build your next Gen AI Breakthrough - April 2024Neo4j

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

How to convert PDF to text with Nanonetsnaman860154

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

The transition to renewables in India.pdfCompetition Advisory Services (India) LLP

APIForce Zurich 5 April Automation LPDGMarianaLemus7

Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Recently uploaded (20)

Artificial intelligence in the post-deep learning era

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Key Features Of Token Development (1).pptx

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

Designing IA for AI - Information Architecture Conference 2024

Presentation on how to chat with PDF using ChatGPT code interpreter

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Scanning the Internet for External Cloud Exposures via SSL Certs

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Advanced Test Driven-Development @ php[tek] 2024

Build your next Gen AI Breakthrough - April 2024

Breaking the Kubernetes Kill Chain: Host Path Mount

How to convert PDF to text with Nanonets

Pigging Solutions Piggable Sweeping Elbows

Benefits Of Flutter Compared To Other Frameworks

The transition to renewables in India.pdf

APIForce Zurich 5 April Automation LPDG

Unlocking the Potential of the Cloud for IBM Power Systems

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Deep Learning Theory Seminar (Chap 3, part 2)

1. Deep Learning Theory Lecture Note Chapter 3 (part 2) 2022.04.06. KAIST ALIN-LAB Sangwoo Mo 1

2. • Maurey sampling technique • Let 𝑋 = 𝔼𝑉 for random variable 𝑉 supported on a set 𝑆 • Finite-sample approx. of 𝑋 is & 𝑋 = ! " ∑#$! " 𝑉# for 𝑉# iid sampled from 𝑝(𝑉) • Here, & 𝑋 ≈ 𝑋 as 𝑘 → ∞ (precisely, 𝑋 − & 𝑋 = 𝑂(1/𝑘)) • It is very intuitive… let’s prove it! (3.3) How to sample finite networks? 2 𝑉 is on a Hilbert space (i.e., has a inner product)

3. • Maurey sampling technique • Formal statement (3.3) How to sample finite networks? 3 = 𝑂(1/𝑘), goes to zero as 𝑘 → ∞

4. • Maurey sampling technique • Formal statement (3.3) How to sample finite networks? 4 = 𝑂(1/𝑘), goes to zero as 𝑘 → ∞ (1) We bound the 𝔼 form (2) If 𝔼 over 𝑉!, … , 𝑉" is some value 𝐾, there exists some 𝑈!, … , 𝑈" with value ≤ 𝐾 (we need only one realization of 𝑈!, … , 𝑈" that satisfies ! " ∑# 𝑈# ≈ 𝑋) This technique is called “probabilistic method”!

5. • Maurey sampling technique • Proof of Lemma 3.1 (3.3) How to sample finite networks? 5 𝔼𝑉#𝔼𝑉 $ − 𝔼𝑉#𝑋 − 𝔼𝑉 $𝑋 + 𝑋 % = 𝑋 % − 𝑋 % − 𝑋 % + 𝑋 % = 0 𝑉 % − 2𝔼𝑉𝑋 + 𝑋 % = 𝑉 % − 2 𝑋 % + 𝑋 % = 𝑉 % − 𝑋 %

6. • Sampling finite-width network • Lemma 3.1 assumes that 𝑝(𝑉) is probability – (1) nonzero, (2) sum is 1 • However, our “weight distribution” of infinite-width NN is not a probability! • Our infinite-width NN • The weight distribution of (𝑤, 𝑏) is sin ⋯ – (1) can be negative, (2) sum is not 1 • Q. How to extend Maurey sampling for general weight distribution? (3.3) How to sample finite networks? 6

7. • Sampling finite-width network • For simplicity, let the infinite-width NN be where 𝜇 is a signed measure over weight vectors 𝑤 ∈ ℝ% • 𝑔 is some abstract node (e.g., 𝑔 𝑥; 𝑤 = 𝜎(𝑎⊺𝑥 + 𝑏) for 𝑤 = {𝑎, 𝑏}) • To convert the general signed measure 𝜇 to a probability measure, 1) Introduce a sign parameter 𝑠 ∈ {±1} and consider nonnegative measure 𝜇± • For 𝜇 = 𝜇& − 𝜇', both 𝜇& and 𝜇' are nonnegative (Jordan decomposition) • Multiply 𝑠 = +1 for 𝜇& and 𝑠 = −1 for 𝜇' regions (Pr s = +1 = 𝜇& / 𝜇 ) 2) Normalize nonnegative measure 𝜇/‖𝜇‖ to make sum 1 • Multiply the normalizing constant ‖𝜇‖ to the output 𝑔(𝑥; 𝑤, 𝑠) • After the conversion, one can extend Maurey sampling for general signed measure (3.3) How to sample finite networks? 7 ‖ Infinite NN – Finite NN ‖

8. • Sampling finite-width network • Applying Maurey sampling, the approx. error of finite-width NN is • (3.1) Univariate case • (3.2) Baron’s construction (3.3) How to sample finite networks? 8 Approx. error ≤ Approx. error ≤

9. 9 Thank you for listening! 😀

Deep Learning Theory Seminar (Chap 3, part 2)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Learning Theory Seminar (Chap 3, part 2)

Similar to Deep Learning Theory Seminar (Chap 3, part 2) (20)

More from Sangwoo Mo

More from Sangwoo Mo (20)

Recently uploaded

Recently uploaded (20)

Deep Learning Theory Seminar (Chap 3, part 2)