FEDBN: FEDERATED LEARNING
ON NON-IID FEATURES VIA LOCAL
BATCH NORMALIZATION
Paper presentation by Anam ur rehman
Contact: anamur.rehman@studenti.polito.it
Published as a conference paper at ICLR 2021
1
Authors
2
FEDBN: Federated learning[1]
Year
2021
2020
2018
2017
2016
2015
2014
2013
2012
2011
[1] Jakub et al. Federated optimization: Distributed machine learning for on-device intelligence. 2016
Classical Machine Learning:
• Centralized data storage
• Training process computations at the central server.
What if ?
 Data stays distributed on remote devices
Devices maintain control of their own data
 Training is done locally on remote devices
 One global model is learned via aggregation
3
Autonomous cars on an
average generate around 4 GB
of data per hour of driving.
FEDBN: Federated learning
Applications [1]
• Transportation: self-driving cars
• Healthcare: predictions on patient data
• Cybersecurity: spam filtering
• Smart applications: voice recognition, next word prediction, etc.
[1] Read more: Priyanka et al. Federated Learning: Opportunities and Challenges, 2021
Challenges [1]
• Communication Overheads: presence of stragglers
• Heterogeneity: system, statistical (in contrast to distributed learning)
• Privacy concerns
4
Year
2021
2020
2018
2017
2016
2015
2014
2013
2012
2011
FEDBN: Federated learning
Statistical heterogeneity among local datasets:
• Unbalancedness: Clients may have different amount of data
 Example, Spam filter for emails.
• Covariate shift: Statistical distribution of data varies among clients
 Example: NLP digits recognition
• Concept shift: Same features may correspond to different labels for different clients
 Example, in NLP, Sentiment analysis on same text may yield different sentiments for different clients
5
FEDBN: Example of NonIID datasets
MNIST
MNIST-M
USPS SynthDigits
SVHN
Covariate shift among datasets
Digits
dataset
6
FEDBN: Related work
• FedAvg[1]: Federated Average
[1] Brendan McMahan et al. Communication-efficient learning of deep networks from decentralized data. 2017.
At each communication round
1. Server randomly selects a subset of K clients and Send them current global model
2. Selected device k updates this model on local client data via SGD. After training client
sends the new local model back to server
3. Server aggregates local models to form a new global model
- Convergence in not guranteed. In hetergeneous settings it can diverge [1]
Year
2021
2020
2018
2017
2016
2015
2014
2013
2012
2011
7
FEDBN: Related work
• FedProx[1]: Federated Optimization in Heterogeneous Networks
[1] Tian Li et al, In Conference on Machine Learning and Systems, 2020a, 2020b.
Slide credit: Tian Li, MLSys presentation.
+ Limits the impact of heterogeneous local updates
+ Safely incorporate partial work of stragglers
+ Generalization of FedAvg; Allows for any local solver
+ Theoretical guarantees for convergence
Year
2021
2020
2018
2017
2016
2015
2014
2013
2012
2011
8
FEDBN: Related work
• SiloBN[1]: Siloed Federated Learning for Multi-Centric Histopathology Datasets
[1] Mathieu Andreux et al, Siloed federated learning for multi-centric histopathology datasets, pp. 129–139. Springer, 2020.
Slide credits: [1]
Year
2021
2020
2018
2017
2016
2015
2014
2013
2012
2011
9
FEDBN: Batch Normalization
Year
2021
2020
2018
2017
2016
2015
2014
2013
2012
2011
[1] Sergey Ioffe et al. Batch normalization: Accelerating deep network training by reducing internal covariate shift. 2015
γ and β are the only
learnable parameters
of BN layer.
10
Why we use it ?
To reduce internal covariate
shift in neural network [1].
How it works ?
11
2021
FEDBN: Problem with non IID data
12
Consider a simple,
non-convex learning problem:
s.t
ϵ ∼ 𝒩 0, σ2
𝑤2
∗
Two clients train a model
s.t
x1 ∼ 𝒩 μ, σ1
2
.
x2 ∼ 𝒩 μ, σ2
2
.
and
σ1
2
≠ σ2
2
local squared loss before and after local BN
𝑤1
∗
w
𝑓𝑤 𝑥𝑖 = 𝑐𝑜𝑠 𝑤𝑥𝑖
FEDBN: Why not just take the average? (SiloBN)
Client 1
w1
∗
: Optimal weight
γ1
∗
: Optimal BN parameter
Observation 1:
For a fixed optimal weight w1
∗
,
changing γ deteriorates the model
quality.
Observation 2:
For a given optimal BN
parameter γ1
∗
, changing w
deteriorates the quality.
13
FEDBN: How it works?
Local training
Global aggregation
14
FEDBN: Pytorch implementation
15
FEDBN: How it Really Works?
Source: med-air/FedBN (github.com)
16
Global
Aggregation
FEDBN: Results on digit dataset (FedAvg vs FedBN)
• Outperforms FedAvg on SVHN dataset
• Faster convergence
• Smooth and robust convergence
17
FEDBN: Results; what if
Communication is done
at different frequencies?
18
FEDBN: Results; what if
Dataset size varies for
each client?
19
FEDBN: Contributions
Provides convergence guarantees.
Improves the convergence behavior on non-IID datasets.
One step further in privacy of client’s data.
20
FEDBN: Take home message
• Use batch normalization
• Keep it local
• Smooth and fast convergence
21
Useful links
Federated Optimization in Heterogeneous Networks
FedProx presentation by Tian Li:
22
med-air/FedBN (github.com)
Pytorch implementation of FedBN:
Brendan McMahan’s Talk:
Guarding user Privacy with Federated Learning

FedBN

  • 1.
    FEDBN: FEDERATED LEARNING ONNON-IID FEATURES VIA LOCAL BATCH NORMALIZATION Paper presentation by Anam ur rehman Contact: anamur.rehman@studenti.polito.it Published as a conference paper at ICLR 2021 1
  • 2.
  • 3.
    FEDBN: Federated learning[1] Year 2021 2020 2018 2017 2016 2015 2014 2013 2012 2011 [1]Jakub et al. Federated optimization: Distributed machine learning for on-device intelligence. 2016 Classical Machine Learning: • Centralized data storage • Training process computations at the central server. What if ?  Data stays distributed on remote devices Devices maintain control of their own data  Training is done locally on remote devices  One global model is learned via aggregation 3 Autonomous cars on an average generate around 4 GB of data per hour of driving.
  • 4.
    FEDBN: Federated learning Applications[1] • Transportation: self-driving cars • Healthcare: predictions on patient data • Cybersecurity: spam filtering • Smart applications: voice recognition, next word prediction, etc. [1] Read more: Priyanka et al. Federated Learning: Opportunities and Challenges, 2021 Challenges [1] • Communication Overheads: presence of stragglers • Heterogeneity: system, statistical (in contrast to distributed learning) • Privacy concerns 4 Year 2021 2020 2018 2017 2016 2015 2014 2013 2012 2011
  • 5.
    FEDBN: Federated learning Statisticalheterogeneity among local datasets: • Unbalancedness: Clients may have different amount of data  Example, Spam filter for emails. • Covariate shift: Statistical distribution of data varies among clients  Example: NLP digits recognition • Concept shift: Same features may correspond to different labels for different clients  Example, in NLP, Sentiment analysis on same text may yield different sentiments for different clients 5
  • 6.
    FEDBN: Example ofNonIID datasets MNIST MNIST-M USPS SynthDigits SVHN Covariate shift among datasets Digits dataset 6
  • 7.
    FEDBN: Related work •FedAvg[1]: Federated Average [1] Brendan McMahan et al. Communication-efficient learning of deep networks from decentralized data. 2017. At each communication round 1. Server randomly selects a subset of K clients and Send them current global model 2. Selected device k updates this model on local client data via SGD. After training client sends the new local model back to server 3. Server aggregates local models to form a new global model - Convergence in not guranteed. In hetergeneous settings it can diverge [1] Year 2021 2020 2018 2017 2016 2015 2014 2013 2012 2011 7
  • 8.
    FEDBN: Related work •FedProx[1]: Federated Optimization in Heterogeneous Networks [1] Tian Li et al, In Conference on Machine Learning and Systems, 2020a, 2020b. Slide credit: Tian Li, MLSys presentation. + Limits the impact of heterogeneous local updates + Safely incorporate partial work of stragglers + Generalization of FedAvg; Allows for any local solver + Theoretical guarantees for convergence Year 2021 2020 2018 2017 2016 2015 2014 2013 2012 2011 8
  • 9.
    FEDBN: Related work •SiloBN[1]: Siloed Federated Learning for Multi-Centric Histopathology Datasets [1] Mathieu Andreux et al, Siloed federated learning for multi-centric histopathology datasets, pp. 129–139. Springer, 2020. Slide credits: [1] Year 2021 2020 2018 2017 2016 2015 2014 2013 2012 2011 9
  • 10.
    FEDBN: Batch Normalization Year 2021 2020 2018 2017 2016 2015 2014 2013 2012 2011 [1]Sergey Ioffe et al. Batch normalization: Accelerating deep network training by reducing internal covariate shift. 2015 γ and β are the only learnable parameters of BN layer. 10 Why we use it ? To reduce internal covariate shift in neural network [1]. How it works ?
  • 11.
  • 12.
    FEDBN: Problem withnon IID data 12 Consider a simple, non-convex learning problem: s.t ϵ ∼ 𝒩 0, σ2 𝑤2 ∗ Two clients train a model s.t x1 ∼ 𝒩 μ, σ1 2 . x2 ∼ 𝒩 μ, σ2 2 . and σ1 2 ≠ σ2 2 local squared loss before and after local BN 𝑤1 ∗ w 𝑓𝑤 𝑥𝑖 = 𝑐𝑜𝑠 𝑤𝑥𝑖
  • 13.
    FEDBN: Why notjust take the average? (SiloBN) Client 1 w1 ∗ : Optimal weight γ1 ∗ : Optimal BN parameter Observation 1: For a fixed optimal weight w1 ∗ , changing γ deteriorates the model quality. Observation 2: For a given optimal BN parameter γ1 ∗ , changing w deteriorates the quality. 13
  • 14.
    FEDBN: How itworks? Local training Global aggregation 14
  • 15.
  • 16.
    FEDBN: How itReally Works? Source: med-air/FedBN (github.com) 16 Global Aggregation
  • 17.
    FEDBN: Results ondigit dataset (FedAvg vs FedBN) • Outperforms FedAvg on SVHN dataset • Faster convergence • Smooth and robust convergence 17
  • 18.
    FEDBN: Results; whatif Communication is done at different frequencies? 18
  • 19.
    FEDBN: Results; whatif Dataset size varies for each client? 19
  • 20.
    FEDBN: Contributions Provides convergenceguarantees. Improves the convergence behavior on non-IID datasets. One step further in privacy of client’s data. 20
  • 21.
    FEDBN: Take homemessage • Use batch normalization • Keep it local • Smooth and fast convergence 21
  • 22.
    Useful links Federated Optimizationin Heterogeneous Networks FedProx presentation by Tian Li: 22 med-air/FedBN (github.com) Pytorch implementation of FedBN: Brendan McMahan’s Talk: Guarding user Privacy with Federated Learning