A short introduction to Vertical Federated learning presented at the Summer School MegaData: Federated Machine Learning 2023 at the University in Tartu.
2. Federated Learning
“Federated learning is a machine learning setting where multiple entities (clients) collaborate in solving a
machine learning problem, under the coordination of a central server or service provider. Each client’s
raw data is stored locally and not exchanged or transferred; instead focused updates intended for
immediate aggregation are used to achieve the learning objective.”
Kairouz et al., Advances and open problems in federated learning, 2019
8. Step 1 - Secure Data Alignment
Monica Scannapieco, et al., 2007. Privacy Preserving Schema and Data Matching. https://doi.org/10.1145/1247480.1247553
9. Secure Model Training in VFL
Yang, et al., Federated Machine Learning: Concept and Applications
•Step 1: collaborator C creates encryption pairs,
send public key to A and B;
•Step 2: A and B encrypt and exchange the
intermediate results for gradient and loss
calculations;
•Step 3: A and B computes encrypted gradients
and adds additional mask, respectively, and B
also computes encrypted loss; A and B send
encrypted values to C;
•Step 4: C decrypts and send the decrypted
gradients and loss back to A and B; A and B
unmask the gradients, update the model
parameters accordingly.
10. Vertical Federated Linear Regression
Yang, et al., Federated Machine Learning: Concept and Applications
11. Vertical Federated Linear Regression
Yang, et al., Federated Machine Learning: Concept and Applications
12. Secure Evaluation in VFL
Yang, et al., Federated Machine Learning: Concept and Applications
Is the evaluation secure enough? Can C infer raw
data of A and B?
Possible Solution!!!
Secure Multiparty Computation (SMC)
13. Do we really need a coordinator?
(Yang et al., Parallel Distributed Logistic Regression for Vertical Federated Learning without Third-Party Coordinator,
14. Existing Vertically Federated Learning Algorithms
•Linear regression
(Gascon, et al., Privacy-preserving distributed linear regression on high-dimensional data. Proceedings on Privacy Enhancing
Technologies, 2017(4):345-364,2017)
•Association rule-mining
(Vaidya, Clifton, Privacy preserving association rule mining in vertically partitioned data. In Proceedings of the eighth ACM
SIGKDD international conference on Knowledge discovery and data mining, pages 639-644. ACM, 2002.)
•K-means clustering
(Vaidya, Clifton. Privacy-preserving k-means clustering over vertically partitioned data. In Proceedings of the ninth ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 206-215, 2003.)
•Logistic regression
(Hardy et al., Private federated learning on vertically partitioned data via entity resolution and additively homomorphic
encryption, arXiv:1711.10677, 2017.)
•Random forest
(Liu, et al., Federated forest. arXiv:1905.10053, 2019.)
•XGBoost
(Cheng, et al., Secureboost: A lossless federated learning framework. arXiv:1901.08755, 2019.)
30. Incentive/Reward Allocation to Parties in VFL
● What is the contribution of the parties?
● What do they bring to the table?
● How to reward parties with incentive fairly?
● How to explain the allocated incentives to the parties?
31. Existing Approaches in FL for Incentive Allocation
Game Theory Auction Theory Contract Theory
Incentive Allocation in FL
Shapley Value
Stackelberg
Game
Only Shapley values have been explored so far for VFL settings!!
32. Designing Pipeline for Fair Incentive Allocation in VFL
Client Selection
Contribution
Measurement
Incentive Allocation Explanation
33. Open Challenges in VFL
● Communication Overhead
● Asynchronism
● Data Scarcity
● Data Redundancy
● Defense Mechanisms for Backdoor Attacks
● High Dimensions
● Fairness: Model Fairness, Collaborative Fairness
● Explainability
36. Linear Regression Model
Features, X = {x1….x6}
Number of training Samples = 7000
Number of testing samples: 3000
Learning Rate : 0.01
Epochs: 50
R2_Score: 0.99
Centralized Linear Regression
37. Target: Y
Number of training samples: 7000
Number of features: 2
X = (x1,x2)
Number of training samples: 7000
Number of features: 2
X = (x3,x4)
Number of training samples: 7000
Number of features: 2
X = (x5,x6)
Client1 Client2 Client3
Features, X = {x1….x6}
Number of training Samples = 7000
Number of testing samples: 3000
Vertical Partitioning of the Dataset
38. R2_Score: 0.3054
Linear Regression Model Linear Regression Model Linear Regression Model
Conventional Machine Learning
Target: Y
Number of training samples: 7000
Number of testing samples: 3000
Number of features: 2
X = (x1,x2)
Number of training samples: 7000
Number of testing samples: 3000
Number of features: 2
X = (x3,x4)
Number of training samples: 7000
Number of testing samples: 3000
Number of features: 2
X = (x5,x6)
Client1 Client2 Client3
39. Vertical Federated Linear Regression
Guest Party
(Client with Labels)
Host Party
Complete a forward propagation using
local data
Receive forward output or intermediate
results from Host Party
Calculate loss from loss function
Send loss to the host party
Compute gradients
Update local model
Complete a forward propagation using
local data
Send intermediate results to Guest Party
Receive loss computed from Guest Party
Compute gradients
Update local model
40. Comparison of Weights After Convergence
w1 w2 w3 w4 w5 w6
Actual
Weights
2.0 5.0 3.0 4.0 1.0 6.0
Weights after
convergence
(Centralized
Learning)
2.01 4.91 3.006 3.996 1.03 5.897
Weights after
convergence
(Vertical
Federated
Learning)
1.95 4.87 2.90 3.88 1.06 5.91
41. Logistic Regression Model
Contains Labels: Y
Number of training samples: 7000
Number of testing samples: 3000
Number of features: 2
X = (x1,x2)
Logistic Regression Model
Does not contain Labels
Number of training samples: 7000
Number of testing samples: 3000
Number of features: 2
X = (x3…x5)
Logistic Regression Model
Does not contain Labels
Number of training samples: 7000
Number of testing samples: 3000
Number of features: 2
X = (x6)
Client1: Guest Client2: Host Client3: Host
R2_SCORE: 0.99
Evaluation of model in VFL
Client1 Output Client2 Output Client3 Output
+ +