Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
A Decentralised Platform for Provenance Management of Machine Learning Software Systems
1. A Decentralised Platform for
Provenance Management of
Machine Learning
Software Systems
Nguyen Khoi Tran, Bushra Sabir, M.Ali Babar, Nini Cui;
CREST – The University of Adelaide, Australia
MehranAbolhasan, and Justin Lipman;
University of Technology Sydney, Australia
16th European Conference on
Software Architecture (ECSA)
Mon 19 - Fri 23 September 2022
Prague, Czech Republic
2. What if a key feature of your software is built
by dozens of teams around the world …
… but you have little idea how it was built?
1Bernardi, L., Mavridis, T., Estevez, P.: 150 successful machine learning models: 6 lessons learned at Booking.com. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
pp. 1743–1751 (2019)
2 Nahar, N., Zhou, S., Lewis, G., Ka ̈stner, C.: Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process. Organization 1(2), 3 (2022)
3. Scenario: Distributed ML DevOps
Auditor
Model
Verifier
Model
Developer
Client /
Operator
Dataset
Admin
I have idea for an ML application
1
I hire a company to collect data for me
2
I gather the data
3
I outsource labeling to Amazon
Mechanism Turk
4
I pass the data to the appointed
developers
5
I develop the ML model
6 I test and verify the model
7
I return the model to client
8
I seek third-party validation
9
4. Scenario: Distributed ML DevOps goes wrong
Auditor
Model
Verifier
Model
Developer
Client /
Operator
Dataset
Admin
I have idea for an ML application
1
I hire a company to collect data for me
2
I gather the data
3
I outsource labeling to Amazon
Mechanism Turk
4
I pass the data to the appointed
developers
5
I develop the ML model
6 I test and verify the model
7
I return the model to client
8
I seek third-party validation
9
Deliberate mislabeling
(poisoning)
Dataset tampering
Vulnerability in ML
frameworks
Model swapping
Cover-up
Not enough
information
5. How to capture and preserve the
records of “who did what” to ML assets
(a.k.a., workflow provenance information)
in a distributed ML workflow environment?
6. Existing Approach: A Centralised Platform
Auditor
Model
Verifier
Model
Developer
Client /
Operator
Dataset
Admin
Provenance Database
Dataset Provenance
(e.g., data sheet)
Auditing results
Asset Provenance
Model development records
Model testing records
(e.g., model card)
7. Problem: Security
Auditor
Model
Verifier
Model
Developer
Client /
Operator
Dataset
Admin
Provenance Database
Dataset Provenance
(e.g., data sheet)
Auditing results
Asset Provenance
Model development records
Model testing records
(e.g., model card)
Authenticity: how to know that the information
is real and comes from an authorized entity?
Integrity: how to prevent and detect the
tampering of provenance records?
Non-repudiation: how to prevent a
party from falsely denying its records?
8. Problem: Resilience
Auditor
Model
Verifier
Model
Developer
Client /
Operator
Dataset
Admin
Provenance Database
Dataset Provenance
(e.g., data sheet)
Auditing results
Asset Provenance
Model development records
Model testing records
(e.g., model card)
Info Availability: how to ensure
that the provenance records
are always accessible?
Fault-tolerance: How to avoid
single-point of failure?
Process Availability: how to
ensure that recording of new
provenance info is available?
9. Problem: Decentralisation and Trust
Auditor
Model
Verifier
Model
Developer
Client /
Operator
Dataset
Admin
Provenance Database
Dataset Provenance
(e.g., data sheet)
Auditing results
Asset Provenance
Model development records
Model testing records
(e.g., model card)
Disintermediation: Who
would manage and operate
this centralized database?
Information-flow: How to
avoid sending sensitive
information to a third party?
User-driven: How to enable users to
control what and how to submit
provenance information?
12. Design Principles
If you use provenance …
… you control it
… you manage and store it
P1
Use your existing tools
Keep info flow within your organisation
P2
Embed provenance records in blockchain for security
Embed provenance update process in smart contracts for resilience
P3
13. Decentralized Software Platform
Auditor
Model Verifier
Model
Developer
Model
Operator
Dataset
Admin
ProML
Node
ProML
Node
ProML
Node
ProML
Node
ProML
Node
Provenance Update
Broadcasts
P
r
o
M
L
N
o
d
e
Provider
Clients
Service
IPFS Client
Storage Provider
Provenance Capturing
Blockchain
Wallet
Signer
Content Distribution Network
Dataset Model
Blockchain
Provenance
Update
Process
Dataset Model
Provenance Provenance
User
Interface
CLI
Client
Capturing
Library
Query
Interface
Blockchain Client
Blockchain
Provider
Provenance Querying
If you use provenance …
… you control it
… you manage and store it
P1
Use your existing tools
Keep info flow within your organisation
P2
Embed provenance records in blockchain for security
Embed provenance update process in smart contracts for resilience
P3
14. P
r
o
M
L
N
o
d
e
Provider
Clients
Service
IPFS Client
Provenance
Capturing
Blockchain
Wallet
Content Distribution Network
Dataset Model
Blockchain
Provenance
Update
Process
Dataset Model
Provenance Provenance
User
Interface
Capturing
Library
Blockchain Client
Use your existing tools
Keep info flow within your organisation
P2
User-driven Provenance Capturing
1. Develop
Model
Developer
ML Training Script /
Notebook
Calls to Logging
API
2. Embed
3. Send 𝑝𝑚i
4a. Submit payload
4b. Return CID
Storage Provider
7. Submit tx𝑝𝑚i
8. Validate and Insert tx𝑝𝑚i
5. Craft tx𝑝𝑚i
6. Sign tx𝑝𝑚i
Signer
Blockchain
Provider
Exemplary logging API
Function Parameters
selectData() datasetID, datasetVersion,
datasetMetadata: columnInfo, labelInfo
preprocessData() processedDataset,
datasetMetadata: columnInfo, labelInfo
engineerFeatures() featureList,
featureSelectAlg: algConfigs
train() classifierInfo: type, library, version, hyperparameters
model
evaluate() trainingSetRatio, F1, acc, trainingDuration
validate() F1, acc, recall, precision, Matthew, MSE, Fowlkes
deploy() model, deploymentInfo
15. Where we are …
Problem Solution
Security Authenticity
Integrity
Non-repudiation
Resilience Availability and fault tolerance • Replicate assets and provenance
information with blockchain and IPFS
Decentralization Disintermediation
User-driven
Control information flow
• Decentralized platform architecture
• User-driven provenance extraction
• Limit information flow within a trust domain
16. How?
Leaning on the Blockchain’s Replicated State Machine Model
Embed provenance records in blockchain for security
Embed provenance update process in smart contracts for resilience
P3
R
R
e
e
p
p
l
i
l
c
i
c
a
a
t
e
t
e
d
d
S
S
t
a
t
a
t
e
t
e
RMepalicchaitneedsState
M
M
a
a
c
c
h
h
i
n
i
n
e
e
s
s
Smart Contract
SC1’s Logic
BC State i
Smart Contract SC1’s
State
Var a = 1
BC State i+1
Smart Contract SC1’s
State
Var a = 2
Blockchain Transaction:
• To: SC1
• Instruction: Use SC1’s logic to update Var a
• Parameter: new a = 2
• Signed by sender
17. Artefact-as-a-State-Machine Approach
Mapping ML assets → Smart Contracts
Asset’s Life cycle events → State Transitions → Smart Contract Logic
Provenance updates → Blockchain Transactions
Embed provenance records in blockchain for security
Embed provenance update process in smart contracts for resilience
P3
R
R
e
e
p
p
l
i
l
c
i
c
a
a
t
e
t
e
d
d
S
S
t
a
t
a
t
e
t
e
RMepalicchaitneedsState
M
M
a
a
c
c
h
h
i
n
i
n
e
e
s
s
MLAsset ML1’s
Logic
BC State i
MLAsset ML1’s State
Var model_Info
= Null
Provenance Update:
• To: MLAsset ML1, From: Model Developer
• Instruction: Declare that model has been trained
• Parameter: model_info = IPFS hash
• Signed by model developer
BC State i+1
MLAsset ML1’s State
Var model_Info
= IPFS_Hash
18. Where we are …
Problem Solution
Security Authenticity
Integrity
Non-repudiation
• Artefact-as-a-State-Machine
• Embed provenance update into blockchain
transactions
• Embed workflow in blockchain smart
contracts
Resilience Availability and fault tolerance • Replicate assets and provenance
information with blockchain and IPFS
Decentralization Disintermediation
User-driven
Control information flow
• Decentralized platform architecture
• User-driven provenance extraction
• Limit information flow within a trust domain
20. Methodology: Case Study
Case
• Adversarial ML Research
Project
• Repeatedly training an ML-
based intrusion detection
model based on various
form of poisoned KDD99
dataset
• Would benefit from
provenance records
Study
Adapt ProML to the workflow
Deploy ProML node
Deploy ProML smart contracts
on Ropsten
Embed ProML in training scripts
Extract statistics from blockchain
transactions
ID Operation
D1 Register a dataset
D2 Update a dataset
ML1 Register an ML model
ML2-1 Update model's provenance: selecting a dataset
ML2-2 Update model's provenance: preprocessing a
dataset
ML2-3 Update model's provenance: feature engineering
ML2-4 Update model's provenance: training the model
ML2-5 Update model's provenance: evaluating a model
ML2-6 Update model's provenance: validating a model
ML2-7 Update model's provenance: deploying a model
21. Sample Data
Initial State Registered
Selected
Dataset
Pre-
processed
Dataset
Engineered
Feature Sets Trained Model
Evaluated
model
Validated
Model Deployed
ML2-4: Training:
0xd816547ccc817d8cd3b28a56a84e8f2bd960ab3c648e6425bee2eade363e2501
ML2-3: Feature engineering:
0x0e75eb311c8f4d0a89948a701729e3696d6da33bec5a7e6403543c4d676ea380
ML2-6: Validation:
0xfb67bbc4e7391ca8711d6fa9f06a688a774329deef05e732c49749b3d44657fa
ML2-5: Evaluation:
0x4a30e2905f6f774d02f80a366566492698dacd47ca2a90ff55bfd56c1f910cbc
ML2-1: Select Dataset:
0x1e440b6842ab9efd56ff995a7cc43d08a1f75ece170226c25aeacfc1946a1c66
23. How long to persist ML Provenance?
153.26 148.58 149.68
169.67167.63
180.00
140.00
126.14
161.00
140.75
200.00
180.00
160.00
140.00
120.00
100.00
80.00
60.00
40.00
20.00
0.00
Seconds
D1 D2 ML1 ML2-1 ML2-2 ML2-3 ML2-4 ML2-5 ML2-6 ML2-7
L@1 L@6 L@12
16 seconds to store ML provenance on a
blockchain
80 seconds to achieve 6 confirmations
(Bitcoin’s threshold)
150 seconds to achieve 12 confirmations
(High degree of confidence)
24. How costly is an update?
771284792153
1378111
66904
413502413495
595752
186649
232120
95958
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
D1 D2 ML1 ML2-1 ML2-2 ML2-3 ML2-4 ML2-5 ML2-6 ML2-7
Gas
Units
Asset registration is costlier than updating
USD $160 for registering a new asset.
USD $47 per update (depending on
amount of info)
Might not matter for private blockchain!
USD $0.000163542 per gas unit (April 2022 )
25. Security Threats and Countermeasures
Threats Target Countermeasure
T1 - Tampering
At-rest data: datasets, models,
provenance
T2 – Tampering
In-transit data: datasets, models,
provenance
T3 – Spoofing Provenance records, verification results
T4 – Repudiation Provenance records
T5 – DoS Data stores
T6 – DoS Provenance capturing process
T7 – DoS Provenance retrieval process
Embed provenance recording
process smart contracts
Store provenance on
blockchain
Store provenance on
blockchain
Store provenance on
blockchain
Store assets on P2P content
network
Store assets on P2P content
network
Anchor off-chain artefacts to
on-chain records
Anchor off-chain artefacts to
on-chain records
Embed provenance updates
in blockchain transactions
Embed provenance updates
in blockchain transactions
27. … but you have little idea how it was built?
Auditor
Model
Verifier
Model
Developer
Model
Operator
Dataset
Admin
I have idea for an ML application
1
I hire a company to collect data for me
2
I gather the data
3
I outsource labeling to Amazon
Mechanism Turk
4
I pass the data to the appointed
developers
5
I develop the ML model
6 I test and verify the model
7
I return the model to client
8
I seek third-party validation
9
Deliberate mislabeling
(poisoning)
Dataset tampering
Vulnerability in ML
frameworks
Model swapping
Cover-up
Not enough
information
What if a key feature of your software is built by
dozens of teams around the world …
28. ProML: Decentralised Provenance Management
Problem Solution
Security Authenticity • Artefact-as-a-State-Machine
• Embed provenance update into
blockchain transactions
• Embed workflow in blockchain
smart contracts
Integrity
Non-repudiation
Resilience Availability and fault
tolerance
• Replicate assets and provenance
information with blockchain and
IPFS
Decentralization Disintermediation • Decentralized platform
architecture
• User-driven provenance extraction
• Limit information flow within a trust
domain
User-driven
Control information
flow
If you use provenance …
… you control it
… you manage and store it
P1
Use your existing tools
Keep info flow within your organisation
P2
Embed provenance records in blockchain for
security
Embed provenance update process in smart
contracts for resilience
P3