SlideShare a Scribd company logo
1 of 29
Download to read offline
A Decentralised Platform for
Provenance Management of
Machine Learning
Software Systems
Nguyen Khoi Tran, Bushra Sabir, M.Ali Babar, Nini Cui;
CREST – The University of Adelaide, Australia
MehranAbolhasan, and Justin Lipman;
University of Technology Sydney, Australia
16th European Conference on
Software Architecture (ECSA)
Mon 19 - Fri 23 September 2022
Prague, Czech Republic
What if a key feature of your software is built
by dozens of teams around the world …
… but you have little idea how it was built?
1Bernardi, L., Mavridis, T., Estevez, P.: 150 successful machine learning models: 6 lessons learned at Booking.com. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
pp. 1743–1751 (2019)
2 Nahar, N., Zhou, S., Lewis, G., Ka ̈stner, C.: Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process. Organization 1(2), 3 (2022)
Scenario: Distributed ML DevOps
Auditor
Model
Verifier
Model
Developer
Client /
Operator
Dataset
Admin
I have idea for an ML application
1
I hire a company to collect data for me
2
I gather the data
3
I outsource labeling to Amazon
Mechanism Turk
4
I pass the data to the appointed
developers
5
I develop the ML model
6 I test and verify the model
7
I return the model to client
8
I seek third-party validation
9
Scenario: Distributed ML DevOps goes wrong
Auditor
Model
Verifier
Model
Developer
Client /
Operator
Dataset
Admin
I have idea for an ML application
1
I hire a company to collect data for me
2
I gather the data
3
I outsource labeling to Amazon
Mechanism Turk
4
I pass the data to the appointed
developers
5
I develop the ML model
6 I test and verify the model
7
I return the model to client
8
I seek third-party validation
9
Deliberate mislabeling
(poisoning)
Dataset tampering
Vulnerability in ML
frameworks
Model swapping
Cover-up
Not enough
information
How to capture and preserve the
records of “who did what” to ML assets
(a.k.a., workflow provenance information)
in a distributed ML workflow environment?
Existing Approach: A Centralised Platform
Auditor
Model
Verifier
Model
Developer
Client /
Operator
Dataset
Admin
Provenance Database
Dataset Provenance
(e.g., data sheet)
Auditing results
Asset Provenance
Model development records
Model testing records
(e.g., model card)
Problem: Security
Auditor
Model
Verifier
Model
Developer
Client /
Operator
Dataset
Admin
Provenance Database
Dataset Provenance
(e.g., data sheet)
Auditing results
Asset Provenance
Model development records
Model testing records
(e.g., model card)
Authenticity: how to know that the information
is real and comes from an authorized entity?
Integrity: how to prevent and detect the
tampering of provenance records?
Non-repudiation: how to prevent a
party from falsely denying its records?
Problem: Resilience
Auditor
Model
Verifier
Model
Developer
Client /
Operator
Dataset
Admin
Provenance Database
Dataset Provenance
(e.g., data sheet)
Auditing results
Asset Provenance
Model development records
Model testing records
(e.g., model card)
Info Availability: how to ensure
that the provenance records
are always accessible?
Fault-tolerance: How to avoid
single-point of failure?
Process Availability: how to
ensure that recording of new
provenance info is available?
Problem: Decentralisation and Trust
Auditor
Model
Verifier
Model
Developer
Client /
Operator
Dataset
Admin
Provenance Database
Dataset Provenance
(e.g., data sheet)
Auditing results
Asset Provenance
Model development records
Model testing records
(e.g., model card)
Disintermediation: Who
would manage and operate
this centralized database?
Information-flow: How to
avoid sending sensitive
information to a third party?
User-driven: How to enable users to
control what and how to submit
provenance information?
Problem Summary
Problem
Security Authenticity
Integrity
Non-repudiation
Resilience Availability and fault tolerance
Decentralization Disintermediation
User-driven
Control information flow
ProML Platform
Blockchain-aided Decentralisation and Security
Design Principles
If you use provenance …
… you control it
… you manage and store it
P1
Use your existing tools
Keep info flow within your organisation
P2
Embed provenance records in blockchain for security
Embed provenance update process in smart contracts for resilience
P3
Decentralized Software Platform
Auditor
Model Verifier
Model
Developer
Model
Operator
Dataset
Admin
ProML
Node
ProML
Node
ProML
Node
ProML
Node
ProML
Node
Provenance Update
Broadcasts
P
r
o
M
L
N
o
d
e
Provider
Clients
Service
IPFS Client
Storage Provider
Provenance Capturing
Blockchain
Wallet
Signer
Content Distribution Network
Dataset Model
Blockchain
Provenance
Update
Process
Dataset Model
Provenance Provenance
User
Interface
CLI
Client
Capturing
Library
Query
Interface
Blockchain Client
Blockchain
Provider
Provenance Querying
If you use provenance …
… you control it
… you manage and store it
P1
Use your existing tools
Keep info flow within your organisation
P2
Embed provenance records in blockchain for security
Embed provenance update process in smart contracts for resilience
P3
P
r
o
M
L
N
o
d
e
Provider
Clients
Service
IPFS Client
Provenance
Capturing
Blockchain
Wallet
Content Distribution Network
Dataset Model
Blockchain
Provenance
Update
Process
Dataset Model
Provenance Provenance
User
Interface
Capturing
Library
Blockchain Client
Use your existing tools
Keep info flow within your organisation
P2
User-driven Provenance Capturing
1. Develop
Model
Developer
ML Training Script /
Notebook
Calls to Logging
API
2. Embed
3. Send 𝑝𝑚i
4a. Submit payload
4b. Return CID
Storage Provider
7. Submit tx𝑝𝑚i
8. Validate and Insert tx𝑝𝑚i
5. Craft tx𝑝𝑚i
6. Sign tx𝑝𝑚i
Signer
Blockchain
Provider
Exemplary logging API
Function Parameters
selectData() datasetID, datasetVersion,
datasetMetadata: columnInfo, labelInfo
preprocessData() processedDataset,
datasetMetadata: columnInfo, labelInfo
engineerFeatures() featureList,
featureSelectAlg: algConfigs
train() classifierInfo: type, library, version, hyperparameters
model
evaluate() trainingSetRatio, F1, acc, trainingDuration
validate() F1, acc, recall, precision, Matthew, MSE, Fowlkes
deploy() model, deploymentInfo
Where we are …
Problem Solution
Security Authenticity
Integrity
Non-repudiation
Resilience Availability and fault tolerance • Replicate assets and provenance
information with blockchain and IPFS
Decentralization Disintermediation
User-driven
Control information flow
• Decentralized platform architecture
• User-driven provenance extraction
• Limit information flow within a trust domain
How?
Leaning on the Blockchain’s Replicated State Machine Model
Embed provenance records in blockchain for security
Embed provenance update process in smart contracts for resilience
P3
R
R
e
e
p
p
l
i
l
c
i
c
a
a
t
e
t
e
d
d
S
S
t
a
t
a
t
e
t
e
RMepalicchaitneedsState
M
M
a
a
c
c
h
h
i
n
i
n
e
e
s
s
Smart Contract
SC1’s Logic
BC State i
Smart Contract SC1’s
State
Var a = 1
BC State i+1
Smart Contract SC1’s
State
Var a = 2
Blockchain Transaction:
• To: SC1
• Instruction: Use SC1’s logic to update Var a
• Parameter: new a = 2
• Signed by sender
Artefact-as-a-State-Machine Approach
Mapping ML assets → Smart Contracts
Asset’s Life cycle events → State Transitions → Smart Contract Logic
Provenance updates → Blockchain Transactions
Embed provenance records in blockchain for security
Embed provenance update process in smart contracts for resilience
P3
R
R
e
e
p
p
l
i
l
c
i
c
a
a
t
e
t
e
d
d
S
S
t
a
t
a
t
e
t
e
RMepalicchaitneedsState
M
M
a
a
c
c
h
h
i
n
i
n
e
e
s
s
MLAsset ML1’s
Logic
BC State i
MLAsset ML1’s State
Var model_Info
= Null
Provenance Update:
• To: MLAsset ML1, From: Model Developer
• Instruction: Declare that model has been trained
• Parameter: model_info = IPFS hash
• Signed by model developer
BC State i+1
MLAsset ML1’s State
Var model_Info
= IPFS_Hash
Where we are …
Problem Solution
Security Authenticity
Integrity
Non-repudiation
• Artefact-as-a-State-Machine
• Embed provenance update into blockchain
transactions
• Embed workflow in blockchain smart
contracts
Resilience Availability and fault tolerance • Replicate assets and provenance
information with blockchain and IPFS
Decentralization Disintermediation
User-driven
Control information flow
• Decentralized platform architecture
• User-driven provenance extraction
• Limit information flow within a trust domain
Evaluation
Performance, Cost, Security in a real-world deployment
Methodology: Case Study
Case
• Adversarial ML Research
Project
• Repeatedly training an ML-
based intrusion detection
model based on various
form of poisoned KDD99
dataset
• Would benefit from
provenance records
Study
Adapt ProML to the workflow
Deploy ProML node
Deploy ProML smart contracts
on Ropsten
Embed ProML in training scripts
Extract statistics from blockchain
transactions
ID Operation
D1 Register a dataset
D2 Update a dataset
ML1 Register an ML model
ML2-1 Update model's provenance: selecting a dataset
ML2-2 Update model's provenance: preprocessing a
dataset
ML2-3 Update model's provenance: feature engineering
ML2-4 Update model's provenance: training the model
ML2-5 Update model's provenance: evaluating a model
ML2-6 Update model's provenance: validating a model
ML2-7 Update model's provenance: deploying a model
Sample Data
Initial State Registered
Selected
Dataset
Pre-
processed
Dataset
Engineered
Feature Sets Trained Model
Evaluated
model
Validated
Model Deployed
ML2-4: Training:
0xd816547ccc817d8cd3b28a56a84e8f2bd960ab3c648e6425bee2eade363e2501
ML2-3: Feature engineering:
0x0e75eb311c8f4d0a89948a701729e3696d6da33bec5a7e6403543c4d676ea380
ML2-6: Validation:
0xfb67bbc4e7391ca8711d6fa9f06a688a774329deef05e732c49749b3d44657fa
ML2-5: Evaluation:
0x4a30e2905f6f774d02f80a366566492698dacd47ca2a90ff55bfd56c1f910cbc
ML2-1: Select Dataset:
0x1e440b6842ab9efd56ff995a7cc43d08a1f75ece170226c25aeacfc1946a1c66
Sample Transaction
(ML2-4: Training)
How long to persist ML Provenance?
153.26 148.58 149.68
169.67167.63
180.00
140.00
126.14
161.00
140.75
200.00
180.00
160.00
140.00
120.00
100.00
80.00
60.00
40.00
20.00
0.00
Seconds
D1 D2 ML1 ML2-1 ML2-2 ML2-3 ML2-4 ML2-5 ML2-6 ML2-7
L@1 L@6 L@12
16 seconds to store ML provenance on a
blockchain
80 seconds to achieve 6 confirmations
(Bitcoin’s threshold)
150 seconds to achieve 12 confirmations
(High degree of confidence)
How costly is an update?
771284792153
1378111
66904
413502413495
595752
186649
232120
95958
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
D1 D2 ML1 ML2-1 ML2-2 ML2-3 ML2-4 ML2-5 ML2-6 ML2-7
Gas
Units
Asset registration is costlier than updating
USD $160 for registering a new asset.
USD $47 per update (depending on
amount of info)
Might not matter for private blockchain!
USD $0.000163542 per gas unit (April 2022 )
Security Threats and Countermeasures
Threats Target Countermeasure
T1 - Tampering
At-rest data: datasets, models,
provenance
T2 – Tampering
In-transit data: datasets, models,
provenance
T3 – Spoofing Provenance records, verification results
T4 – Repudiation Provenance records
T5 – DoS Data stores
T6 – DoS Provenance capturing process
T7 – DoS Provenance retrieval process
Embed provenance recording
process smart contracts
Store provenance on
blockchain
Store provenance on
blockchain
Store provenance on
blockchain
Store assets on P2P content
network
Store assets on P2P content
network
Anchor off-chain artefacts to
on-chain records
Anchor off-chain artefacts to
on-chain records
Embed provenance updates
in blockchain transactions
Embed provenance updates
in blockchain transactions
Summary and Future Research
… but you have little idea how it was built?
Auditor
Model
Verifier
Model
Developer
Model
Operator
Dataset
Admin
I have idea for an ML application
1
I hire a company to collect data for me
2
I gather the data
3
I outsource labeling to Amazon
Mechanism Turk
4
I pass the data to the appointed
developers
5
I develop the ML model
6 I test and verify the model
7
I return the model to client
8
I seek third-party validation
9
Deliberate mislabeling
(poisoning)
Dataset tampering
Vulnerability in ML
frameworks
Model swapping
Cover-up
Not enough
information
What if a key feature of your software is built by
dozens of teams around the world …
ProML: Decentralised Provenance Management
Problem Solution
Security Authenticity • Artefact-as-a-State-Machine
• Embed provenance update into
blockchain transactions
• Embed workflow in blockchain
smart contracts
Integrity
Non-repudiation
Resilience Availability and fault
tolerance
• Replicate assets and provenance
information with blockchain and
IPFS
Decentralization Disintermediation • Decentralized platform
architecture
• User-driven provenance extraction
• Limit information flow within a trust
domain
User-driven
Control information
flow
If you use provenance …
… you control it
… you manage and store it
P1
Use your existing tools
Keep info flow within your organisation
P2
Embed provenance records in blockchain for
security
Embed provenance update process in smart
contracts for resilience
P3
CRICOS 00123M

More Related Content

Similar to A Decentralised Platform for Provenance Management of Machine Learning Software Systems

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
 
Containerized DBs in a Machine Data environment with Crate.io
Containerized DBs in a Machine Data environment with Crate.ioContainerized DBs in a Machine Data environment with Crate.io
Containerized DBs in a Machine Data environment with Crate.ioClaus Matzinger
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...James Anderson
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4
 
Automating your AWS Security Operations
Automating your AWS Security OperationsAutomating your AWS Security Operations
Automating your AWS Security OperationsAmazon Web Services
 
Seldon: Deploying Models at Scale
Seldon: Deploying Models at ScaleSeldon: Deploying Models at Scale
Seldon: Deploying Models at ScaleSeldon
 
Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...
Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...
Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...confluent
 
FSI202 Machine Learning in Capital Markets
FSI202 Machine Learning in Capital MarketsFSI202 Machine Learning in Capital Markets
FSI202 Machine Learning in Capital MarketsAmazon Web Services
 
Automating your AWS Security Operations
Automating your AWS Security OperationsAutomating your AWS Security Operations
Automating your AWS Security OperationsEvident.io
 
Perth Meetup August 2021
Perth Meetup August 2021Perth Meetup August 2021
Perth Meetup August 2021Michael Price
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsAndrzej Michałowski
 
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...Databricks
 
Digital_IOT_(Microsoft_Solution).pdf
Digital_IOT_(Microsoft_Solution).pdfDigital_IOT_(Microsoft_Solution).pdf
Digital_IOT_(Microsoft_Solution).pdfssuserd23711
 
Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04nihshowandtell
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuantUniversity
 
What are the best tools used in cybersecurity in 2023.pdf
What are the best tools used in cybersecurity in 2023.pdfWhat are the best tools used in cybersecurity in 2023.pdf
What are the best tools used in cybersecurity in 2023.pdftsaaroacademy
 

Similar to A Decentralised Platform for Provenance Management of Machine Learning Software Systems (20)

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Containerized DBs in a Machine Data environment with Crate.io
Containerized DBs in a Machine Data environment with Crate.ioContainerized DBs in a Machine Data environment with Crate.io
Containerized DBs in a Machine Data environment with Crate.io
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
 
Automating your AWS Security Operations
Automating your AWS Security OperationsAutomating your AWS Security Operations
Automating your AWS Security Operations
 
Seldon: Deploying Models at Scale
Seldon: Deploying Models at ScaleSeldon: Deploying Models at Scale
Seldon: Deploying Models at Scale
 
Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...
Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...
Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...
 
FSI202 Machine Learning in Capital Markets
FSI202 Machine Learning in Capital MarketsFSI202 Machine Learning in Capital Markets
FSI202 Machine Learning in Capital Markets
 
Automating your AWS Security Operations
Automating your AWS Security OperationsAutomating your AWS Security Operations
Automating your AWS Security Operations
 
Perth Meetup August 2021
Perth Meetup August 2021Perth Meetup August 2021
Perth Meetup August 2021
 
Emmert_Resume
Emmert_ResumeEmmert_Resume
Emmert_Resume
 
IoT meets Big Data
IoT meets Big DataIoT meets Big Data
IoT meets Big Data
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systems
 
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
 
Digital_IOT_(Microsoft_Solution).pdf
Digital_IOT_(Microsoft_Solution).pdfDigital_IOT_(Microsoft_Solution).pdf
Digital_IOT_(Microsoft_Solution).pdf
 
Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
 
What are the best tools used in cybersecurity in 2023.pdf
What are the best tools used in cybersecurity in 2023.pdfWhat are the best tools used in cybersecurity in 2023.pdf
What are the best tools used in cybersecurity in 2023.pdf
 
E2matrix
E2matrixE2matrix
E2matrix
 

More from CREST @ University of Adelaide

Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...CREST @ University of Adelaide
 
Understanding and Addressing Architectural Challenges of Cloud- Based Systems
Understanding and Addressing Architectural Challenges of Cloud- Based SystemsUnderstanding and Addressing Architectural Challenges of Cloud- Based Systems
Understanding and Addressing Architectural Challenges of Cloud- Based SystemsCREST @ University of Adelaide
 
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...CREST @ University of Adelaide
 
A Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
A Deep Dive into the Socio-Technical Aspects of Delays in Security PatchingA Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
A Deep Dive into the Socio-Technical Aspects of Delays in Security PatchingCREST @ University of Adelaide
 
Mining Software Repositories for Security: Data Quality Issues Lessons from T...
Mining Software Repositories for Security: Data Quality Issues Lessons from T...Mining Software Repositories for Security: Data Quality Issues Lessons from T...
Mining Software Repositories for Security: Data Quality Issues Lessons from T...CREST @ University of Adelaide
 
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...CREST @ University of Adelaide
 
Falling for Phishing: An Empirical Investigation into People's Email Response...
Falling for Phishing: An Empirical Investigation into People's Email Response...Falling for Phishing: An Empirical Investigation into People's Email Response...
Falling for Phishing: An Empirical Investigation into People's Email Response...CREST @ University of Adelaide
 
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...CREST @ University of Adelaide
 
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...CREST @ University of Adelaide
 
Detecting Misuses of Security APIs: A Systematic Review
Detecting Misuses of Security APIs: A Systematic ReviewDetecting Misuses of Security APIs: A Systematic Review
Detecting Misuses of Security APIs: A Systematic ReviewCREST @ University of Adelaide
 
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...CREST @ University of Adelaide
 
Energy Efficiency Evaluation of Local and Offloaded Data Processing
Energy Efficiency Evaluation of Local and Offloaded Data ProcessingEnergy Efficiency Evaluation of Local and Offloaded Data Processing
Energy Efficiency Evaluation of Local and Offloaded Data ProcessingCREST @ University of Adelaide
 

More from CREST @ University of Adelaide (20)

Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
 
Making Software and Software Engineering visible
Making Software and Software Engineering visibleMaking Software and Software Engineering visible
Making Software and Software Engineering visible
 
Understanding and Addressing Architectural Challenges of Cloud- Based Systems
Understanding and Addressing Architectural Challenges of Cloud- Based SystemsUnderstanding and Addressing Architectural Challenges of Cloud- Based Systems
Understanding and Addressing Architectural Challenges of Cloud- Based Systems
 
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
 
A Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
A Deep Dive into the Socio-Technical Aspects of Delays in Security PatchingA Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
A Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
 
Mining Software Repositories for Security: Data Quality Issues Lessons from T...
Mining Software Repositories for Security: Data Quality Issues Lessons from T...Mining Software Repositories for Security: Data Quality Issues Lessons from T...
Mining Software Repositories for Security: Data Quality Issues Lessons from T...
 
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
 
Falling for Phishing: An Empirical Investigation into People's Email Response...
Falling for Phishing: An Empirical Investigation into People's Email Response...Falling for Phishing: An Empirical Investigation into People's Email Response...
Falling for Phishing: An Empirical Investigation into People's Email Response...
 
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
 
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
 
Detecting Misuses of Security APIs: A Systematic Review
Detecting Misuses of Security APIs: A Systematic ReviewDetecting Misuses of Security APIs: A Systematic Review
Detecting Misuses of Security APIs: A Systematic Review
 
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
 
Data Quality for Software Vulnerability Dataset
Data Quality for Software Vulnerability DatasetData Quality for Software Vulnerability Dataset
Data Quality for Software Vulnerability Dataset
 
Mod2Dash Presentation
Mod2Dash PresentationMod2Dash Presentation
Mod2Dash Presentation
 
Run-time Patching and updating Impact Estimation
Run-time Patching and updating Impact EstimationRun-time Patching and updating Impact Estimation
Run-time Patching and updating Impact Estimation
 
ECSA 2023 Ubuntu Case Study
ECSA 2023 Ubuntu Case StudyECSA 2023 Ubuntu Case Study
ECSA 2023 Ubuntu Case Study
 
Energy Efficiency Evaluation of Local and Offloaded Data Processing
Energy Efficiency Evaluation of Local and Offloaded Data ProcessingEnergy Efficiency Evaluation of Local and Offloaded Data Processing
Energy Efficiency Evaluation of Local and Offloaded Data Processing
 
Designing Quality-Driven Blockchain Networks
Designing Quality-Driven Blockchain NetworksDesigning Quality-Driven Blockchain Networks
Designing Quality-Driven Blockchain Networks
 
Privacy Engineering in the Wild
Privacy Engineering in the WildPrivacy Engineering in the Wild
Privacy Engineering in the Wild
 
Security Data Quality Challenges
Security Data Quality ChallengesSecurity Data Quality Challenges
Security Data Quality Challenges
 

Recently uploaded

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 

Recently uploaded (20)

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 

A Decentralised Platform for Provenance Management of Machine Learning Software Systems

  • 1. A Decentralised Platform for Provenance Management of Machine Learning Software Systems Nguyen Khoi Tran, Bushra Sabir, M.Ali Babar, Nini Cui; CREST – The University of Adelaide, Australia MehranAbolhasan, and Justin Lipman; University of Technology Sydney, Australia 16th European Conference on Software Architecture (ECSA) Mon 19 - Fri 23 September 2022 Prague, Czech Republic
  • 2. What if a key feature of your software is built by dozens of teams around the world … … but you have little idea how it was built? 1Bernardi, L., Mavridis, T., Estevez, P.: 150 successful machine learning models: 6 lessons learned at Booking.com. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining pp. 1743–1751 (2019) 2 Nahar, N., Zhou, S., Lewis, G., Ka ̈stner, C.: Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process. Organization 1(2), 3 (2022)
  • 3. Scenario: Distributed ML DevOps Auditor Model Verifier Model Developer Client / Operator Dataset Admin I have idea for an ML application 1 I hire a company to collect data for me 2 I gather the data 3 I outsource labeling to Amazon Mechanism Turk 4 I pass the data to the appointed developers 5 I develop the ML model 6 I test and verify the model 7 I return the model to client 8 I seek third-party validation 9
  • 4. Scenario: Distributed ML DevOps goes wrong Auditor Model Verifier Model Developer Client / Operator Dataset Admin I have idea for an ML application 1 I hire a company to collect data for me 2 I gather the data 3 I outsource labeling to Amazon Mechanism Turk 4 I pass the data to the appointed developers 5 I develop the ML model 6 I test and verify the model 7 I return the model to client 8 I seek third-party validation 9 Deliberate mislabeling (poisoning) Dataset tampering Vulnerability in ML frameworks Model swapping Cover-up Not enough information
  • 5. How to capture and preserve the records of “who did what” to ML assets (a.k.a., workflow provenance information) in a distributed ML workflow environment?
  • 6. Existing Approach: A Centralised Platform Auditor Model Verifier Model Developer Client / Operator Dataset Admin Provenance Database Dataset Provenance (e.g., data sheet) Auditing results Asset Provenance Model development records Model testing records (e.g., model card)
  • 7. Problem: Security Auditor Model Verifier Model Developer Client / Operator Dataset Admin Provenance Database Dataset Provenance (e.g., data sheet) Auditing results Asset Provenance Model development records Model testing records (e.g., model card) Authenticity: how to know that the information is real and comes from an authorized entity? Integrity: how to prevent and detect the tampering of provenance records? Non-repudiation: how to prevent a party from falsely denying its records?
  • 8. Problem: Resilience Auditor Model Verifier Model Developer Client / Operator Dataset Admin Provenance Database Dataset Provenance (e.g., data sheet) Auditing results Asset Provenance Model development records Model testing records (e.g., model card) Info Availability: how to ensure that the provenance records are always accessible? Fault-tolerance: How to avoid single-point of failure? Process Availability: how to ensure that recording of new provenance info is available?
  • 9. Problem: Decentralisation and Trust Auditor Model Verifier Model Developer Client / Operator Dataset Admin Provenance Database Dataset Provenance (e.g., data sheet) Auditing results Asset Provenance Model development records Model testing records (e.g., model card) Disintermediation: Who would manage and operate this centralized database? Information-flow: How to avoid sending sensitive information to a third party? User-driven: How to enable users to control what and how to submit provenance information?
  • 10. Problem Summary Problem Security Authenticity Integrity Non-repudiation Resilience Availability and fault tolerance Decentralization Disintermediation User-driven Control information flow
  • 12. Design Principles If you use provenance … … you control it … you manage and store it P1 Use your existing tools Keep info flow within your organisation P2 Embed provenance records in blockchain for security Embed provenance update process in smart contracts for resilience P3
  • 13. Decentralized Software Platform Auditor Model Verifier Model Developer Model Operator Dataset Admin ProML Node ProML Node ProML Node ProML Node ProML Node Provenance Update Broadcasts P r o M L N o d e Provider Clients Service IPFS Client Storage Provider Provenance Capturing Blockchain Wallet Signer Content Distribution Network Dataset Model Blockchain Provenance Update Process Dataset Model Provenance Provenance User Interface CLI Client Capturing Library Query Interface Blockchain Client Blockchain Provider Provenance Querying If you use provenance … … you control it … you manage and store it P1 Use your existing tools Keep info flow within your organisation P2 Embed provenance records in blockchain for security Embed provenance update process in smart contracts for resilience P3
  • 14. P r o M L N o d e Provider Clients Service IPFS Client Provenance Capturing Blockchain Wallet Content Distribution Network Dataset Model Blockchain Provenance Update Process Dataset Model Provenance Provenance User Interface Capturing Library Blockchain Client Use your existing tools Keep info flow within your organisation P2 User-driven Provenance Capturing 1. Develop Model Developer ML Training Script / Notebook Calls to Logging API 2. Embed 3. Send 𝑝𝑚i 4a. Submit payload 4b. Return CID Storage Provider 7. Submit tx𝑝𝑚i 8. Validate and Insert tx𝑝𝑚i 5. Craft tx𝑝𝑚i 6. Sign tx𝑝𝑚i Signer Blockchain Provider Exemplary logging API Function Parameters selectData() datasetID, datasetVersion, datasetMetadata: columnInfo, labelInfo preprocessData() processedDataset, datasetMetadata: columnInfo, labelInfo engineerFeatures() featureList, featureSelectAlg: algConfigs train() classifierInfo: type, library, version, hyperparameters model evaluate() trainingSetRatio, F1, acc, trainingDuration validate() F1, acc, recall, precision, Matthew, MSE, Fowlkes deploy() model, deploymentInfo
  • 15. Where we are … Problem Solution Security Authenticity Integrity Non-repudiation Resilience Availability and fault tolerance • Replicate assets and provenance information with blockchain and IPFS Decentralization Disintermediation User-driven Control information flow • Decentralized platform architecture • User-driven provenance extraction • Limit information flow within a trust domain
  • 16. How? Leaning on the Blockchain’s Replicated State Machine Model Embed provenance records in blockchain for security Embed provenance update process in smart contracts for resilience P3 R R e e p p l i l c i c a a t e t e d d S S t a t a t e t e RMepalicchaitneedsState M M a a c c h h i n i n e e s s Smart Contract SC1’s Logic BC State i Smart Contract SC1’s State Var a = 1 BC State i+1 Smart Contract SC1’s State Var a = 2 Blockchain Transaction: • To: SC1 • Instruction: Use SC1’s logic to update Var a • Parameter: new a = 2 • Signed by sender
  • 17. Artefact-as-a-State-Machine Approach Mapping ML assets → Smart Contracts Asset’s Life cycle events → State Transitions → Smart Contract Logic Provenance updates → Blockchain Transactions Embed provenance records in blockchain for security Embed provenance update process in smart contracts for resilience P3 R R e e p p l i l c i c a a t e t e d d S S t a t a t e t e RMepalicchaitneedsState M M a a c c h h i n i n e e s s MLAsset ML1’s Logic BC State i MLAsset ML1’s State Var model_Info = Null Provenance Update: • To: MLAsset ML1, From: Model Developer • Instruction: Declare that model has been trained • Parameter: model_info = IPFS hash • Signed by model developer BC State i+1 MLAsset ML1’s State Var model_Info = IPFS_Hash
  • 18. Where we are … Problem Solution Security Authenticity Integrity Non-repudiation • Artefact-as-a-State-Machine • Embed provenance update into blockchain transactions • Embed workflow in blockchain smart contracts Resilience Availability and fault tolerance • Replicate assets and provenance information with blockchain and IPFS Decentralization Disintermediation User-driven Control information flow • Decentralized platform architecture • User-driven provenance extraction • Limit information flow within a trust domain
  • 19. Evaluation Performance, Cost, Security in a real-world deployment
  • 20. Methodology: Case Study Case • Adversarial ML Research Project • Repeatedly training an ML- based intrusion detection model based on various form of poisoned KDD99 dataset • Would benefit from provenance records Study Adapt ProML to the workflow Deploy ProML node Deploy ProML smart contracts on Ropsten Embed ProML in training scripts Extract statistics from blockchain transactions ID Operation D1 Register a dataset D2 Update a dataset ML1 Register an ML model ML2-1 Update model's provenance: selecting a dataset ML2-2 Update model's provenance: preprocessing a dataset ML2-3 Update model's provenance: feature engineering ML2-4 Update model's provenance: training the model ML2-5 Update model's provenance: evaluating a model ML2-6 Update model's provenance: validating a model ML2-7 Update model's provenance: deploying a model
  • 21. Sample Data Initial State Registered Selected Dataset Pre- processed Dataset Engineered Feature Sets Trained Model Evaluated model Validated Model Deployed ML2-4: Training: 0xd816547ccc817d8cd3b28a56a84e8f2bd960ab3c648e6425bee2eade363e2501 ML2-3: Feature engineering: 0x0e75eb311c8f4d0a89948a701729e3696d6da33bec5a7e6403543c4d676ea380 ML2-6: Validation: 0xfb67bbc4e7391ca8711d6fa9f06a688a774329deef05e732c49749b3d44657fa ML2-5: Evaluation: 0x4a30e2905f6f774d02f80a366566492698dacd47ca2a90ff55bfd56c1f910cbc ML2-1: Select Dataset: 0x1e440b6842ab9efd56ff995a7cc43d08a1f75ece170226c25aeacfc1946a1c66
  • 23. How long to persist ML Provenance? 153.26 148.58 149.68 169.67167.63 180.00 140.00 126.14 161.00 140.75 200.00 180.00 160.00 140.00 120.00 100.00 80.00 60.00 40.00 20.00 0.00 Seconds D1 D2 ML1 ML2-1 ML2-2 ML2-3 ML2-4 ML2-5 ML2-6 ML2-7 L@1 L@6 L@12 16 seconds to store ML provenance on a blockchain 80 seconds to achieve 6 confirmations (Bitcoin’s threshold) 150 seconds to achieve 12 confirmations (High degree of confidence)
  • 24. How costly is an update? 771284792153 1378111 66904 413502413495 595752 186649 232120 95958 0 200000 400000 600000 800000 1000000 1200000 1400000 1600000 D1 D2 ML1 ML2-1 ML2-2 ML2-3 ML2-4 ML2-5 ML2-6 ML2-7 Gas Units Asset registration is costlier than updating USD $160 for registering a new asset. USD $47 per update (depending on amount of info) Might not matter for private blockchain! USD $0.000163542 per gas unit (April 2022 )
  • 25. Security Threats and Countermeasures Threats Target Countermeasure T1 - Tampering At-rest data: datasets, models, provenance T2 – Tampering In-transit data: datasets, models, provenance T3 – Spoofing Provenance records, verification results T4 – Repudiation Provenance records T5 – DoS Data stores T6 – DoS Provenance capturing process T7 – DoS Provenance retrieval process Embed provenance recording process smart contracts Store provenance on blockchain Store provenance on blockchain Store provenance on blockchain Store assets on P2P content network Store assets on P2P content network Anchor off-chain artefacts to on-chain records Anchor off-chain artefacts to on-chain records Embed provenance updates in blockchain transactions Embed provenance updates in blockchain transactions
  • 26. Summary and Future Research
  • 27. … but you have little idea how it was built? Auditor Model Verifier Model Developer Model Operator Dataset Admin I have idea for an ML application 1 I hire a company to collect data for me 2 I gather the data 3 I outsource labeling to Amazon Mechanism Turk 4 I pass the data to the appointed developers 5 I develop the ML model 6 I test and verify the model 7 I return the model to client 8 I seek third-party validation 9 Deliberate mislabeling (poisoning) Dataset tampering Vulnerability in ML frameworks Model swapping Cover-up Not enough information What if a key feature of your software is built by dozens of teams around the world …
  • 28. ProML: Decentralised Provenance Management Problem Solution Security Authenticity • Artefact-as-a-State-Machine • Embed provenance update into blockchain transactions • Embed workflow in blockchain smart contracts Integrity Non-repudiation Resilience Availability and fault tolerance • Replicate assets and provenance information with blockchain and IPFS Decentralization Disintermediation • Decentralized platform architecture • User-driven provenance extraction • Limit information flow within a trust domain User-driven Control information flow If you use provenance … … you control it … you manage and store it P1 Use your existing tools Keep info flow within your organisation P2 Embed provenance records in blockchain for security Embed provenance update process in smart contracts for resilience P3