2. CREST Health Analytics Platform
• CREST platform is a one-stop solution for health data storage and analysis.
• It has comprehensive, flexible, and scalable ecosystem of frameworks.
• It allows capturing, processing, analysis, and visualisation of large volumes of
health data which are too complex for the traditional data-processing application.
3. Use cases and scenarios
Scenario: For emergency health situations such as pandemic or flooding, there is a high
need to do predictive analytics to know the requirement of medical supplies.
Solution: Health analytics with the CREST platform to
• Predictive analysis using outbreak patterns and other historical data
• Monitoring of cases – numbers of cases and patients' health
• Recommendations on resources for healthcare facilities
5. CREST Infrastructure and data management
• Automated infrastructure deployment (IaaC)
• Network configuration
• Software installation
• Benchmarking experimentation testbed
•
• Run-time patching recovery
• Data storage and management
• Big data storage solutions cluster configuration
6. Use cases and scenarios
• Determining energy efficiency of various data workloads for low-powered devices
• Measuring performance and resource usage for various data distribution flows
• Modelling effects of node mobility under different networking scenarios
• Automated comparison of multiple data storage and processing solutions
• Detecting and recovering from broken run-time patches
12. Supply Chain
Provenance of ML-
based Software
Nguyen Khoi Tran, M. Ali Babar, Mingyu Guo;
CREST – The University of Adelaide, Australia
13. Scenario: Distributed ML DevOps
Auditor
Model
Verifier
Model
Developer
Client /
Operator
Dataset
Admin
I have idea for an ML application
1
I hire a company to collect data for me
2
I gather the data
3
I outsource labeling to Amazon
Mechanism Turk
4
I pass the data to the appointed
developers
5
I develop the ML model
6 I test and verify the model
7
I return the model to client
8
I seek third-party validation
9
14. Scenario: Distributed ML DevOps goes wrong
Auditor
Model
Verifier
Model
Developer
Client /
Operator
Dataset
Admin
I have idea for an ML application
1
I hire a company to collect data for me
2
I gather the data
3
I outsource labeling to Amazon
Mechanism Turk
4
I pass the data to the appointed
developers
5
I develop the ML model
6 I test and verify the model
7
I return the model to client
8
I seek third-party validation
9
Deliberate mislabeling
(poisoning)
Dataset tampering
Vulnerability in ML
frameworks
Model swapping
Cover-up
Not enough
information
15. How to capture and preserve the
records of “who did what” to ML assets
(a.k.a., workflow provenance information)
in a distributed ML workflow environment?
16. Existing Approach: A Centralised Platform
Auditor
Model
Verifier
Model
Developer
Client /
Operator
Dataset
Admin
Provenance Database
Dataset Provenance
(e.g., data sheet)
Auditing results
Asset Provenance
Model development records
Model testing records
(e.g., model card)
18. Decentralized Software Platform
Auditor
Model Verifier
Model
Developer
Model
Operator
Dataset
Admin
ProML
Node
ProML
Node
ProML
Node
ProML
Node
ProML
Node
Provenance Update
Broadcasts
P
r
o
M
L
N
o
d
e
Provider
Clients
Service
IPFS Client
Storage Provider
Provenance Capturing
Blockchain
Wallet
Signer
Content Distribution Network
Dataset Model
Blockchain
Dataset
Provenance
Provenance
Update
Process
Model
Provenance
User
Interface
CLI
Client
Capturing
Library
Query
Interface
Blockchain Client
Blockchain
Provider
Provenance Querying
If you use provenance …
… you control it
… you manage and store it
P1
Use your existing tools
Keep info flow within your organisation
P2
Embed provenance records in blockchain for security
Embed provenance update process in smart contracts for resilience
P3
19. User-driven Provenance Capturing
P
r
o
M
L
N
o
d
e
Provider
Clients
Service
IPFS Client
Provenance
Capturing
Blockchain
Wallet
Content Distribution Network
Dataset Model
Blockchain
Dataset
Provenance
Provenance
Update
Process
Model
Provenance
User
Interface
Capturing
Library
Blockchain Client
Use your existing tools
Keep info flow within your organisation
P2
1. Develop
Model
Developer
ML Training Script /
Notebook
Calls to Logging
API
2. Embed
3. Send 𝑝𝑚𝑖
4a. Submit payload
4b. Return CID
Storage Provider
7. Submit tx𝑝𝑚𝑖
8. Validate and Insert tx𝑝𝑚𝑖
5. Craft tx𝑝𝑚𝑖
6. Sign tx𝑝𝑚𝑖
Signer
Blockchain
Provider
Exemplary logging API
Function Parameters
selectData() datasetID, datasetVersion,
datasetMetadata: columnInfo, labelInfo
preprocessData() processedDataset,
datasetMetadata: columnInfo, labelInfo
engineerFeatures() featureList,
featureSelectAlg: algConfigs
train() classifierInfo: type, library, version, hyperparameters
model
evaluate() trainingSetRatio, F1, acc, trainingDuration
validate() F1, acc, recall, precision, Matthew, MSE, Fowlkes
deploy() model, deploymentInfo
20. Sample Data
Initial State Registered
Selected
Dataset
Pre-
processed
Dataset
Engineered
Feature Sets
Trained Model
Evaluated
model
Validated
Model
Deployed
ML2-4: Training:
0xd816547ccc817d8cd3b28a56a84e8f2bd960ab3c648e6425bee2eade363e2501
ML2-3: Feature engineering:
0x0e75eb311c8f4d0a89948a701729e3696d6da33bec5a7e6403543c4d676ea380
ML2-6: Validation:
0xfb67bbc4e7391ca8711d6fa9f06a688a774329deef05e732c49749b3d44657fa
ML2-5: Evaluation:
0x4a30e2905f6f774d02f80a366566492698dacd47ca2a90ff55bfd56c1f910cbc
ML2-1: Select Dataset:
0x1e440b6842ab9efd56ff995a7cc43d08a1f75ece170226c25aeacfc1946a1c66
Good morning everyone, I'm Triet, a research fellow in CREST.
And today I'd like to share with you an overview of our research on software security.
To start with, I believe that you all can recognize most if not all the icons on the screen here.
Actually, it's even hard to imagine that we don't use any of these software apps in a day. At least, for me, I don't even remember how many emails I've sent and received this week.
Nowadays, software exists everywhere and has become a part of our daily life. They've drastically changed the way we live, work, and interact.
And of course, software is also widely used for the healthcare domain. A notable and recent example would be the contact tracing app, or specifically CovidSafe app we have here in Australia.
And if you think about these apps, these aren't just "software", but they're actually "AI powered software". For example, Google products use advanced AI recommender systems to show us the next video to watch on YouTube.
In the context of healthcare and specifically COVID-19, as far as I know, AI has been utilised to predict the hot-spot locations or the number of new positive cases next week to help government come up with suitable preventive measures early.
As we can see, these software apps and technologies are very useful for us, but they also contain security risks that can lead to catastrophic consequences.
For example, last year, you may have heard about the Log4J vulnerability, which took the entire Internet by storm back then.
This vulnerability can be exploited and affect millions of systems around the world. And it's estimated that billions of dollars will be lost in the resulting cyber attacks caused by just this one vulnerability.
And this’s just one example among the thousands of critical vulnerabilities discovered every single year. So you can see much damage vulnerabilities can cause if we don’t prevent and address them on time.
And the vision of our research is to prevent such dangerous vulnerabilities. Specifically, we aim to develop tools and techniques and distill practices to provide early information and warnings about software vulnerabilities for both expert and non-expert users.
Our research mainly leverages various data sources to develop high-performing and robust AI/data-driven techniques to automate and give insights into the whole vulnerability lifecycle, ranging from early detecting these vulnerabilities, to assessing them in terms of their probability of exploitation and impacts, and then giving recommendations to developers to plan and prioritise the mitigation and fixing. And currently, our research targets both traditional and contemporary AI-based systems as well as the supporting infrastructure for these systems.
For AI systems, we've focused on phishing detection systems, for example, systems preventing the spam emails that can trick users into clicking malicious links and losing their personal data.
It's worth noting that so far, we have not analysed vulnerabilities in healthcare apps. So, I believe that can be one area of collaboration that we can explore in the meeting today.
And that's all about our current software security research. Thank you.
We can leverage decentralised technology to solve this problem?
Let's see how these principles manifest in the platform. According to the first design principle, we structure the ProML platform as a collection of peer nodes, called ProML nodes. All participants who collaborate on an ML model can deploy a ProML node within their organisation. Each ProML node acts as a representative of a participant. All of the interconnected ProML nodes have equal rights and responsibility to access, update, and secure ML provenance information.
ProML nodes are synchronised with each other using a blockchain protocol. The ProML nodes themselves can act as full nodes to form a blockchain network. Alternatively, the framework can rely on a remote blockchain network.
ProML nodes are also gateways for participants to interact with the provenance information. Through APIs and command line interfaces, the ML toolsets of participants submit and query provenance information. No new tools are necessary. It should also be noted that these exchanges of information happen within organisational boundaries, thus fulfilling the design principle 2.
ProML also supports peer-to-peer content distribution network as an alternative venue for storing and distributing models and datasets.
Let's take a closer look at how provenance information is captured.
The process is initiated by a workflow participant, such as a model developer. They embed function calls to a Logging API provided by their local ProML node.
When the training script or notebook runs, the function calls will happen at the exact spot specified by the model developer and submit the requested provenance information to the local ProML node.
If a participant chooses to, the ProML node can offload the payload part of the submited proveannce information, such as a dataset or binary of a model, and replace the payload with its corresponding hash. This process is called offloading.
After offloading, the ProML node transforms the submitted provenance information into a blockchain transaction and sign it on behalf of the participant.
Finally, ProML node submits the transaction to the blockchain via its local blockchain client. After the blockchain mining process is completed, the new information would be available across all workflow participants.
Here are some sample transactions on the Ropsten network that capture some key provenance updates. The hex strings are hash of blockchain transactions on the Ropsten network.
For instance, the training record ML2-4 contains the information required by researchers in the case project, such as hyperparameters, type and version of the utilised ML training library.