Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Privacy-Preserving Data Analysis Workflows for eScience

82 views

Published on

Computing-intensive experiences in modern sciences have become increasingly data-driven illustrating perfectly the Big-Data era's challenges. These experiences are usually specified and enacted in the form of workflows that would need to manage (i.e.,~read, write, store, and retrieve) sensitive data like persons' past diseases and treatments. While there is an active research body on how to protect sensitive data by, for instance, anonymizing datasets, there is a limited number of approaches that would assist scientists identifying the datasets, generated by the workflows, that need to be anonymized along with setting the anonymization degree that must be met. We present in this paper a preliminary for setting and inferring anonymization requirements of datasets used and generated by a workflow execution. The approach was implemented and showcased using a concrete example, and its efficiency assessed through validation exercises.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Privacy-Preserving Data Analysis Workflows for eScience

  1. 1. Khalid Belhajjame, Noura Faci, Zakaria Maamar,Vanilson Burégio, Edvan Soares and Mahmoud Berhamgi Contact: kbelhajj@gmail.com
  2. 2.  Data driven analysis pipelines  Systematic gathering of data and analysis tools into computational solutions for scientific problem-solving  Tools for automating frequently performed data intensive activities  Provenance for the resulting datasets  The method followed  The resources used  The datasets used Khalid Belhajjame @ DarliAP Workshop, 2019 2
  3. 3. GWAS, Pharmacogenomics Association study of Nevirapine-induced skin rash inThai Population Trypanosomiasis (sleeping sickness parasite) in African Cattle Astronomy & HelioPhysics Library Doc Preservation Systems Biology of Micro- Organisms Observing Systems Simulation Experiments JPL, NASA BioDiversity Invasive Species Modelling [Credit Carole A. Goble] Khalid Belhajjame @ DarliAP Workshop, 2019 3
  4. 4.  In fields such as biomedicine and social and behavioral sciences, workflow executions manipulate and generate sensitive information about individuals.  There is, therefore, a serious concern about dataset inappropriate manipulation/misuse during experiences that could lead to sensitive-data leak and/or misuse.  Publishing the provenance of the executions of such workflows raises privacy concerns. Khalid Belhajjame @ DarliAP Workshop, 2019 4
  5. 5. To our knowledge, there does not exist any proposal that assists scientists in the task of anonymizing the provenance of their experiments.. Khalid Belhajjame @ DarliAP Workshop, 2019 5 Our objective: we seek to assist scientists in the task of anonymizing workflow provenance to preserve the privacy of individuals.  Most related work in the area have focused on the problem of securing workflow provenance and policing their access.  Protecting the integrity of provenance data from corruption using cryptography techniques [Hasan and Khan, 2017; Lyle and Martin, 2010].  Deriving a partial view on a workflow that conforms to a pre-specified access permissions on the modules' inputs and output and their dependences [Chebotko et al., 2008; Cohen Boulakia et al., 2008]  Policy languages allowing scientists to specify relationships between datasets and the workflow modules, and their properties relevant to datasets [Alhaqbani et al., 2013; Gil et al., 2010]  Protecting the privacy of the modules that compose the workflows by hiding certain parameters (attributes) of the module that compose the workflow [Davidson et al., 2011].
  6. 6. [Credit: Steve Touw, Immuta] Khalid Belhajjame @ DarliAP Workshop, 2019 6 ‘Differential privacy formalizes the idea that a "private" computation should not reveal whether any one person participated in the input or not, much less what their data are.’ - [Frank McSherry] (https://github.com/frankmcsherry/blog/blob/master/posts/2016-02-03.md) $320k $340k $330k $30M Sensitivity of median = ~10k Sensitivity of mean = ~30M
  7. 7. Khalid Belhajjame @ DarliAP Workshop, 2019 7  For our work, we chose to use the most fundamental anonymization privacy model, namely k-anonymity, which has been proposed to protect individual privacy in data publishing.  While k-anonymity is less powerful than differntial privacy, it is suitable for our purposes, given that it provides the means for :  Exploring the provenance of workflows,  Examining the data products used and generated by the workflows,  Preserve (to certain extent) lineage information between data products.
  8. 8. Khalid Belhajjame @ DarliAP Workshop, 2019 8 • A workflow is defined by the triple • An operation op in OP is defined as. • The data links:
  9. 9. Khalid Belhajjame @ DarliAP Workshop, 2019 9
  10. 10. Khalid Belhajjame @ DarliAP Workshop, 2019 10
  11. 11. Khalid Belhajjame @ DarliAP Workshop, 2019 11
  12. 12. Khalid Belhajjame @ DarliAP Workshop, 2019 12
  13. 13.  Sensitive parameters To specify that a given input or output parameter carries sensitive data, we use the following boolean function: that is true if the data bound to <op,p> during the execution are sensitive  Anonymity Degree we use the following function to specify the anonymity degree of the parameter <p, op> with respect to a workflow instance insWf: Khalid Belhajjame @ DarliAP Workshop, 2019 13
  14. 14.  Manual identification of a workflow’s parameters that are sensitive and setting their anonymity degrees can be tedious.  This is the case when the workflow includes a large number of operations.  We assist the scientist in this task by leveraging parameter dependencies. Khalid Belhajjame @ DarliAP Workshop, 2019 14
  15. 15.  A parameter <op, p> depends on a parameter <op', p’> in a workflow (DWf), if during the execution of (DWf) the data bound to <op', p’> contribute to or influence the data bound to <op', p’>  Given a workflow (DWf), the dependencies between its parameters are inferred as follows:  Given an operation (op) that belongs to (DWf), we can infer that the outputs of (op) depends on its inputs.  If the workfow (DWf) contains a data link connecting an output <op, o> to an input <op, i>, then:  We also transitively derive dependencies between the operation parameters: Khalid Belhajjame @ DarliAP Workshop, 2019 15
  16. 16.  A parameter <p', op’> that is not an input to the workflow may be sensitive if it depends on a workflow input that is known to be sensitive:  Note that we say may be sensitive. This is because an operation that consumes sensitive datasets may produce non-sensitive datasets. Khalid Belhajjame @ DarliAP Workshop, 2019 16
  17. 17.  In addition to assisting the designer identify sensitive intermediate and final output parameters, we also infer details about the anonymity degree that should be applied to dataset instances of those sensitive parameters.  The anonymity degree of a parameter <p', op’> given a workflow execution insWf can be defined as the maximum degree of the sensitive datasets that are used as input to the workflow and that contribute to the datasets instances of <p', op’>. Khalid Belhajjame @ DarliAP Workshop, 2019 17
  18. 18. Khalid Belhajjame @ DarliAP Workshop, 2019 18 Sensi ve Data Non Sensi ve Data Sensi ve Data Data owner Data owner Non Sensi ve Data Non Sensi ve Data Non Sensi ve Data Public data repositories Trusted workflow environment Workflow execu on engine Workflow workbench Data anonymizer Private data repository share launch execution get inputs store outputs publish data 1 2 3 4 5 6 7 get data launch data anonymization
  19. 19.  For validation purposes, we used 20 different CWL workflows [1], we performed 500s executions per workflow, and computed the overhead of our method in terms of the computation of parameter dependencies, identification of sensitive parameters and the computation of anonymity degree.  The results obtained showed that the overhead is small compared to the execution of the workflow. It takes in average less than a millisecond to perform all the computation necessary. Khalid Belhajjame @ DarliAP Workshop, 2019 19 [1] view.commonwl.org/workflows
  20. 20.  We presented an approach for preserving privacy in the context of scientific workflows that heavily rely on large datasets.  We have shown how data plays a role in i) identifying sensitive operation parameters in the workflow and ii) deriving the anonymity degree that needs to be enforced when publishing the datasets instances of these parameters.  This is a preliminary work that opens up opportunities for more research in the field of anonymization of workflow data Khalid Belhajjame @ DarliAP Workshop, 2019 20
  21. 21. Khalid Belhajjame, Noura Faci, Zakaria Maamar,Vanilson Burégio, Edvan Soares and Mahmoud Berhamgi Contact: kbelhajj@gmail.com

×