The DataTags framework makes it easy for data producers to deposit, data publishers to store and distribute, and data users to access and use datasets containing confidential information, in a standardized and responsible way. The talk will first introduce the concepts and tools behind DataTags, and then focus on the user-facing component of the system - Tagging Server (available today at datatags.org). We will conclude by describing how future versions of Dataverse will use DataTags to automatically handle sensitive datasets, that can only be shared under some restrictions.
Big Data Repository for Structural Biology: Challenges and Opportunities by P...datascienceiqss
SBGrid (Morin et al., 2013, eLIFE and www.sbgrid.org) is a Harvard based structural biology global computing consortium with a primary focus on the curation of research software. Dr. Sliz will discuss a recent SBGrid project that aims to establish a repository for experimental datasets from SBGrid laboratories. Issues of handling large data volumes, data validation and repository sustainability will be addressed in this talk.
Data Citation Implementation Guidelines By Tim Clarkdatascienceiqss
This document outlines guidelines for improving the reproducibility of scholarly research through better data citation practices. It recommends depositing cited datasets in archival repositories, using persistent identifiers that meet JDDCP criteria, and having identifiers resolve to landing pages that provide both human- and machine-readable metadata about the dataset. Landing pages should be retained even if the underlying data is removed. Repositories are responsible for maintaining identifier persistence and researchers should treat data as first-class objects in the scholarly process. Following these guidelines would radically improve transparency and enable both humans and machines to access and interpret cited data.
Lesson 8 in a set of 10 created by DataONE on Best Practices for Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Lesson 2 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
DataONE Education Module 10: Legal and Policy IssuesDataONE
This document discusses legal, ethical and policy issues related to managing research data. It defines key concepts like copyright, licenses and waivers, and explains why identifying ownership and control is important. Restrictions on data use and sharing are discussed, including protecting privacy and following regulations. Open licensing is presented as a way to facilitate sharing while still giving credit. The importance of behaving ethically and respecting licenses is emphasized.
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
Data commons are emerging as a solution to challenges in analyzing and sharing large biomedical datasets. A data commons co-locates data with cloud computing infrastructure and software tools to create an interoperable resource for the research community. Examples include the NCI Genomic Data Commons and the Open Commons Consortium. The open source Gen3 platform supports building disease- or project-specific data commons to facilitate open data sharing while protecting patient privacy. Developing interoperable data commons can accelerate research through increased access to data.
What is Data Commons and How Can Your Organization Build One?Robert Grossman
1. Data commons co-locate large biomedical datasets with cloud computing infrastructure and analysis tools to create shared resources for the research community.
2. The NCI Genomic Data Commons is an example of a data commons that makes over 2.5 petabytes of cancer genomics data available through web portals, APIs, and harmonized analysis pipelines.
3. The Gen3 platform is an open source software stack for building data commons that can interoperate through common APIs and data models to support reproducible, collaborative research across projects.
Big Data Repository for Structural Biology: Challenges and Opportunities by P...datascienceiqss
SBGrid (Morin et al., 2013, eLIFE and www.sbgrid.org) is a Harvard based structural biology global computing consortium with a primary focus on the curation of research software. Dr. Sliz will discuss a recent SBGrid project that aims to establish a repository for experimental datasets from SBGrid laboratories. Issues of handling large data volumes, data validation and repository sustainability will be addressed in this talk.
Data Citation Implementation Guidelines By Tim Clarkdatascienceiqss
This document outlines guidelines for improving the reproducibility of scholarly research through better data citation practices. It recommends depositing cited datasets in archival repositories, using persistent identifiers that meet JDDCP criteria, and having identifiers resolve to landing pages that provide both human- and machine-readable metadata about the dataset. Landing pages should be retained even if the underlying data is removed. Repositories are responsible for maintaining identifier persistence and researchers should treat data as first-class objects in the scholarly process. Following these guidelines would radically improve transparency and enable both humans and machines to access and interpret cited data.
Lesson 8 in a set of 10 created by DataONE on Best Practices for Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Lesson 2 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
DataONE Education Module 10: Legal and Policy IssuesDataONE
This document discusses legal, ethical and policy issues related to managing research data. It defines key concepts like copyright, licenses and waivers, and explains why identifying ownership and control is important. Restrictions on data use and sharing are discussed, including protecting privacy and following regulations. Open licensing is presented as a way to facilitate sharing while still giving credit. The importance of behaving ethically and respecting licenses is emphasized.
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
Data commons are emerging as a solution to challenges in analyzing and sharing large biomedical datasets. A data commons co-locates data with cloud computing infrastructure and software tools to create an interoperable resource for the research community. Examples include the NCI Genomic Data Commons and the Open Commons Consortium. The open source Gen3 platform supports building disease- or project-specific data commons to facilitate open data sharing while protecting patient privacy. Developing interoperable data commons can accelerate research through increased access to data.
What is Data Commons and How Can Your Organization Build One?Robert Grossman
1. Data commons co-locate large biomedical datasets with cloud computing infrastructure and analysis tools to create shared resources for the research community.
2. The NCI Genomic Data Commons is an example of a data commons that makes over 2.5 petabytes of cancer genomics data available through web portals, APIs, and harmonized analysis pipelines.
3. The Gen3 platform is an open source software stack for building data commons that can interoperate through common APIs and data models to support reproducible, collaborative research across projects.
This is an overview of the Data Biosphere Project, its goals, its architecture, and the three core projects that form its foundation. We also discuss data commons.
This a talk that I gave at BioIT World West on March 12, 2019. The talk was called: A Gen3 Perspective of Disparate Data:From Pipelines in Data Commons to AI in Data Ecosystems.
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021dkNET
Abstract
Good data stewardship is the cornerstone of knowledge, discovery, and innovation in research. The FAIR Data Principles address data creators, stewards, software engineers, publishers, and others to promote maximum use of research data. The principles can be used as a framework for fostering and extending research data services.
This talk will provide an overview of the FAIR principles and the drivers behind their development by a broad community of international stakeholders. We will explore a range of topics related to putting FAIR data into practice, including how and where data can be described, stored, and made discoverable (e.g., data repositories, metadata); methods for identifying and citing data; interoperability of (meta)data; best-practice examples; and tips for enabling data reuse (e.g., data licensing). Practical examples of how FAIR is applied will be provided along the way.
Presenter: Christopher Erdmann, Engagement, support, and training expert on the NHLBI BioData Catalyst project at University of North Carolina Renaissance Computing Institute
dkNET Webinars Information: https://dknet.org/about/webinar
Some Frameworks for Improving Analytic Operations at Your CompanyRobert Grossman
I review three frameworks for analytic operations that are designed to improve the value obtained when deploying analytic models into products, services and internal operations.
The DataTags System: Sharing Sensitive Data with ConfidenceMerce Crosas
This talk was part of a session at the Research Data Alliance (RDA) 8th Plenary on Privacy Implications of Research Data Sets, during International Data Week 2016:
https://rd-alliance.org/rda-8th-plenary-joint-meeting-ig-domain-repositories-wg-rdaniso-privacy-implications-research-data
Slides in Merce Crosas site:
http://scholar.harvard.edu/mercecrosas/presentations/datatags-system-sharing-sensitive-data-confidence
DataONE Education Module 01: Why Data Management?DataONE
Lesson 1 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
There are two cultures in data science and analytics - those that develop analytic models and those that deploy analytic models into operational systems. In this talk, we review the life cycle of analytic models and provide an overview of some of the approaches that have been developed for managing analytic models and workflows and for deploying them, including using analytic engines and analytic containers . We give a quick overview of languages for analytic models (PMML) and analytic workflows (PFA). We also describe the emerging discipline of AnalyticOps that has borrowed some of the techniques of DevOps.
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCESMicah Altman
This talk, is part of the MIT Program on Information Science brown bag series (http://informatics.mit.edu)
This talk reviews emerging big data sources for social scientific analysis and explores the challenges these present. Many of these sources pose distinct challenges for acquisition, processing, analysis, inference, sharing, and preservation.
Dr Micah Altman is Director of Research and Head/Scientist, Program on Information Science for the MIT Libraries, at the Massachusetts Institute of Technology. Dr. Altman is also a Non-Resident Senior Fellow at The Brookings Institution. Prior to arriving at MIT, Dr. Altman served at Harvard University for fifteen years as the Associate Director of the Harvard-MIT Data Center, Archival Director of the Henry A. Murray Archive, and Senior Research Scientist in the Institute for Quantitative Social Sciences.
Dr. Altman conducts research in social science, information science and research methods -- focusing on the intersections of information, technology, privacy, and politics; and on the dissemination, preservation, reliability and governance of scientific knowledge.
This document discusses metadata, including what it is, why it is important, common components of metadata records, examples of metadata standards, and tips for writing good metadata. Metadata captures key details about data, such as who created it, when, how, and why, to facilitate discovering, understanding, and reusing the data. Standards provide consistency for computer interpretation and searching. Good metadata includes specific, accurate, and complete details to fully document data.
In this webinar, we gave a general introduction of the dkNET portal and showed how dkNET can be used to address a variety of use cases, including:
1) Find funding sources for your research of interest
2) Determine what study section have reviewed this type of research
3) Help with new NIH guidelines for rigor and reproducibility
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Tom Plasterer
As scientists in the life sciences we are trained to pursue singular goals around a publication or a validated target or a drug submission. Our failure rates are exceedingly high especially as we move closer to patients in the attempt to collect sufficient clinical evidence to demonstrate the value of novel therapeutics. This wastes resources as well as time for patients depending upon us for the next breakthrough.
Edge Informatics is an approach to ameliorate these failures. Using both technical and social solutions together knowledge can be shared and leveraged across the drug development process. This is accomplished by making data assets discoverable, accessible, self-described, reusable and annotatable. The Open PHACTS project pioneered this approach and has provided a number of the technical and social solutions to enable Edge Informatics. A number of pre-competitive consortia and some content providers have also embraced this approach, facilitating networks of collaborators within and outside a given organization. When taken together more accurate, timely and inclusive decision-making is fostered.
Keynote on software sustainability given at the 2nd Annual Netherlands eScience Symposium, November 2014.
Based on the article
Carole Goble ,
Better Software, Better Research
Issue No.05 - Sept.-Oct. (2014 vol.18)
pp: 4-8
IEEE Computer Society
http://www.computer.org/csdl/mags/ic/2014/05/mic2014050004.pdf
http://doi.ieeecomputersociety.org/10.1109/MIC.2014.88
http://www.software.ac.uk/resources/publications/better-software-better-research
Current trends in data security nursing research pptNursing Path
The document discusses current trends in data security. It begins by defining data security and its goals of confidentiality and integrity. Traditional SQL-based access control and views are described as having limitations. Two main attacks are discussed: SQL injection due to poor application implementation of security policies, and unintended information leakage when published data is combined from multiple sources. Current research topics aim to address leakage, enforce complex privacy policies, and allow secure sharing of data through techniques like encryption and secure computation. The challenges of moving policy implementation closer to the database are also discussed.
This document summarizes key points about data science and privacy regulation:
1. Regulation aims to alter behavior according to standards to achieve defined outcomes, and can involve standard-setting, information gathering, and modifying behavior.
2. With "big data", problems arise for the laissez-faire conception of privacy regulation due to market failures, insider threats, and mass surveillance capabilities.
3. Designing for privacy is important, such as data minimization, decentralization, consent requirements, and easy-to-use privacy interfaces. The "data exhaust" from ubiquitous data collection threatens privacy in Europe.
Should We Expect a Bang or a Whimper? Will Linked Data Revolutionize Scholar Authoring and Workflow Tools?
Jeff Baer, Senior Director of Product Management, Research Development Services, Proquest
The NIDDK Information Network (dkNET) portal facilitates access to research resources relevant to metabolic, digestive, and kidney diseases. dkNET functions as a search engine to discover resources across millions of records in hundreds of biomedical databases. It provides the ability to search across data sources, a resource registry, and the literature. dkNET aims to make it easy to find relevant research resources through its concept-based search interface.
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...SEAD
This document discusses the Sustainable Environment Actionable Data (SEAD) project, which aims to lower the costs and increase the value of data curation through a data lifecycle approach. SEAD provides lightweight data services to support sustainability research, including secure project workspaces, active and social curation tools, and integrated lifecycle support for data from ingest to long-term preservation. By leveraging technologies like Web 2.0 and standards, SEAD simplifies and automates curation processes using metadata captured from data producers and users. This allows curation activities to begin earlier in the data lifecycle and be distributed across researchers and curators.
Maintaining Data Confidentiality in Association Rule Mining in Distributed En...IJSRD
The data in real world applications is distributed at multiple locations, and the owner of the databases may be different people. Thus to perform mining task, the data needs to be kept at central location which causes threat to the privacy of corporate data. Hence the key challenge is to applying mining on distributed source data with preserving privacy of corporate data. The system addresses the problem of incrementally mining frequent itemsets in dynamic environment. The assumption made here is that, after initial mining the source undergoes into small changes in each time. The privacy of data should not be threatened by an adversary i.e. the miner and target database owner should not be able to recover original data from transformed data.
Sharing Data Through Plots with Plotly by Alex Johnsondatascienceiqss
Plotly is a web-based visualization and data analysis platform. Sharing Plotly plots in your dataverse creates a rich experience in which each user can interact with the data in whatever way best fits their needs: from zooming in and examining individual data points, to changing colors and styles, to importing the raw data into their favorite software environment for further analysis.
Persistent Identifier Services and their Metadata by John Kunzedatascienceiqss
Persistent identifiers (Pids) provide machine-actionable links to data and metadata that are vital to APIs (application programming interfaces) for publishing and citation. APIs are essentially request/response patterns that use Pids to reference things and metadata to describe not only the things themselves, but also any actions requested or taken. As a result, metadata design and standardization is wedded to API design and enhancement. With Pids as nouns and metadata as adjectives and qualifiers, Pid services play a key role in API implementation.
This is an overview of the Data Biosphere Project, its goals, its architecture, and the three core projects that form its foundation. We also discuss data commons.
This a talk that I gave at BioIT World West on March 12, 2019. The talk was called: A Gen3 Perspective of Disparate Data:From Pipelines in Data Commons to AI in Data Ecosystems.
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021dkNET
Abstract
Good data stewardship is the cornerstone of knowledge, discovery, and innovation in research. The FAIR Data Principles address data creators, stewards, software engineers, publishers, and others to promote maximum use of research data. The principles can be used as a framework for fostering and extending research data services.
This talk will provide an overview of the FAIR principles and the drivers behind their development by a broad community of international stakeholders. We will explore a range of topics related to putting FAIR data into practice, including how and where data can be described, stored, and made discoverable (e.g., data repositories, metadata); methods for identifying and citing data; interoperability of (meta)data; best-practice examples; and tips for enabling data reuse (e.g., data licensing). Practical examples of how FAIR is applied will be provided along the way.
Presenter: Christopher Erdmann, Engagement, support, and training expert on the NHLBI BioData Catalyst project at University of North Carolina Renaissance Computing Institute
dkNET Webinars Information: https://dknet.org/about/webinar
Some Frameworks for Improving Analytic Operations at Your CompanyRobert Grossman
I review three frameworks for analytic operations that are designed to improve the value obtained when deploying analytic models into products, services and internal operations.
The DataTags System: Sharing Sensitive Data with ConfidenceMerce Crosas
This talk was part of a session at the Research Data Alliance (RDA) 8th Plenary on Privacy Implications of Research Data Sets, during International Data Week 2016:
https://rd-alliance.org/rda-8th-plenary-joint-meeting-ig-domain-repositories-wg-rdaniso-privacy-implications-research-data
Slides in Merce Crosas site:
http://scholar.harvard.edu/mercecrosas/presentations/datatags-system-sharing-sensitive-data-confidence
DataONE Education Module 01: Why Data Management?DataONE
Lesson 1 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
There are two cultures in data science and analytics - those that develop analytic models and those that deploy analytic models into operational systems. In this talk, we review the life cycle of analytic models and provide an overview of some of the approaches that have been developed for managing analytic models and workflows and for deploying them, including using analytic engines and analytic containers . We give a quick overview of languages for analytic models (PMML) and analytic workflows (PFA). We also describe the emerging discipline of AnalyticOps that has borrowed some of the techniques of DevOps.
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCESMicah Altman
This talk, is part of the MIT Program on Information Science brown bag series (http://informatics.mit.edu)
This talk reviews emerging big data sources for social scientific analysis and explores the challenges these present. Many of these sources pose distinct challenges for acquisition, processing, analysis, inference, sharing, and preservation.
Dr Micah Altman is Director of Research and Head/Scientist, Program on Information Science for the MIT Libraries, at the Massachusetts Institute of Technology. Dr. Altman is also a Non-Resident Senior Fellow at The Brookings Institution. Prior to arriving at MIT, Dr. Altman served at Harvard University for fifteen years as the Associate Director of the Harvard-MIT Data Center, Archival Director of the Henry A. Murray Archive, and Senior Research Scientist in the Institute for Quantitative Social Sciences.
Dr. Altman conducts research in social science, information science and research methods -- focusing on the intersections of information, technology, privacy, and politics; and on the dissemination, preservation, reliability and governance of scientific knowledge.
This document discusses metadata, including what it is, why it is important, common components of metadata records, examples of metadata standards, and tips for writing good metadata. Metadata captures key details about data, such as who created it, when, how, and why, to facilitate discovering, understanding, and reusing the data. Standards provide consistency for computer interpretation and searching. Good metadata includes specific, accurate, and complete details to fully document data.
In this webinar, we gave a general introduction of the dkNET portal and showed how dkNET can be used to address a variety of use cases, including:
1) Find funding sources for your research of interest
2) Determine what study section have reviewed this type of research
3) Help with new NIH guidelines for rigor and reproducibility
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Tom Plasterer
As scientists in the life sciences we are trained to pursue singular goals around a publication or a validated target or a drug submission. Our failure rates are exceedingly high especially as we move closer to patients in the attempt to collect sufficient clinical evidence to demonstrate the value of novel therapeutics. This wastes resources as well as time for patients depending upon us for the next breakthrough.
Edge Informatics is an approach to ameliorate these failures. Using both technical and social solutions together knowledge can be shared and leveraged across the drug development process. This is accomplished by making data assets discoverable, accessible, self-described, reusable and annotatable. The Open PHACTS project pioneered this approach and has provided a number of the technical and social solutions to enable Edge Informatics. A number of pre-competitive consortia and some content providers have also embraced this approach, facilitating networks of collaborators within and outside a given organization. When taken together more accurate, timely and inclusive decision-making is fostered.
Keynote on software sustainability given at the 2nd Annual Netherlands eScience Symposium, November 2014.
Based on the article
Carole Goble ,
Better Software, Better Research
Issue No.05 - Sept.-Oct. (2014 vol.18)
pp: 4-8
IEEE Computer Society
http://www.computer.org/csdl/mags/ic/2014/05/mic2014050004.pdf
http://doi.ieeecomputersociety.org/10.1109/MIC.2014.88
http://www.software.ac.uk/resources/publications/better-software-better-research
Current trends in data security nursing research pptNursing Path
The document discusses current trends in data security. It begins by defining data security and its goals of confidentiality and integrity. Traditional SQL-based access control and views are described as having limitations. Two main attacks are discussed: SQL injection due to poor application implementation of security policies, and unintended information leakage when published data is combined from multiple sources. Current research topics aim to address leakage, enforce complex privacy policies, and allow secure sharing of data through techniques like encryption and secure computation. The challenges of moving policy implementation closer to the database are also discussed.
This document summarizes key points about data science and privacy regulation:
1. Regulation aims to alter behavior according to standards to achieve defined outcomes, and can involve standard-setting, information gathering, and modifying behavior.
2. With "big data", problems arise for the laissez-faire conception of privacy regulation due to market failures, insider threats, and mass surveillance capabilities.
3. Designing for privacy is important, such as data minimization, decentralization, consent requirements, and easy-to-use privacy interfaces. The "data exhaust" from ubiquitous data collection threatens privacy in Europe.
Should We Expect a Bang or a Whimper? Will Linked Data Revolutionize Scholar Authoring and Workflow Tools?
Jeff Baer, Senior Director of Product Management, Research Development Services, Proquest
The NIDDK Information Network (dkNET) portal facilitates access to research resources relevant to metabolic, digestive, and kidney diseases. dkNET functions as a search engine to discover resources across millions of records in hundreds of biomedical databases. It provides the ability to search across data sources, a resource registry, and the literature. dkNET aims to make it easy to find relevant research resources through its concept-based search interface.
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...SEAD
This document discusses the Sustainable Environment Actionable Data (SEAD) project, which aims to lower the costs and increase the value of data curation through a data lifecycle approach. SEAD provides lightweight data services to support sustainability research, including secure project workspaces, active and social curation tools, and integrated lifecycle support for data from ingest to long-term preservation. By leveraging technologies like Web 2.0 and standards, SEAD simplifies and automates curation processes using metadata captured from data producers and users. This allows curation activities to begin earlier in the data lifecycle and be distributed across researchers and curators.
Maintaining Data Confidentiality in Association Rule Mining in Distributed En...IJSRD
The data in real world applications is distributed at multiple locations, and the owner of the databases may be different people. Thus to perform mining task, the data needs to be kept at central location which causes threat to the privacy of corporate data. Hence the key challenge is to applying mining on distributed source data with preserving privacy of corporate data. The system addresses the problem of incrementally mining frequent itemsets in dynamic environment. The assumption made here is that, after initial mining the source undergoes into small changes in each time. The privacy of data should not be threatened by an adversary i.e. the miner and target database owner should not be able to recover original data from transformed data.
Sharing Data Through Plots with Plotly by Alex Johnsondatascienceiqss
Plotly is a web-based visualization and data analysis platform. Sharing Plotly plots in your dataverse creates a rich experience in which each user can interact with the data in whatever way best fits their needs: from zooming in and examining individual data points, to changing colors and styles, to importing the raw data into their favorite software environment for further analysis.
Persistent Identifier Services and their Metadata by John Kunzedatascienceiqss
Persistent identifiers (Pids) provide machine-actionable links to data and metadata that are vital to APIs (application programming interfaces) for publishing and citation. APIs are essentially request/response patterns that use Pids to reference things and metadata to describe not only the things themselves, but also any actions requested or taken. As a result, metadata design and standardization is wedded to API design and enhancement. With Pids as nouns and metadata as adjectives and qualifiers, Pid services play a key role in API implementation.
Dataverse in China: Internationalization, Curation and Promotion by Yin Shenqindatascienceiqss
Zhang Jilong & Yin Shenqin will discuss the internationalization development work done by Fudan University to support a Chinese language user interface in Dataverse. Additionally, the practice of data curation at Fudan University will be presented, as well as the branding and dissemination of Dataverse in China.
Center for Open Science and the Open Science Framework: Dataverse Add-on by S...datascienceiqss
The Open Science Framework (OSF: http://osf.io; supported and maintained by the Center for Open Science - COS: http://centerforopenscience.org/) is a free, open source workflow management service and repository designed for scientists to manage and connect everything across their research process. One of the first add-on connections was Dataverse, which provides value to users through an easy connection as a repository service. This talk will introduce the Dataverse add-on connection and provide a technical view of how it was built and how it connects the OSF and Dataverse.
Political Analysis Dataverse by Jonathan N. Katzdatascienceiqss
Political Analysis is the official journal of the Society for Political Methodology. It publishes quantitative and qualitative political methodology articles. Until 2012, data sharing policies were voluntary or lacked verification. Now, authors must upload data and code to the journal's Dataverse before publication. There has been 100% compliance, with over 15,000 downloads of the 241 replication datasets. Issues include occasional missing files or personal information in initial submissions, and the replication review occurring outside the journal management system. The publisher is working to integrate Dataverse access directly into the system.
Geospatial Data Visualization: WorldMap Integration by Raman Prasaddatascienceiqss
The Boston Area Research Initiative (BARI) has started work to integrate the Dataverse and Harvard WorldMap systems. This will give Dataverse users the opportunity to map shapefiles as well as tabular files. The maps will be saved in the WorldMap system, allow users to take advantage of the WorldMap’s rich feature set as well as share the maps through Dataverse. This effort involved building APIs on both systems and connecting them via a small web application.
Metadata & Data Curation Services by Thu-Mai Christiandatascienceiqss
The Odum Institute was an early adopter of the Dataverse Network™ (DVN) virtual archive platform, transferring all of its holdings to the Virtual Data Center (VDC), the DVN’s precursor, in 2005. This presentation will illustrate the Odum Institute Data Archive’s integration of the Dataverse Network™ into its current data curation pipeline process and discuss the Dataverse Network’s role in the Institute’s tiered levels of data curation services.
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...datascienceiqss
Scholars Portal, a program of the Ontario Council of University Libraries (OCUL), provides the technical infrastructure to store, preserve, and provide access to shared digital library collections in Ontario - including hosting a local instance of Dataverse since 2011. As part of a national project known as Portage (a project of the Canadian Association of Research Libraries), Scholars Portal is partnering with Artefactual Systems, Dataverse, the University of British Columbia, the University of Alberta, and others, to integrate Dataverse with preservation software Archivematica. When completed, this project will facilitate the long-term preservation of research data according to the Open Archival Information System (OAIS) Reference Model.
The Project TIER Dataverse: Archiving and Sharing Replicable Student Research...datascienceiqss
The document discusses Project TIER, which aims to teach students protocols for documenting their research in a way that allows for easy and exact replication of statistical results. It develops standards for student papers, theses, and independent projects. Project TIER also seeks to disseminate these methods of transparent and replicable research to other institutions. The document provides an example of the directory structure used to archive a senior thesis according to Project TIER's documentation protocol.
Data Publishing Models by Sünje Dallmeier-Tiessendatascienceiqss
Data Publishing is becoming an integral part of scholarly communication today. Thus, it is indispensable to understand how data publishing works across disciplines. Are there best practices others can learn from or even data publishing standards? How do they impact interoperability in the Open Science landscape? The presentation will look at a range of examples, and the main building blocks of data publishing today. The work has been conducted as part of the RDA Data Publishing Workflows group.
Dataverse in the Universe of Data by Christine L. Borgmandatascienceiqss
Data repositories are much more than "black boxes" where data go in but may never come out. Rather, they are situated in communities, with contributors, users, reusers, and repository staff who may engage actively or passively with participants. This talk will explore the roles that Dataverse plays – or could play – in individual communities.
DataTags, The Tags Toolset, and Dataverse IntegrationMichael Bar-Sinai
This presentation describes the concept of DataTags, which simplifies handling of sensitive datasets. It then shows the Tags toolset, and how it is integrated with Dataverse, Harvard's popular dataset repository.
Achieving Privacy in Publishing Search logsIOSR Journals
The document discusses algorithms for publishing search logs while preserving user privacy. It analyzes a search log using an algorithm that produces three types of outputs: query counts, a query-action graph showing query-result click counts, and a query-reformulation graph showing query suggestions clicked. The algorithm adds noise to query counts before publishing to achieve differential privacy. It aims to provide useful aggregated information for applications like search improvement while preventing re-identification of individual user data in the search log.
The document outlines 4 steps for successful data mapping and identification during discovery: 1) Conduct a thorough investigation of where potentially responsive data is located; 2) Understand who has access to the data as these individuals can help locate additional data sources; 3) Consider data retention, destruction and archiving policies to prevent accidental data loss; 4) Identify hard copies of documents that may supplement electronic data. Taking these steps to map out a data acquisition plan can increase the value of the discovery process and the likelihood of case success.
Multilevel Privacy Preserving by Linear and Non Linear Data DistortionIOSR Journals
This document discusses privacy-preserving techniques for data mining called multilevel privacy preserving. It introduces the concept of generating multiple perturbed copies of data at different trust levels to protect privacy while allowing useful data mining. Key techniques discussed include data perturbation through adding random noise or distorting values, as well as data modification through aggregation, suppression, and swapping. Maintaining privacy is achieved by ensuring the noise added to different copies has a "corner-wave" covariance structure so statistical values do not differ significantly from the original data.
When we start to look at the promises of Big Data and the rapid evolution of tools and practices that yield amazing actionable insight, we must first look at why we’re doing it. How will this initiative support our strategy? What areas of improvement are we targeting? While there is an argument for the “happy accident” — data analysis that points us in new directions and opportunities — every tool and process should clearly align with our strategies and missions.
The document summarizes an informational webinar about efficiently handling subject access requests through automation. It discusses the growing volume and complexity of subject access requests, as well as the challenges of the manual process. The webinar promotes automating tasks like deduplication, redaction, classification, searching, and producing documents through a software solution from ZyLAB that can help organizations scale to meet demand while reducing costs and risks. Automation through ZyLAB's eDiscovery platform is presented as helping make the subject access request process more efficient.
Final Project RequirementsSubmit your final project topic here. .docxlmelaine
Final Project Requirements
Submit your final project topic here. Include a short paragraph describing your project and how you intend to research it. (Below find the project topic selected)
Here is a list of your upcoming project deliverables:
· Submit final project topic. (Done, see below for selected topic)
· Submit a brief abstract describing your final project.
· Submit final project materials.
. 500-700 word, double spaced, written in APA format, showing sources and a bibliography
· Residency
. Prepare a presentation on your final topic
How to Write an Abstract?
https://www.wikihow.com/Write-an-Abstract
How to Write a Research Paper?
https://owl.purdue.edu/owl/purdue_owl.html
Writing in APA Format with WORD
View this video for a quick overview of using APA format with WORD
https://www.youtube.com/watch?v=_ODakMMqvIs
Selected Topic for Final Project
CLOUD COMPUTING
Cloud computing is the delivery of multiple services through the Internet. These resources include tools and applications like data storage, servers, databases, networking, and software. Rather than keeping files on a proprietary hard drive or local storage device, cloud-based storage makes it possible to save them to a remote database. As long as an electronic device has access to the web, it has access to the data and the software programs to run it. Cloud computing is a popular option for people and businesses for a number of reasons including cost savings, increased productivity, speed and efficiency, performance, and security.
A fundamental concept behind cloud computing is that the location of the service, and many of the details such as the hardware or operating system on which it is running, are largely irrelevant to the user. The cloud’s main appeal is to reduce the time to market of applications that need to scale dynamically. Increasingly, however, developers are drawn to the cloud by the abundance of advanced new services that can be incorporated into applications, from machine learning to internet of things (IoT) connectivity.
My workplace Eco-system is currently cloud based as we have teams across the world who utilizes the services anytime/anywhere they want. Our database/application servers are spread across nation and stand-up servers are placed strategically across the nation as a part of disaster recovery plan. This project gives me an opportunity to dig deep and understand the cloud policies and processes in our cloud based ecosystem.
Week 12 Paper
on Sat, Jan 25 2020, 12:32 AM
67% highest match
Attachments (1)
· Submission_Text.html 67%
Word Count: 558
Submission_Text.html
1 MONITORING AND EVALUATING THE DATABASE FRAMEWORKS IS BASIC TO TENDING TO A FEW SOX GUIDELINES. 2 “AN EXHAUSTIVE AUDITING TECHNIQUE TRACKS CLIENT ACTION, INFORMATION AND SCHEMA ADJUSTMENTS, SECURITY CHANGES AS WELL AS OTHER ACTIVITIES, ASSISTING WITH UNCOVERING BOTH GENUINE AND POTENTIAL SECURITY DANGERS.” (AFYOUNI, 2018). COMPREHENSIVE AUDITING IS LIKEWISE NECESSARY TO MEETING ...
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document discusses security challenges for big data platforms and provides recommendations. It notes that as big data technologies become more mainstream, security is increasingly important. However, big data platforms are less mature than traditional databases and can combine different types of organizational data. The document outlines considerations for designing secure big data environments, including performance, compliance, security, and access. It recommends best practices like sending security context with data, managing identities centrally, and tracking implications of combining data sets.
Analytics and Data Visualization in the Legal MarketLexisNexis
Legal work requires being able to identify and recognize patterns in patterns in information – or data – quickly and efficiently. According to this Blue Hill Research report, this practice’s popularity is continuing to grow, and plays a pivotal role in today’s legal industry, allowing those who practice law to gain a better understanding of relationships between datasets. In fact, the ability to see data visually may give users the opportunity of identifying unnoticed relationships between these data sets.
The document provides an introduction to big data, including definitions and characteristics. It discusses how big data can be described by its volume, variety, and velocity. It notes that big data is large and complex data that is difficult to process using traditional data management tools. Common sources of big data include social media, sensors, and scientific instruments. Challenges in big data include capturing, storing, analyzing, and visualizing large and diverse datasets that are generated quickly. Distributed file systems and technologies like Hadoop are well-suited for processing big data.
Big data offers opportunities but also security and privacy issues due to its large volume, velocity, and variety. Some key security issues include insecure computation, lack of input validation and filtering, and privacy concerns in data mining and analytics. Recommendations to enhance big data security include securing computation code, implementing comprehensive input validation and filtering, granular access controls, and securing data storage and computation. Case studies on security issues include vulnerability to fake data generation, challenges with Amazon's data lakes, possibility of sensitive information mining, and the rapid evolution of NoSQL databases lacking security focus.
Practical and Actionable Threat Intelligence CollectionSeamus Tuohy
A great deal of the existing human rights reporting and analysis aggregate and strip away contextual information in order to produce “quantified knowledge” that is technically reliable and useful for governmental decision making. The results produced often end up too delayed, partial, distorted, and misleading to be used by local actors and human rights defenders to directly respond to the threats that they face. Those who could benefit most from the human rights knowledge being collected and shared in the digital world are those that existing repositories of information serve the least.
In this presentation I will provide concrete guidance on approaches for adopting data-rich, practical, and actionable threat information collection. In this content heavy 1.5 hour talk I will discuss a range of tools and techniques for seeking out sources of actionable information, distinguishing valuable information from useless but interesting information, and streamlining your information collection and analysis process to allow you to focus on your real work.
This talk WON’T be focused on collecting or sharing threat intelligence and/or human rights research aimed at evidence creation or changing the public dialogue. It WILL be focused on helping you identify, collect, and use publicly available sources of information to respond to your changing threat landscape.
Big data is used for structured, unstructured and semi-structured large volume of data which is difficult to
manage and costly to store. Using explanatory analysis techniques to understand such raw data, carefully
balance the benefits in terms of storage and retrieval techniques is an essential part of the Big Data. The
research discusses the Map Reduce issues, framework for Map Reduce programming model and
implementation. The paper includes the analysis of Big Data using Map Reduce techniques and identifying
a required document from a stream of documents. Identifying a required document is part of the security in
a stream of documents in the cyber world. The document may be significant in business, medical, social, or
terrorism.
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond HillClaraZara1
Big data is used for structured, unstructured and semi-structured large volume of data which is difficult to manage and costly to store. Using explanatory analysis techniques to understand such raw data, carefully balance the benefits in terms of storage and retrieval techniques is an essential part of the Big Data. The research discusses the MapReduce issues, framework for MapReduce programming model and implementation. The paper includes the analysis of Big Data using MapReduce techniques and identifying a required document from a stream of documents. Identifying a required document is part of the security in a stream of documents in the cyber world. The document may be significant in business, medical, social, or terrorism.
CuttingEEG - Open Science, Open Data and BIDS for EEGRobert Oostenveld
Starting with education, inception of research questions, planning, acquisition, analysis and reporting, there are multiple points where Open Science should play a role. In my presentation at the CuttingEEG conference in Paris, I argue that we should not only be sharing primary outcomes as Open Access publications, but that openness involves the full research cycle. Specifically, I will be sharing my experience with Open Data, privacy challenges and possibilities under the GDPR, Open Source for sharing analysis methods, dealing with imperfections in science and versioning of data, code and results. Finally, I will introduce BIDS for EEG, a new effort to increase the impact of shared and well-documented EEG data.
The document recommends an approach for a Federal Data Reference Model (DRM) to address issues with the current stovepiped data systems across government agencies. It proposes a federated data management approach using the DRM framework to provide common data definitions and enable horizontal and vertical information sharing. This would allow agencies to more easily integrate and share data both internally and externally. The DRM is based on model-driven architecture principles to provide a virtual representation of all data sources and abstract away data storage details.
The document recommends an approach for a Federal Data Reference Model (DRM) to address issues with the current stovepiped data systems across government agencies. It proposes a federated data management approach using the DRM framework to enable horizontal and vertical information sharing across agencies. This would provide a common structure and tools to describe data, enable data integration and interoperability, and facilitate information exchange with external partners. It also discusses the related Net-Centric Data Strategy and role of Communities of Interest in implementing the goals of making data visible, accessible, understandable, and trusted across the government.
Closing the data source discovery gap and accelerating data discovery comprises three steps: profile, identify, and unify. This white paper discusses how the Attivio
platform executes those steps, the pain points each one addresses, and the value Attivio provides to advanced analytics and business intelligence (BI) initiatives.
The document discusses trends in e-discovery in 2014 and expected trends going forward. In 2014, powerful new data analytics tools and technologies for search and review became available, including predictive coding and cloud-based solutions. The use of predictive coding is increasing slowly due to lack of clear legal standards. Mobile devices and information governance are also growing areas of focus, with recent data breaches increasing urgency around proper data management. Looking ahead, more holistic and combined approaches to discovery are expected, as well as increased use of technologies like automated classification and disposal to support information governance programs.
Similar to DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai (20)
Citing Data in Journal Articles using JATS by Deborah A. Lapeyredatascienceiqss
This document discusses how to cite data sources in journal articles using the Journal Article Tag Suite (JATS). It recommends using the <mixed-citation> element to cite data in references and outlines elements like <data-title>, <version>, and <pub-id> that can be used to represent different components of data citations. It also proposes new attributes and values for existing JATS elements to better support citing datasets, databases, and other research materials.
This document outlines a project between the Odum Institute and IQSS Dataverse team to integrate the Dataverse data repository system with iRODS, an open source data management system. The goals are to expand storage options for Dataverse, integrate curation workflows, and connect Dataverse to national research data infrastructure. A prototype will be developed to enable automated ingest of data from Dataverse to iRODS using rules and APIs. Challenges include migrating both systems to newer versions while maintaining authentication between them. An initial prototype is expected in August 2015.
DataTags: Sharing Privacy Sensitive Data by Latanya Sweeneydatascienceiqss
The DataTags framework makes it easy for data producers to deposit, data publishers to store and distribute, and data users to access and use datasets containing confidential information, in a standardized and responsible way. The talk will first introduce the concepts and tools behind DataTags, and then focus on the user-facing component of the system - Tagging Server (available today at datatags.org). We will conclude by describing how future versions of Dataverse will use DataTags to automatically handle sensitive datasets, that can only be shared under some restrictions.
Data Analysis in Dataverse & Visualization of Datasets on Historical Maps by ...datascienceiqss
The CLIO INFRA project (http://www.clio-infra.eu) hosted at the International Institute of Social History (IISH, http://www.socialhistory.org) integrates a number of data(hubs) on global social, economic and institutional indicators over the past five centuries. Considering a switch to Dataverse, the IISH is currently developing visualization tools that fit on top of Dataverse, especially historical maps with temporally accurate borders, such as provided in the NLGIS (http://nlgis.nl), Labour conflicts (http://laborconflicts.socialhistory.org) projects.
This talk will introduce the Dataverse widget developed to extract, transform and load data from datasets storage to platform independent internal data warehouses with possibilities to validate and check the quality of research datasets.
TwoRavens: A Graphical, Browser-Based Statistical Interface for Data Reposito...datascienceiqss
TwoRavens is a gesture-based web application that allows users to easily explore, analyze, and conduct interactive statistical modeling on tabular data files stored in Harvard's Dataverse, which contains nearly 25,000 datasets. It provides easy access to descriptive statistics and uses a directed graph interface for statistical modeling. The application is browser-based, device independent, and does not require any statistical expertise. Future directions include automated statistical model selection, accumulating user results, and an interface for curator privacy models.
The MIT Libraries hosts a Dataverse instance to house and disseminate datasets purchased by the Libraries for use by MIT affiliates in their research. Access is restricted to MIT IP ranges and eventually Shibboleth authentication to comply with licensing terms that limit dissemination to university affiliates. Future plans include deeper integration with the Libraries' digital systems, expanding subject coverage, involving more librarians, and potentially centralizing some management functions.
American Journal of Political Science & The Odum Institute: Promoting Researc...datascienceiqss
The document discusses the American Journal of Political Science's (AJPS) new policy requiring authors to submit replication files to reproduce study results. It notes that the Odum Institute verifies these replication files and has so far verified 18 out of 25 total submissions, with most requiring 1-2 resubmissions. It outlines next steps to improve the verification workflow and more directly connect articles and datasets.
Data in Brief and Dataverse: Incentivizing Authors to Share Data by Paige Sha...datascienceiqss
Data in Brief, an Open Access journal published by Elsevier, exclusively publishes data articles wherein researchers describe their datasets. Data in Brief requires that all data be made publicly available either directly with the article as supplementary files or in a public repository. Data in Brief has teamed up with Dataverse to provide a venue for authors to archive and openly share their data.
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...datascienceiqss
It would be useful to be able to discover what kinds of data are contained in the myriad general-purpose public data repositories. It would be even better if it were possible to query that data and/or have that data conform to a particular context-dependent data format. This was the ambition of the Data FAIRport project. I will be demonstrating the "strawman" demonstration of a fully-functional Data FAIRport, where the meta/data in a public repository can be "projected" into one of a number of different context-dependent formats, such that it can be cross-queried in combination with the (potentially "projected") data from other repositories.
Contributing Code to Dataverse by Gustavo Duranddatascienceiqss
Dataverse is an open source data repository platform built using Java and Java EE technologies like JSF, EJB, and JPA. It uses PostgreSQL for the database, Solr for indexing and search, and supports file storage on the file system with plans to abstract this. Developers can contribute code through GitHub pull requests following an agile development process. The architecture includes models for datasets, fields, permissions, and APIs. Third party plugins can integrate with Dataverse through its APIs for search, data access, deposit, and administration.
Towards a common deposit api (the dataverse example) Elizabeth Quigley + Phil...datascienceiqss
For the past few years Dataverse has been using the SWORD protocol as the standard for a Data Deposit API, but is this the standard all repositories should use for Data Deposit APIs? We will discuss the good parts and the challenges of this approach. Additionally this presentation will lead into the Panel Discussion consisting of various stakeholders from publishers, domain and general repositories, funding agencies, researchers, and industry.
Gender and Mental Health - Counselling and Family Therapy Applications and In...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Chapter wise All Notes of First year Basic Civil Engineering.pptxDenish Jangid
Chapter wise All Notes of First year Basic Civil Engineering
Syllabus
Chapter-1
Introduction to objective, scope and outcome the subject
Chapter 2
Introduction: Scope and Specialization of Civil Engineering, Role of civil Engineer in Society, Impact of infrastructural development on economy of country.
Chapter 3
Surveying: Object Principles & Types of Surveying; Site Plans, Plans & Maps; Scales & Unit of different Measurements.
Linear Measurements: Instruments used. Linear Measurement by Tape, Ranging out Survey Lines and overcoming Obstructions; Measurements on sloping ground; Tape corrections, conventional symbols. Angular Measurements: Instruments used; Introduction to Compass Surveying, Bearings and Longitude & Latitude of a Line, Introduction to total station.
Levelling: Instrument used Object of levelling, Methods of levelling in brief, and Contour maps.
Chapter 4
Buildings: Selection of site for Buildings, Layout of Building Plan, Types of buildings, Plinth area, carpet area, floor space index, Introduction to building byelaws, concept of sun light & ventilation. Components of Buildings & their functions, Basic concept of R.C.C., Introduction to types of foundation
Chapter 5
Transportation: Introduction to Transportation Engineering; Traffic and Road Safety: Types and Characteristics of Various Modes of Transportation; Various Road Traffic Signs, Causes of Accidents and Road Safety Measures.
Chapter 6
Environmental Engineering: Environmental Pollution, Environmental Acts and Regulations, Functional Concepts of Ecology, Basics of Species, Biodiversity, Ecosystem, Hydrological Cycle; Chemical Cycles: Carbon, Nitrogen & Phosphorus; Energy Flow in Ecosystems.
Water Pollution: Water Quality standards, Introduction to Treatment & Disposal of Waste Water. Reuse and Saving of Water, Rain Water Harvesting. Solid Waste Management: Classification of Solid Waste, Collection, Transportation and Disposal of Solid. Recycling of Solid Waste: Energy Recovery, Sanitary Landfill, On-Site Sanitation. Air & Noise Pollution: Primary and Secondary air pollutants, Harmful effects of Air Pollution, Control of Air Pollution. . Noise Pollution Harmful Effects of noise pollution, control of noise pollution, Global warming & Climate Change, Ozone depletion, Greenhouse effect
Text Books:
1. Palancharmy, Basic Civil Engineering, McGraw Hill publishers.
2. Satheesh Gopi, Basic Civil Engineering, Pearson Publishers.
3. Ketki Rangwala Dalal, Essentials of Civil Engineering, Charotar Publishing House.
4. BCP, Surveying volume 1
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxEduSkills OECD
Iván Bornacelly, Policy Analyst at the OECD Centre for Skills, OECD, presents at the webinar 'Tackling job market gaps with a skills-first approach' on 12 June 2024
A Visual Guide to 1 Samuel | A Tale of Two HeartsSteve Thomason
These slides walk through the story of 1 Samuel. Samuel is the last judge of Israel. The people reject God and want a king. Saul is anointed as the first king, but he is not a good king. David, the shepherd boy is anointed and Saul is envious of him. David shows honor while Saul continues to self destruct.
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...EduSkills OECD
Andreas Schleicher, Director of Education and Skills at the OECD presents at the launch of PISA 2022 Volume III - Creative Minds, Creative Schools on 18 June 2024.
How Barcodes Can Be Leveraged Within Odoo 17Celine George
In this presentation, we will explore how barcodes can be leveraged within Odoo 17 to streamline our manufacturing processes. We will cover the configuration steps, how to utilize barcodes in different manufacturing scenarios, and the overall benefits of implementing this technology.
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.pptHenry Hollis
The History of NZ 1870-1900.
Making of a Nation.
From the NZ Wars to Liberals,
Richard Seddon, George Grey,
Social Laboratory, New Zealand,
Confiscations, Kotahitanga, Kingitanga, Parliament, Suffrage, Repudiation, Economic Change, Agriculture, Gold Mining, Timber, Flax, Sheep, Dairying,
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
1. Latanya Sweeney, Mercè Crosas, David O'Brien, Alexandra Wood, Urs
Gasser, Steve Chong, Micah Altman, Michael Bar-Sinai and others.
presented by
Michael Bar-Sinai
mbarsinai@iq.harvard.edu @michbarsinai
the programs interactively executed by
The language we develop to create them
suggested wording for questions, sub-
e harmonized decision-graphs take a long
we plan to support a special TODO type,
harmonized decision-graphs can be
ing useful views of the harmonized
Below are two views of a harmonized
(based on HTML5) and another static
was automatically generated by our
al information as well as basic wording
researcher may be different).
gulations related to IRBs, consent and
w chart of questions for an interview of a
gal experts review our approach and all
and prudent with respect to data sharing
show parts of the HIPAA harmonized
ed'Decision@Graph' CONCLUSIONS'
The DataTags project will allow researchers to publish their data, without
breaching laws or regulations. Using a simple interview process, the
system and researcher will generate a machine actionable data policy
appropriate for a dataset – its “DataTags”. This policy will later by used by
systems like Dataverse to decide how the data should be made available,
and to whom. The system will also be able to generate a customized DUA
based on these tags – a task that is currently done manually, consuming a
lot of time and resources.
The programming language for Tag Space and Harmonized decision-graph
description, and the tools related to it, will be able to describe general
harmonized decision-graphs, not just in the legal field. While easy to learn,
the language relies on Graph Theory, a robust foundation that will allow
various tools, including model checking and program/harmonized
decision-graph validations.
We believe DataTags will dramatically improve the rate of data sharing
among researchers, while maintaining legal compliance and at no cost to
the researcher or her institution. As a result, we expect more data to be
available for researchers, with fewer barriers of access.
REFERENCES'
[1] Sweeney L. Operationalizing American Jurisprudence for Data Sharing.
White Paper. 2013
[2] http://www.security.harvard.edu/research-data-security-policy
ACKNOWLEDGEMENTS''
Bob Gellman – validating the current harmonized decision-graph we have
is HIPAA compliant.
1.
Person-specific
[PrivacyTagSet ]
2.
Explicit consent
[PrivacyTagSet ]
YES
1.1.
Tags= [GREEN, store=clear, transfer=clear, auth=none, basis=not applicable, identity=not person-specific, harm=negligible]
[PrivacyTagSet (EncryptionType): Clear(AuthenticationType): None(EncryptionType): Clear(DuaAgreementMethod): None]
NO
ng?
YES
NO
nt, effort=___, harm=___]
e(EncryptionType): Clear]
shing and Credit Cards)
deo Cameras)
y)
D)
sts)
rmonized*decision*graph,*compu/ng*
AA*compliance*
r DataTags to be successful. From the data
gging process may be experienced as a
unfamiliar terms, and carrying dire legal
tly. Thus, the interview process and its user
nviting, non-intimidating and user-
legal or technical terms are used, a
ily available.
ocess depends on the answers, existing
display (such as progress bars or a check
to convey the progress made so far in a
engaged in the process is an open research
dy.
er'Interface'
Dataset&
Interview&
Handling1
Access&
Control&
DUAs,&Legal&
Policies&
Data&Tags&
Dataset&
Dataset&
Dataset& Dataset&
Shame&
Civil&
Penal>es&
Criminal&
Penal>es&
Max&
Control&
No&Risk&
Minimal&
Direct&&
Access&
Criminal&
Penal>es&
Privacy&
Preserving&
Access&
Minimal&
Privacy&
Preserving&
Minimal&
Differen>al&
Privacy&
ε=1&
ε=1/10&
ε=1/100&
ε=1/1000&
Custom&
Agreement&
Overview*of*a*dataset*ingest*workflow*in*Dataverse,*showing*the*
role*of*the*DataTags*project*in*the*process.**
Create and maintain a user-friendly system that allows researchers to
share data with confidence, knowing they comply with the laws and
regulations governing shared datasets.
We plan to achieve the above by the following efforts:
1. Describe the space of possible data policies using orthogonal
dimensions, allowing an efficient and unambiguous description of each
policy.
2. Harmonize American jurisprudence into a single decision-graph for
making decisions about data sharing policies applicable to a given
dataset.
3. Create an automated interview for composing data policies, such that
the resulting policy complies with the harmonized laws and regulations
(initially assuming the researcher’s answers correctly described the
dataset).
4. Create a set of “DataTags” – fully specified data policies (defined in
Describing a Tag Space), that are the only possible results of a tagging
process.
5. Create a formal language for describing the data policies space and the
harmonized decision-graph, complete with a runtime engine and
inspection tools.
6. Create an inviting, user-friendly web-based automated interview system
to allow researchers to tag their data sets, as part of the Dataverse
system.
and credible. Funding agencies and publications increasingly require data
sharing too. Sharing data while maintaining protections is usually left to
the social science researcher to do with little or no guidance or assistance.
It is no easy feat. There are about 2187 privacy laws at the state and federal
levels in the United States [1]. Additionally, some data sets are collected or
disseminated under binding contracts, data use agreements, data sharing
restrictions etc. Technologically, there is an ever-growing set of solutions
to protect data – but people outside of the data security community may not
know about them and their applicability to any legal setting is not clear.
The DataTags project aims to help social scientists share their data widely
with necessary protections. This is done by means of interactive
computation, where the researcher and the system traverse a decision
graph, creating a machine-actionable data handling policy as they go. The
system then makes guarantees that releases of the data adhere to the
associated policy.
OBJECTIVES'
executed and reasoned about.
Part of the tooling effort is creating useful views of the harmonized
decision-graph and its sub-parts. Below are two views of a harmonized
decision-graph – one interactive (based on HTML5) and another static
(based on Graphviz). The latter was automatically generated by our
interpreter. Nodes show technical information as well as basic wording
(actual wording presented to the researcher may be different).
We have already harmonized regulations related to IRBs, consent and
HIPAA and made a summary flow chart of questions for an interview of a
researcher. We have also had legal experts review our approach and all
agreed it was sufficient, proper and prudent with respect to data sharing
under HIPAA. The views below show parts of the HIPAA harmonized
decision-graph.
and to whom. The sy
based on these tags –
lot of time and resour
The programming lan
description, and the t
harmonized decision-
the language relies on
various tools, includi
decision-graph valida
We believe DataTags
among researchers, w
the researcher or her
available for research
[1] Sweeney L. Oper
White Paper. 2013
[2] http://www.securi
A
Bob Gellman – valida
is HIPAA compliant.
Harm(Level( DUA(Agreement(
Method(
Authen$ca$on( Transit( Storage(
No(Risk( Implicit( None( Clear( Clear(
Data$is$non)confiden.al$informa.on$that$can$be$stored$and$shared$freely
Minimal( Implicit( Email/OAuth( Clear( Clear(
May$have$individually$iden.fiable$informa.on$but$disclosure$would$not$cause$material$harm$
Shame( Click(Through( Password( Encrypted( Encrypted(
May$have$individually$iden.fiable$informa.on$that$if$disclosed$could$be$expected$to$damage$a$
person’s$reputa.on$or$cause$embarrassment
Civil(Penal$es( Signed( Password( Encrypted( Encrypted(
May$have$individually$iden.fiable$informa.on$that$includes$Social$Security$numbers,$financial$
informa.on,$medical$records,$and$other$individually$iden.fiable$informa.on
Criminal(
Penal$es(
Signed(
(
Two:Factor( Encrypted( Encrypted(
May$have$individually$iden.fiable$informa.on$that$could$cause$significant$harm$to$an$individual$
if$exposed,$including$serious$risk$of$criminal$liability,$psychological$harm,$loss$of$insurability$or$
employability,$or$significant$social$harm
Maximum(
Control(
Signed(
(
Two:Factor( Double(
Encrypted(
Double(
Encrypted(
Defined$as$such,$or$may$be$life)threatening$(e.g.$interviews$with$iden.fiable$gang$members).
Screenshot*of*a*ques/on*screen,*part*of*the*tagging*process.*Note**
the*current*data*tags*on*the*right,*allowing*the*user*to*see*what*
was*achieved*so*far*in*the*tagging*process.*
In order to define the tags and their possible values, we are developing a
formal language, designed to allow legal experts with little or no
programming experience to write interviews. This will enable frequent
updates to the system, a fundamental requirement since laws governing
research data may change. Below is the full tag space needed for HIPAA
compliance, and part of the code used to create it.
Representing the tag space as a graph allows us to reason about it using
Graph Theory. Under these terms, creating DataTags to represent a data
policy translates to selecting a sub-graph from the tag space graph. A single
node n is said to be fully-specified in sub-graph S, if S contains an edge
from n to one of its leafs. A Compound node c is said to be fully-specified
in sub-graph S if all its single and compound child nodes are fully
specified in sub-graph S.
A tagging process has to yield a sub-graph in which the root node (shown
in yellow) is fully-specified.
Describing'a'Tag'Space'
DataType: Standards, Effort, Harm.!
!
Standards: some of HIPAA, FERPA,!
ElectronicWiretapping,!
CommonRule.!
Effort: one of Identified, Identifiable, !
DeIdentified, Anonymous.!
Harm: one of NoRisk, Minimal, Shame, Civil,!
Criminal, MaxControl.!
!
The*tag*space*graph*needed*for*HIPAA*compliance,*and*part*of*the*code*used*to*
describe*it.*Base*graph*for*the*diagram*was*created*by*our*language*
interpreter.*
DataTags
blue
green
orange
red
crimson
None1yr
2yr
5yr
No Restriction
Research
IRB
No Product
None
Email
OAuth
Password
none
Email
Signed
HIPAA
FERPA
ElectronicWiretapping CommonRule
Identified
Reidentifiable
DeIdentified
Anonymous
NoRisk
Minimal Shame
Civil
Criminal
MaxContro
l
Anyone
NotOnline
Organization
Group
NoOne
NoMatching
NoEntities
NoPeople NoProhibition
Contact NoRestriction
Notify
PreApprove
Prohibited
Click
Signed
SignWithId
Clear
Encrypt
DoubleEncrypt
Clear
Encrypt
DoubleEncrypt
code
Handling
DataType
DUA
Storage
Transit
Authentication
Standards
Effort
Harm
TimeLimit
Sharing
Reidentify
Publication
Use
Acceptance
Approval
Compund
Simple
Aggregate
Value
1.
Person-specific
[PrivacyTagSet ]
2.
Explicit consent
[PrivacyTagSet ]
YES
1.1.
Tags= [GREEN, store=clear, transfer=clear, auth=none, basis=not applicable, identity=not person-specific, harm=negligible]
[PrivacyTagSet (EncryptionType): Clear(AuthenticationType): None(EncryptionType): Clear(DuaAgreementMethod): None]
NO
2.1.
Did the consent have any restrictions on sharing?
[PrivacyTagSet ]
YES
3.
Medical Records
[PrivacyTagSet ]
NO
2.1.2.
Add DUA terms and set tags from DUA specifics
[PrivacyTagSet ]
YES
2.1.1.
Tags= [GREEN, store=clear, transfer=clear, auth=none, basis=Consent, effort=___, harm=___]
[PrivacyTagSet (EncryptionType): Clear(AuthenticationType): None(EncryptionType): Clear]
NO
YES NO
3.1.
HIPAA
[PrivacyTagSet ]
YES
3.2.
Not HIPAA
[PrivacyTagSet ]
NO
3.1.5.
Covered
[PrivacyTagSet ]
YES
4.
Arrest and Conviction Records
[PrivacyTagSet ]
NO
3.1.5.1.
Tags= [RED, store=encrypt, transfer=encrypt, auth=Approval, basis=HIPAA Business Associate, effort=identifiable, harm=criminal]
[PrivacyTagSet (EncryptionType): Encrypted(AuthenticationType): Password(EncryptionType): Encrypted(DuaAgreementMethod): Sign]
YES NO
5.
Bank and Financial Records
[PrivacyTagSet ]
YESNO
6.
Cable Television
[PrivacyTagSet ]
YESNO
7.
Computer Crime
[PrivacyTagSet ]
YESNO
8.
Credit reporting and Investigations (including ‘Credit Repair,’ ‘Credit Clinics,’ Check-Cashing and Credit Cards)
[PrivacyTagSet ]
YESNO
9.
Criminal Justice Information Systems
[PrivacyTagSet ]
YESNO
10.
Electronic Surveillance (including Wiretapping, Telephone Monitoring, and Video Cameras)
[PrivacyTagSet ]
YESNO
11.
Employment Records
[PrivacyTagSet ]
YESNO
12.
Government Information on Individuals
[PrivacyTagSet ]
YESNO
13.
Identity Theft
[PrivacyTagSet ]
YESNO
14.
Insurance Records (including use of Genetic Information)
[PrivacyTagSet ]
YESNO
15.
Library Records
[PrivacyTagSet ]
YESNO
16.
Mailing Lists (including Video rentals and Spam)
[PrivacyTagSet ]
YESNO
17.
Special Medical Records (including HIV Testing)
[PrivacyTagSet ]
YESNO
18.
Non-Electronic Visual Surveillance (also Breast-Feeding)
[PrivacyTagSet ]
YESNO
19.
Polygraphing in Employment
[PrivacyTagSet ]
YESNO
20.
Privacy Statutes/State Constitutions (including the Right to Publicity)
[PrivacyTagSet ]
YESNO
21.
Privileged Communications
[PrivacyTagSet ]
YESNO
22.
Social Security Numbers
[PrivacyTagSet ]
YESNO
23.
Student Records
[PrivacyTagSet ]
YESNO
24.
Tax Records
[PrivacyTagSet ]
YESNO
25.
Telephone Services (including Telephone Solicitation and Caller ID)
[PrivacyTagSet ]
YESNO
26.
Testing in Employment (including Urinalysis, Genetic and Blood Tests)
[PrivacyTagSet ]
YESNO
27.
Tracking Technologies
[PrivacyTagSet ]
YESNO
28.
Voter Records
[PrivacyTagSet ]
YESNO
YES NO
YES NO
Two*views*of*the*same*harmonized*decision*graph,*compu/ng*
HIPAA*compliance*
Usability is a major challenge for DataTags to be successful. From the data
publisher point of view, a data tagging process may be experienced as a
daunting chore containing many unfamiliar terms, and carrying dire legal
consequences if not done correctly. Thus, the interview process and its user
interface will be designed to be inviting, non-intimidating and user-
friendly. For example, whenever legal or technical terms are used, a
layman explanation will be readily available.
As the length of the interview process depends on the answers, existing
best practices for advancement display (such as progress bars or a check
list) cannot be used. Being able to convey the progress made so far in a
gratifying way, keeping the user engaged in the process is an open research
question which we intend to study.
User'Interface'
In*order*to*make*the*tagging*process*approachable*and*nonE
in/mida/ng,*whenever*a*technical*or*a*legal*term*is*used,*an*
explana/on*is*readily*available.*Shown*here*is*part*of*the*final*
tagging*page,*and*an*explained*technical*term.**
Dataset&
Inter
Hand
Acc
Con
DUAs,
Poli
Data
Overview*of*a*d
role
earchers to handle research data. We extend this to a 6-level scale for
cifying data policies regarding security and privacy of data. The scale is
ed on the level of harm malicious use of the data may cause. The
umns represent some of the dimensions of the data policy space.
the runtime and the researcher. The language we develop to create them
will support tagging statements, suggested wording for questions, sub-
routines and more. As we realize harmonized decision-graphs take a long
time to create and verify legally, we plan to support a special TODO type,
such that partially implemented harmonized decision-graphs can be
executed and reasoned about.
Part of the tooling effort is creating useful views of the harmonized
decision-graph and its sub-parts. Below are two views of a harmonized
decision-graph – one interactive (based on HTML5) and another static
(based on Graphviz). The latter was automatically generated by our
interpreter. Nodes show technical information as well as basic wording
(actual wording presented to the researcher may be different).
We have already harmonized regulations related to IRBs, consent and
HIPAA and made a summary flow chart of questions for an interview of a
researcher. We have also had legal experts review our approach and all
agreed it was sufficient, proper and prudent with respect to data sharing
under HIPAA. The views below show parts of the HIPAA harmonized
decision-graph.
breaching laws or regulations. Using a simple interview process, the
system and researcher will generate a machine actionable data policy
appropriate for a dataset – its “DataTags”. This policy will later by used by
systems like Dataverse to decide how the data should be made available,
and to whom. The system will also be able to generate a customized DUA
based on these tags – a task that is currently done manually, consuming a
lot of time and resources.
The programming language for Tag Space and Harmonized decision-graph
description, and the tools related to it, will be able to describe general
harmonized decision-graphs, not just in the legal field. While easy to learn,
the language relies on Graph Theory, a robust foundation that will allow
various tools, including model checking and program/harmonized
decision-graph validations.
We believe DataTags will dramatically improve the rate of data sharing
among researchers, while maintaining legal compliance and at no cost to
the researcher or her institution. As a result, we expect more data to be
available for researchers, with fewer barriers of access.
REFERENCES'
[1] Sweeney L. Operationalizing American Jurisprudence for Data Sharing.
White Paper. 2013
[2] http://www.security.harvard.edu/research-data-security-policy
ACKNOWLEDGEMENTS''
Bob Gellman – validating the current harmonized decision-graph we have
is HIPAA compliant.
m(Level( DUA(Agreement(
Method(
Authen$ca$on( Transit( Storage(
isk( Implicit( None( Clear( Clear(
is$non)confiden.al$informa.on$that$can$be$stored$and$shared$freely
mal( Implicit( Email/OAuth( Clear( Clear(
have$individually$iden.fiable$informa.on$but$disclosure$would$not$cause$material$harm$
me( Click(Through( Password( Encrypted( Encrypted(
have$individually$iden.fiable$informa.on$that$if$disclosed$could$be$expected$to$damage$a$
on’s$reputa.on$or$cause$embarrassment
Penal$es( Signed( Password( Encrypted( Encrypted(
have$individually$iden.fiable$informa.on$that$includes$Social$Security$numbers,$financial$
ma.on,$medical$records,$and$other$individually$iden.fiable$informa.on
inal(
l$es(
Signed(
(
Two:Factor( Encrypted( Encrypted(
have$individually$iden.fiable$informa.on$that$could$cause$significant$harm$to$an$individual$
posed,$including$serious$risk$of$criminal$liability,$psychological$harm,$loss$of$insurability$or$
oyability,$or$significant$social$harm
mum(
rol(
Signed(
(
Two:Factor( Double(
Encrypted(
Double(
Encrypted(
ned$as$such,$or$may$be$life)threatening$(e.g.$interviews$with$iden.fiable$gang$members).
order to define the tags and their possible values, we are developing a
mal language, designed to allow legal experts with little or no
gramming experience to write interviews. This will enable frequent
ates to the system, a fundamental requirement since laws governing
earch data may change. Below is the full tag space needed for HIPAA
mpliance, and part of the code used to create it.
presenting the tag space as a graph allows us to reason about it using
ph Theory. Under these terms, creating DataTags to represent a data
icy translates to selecting a sub-graph from the tag space graph. A single
e n is said to be fully-specified in sub-graph S, if S contains an edge
m n to one of its leafs. A Compound node c is said to be fully-specified
ub-graph S if all its single and compound child nodes are fully
cified in sub-graph S.
agging process has to yield a sub-graph in which the root node (shown
yellow) is fully-specified.
Describing'a'Tag'Space'
pe: Standards, Effort, Harm.!
rds: some of HIPAA, FERPA,!
ElectronicWiretapping,!
CommonRule.!
: one of Identified, Identifiable, !
DeIdentified, Anonymous.!
one of NoRisk, Minimal, Shame, Civil,!
Criminal, MaxControl.!
tag*space*graph*needed*for*HIPAA*compliance,*and*part*of*the*code*used*to*
describe*it.*Base*graph*for*the*diagram*was*created*by*our*language*
interpreter.*
DataTags
blue
green
orange
red
crimson
None1yr
2yr
5yr
No Restriction
Research
IRB
No Product
None
Email
OAuth
Password
none
Email
Signed
HIPAA
FERPA
ElectronicWiretapping CommonRule
Identified
Reidentifiable
DeIdentified
Anonymous
NoRisk
Minimal Shame
Civil
Criminal
MaxContro
l
Anyone
NotOnline
Organization
Group
NoOne
NoMatching
NoEntities
NoPeople NoProhibition
Contact NoRestriction
Notify
PreApprove
Prohibited
Click
Signed
SignWithId
Clear
Encrypt
DoubleEncrypt
Clear
Encrypt
DoubleEncrypt
code
Handling
DataType
DUA
Storage
Transit
Authentication
Standards
Effort
Harm
TimeLimit
Sharing
Reidentify
Publication
Use
Acceptance
Approval
Compund
Simple
Aggregate
Value
1.
Person-specific
[PrivacyTagSet ]
2.
Explicit consent
[PrivacyTagSet ]
YES
1.1.
Tags= [GREEN, store=clear, transfer=clear, auth=none, basis=not applicable, identity=not person-specific, harm=negligible]
[PrivacyTagSet (EncryptionType): Clear(AuthenticationType): None(EncryptionType): Clear(DuaAgreementMethod): None]
NO
2.1.
Did the consent have any restrictions on sharing?
[PrivacyTagSet ]
YES
3.
Medical Records
[PrivacyTagSet ]
NO
2.1.2.
Add DUA terms and set tags from DUA specifics
[PrivacyTagSet ]
YES
2.1.1.
Tags= [GREEN, store=clear, transfer=clear, auth=none, basis=Consent, effort=___, harm=___]
[PrivacyTagSet (EncryptionType): Clear(AuthenticationType): None(EncryptionType): Clear]
NO
YES NO
3.1.
HIPAA
[PrivacyTagSet ]
YES
3.2.
Not HIPAA
[PrivacyTagSet ]
NO
3.1.5.
Covered
[PrivacyTagSet ]
YES
4.
Arrest and Conviction Records
[PrivacyTagSet ]
NO
3.1.5.1.
Tags= [RED, store=encrypt, transfer=encrypt, auth=Approval, basis=HIPAA Business Associate, effort=identifiable, harm=criminal]
[PrivacyTagSet (EncryptionType): Encrypted(AuthenticationType): Password(EncryptionType): Encrypted(DuaAgreementMethod): Sign]
YES NO
5.
Bank and Financial Records
[PrivacyTagSet ]
YESNO
6.
Cable Television
[PrivacyTagSet ]
YESNO
7.
Computer Crime
[PrivacyTagSet ]
YESNO
8.
Credit reporting and Investigations (including ‘Credit Repair,’ ‘Credit Clinics,’ Check-Cashing and Credit Cards)
[PrivacyTagSet ]
YESNO
9.
Criminal Justice Information Systems
[PrivacyTagSet ]
YESNO
10.
Electronic Surveillance (including Wiretapping, Telephone Monitoring, and Video Cameras)
[PrivacyTagSet ]
YESNO
11.
Employment Records
[PrivacyTagSet ]
YESNO
12.
Government Information on Individuals
[PrivacyTagSet ]
YESNO
13.
Identity Theft
[PrivacyTagSet ]
YESNO
14.
Insurance Records (including use of Genetic Information)
[PrivacyTagSet ]
YESNO
15.
Library Records
[PrivacyTagSet ]
YESNO
16.
Mailing Lists (including Video rentals and Spam)
[PrivacyTagSet ]
YESNO
17.
Special Medical Records (including HIV Testing)
[PrivacyTagSet ]
YESNO
18.
Non-Electronic Visual Surveillance (also Breast-Feeding)
[PrivacyTagSet ]
YESNO
19.
Polygraphing in Employment
[PrivacyTagSet ]
YESNO
20.
Privacy Statutes/State Constitutions (including the Right to Publicity)
[PrivacyTagSet ]
YESNO
21.
Privileged Communications
[PrivacyTagSet ]
YESNO
22.
Social Security Numbers
[PrivacyTagSet ]
YESNO
23.
Student Records
[PrivacyTagSet ]
YESNO
24.
Tax Records
[PrivacyTagSet ]
YESNO
25.
Telephone Services (including Telephone Solicitation and Caller ID)
[PrivacyTagSet ]
YESNO
26.
Testing in Employment (including Urinalysis, Genetic and Blood Tests)
[PrivacyTagSet ]
YESNO
27.
Tracking Technologies
[PrivacyTagSet ]
YESNO
28.
Voter Records
[PrivacyTagSet ]
YESNO
YES NO
YES NO
Two*views*of*the*same*harmonized*decision*graph,*compu/ng*
HIPAA*compliance*
Usability is a major challenge for DataTags to be successful. From the data
publisher point of view, a data tagging process may be experienced as a
daunting chore containing many unfamiliar terms, and carrying dire legal
consequences if not done correctly. Thus, the interview process and its user
interface will be designed to be inviting, non-intimidating and user-
friendly. For example, whenever legal or technical terms are used, a
layman explanation will be readily available.
As the length of the interview process depends on the answers, existing
best practices for advancement display (such as progress bars or a check
list) cannot be used. Being able to convey the progress made so far in a
gratifying way, keeping the user engaged in the process is an open research
question which we intend to study.
User'Interface'
In*order*to*make*the*tagging*process*approachable*and*nonE
in/mida/ng,*whenever*a*technical*or*a*legal*term*is*used,*an*
explana/on*is*readily*available.*Shown*here*is*part*of*the*final*
tagging*page,*and*an*explained*technical*term.**
Dataset&
Interview&
Handling1
Access&
Control&
DUAs,&Legal&
Policies&
Data&Tags&
Dataset&
Dataset&
Dataset& Dataset&
Shame&
Civil&
Penal>es&
Criminal&
Penal>es&
Max&
Control&
No&Risk&
Minimal&
Direct&&
Access&
Criminal&
Penal>es&
Privacy&
Preserving&
Access&
Minimal&
Privacy&
Preserving&
Minimal&
Differen>al&
Privacy&
ε=1&
ε=1/10&
ε=1/100&
ε=1/1000&
Custom&
Agreement&
Overview*of*a*dataset*ingest*workflow*in*Dataverse,*showing*the*
role*of*the*DataTags*project*in*the*process.**
Ben-Gurion University
of the Negev
2. Why Share Data?
• Transparency
• Collaboration
• Research acceleration
• Reproducibility
• Data citation
• Compliance with requirements from
sponsors and publishers
3. Sharing Data is Nontrivial
• Law is complex
• Thousands of privacy laws in the US alone, at federal,
state and local level, usually context-specific
• 2187 [Sweeney, 2013]
• Technology is complex
• e.g. Encryption standards change constantly, as new
vulnerabilities are found
• Specific dataset details (may be) complex
5. deterministic, does not require
anyone except the user
(at runtime)
algorithm may prompt the user for information while running
non-intimidating, does
not assume user is a
legal or a technological
expert
how to transfer, store and disseminate a
dataset, license and DUAs…
Human readable, machine actionable
legally and
technologically
up-to-date.
User-friendly, interactive algorithmic system
for
generating proper dataset handling policies
12. DataTag Definition
DataTags: code, basis, Handling, DataType, DUA, IP, identity, FERPA, CIPSEA.
TODO: IP.
code: one of
blue (Non-confidential information that can be stored and shared freely.),
green (Potentially identifiable but not…),
yellow (Potentially harmful personal information…),
orange (May include sensitive, identifiable personal information…),
red (Very sensitive identifiable personal information…),
crimson (Requires explicit permission for each transaction…)
.
Handling: storage, transit, authentication, auth.
Storage: one of clear, encrypt, doubleEncrypt.
Authentication: some of None, Email, OAuth, Password
- - - continued
13. single value
set of values
compound value
future value
DataTag Definition
DataTags: code, basis, Handling, DataType, DUA, IP, identity, FERPA, CIPSEA.
TODO: IP.
code: one of
blue (Non-confidential information that can be stored and shared freely.),
green (Potentially identifiable but not…),
yellow (Potentially harmful personal information…),
orange (May include sensitive, identifiable personal information…),
red (Very sensitive identifiable personal information…),
crimson (Requires explicit permission for each transaction…)
.
Handling: storage, transit, authentication, auth.
Storage: one of clear, encrypt, doubleEncrypt.
Authentication: some of None, Email, OAuth, Password
- - - continued
14. Current DataTag Definition
• 10 Compound
DataTags, Legal, Assertions…
• 9 Sets
Mostly referring to specific parts of a law
• 19 Single values
TimeLimit, Sharing, Acceptance, Approval…
18. Current Legal Scope
Privacy of medical records
• The Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule, 45 C.F.R. Part 160 and Subparts A and E of Part
164, which regulates the use and disclosure of protected health information held by covered entities such as health care
providers
• The substance abuse confidentiality regulations, 42 C.F.R. Part 2, which protect the confidentiality of the medical records of
patients seeking treatment for alcohol or drug abuse
Privacy of education records
• The Family Educational Rights and Privacy Act (FERPA), 20 U.S.C. § 1232g, which safeguards student records maintained by an
educational agency or institution
• The Protection of Pupil Rights Amendment (PPRA), 20 U.S.C. § 1232h, which establishes privacy-related procedures for certain
surveys, analyses, and evaluations funded by the US Department of Education
• The Education Sciences Reform Act of 2002, 20 U.S.C. § 9573, which restricts the collection, use, and dissemination of
education data in research conducted by the Institute of Education Sciences
Privacy of government records
• The Privacy Act of 1974, 5 U.S.C. § 552a, which establishes fair information practices for protecting personally identifiable
records maintained by federal agencies
• The Confidential Information Protect and Statistical Efficiency Act (CIPSEA), 44 U.S.C. § 3501 note, which protects confidential
data collected by U.S. statistical agencies
• Title 13 of the U.S. Code, which protects the confidentiality of Census Bureau data
• The Driver’s Privacy Protection Act (DPPA), 18 U.S.C. §§ 2721-2725, which restricts the disclosure of personal information from
state department of motor vehicle records Warning - we are still in beta!
19. Selecting the DataTags
• Many possibilities
• Active effort using DataLog/Prolog
• Supervised by Prof. Steve Chong, Harvard SEAS
• Procedural Questionnaire
20. Selecting the DataTags
• Many possibilities
• Active effort using DataLog/Prolog
• Supervised by Prof. Steve Chong, Harvard SEAS
• Procedural Questionnaire
21. Questionnaire Code
(>ppraCompliance< ask:
(text: Do the data contain any information collected from students...?)
(terms:
(Elementary or secondary school: Refers to...))
(no: (end))
)
(>ppra2< ask:
(text: Do the data contain information ... )
(terms:
(Survey or assessment: Refers to any academic or non-academic ...))
(yes:
(>ppra2a< ask:
(text: Do the data contain any personally identifiable information ...)
(terms:
(Deidentified information: Information that..)
(Biometric record: Refers to a record of one ...)
(Targeted request: Refers to a request by a ...)
)
(yes:
(set: PPRA=protected, Code=orange, Harm=civil,
Transit=encrypt, Storage=encrypt, Effort=identifiable))
(no:
(set: PPRA=protectedDeidentified, Code=green, Harm=minimal,
Transit=clear, Storage=clear, Effort=deidentified))
) continued...
22. ppra2a
ask
Do the data from the survey or
assessment contain any personally
identifiable information about a
student? Personally identifiable
informati...
Set
Legal=[ EducationRecords:[ PPRA:[ protectedDeidentified ] ] ]
Code=green
Assertions=[ DataType:[ Harm:minimal Effort:deidentified ] ]
Handling=[ Transit:clear Storage:clear ]
no
Set
Legal=[ EducationRecords:[ PPRA:[ protected ] ] ]
Code=orange
Assertions=[ DataType:[ Harm:civil Effort:identifiable ] ]
Handling=[ Transit:encrypt Storage:encrypt ]
yes
ppraCompliance
ask
Do the data contain any information
collected from students in an elementary
or secondary school?
ppra2
ask
Do the data contain information
collected from students in a survey or
assessment concerning any of the
following eight categories? ...
yes no
yes no
Current Questionnaire
23. questionnaire.flow-c1
$1
ferpaCompliance
ppraCompliance
dua
DPPA1
medicalRecordsCompliance
Gov1
govRecsCompliance
start
ask
Do the data concern living persons?
questionnaire.flow-c1/ppraCompliance
questionnaire.flow-c1/ferpaCompliance
Set
Code=green
yes
Set
Code=blue
Assertions=[ Identity:noPersonData ]
no
questionnaire.flow-c1/medicalRecordsCompliance
questionnaire.flow-c1/govRecsCompliance
todo
Arrest and Conviction Records, Bank and
Financial Records, Cable Television,
Computer Crime, Credit reporting and
Investigations [including 'Credit
Repair', 'Credit Clinics', Check-Cashing
and Credit Cards], Criminal Justice
Information Systems, Electronic
Surveillance [including Wiretapping,
Telephone Monitoring, and Video
Cameras], Employment Records, Government
Information on Individuals, Identity
Theft, Insurance Records [including use
of Genetic Information], Library
Records, Mailing Lists [including Video
rentals and Spam], Special Medical
Records [including HIV Testing],
Non-Electronic Visual Surveillance.
Breast-Feeding, Polygraphing in
Employment, Privacy Statutes/State
Constitutions [including the Right to
Publicity], Privileged Communications,
Social Security Numbers, Student
Records, Tax Records, Telephone Services
[including Telephone Solicitation and
Caller ID], Testing in Employment
[including Urinalysis, Genetic and Blood
Tests], Tracking Technologies, Voter
Records
questionnaire.flow-c1/dua
ferpa15
ask
Are the data being disclosed for the
purpose of conducting an audit or
evaluation of a federal- or
state-supported education program?
Set
Legal=[ EducationRecords:[ FERPA:[ audit ] ] ]
Code=yellow
Assertions=[ DataType:[ Harm:shame Effort:identifiable ] ]
Handling=[ Transit:encrypt ]
yes
REJECT
An educational agency or institution is
likely breaching its FERPA duties
because it is disclosing, or a third
party is re-disclosing, non-directory
PII without parental consent where no
obvious FERPA exception applies.
no
Set
Legal=[ EducationRecords:[ FERPA:[ study ] ] ]
Code=yellow
Assertions=[ DataType:[ Harm:shame Effort:identifiable ] ]
Handling=[ Transit:encrypt ]
todo
DUA subroutine
Set
Legal=[ EducationRecords:[ FERPA:[ directoryInfo ] ] ]
Code=green
Assertions=[ DataType:[ Harm:minimal Effort:identifiable ] ]
todo
DUA subroutine
ferpa10
ask
Do the data include personally
identifiable information about a
student? Personally identifiable
information about a student includes,
but i...
Set
Legal=[ EducationRecords:[ FERPA:[ deidentified ] ] ]
Code=green
Assertions=[ DataType:[ Harm:minimal Effort:deidentified ] ]
Handling=[ Transit:clear Storage:clear ]
no
ferpa11
ask
Did the educational agency or
institution designate all of the
personally identifiable information in
the data as directory information?
The...
yes
Set
Legal=[ EducationRecords:[ FERPA:[ consent ] ] ]
Code=yellow
Assertions=[ DataType:[ Harm:shame Effort:identifiable ] ]
Handling=[ Transit:encrypt ]
todo
Consent sub-routine
ferpa13
ask
Are the data being disclosed to a school
official or contractor with a legitimate
educational interest?
ferpa14
ask
Are the data being disclosed for the
purpose of conducting a study for, or on
behalf of, the educational agency or
institution in order to d...
no
Set
Legal=[ EducationRecords:[ FERPA:[ schoolOfficial ] ] ]
Code=yellow
Assertions=[ DataType:[ Harm:shame Effort:identifiable ] ]
Handling=[ Transit:encrypt ]
yes
ferpa12
ask
Did the parents or guardians of the
students, or the students themselves if
they were adults or emancipated minors
at the time of the data c...yes
no
Set
Legal=[ EducationRecords:[ FERPA:[ directoryOptOut ] ] ]
Code=yellow
Assertions=[ DataType:[ Harm:shame Effort:identifiable ] ]
Handling=[ Transit:encrypt Storage:encrypt ]
ferpa9
ask
Do the data contain any information
derived from institutional records
directly related to a student? This
refers to records maint...
yes no
ferpaCompliance
ask
Do the data contain any information
derived from records maintained by an
educational agency or institution, or by
a person or entity acting...
yes no
ferpa11a
ask
Did any of the students in the data, or
their parents, if the students are under
18, request to opt out of the release of
their information ...
no
yes
no
yes
no
yes
ppra6
ask
Does the school in which the information
was collected receive any funds from the
US Department of Education?
ppra7
ask
Were parents provided with notice and an
opportunity to opt out of the collection
and disclosure of the information?
yes no
ppra2a
ask
Do the data from the survey or
assessment contain any personally
identifiable information about a
student? Personally identifiable
informati...
Set
Legal=[ EducationRecords:[ PPRA:[ protectedDeidentified ] ] ]
Code=green
Assertions=[ DataType:[ Harm:minimal Effort:deidentified ] ]
Handling=[ Transit:clear Storage:clear ]
no
Set
Legal=[ EducationRecords:[ PPRA:[ protected ] ] ]
Code=orange
Assertions=[ DataType:[ Harm:civil Effort:identifiable ] ]
Handling=[ Transit:encrypt Storage:encrypt ]
yes
REJECT
Please ask the party that collected the
data whether written consent from the
parents was obtained. Dataset cannot be
accepted without clarifying this issue.
Set
Legal=[ EducationRecords:[ PPRA:[ optOutProvided ] ] ]
yes
REJECT
Possible PPRA violation due to school’s
failure to provide parents with the
opportunity to opt out of the data
collection
no
Set
Legal=[ EducationRecords:[ PPRA:[ marketing ] ] ]
ppra3
ask
Was the survey or assessment funded, in
whole or in part, by the US Department
of Education?
ppra5
ask
Was any of the personal information in
the data collected from students for the
purpose of marketing or sale?
no
ppra4
ask
Was written consent obtained from the
parents of the students who participated
in the survey or assessment, or from the
students themselves ...
yes
ppraCompliance
ask
Do the data contain any information
collected from students in an elementary
or secondary school?
ppra2
ask
Do the data contain information
collected from students in a survey or
assessment concerning any of the
following eight categories? ...
yes no
REJECT
Possible PPRA violation due to failure
to obtain written consent for collection
of information protected under the PPRA
yes no
Set
Legal=[ EducationRecords:[ PPRA:[ consent ] ] ]
no
yes
Not Sure
no yes
Set
Handling=[ DUA:[ TimeLimit:_1yr ] ]
duaAdditional
ask
Did the data have any restrictions on
sharing, e.g. stated in an agreement or
policy statement?
dua
todo
Data use agreements
duaTimeLimit
ask
Is there any reason why we cannot store
the data indefinitely? Limiting the time
a dataset could be held interferes with
good scienc...
ask
For how long should we keep the data?
yes
Set
Handling=[ DUA:[ TimeLimit:none ] ]
no
Set
Legal=[ ContractOrPolicy:yes ]
Set
Handling=[ DUA:[ TimeLimit:_5yr ] ]
Set
Handling=[ DUA:[ TimeLimit:none ] ]
1 year 5 yearsindefinitely
Set
Handling=[ DUA:[ TimeLimit:_50yr ] ]
50 years
Set
Legal=[ ContractOrPolicy:no ]
yes no
DPPA2
ask
Do the department of motor vehicles
records contain an individual's
photograph or image, Social Security
number, or medical or disability in...
DPPA2a
ask
Has express consent been obtained from
each person whose photograph, Social
Security number, or medical or
disability information appears in...
yes
Set
Code=yellow
Assertions=[ DataType:[ Harm:shame Effort:identifiable ] ]
Handling=[ Transit:encrypt Storage:encrypt ]
no
DPPA6
ask
Was consent obtained from the
individuals whose information is
contained in the data by the requester
of the records?
DPPA7
ask
Were the data obtained for use in
research activities?
no
DPPA6a
ask
Was the obtained consent limited in any
way, such as purpose, duration, parties
that may receive the data?
yes
DPPA3
ask
Is disclosure of the personal
information required by the Driver's
Privacy Protection Act? Disclosure is
required, among other reasons, in c...
Set
Legal=[ GovernmentRecords:[ DPPA:[ required ] ] ]
yes
DPPA4
ask
Is the disclosure of these records made
in accordance with the driver's license
record disclosure law of the state from
which the records we...
no
Set
Legal=[ GovernmentRecords:[ DPPA:[ highlyRestricted ] ] ]
Code=orange
Assertions=[ DataType:[ Harm:civil Effort:identifiable ] ]
Handling=[ Transit:encrypt Storage:encrypt ]
Set
Legal=[ GovernmentRecords:[ DPPA:[ research ] ] ]
DPPA5
ask
Was consent obtained from the person
about whom the information pertains by
the State DMV in response to a request
for an individual record?...
no
DPPA5a
ask
Was the obtained consent limited in any
way, such as purpose, duration, parties
that may receive the data?
yes
DPPA1
ask
Do the data contain personal information
obtained from a state department of
motor vehicles?yes no
yes
REJECT
Possible violation of the DPPA due to
disclosure of highly restricted personal
information without the consent of the
individuals in the data
no
yes
DPPA8
ask
Was the personal information obtained
under one of the other fourteen
permissible uses for the disclosure of
drivers' records, including: ...
no
Set
Legal=[ GovernmentRecords:[ DPPA:[ exception ] ] ]
Set
Legal=[ GovernmentRecords:[ DPPA:[ stateConsentLimited ] ] ]
REJECT
Possible violation of state law on
disclosure of drivers' records
Set
Legal=[ GovernmentRecords:[ DPPA:[ requesterConsentLimited ] ] ]
yes
Set
Legal=[ GovernmentRecords:[ DPPA:[ requesterConsentBroad ] ] ]
no
REJECT
possible violation of the DPPA because
the depositor has not cited one of the
required or permissible uses for
disclosure.
yes
no
yes
Set
Legal=[ GovernmentRecords:[ DPPA:[ stateConsentBroad ] ] ]
noyes
no
MR2a
ask
Were the data obtained from a federally
assisted drug abuse program after March
20, 1972, or a federally assisted
alcohol abuse program afte...
MR7
ask
Do the data contain information from a
covered entity or business associate of
a covered entity?
no
MR3
ask
Do any of the substance abuse records
contain patient identifying information,
or information that would identify a
patient as an alcohol or...
yes
MR2
ask
Do the data contain information related
to substance abuse diagnosis, referral,
or treatment?
yes no
MR8a
ask
Has an expert in statistical or
scientific principles and methods for
deidentification certified that the data
have been deidentified?
MR9
ask
Do the data constitute a limited data
set under the HIPAA Privacy Rule?
no
Set
Legal=[ MedicalRecords:[ HIPAA:[ expertDetermination ] ] ]
Code=green
Assertions=[ DataType:[ Harm:minimal Effort:deidentified ] ]
Handling=[ Transit:clear Storage:clear ]
yes
medicalRecordsCompliance
ask
Do the data contain health or medical
information?
yes no
Set
Legal=[ MedicalRecords:[ HIPAA:[ limitedDataset ] ] ]
Code=green
Assertions=[ DataType:[ Harm:minimal Effort:deidentified ] ]
Handling=[ Transit:clear Storage:clear ]
Set
Legal=[ MedicalRecords:[ HIPAA:[ businessAssociateContract ] ] ]
Code=orange
Assertions=[ DataType:[ Harm:civil Effort:identifiable ] ]
Handling=[ Transit:encrypt Storage:encrypt ]
REJECT
For additional guidance on determining
whether an individual or organization is
a covered entity, please review the
charts provided by the Centers for
Medicare & Medicaid Services at
http://www.cms.gov/Regulations-and-Guidance/HIPAA-Administrative-Simplification/HIPAAGenInfo/Downloads/CoveredEntitycharts.pdf
http://www.cms.gov/Regulations-and-Guidance/HIPAA-Administrative-Simplification/HIPAAGenInfo/Downloads/CoveredEntitycharts.pdf
MR4a
ask
Were the data maintained by the
Department of Veterans Affairs either
shared with the patients’ consent for
the purposes of scientific resea...
Set
Legal=[ MedicalRecords:[ Part2:veteransMedicalData ] ]
Code=orange
Assertions=[ DataType:[ Harm:civil Effort:identifiable ] ]
Handling=[ Transit:encrypt ]
yes
REJECT
reject data and flag for review;
possible violation of 38 U.S.C. 4132 due
to the release of VA medical records
without a valid exception.
no
Not Sure
no
MR8
ask
Have all direct identifiers been removed
from the data?
yes
Set
Legal=[ MedicalRecords:[ Part2:scientificResearch ] ]
Code=orange
Assertions=[ DataType:[ Harm:civil Effort:identifiable ] ]
Handling=[ Transit:encrypt Storage:encrypt ]
MR4
ask
Were the data maintained in connection
with the US Department of Veterans
Affairs or US Armed Services?
yes
MR5
ask
Did the individuals provide written
consent for the disclosure of their
information?
no
MR10
ask
Do the data contain protected health
information from medical records? In
other words, do the data contain any
health information that can b...
MR11
ask
Do the data contain health information
from patients who have provided written
authorization for their information to
be disclosed to or use...
yes no
MR12
ask
Has an oversight body such as an
Institutional Review Board [IRB] or
Privacy Board reviewed how the data will
be used or disclosed and appro...
Set
Legal=[ MedicalRecords:[ HIPAA:[ waiver ] ] ]
Code=orange
Assertions=[ DataType:[ Harm:civil Effort:identifiable ] ]
Handling=[ Transit:encrypt Storage:encrypt ]
yes
MR13
ask
Were the data disclosed pursuant to a
HIPAA business associate contract?
no
MR6
ask
Were the data shared by the substance
abuse program for scientific research
purposes?
no yes
yes no
no
Set
Legal=[ MedicalRecords:[ HIPAA:[ authorization ] ] ]
Code=orange
Assertions=[ DataType:[ Harm:civil Effort:identifiable ] ]
Handling=[ Transit:encrypt Storage:encrypt ]
yes
no
Set
Legal=[ MedicalRecords:[ Part2:consent ] ]
Code=orange
Assertions=[ DataType:[ Harm:civil Effort:identifiable ] ]
Handling=[ Transit:encrypt Storage:encrypt ]
yes
yes
Set
Legal=[ MedicalRecords:[ Part2:deidentified ] ]
Code=green
Assertions=[ DataType:[ Effort:deidentified ] ]
Handling=[ Transit:clear Storage:clear ]
no
no
Set
Legal=[ MedicalRecords:[ HIPAA:[ safeHarborDeidentified ] ] ]
Code=green
Assertions=[ DataType:[ Harm:minimal Effort:deidentified ] ]
Handling=[ Transit:clear Storage:clear ]
yes
yes
no
todo
Privacy Act DUA
PA2
ask
Do the data contain personally
identifiable information?
Set
Legal=[ GovernmentRecords:[ PrivacyAct:[ identifiable ] ] ]
yes
Set
Legal=[ GovernmentRecords:[ PrivacyAct:[ deidentified ] ] ]
no
CIPSEA3
ask
Are the data being shared in
identifiable form?
Set
Legal=[ GovernmentRecords:[ CIPSEA:[ deidentified ] ] ]
no
Set
Legal=[ GovernmentRecords:[ CIPSEA:[ identifiable ] ] ]
yes
Census3
ask
Has a Census Bureau Disclosure Review
Board issued an approval memorandum
granting formal permission for the data
to be published?
Set
Legal=[ GovernmentRecords:[ Census:[ CensusPublished ] ] ]
Code=green
Assertions=[ DataType:[ Harm:minimal Effort:deidentified ] ]
Handling=[ Transit:clear Storage:clear ]
yes
REJECT
Possible violation of Title 13 due to
failure to obtain permission from the
Census Bureau to publish the Census
data.
no
ESRA1
ask
Do the data contain information from an
education research project that receives
funding from the Institute of Education
Sciences? Examples ...
REJECT
Possible violation of ESRA due to
failure to obtain authorization to
disclose the data from the Institute of
Education Sciences
CIPSEA1
ask
Were the data collected directly from
respondents for exclusively statistical
purposes under a pledge of
confidentiality?
PA1
ask
Are the data maintained by the agency in
a system of records?
no
CIPSEA2
ask
Is the agency that collected the data a
statistical agency or unit from this
list: USDA Economic Research Service,
National Agricultural Sta...
yes
Gov2
ask
Are you an employee or an agent, such as
a contractor, consultant, researcher, or
employee of a private organization,
operating under a cont...
Census1
ask
Were the data collected by the US Census
Bureau?
yes
no
no
Census2
ask
Were the data published by the Census
Bureau?
yes
Set
Legal=[ GovernmentRecords:[ Census:[ CensusPublished ] ] ]
Code=green
Assertions=[ DataType:[ Harm:minimal Effort:deidentified ] ]
Handling=[ Transit:clear Storage:clear ]
no
ESRA2
ask
Have the data been published previously,
or did you submit a copy of the data to
the Institute of Education Sciences
[IES] Data Security Off...
yes
Set
Legal=[ GovernmentRecords:[ ESRA:[ public ] ] ]
Code=green
Assertions=[ DataType:[ Harm:minimal ] ]
Handling=[ Transit:clear Storage:clear ]
no
ESRA3
ask
Has the IES Data Security Office limited
access to the data to individuals
authorized by an IES license agreement
or individuals who have ex...
yes
yes no
no
Set
Legal=[ GovernmentRecords:[ ESRA:[ restricted ] ] ]
Code=yellow
Assertions=[ DataType:[ Harm:shame ] ]
Handling=[ Transit:encrypt Storage:encrypt ]
yes
todo
CIPSEA DUA
Gov1
ask
Were the data collected by a federal
agency or a researcher under contract
with or with funds from a federal
agency?
no
yes
yes no
no yes
govRecsCompliance
questionnaire.flow-c1/Gov1
questionnaire.flow-c1/DPPA1
Current Questionnaire
24. Creating the Questionnaire
Legal team at the Berkman Center for Internet and Society
• Identifying regulatory and contractual
requirements most relevant to research data sharing
• Clustering legal requirements and approaches into
general categories
• Drafting annotated DataTags questionnaires
• Supporting implementation process
25. The Big Picture
• DataTags is part of the Privacy Tools for Sharing Research Data project
(http://privacytools.seas.harvard.edu)
• (near!) Future versions of Dataverse will be able to consult a DataTags tagging
server during the data deposit process, and carry out the resultant dataset
handling policy
• Resultant tags will be used to create, e.g. DUAs, licenses, parameters for
privacy-preserving access
Risk%Assessment%
and-
De/
Identification-
Differential%
Privacy-
Customized%&%
Machine/
Actionable-
Terms%of%Use-
Data%Tag%
Generator-
Data%
Set-
Query%
Access-
Restricted%
Access-
Consent%
from%
subjects-
Open%
Access%to%
Sanitized%
Data%Set-
IRB%
proposal%
&%review-
Policy%
Proposals%
and%Best%
Practices-
Database%of%
Privacy%Laws%
&%
Regulations-
Deposit%in%repository-
IntegraFon'of'Tools'