This presentation covers some challenges and makes suggestions to support the work of creating flexible, interoperable data systems for the life sciences.
Our regular Introduction to Data Management (DM) workshop (90-minutes). Covers very basic DM topics and concepts. Audience is graduate students from all disciplines. Most of the content is in the NOTES FIELD.
S. Venkataraman (DCC) talks about the basics of Research Data Management and how to apply this when creating or reviewing a Data Management Plan (DMP). He discusses data formats and metadata standards, persistent identifiers, licensing, controlled vocabularies and data repositories.
link to : dcc.ac.uk/resources
IDCC Workshop: Analysing DMPs to inform research data services: lessons from ...Amanda Whitmire
A workshop as part of the International Digital Curation Conference 2016 on DMP development and support. This presentation demonstrates how we can use data management plans as a source of information to better understand researcher data stewardship practices and how to support them. Be sure to see the slide notes to better understand the presentation (most slides are just photos/icons).
Our regular Introduction to Data Management (DM) workshop (90-minutes). Covers very basic DM topics and concepts. Audience is graduate students from all disciplines. Most of the content is in the NOTES FIELD.
S. Venkataraman (DCC) talks about the basics of Research Data Management and how to apply this when creating or reviewing a Data Management Plan (DMP). He discusses data formats and metadata standards, persistent identifiers, licensing, controlled vocabularies and data repositories.
link to : dcc.ac.uk/resources
IDCC Workshop: Analysing DMPs to inform research data services: lessons from ...Amanda Whitmire
A workshop as part of the International Digital Curation Conference 2016 on DMP development and support. This presentation demonstrates how we can use data management plans as a source of information to better understand researcher data stewardship practices and how to support them. Be sure to see the slide notes to better understand the presentation (most slides are just photos/icons).
Talk at JISC Repositories conference intended for repository managers or research managers on some of the issues involved. Talk had to be originally given unaided because of a technology problem!
Research Data (and Software) Management at Imperial: (Everything you need to ...Sarah Anna Stewart
A presentation on research data management tools, workflows and best practices at Imperial College London with a focus on software management. Presented at the 2017 session of the HPC Summer School (Dept. of Computing).
Presentation on electronic records management and archival issues. Originally presented at the Fall 2008 meeting of the Southeastern Wisconsin Archivists Group
This slideshow was used in an Introduction to Research Data Management course for the Social Sciences Division, University of Oxford, on 2015-05-27. It provides an overview of some key issues, looking at both day-to-day data management, and longer term issues, including sharing, and curation.
This is a presentation for the Erwin Hahn Instiutute in Essen, explaining the background, functional design and technical architecture of the Donders Repository. Furthermore, it explains how it aligns with the DCCN project management and with the researchers workflow
This presentation was delivered at the Elsevier Library Connect Seminar on 6 October 2014 in Johannesburg, 7 October 2014 in Durban and 9 October 2014 in Cape Town and gives an overview of the potential role that librarians can play in research data management
This slideshow was used in a Preparing Your Research Data for the Future course taught in the Medical Sciences Division, University of Oxford, on 2015-06-08. It provides an overview of some key issues, focusing on long-term data management, sharing, and curation.
Keynote on software sustainability given at the 2nd Annual Netherlands eScience Symposium, November 2014.
Based on the article
Carole Goble ,
Better Software, Better Research
Issue No.05 - Sept.-Oct. (2014 vol.18)
pp: 4-8
IEEE Computer Society
http://www.computer.org/csdl/mags/ic/2014/05/mic2014050004.pdf
http://doi.ieeecomputersociety.org/10.1109/MIC.2014.88
http://www.software.ac.uk/resources/publications/better-software-better-research
Introduction to research data management; Lecture 01 for GRAD521Amanda Whitmire
Lesson 1: Introduction to research data management. From a series of lectures from a 10-week, 2-credit graduate-level course in research data management (GRAD521, offered at Oregon State University).
The course description is: "Careful examination of all aspects of research data management best practices. Designed to prepare students to exceed funder mandates for performance in data planning, documentation, preservation and sharing in an increasingly complex digital research environment. Open to students of all disciplines."
Major course content includes: Overview of research data management, definitions and best practices; Types, formats and stages of research data; Metadata (data documentation); Data storage, backup and security; Legal and ethical considerations of research data; Data sharing and reuse; Archiving and preservation.
See also, "Whitmire, Amanda (2014): GRAD 521 Research Data Management Lectures. figshare. http://dx.doi.org/10.6084/m9.figshare.1003835. Retrieved 23:25, Jan 07, 2015 (GMT)"
University of Bath Research Data Management training for researchersJez Cope
Slides from a workshop on Research Data Management for research staff and students at the University of Bath.
Part of the Research360 project (http://blogs.bath.ac.uk/research360).
Authors: Cathy Pink and Jez Cope, University of Bath
These are the slides presented by Denis Engemann in the Open Science Panel discussion at the BIOMAG 2018 meeting in Philadelphia. You can find the original version on https://speakerdeck.com/dengemann/mne-hcp-pitch-biomag-2018
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Tom Plasterer
As scientists in the life sciences we are trained to pursue singular goals around a publication or a validated target or a drug submission. Our failure rates are exceedingly high especially as we move closer to patients in the attempt to collect sufficient clinical evidence to demonstrate the value of novel therapeutics. This wastes resources as well as time for patients depending upon us for the next breakthrough.
Edge Informatics is an approach to ameliorate these failures. Using both technical and social solutions together knowledge can be shared and leveraged across the drug development process. This is accomplished by making data assets discoverable, accessible, self-described, reusable and annotatable. The Open PHACTS project pioneered this approach and has provided a number of the technical and social solutions to enable Edge Informatics. A number of pre-competitive consortia and some content providers have also embraced this approach, facilitating networks of collaborators within and outside a given organization. When taken together more accurate, timely and inclusive decision-making is fostered.
Microfilm or Digitize: Which is Right for You?Brad Houston
Presentation on reformatting options for active and inactive records. Originally presented at the 2009 Annual Conference of the International Institute of Municipal Clerks, May 20, 2009
A presentation I gave at the 2018 Molecular Med Tri-Con in San Francisco, February 2018. This addresses the general challenge of biomedical data management, some of the things to consider when evaluation solutions in this space, and concludes with a brief summary of some of the tools and platforms in this space.
Talk at JISC Repositories conference intended for repository managers or research managers on some of the issues involved. Talk had to be originally given unaided because of a technology problem!
Research Data (and Software) Management at Imperial: (Everything you need to ...Sarah Anna Stewart
A presentation on research data management tools, workflows and best practices at Imperial College London with a focus on software management. Presented at the 2017 session of the HPC Summer School (Dept. of Computing).
Presentation on electronic records management and archival issues. Originally presented at the Fall 2008 meeting of the Southeastern Wisconsin Archivists Group
This slideshow was used in an Introduction to Research Data Management course for the Social Sciences Division, University of Oxford, on 2015-05-27. It provides an overview of some key issues, looking at both day-to-day data management, and longer term issues, including sharing, and curation.
This is a presentation for the Erwin Hahn Instiutute in Essen, explaining the background, functional design and technical architecture of the Donders Repository. Furthermore, it explains how it aligns with the DCCN project management and with the researchers workflow
This presentation was delivered at the Elsevier Library Connect Seminar on 6 October 2014 in Johannesburg, 7 October 2014 in Durban and 9 October 2014 in Cape Town and gives an overview of the potential role that librarians can play in research data management
This slideshow was used in a Preparing Your Research Data for the Future course taught in the Medical Sciences Division, University of Oxford, on 2015-06-08. It provides an overview of some key issues, focusing on long-term data management, sharing, and curation.
Keynote on software sustainability given at the 2nd Annual Netherlands eScience Symposium, November 2014.
Based on the article
Carole Goble ,
Better Software, Better Research
Issue No.05 - Sept.-Oct. (2014 vol.18)
pp: 4-8
IEEE Computer Society
http://www.computer.org/csdl/mags/ic/2014/05/mic2014050004.pdf
http://doi.ieeecomputersociety.org/10.1109/MIC.2014.88
http://www.software.ac.uk/resources/publications/better-software-better-research
Introduction to research data management; Lecture 01 for GRAD521Amanda Whitmire
Lesson 1: Introduction to research data management. From a series of lectures from a 10-week, 2-credit graduate-level course in research data management (GRAD521, offered at Oregon State University).
The course description is: "Careful examination of all aspects of research data management best practices. Designed to prepare students to exceed funder mandates for performance in data planning, documentation, preservation and sharing in an increasingly complex digital research environment. Open to students of all disciplines."
Major course content includes: Overview of research data management, definitions and best practices; Types, formats and stages of research data; Metadata (data documentation); Data storage, backup and security; Legal and ethical considerations of research data; Data sharing and reuse; Archiving and preservation.
See also, "Whitmire, Amanda (2014): GRAD 521 Research Data Management Lectures. figshare. http://dx.doi.org/10.6084/m9.figshare.1003835. Retrieved 23:25, Jan 07, 2015 (GMT)"
University of Bath Research Data Management training for researchersJez Cope
Slides from a workshop on Research Data Management for research staff and students at the University of Bath.
Part of the Research360 project (http://blogs.bath.ac.uk/research360).
Authors: Cathy Pink and Jez Cope, University of Bath
These are the slides presented by Denis Engemann in the Open Science Panel discussion at the BIOMAG 2018 meeting in Philadelphia. You can find the original version on https://speakerdeck.com/dengemann/mne-hcp-pitch-biomag-2018
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Tom Plasterer
As scientists in the life sciences we are trained to pursue singular goals around a publication or a validated target or a drug submission. Our failure rates are exceedingly high especially as we move closer to patients in the attempt to collect sufficient clinical evidence to demonstrate the value of novel therapeutics. This wastes resources as well as time for patients depending upon us for the next breakthrough.
Edge Informatics is an approach to ameliorate these failures. Using both technical and social solutions together knowledge can be shared and leveraged across the drug development process. This is accomplished by making data assets discoverable, accessible, self-described, reusable and annotatable. The Open PHACTS project pioneered this approach and has provided a number of the technical and social solutions to enable Edge Informatics. A number of pre-competitive consortia and some content providers have also embraced this approach, facilitating networks of collaborators within and outside a given organization. When taken together more accurate, timely and inclusive decision-making is fostered.
Microfilm or Digitize: Which is Right for You?Brad Houston
Presentation on reformatting options for active and inactive records. Originally presented at the 2009 Annual Conference of the International Institute of Municipal Clerks, May 20, 2009
A presentation I gave at the 2018 Molecular Med Tri-Con in San Francisco, February 2018. This addresses the general challenge of biomedical data management, some of the things to consider when evaluation solutions in this space, and concludes with a brief summary of some of the tools and platforms in this space.
The concept of big data has been endemic within computer science since the earliest days of computing. “Big Data” originally meant the volume of data that could not be processed (efficiently) by traditional database methods and tools.
In a broad term Big data can be describe as a data sets which is so large or complex that can not be handle by traditional data processing applications. More especially unstructured or semi-structured data.
A Brief History of Information Technology
Databases for Decision Support
OLTP vs. OLAP
Why OLAP & OLTP don’t mix (1)
Organizational Data Flow and Data Storage Components
Loading the Data Warehouse
Characteristics of a Data Warehouse
A Data Warehouse is Subject Oriented
For more visit : http://jsbi.blogspot.com
Responsible conduct of research: Data ManagementC. Tobin Magle
A presentation for the Food and Nutrition Science Responsible conduct of research class on data management best practices. Covers material in the context of writing a data management plan.
The features of a cloud based analysis platform perceived as most important and valuable change over an organization's life. Early stage companies need a relentless focus on agility and time to result. Later stage organizations focus on things like compliance, information security, integration into an enterprise strategy, and marginal cost. This talk will cover the journey from using a platform in early stage research through to using that same platform in manufacturing or the clinic.
Production Bioinformatics, emphasis on ProductionChris Dwan
Production bioinformatics at Sema4 can be thought of as data ops - a peer to the lab ops organization. We operate 24/7 to deliver correct and timely results on NGS and other data for thousands of samples per week. This deck introduces the Prod BI organization and systems architecture with a focus on what it takes to run bioinformatics in production rather than for R&D or pure research.
Training delivered in 2009 for a compute cluster customer in Calcutta, India. I honestly have no idea what I was thinking. There is no possible audience who would have been pleased with this talk.
A response from Newport Construction to the city of Somerville's demand that we be compensated for the improper destruction of our trees.
In which they respond: "No."
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
2. Conclusions
Organizing data is a human practice, not a technology choice
– There is no free lunch
Start simple, with free technologies, and quick wins
– NoSQL databases with headers and checksum
– Plan to invest in infrastructure about 18 months into the journey
– Don’t start with the whole genomes
Good policy makes simple practice
– Make data somebody’s job.
– Enterprise data management has much to teach us
3. Geek Cred: My First Petabyte, 2008My first Petabyte: 2008
5. “Gene expression data …
… are meaningful only in the
context of a detailed description
of the conditions under which
they were generated …
… including the particular state
of the living system under study
…
… and the perturbations to which
it has been subjected
State of the art: 2001
6. Most metadata field names and
their values are not standardized
or controlled.
Even simple binary or numeric
fields are often populated with
inadequate values of different
data types.
By clustering metadata field
names, we discovered there are
often many distinct ways to
represent the same aspect of a
sample.
State of the art: 2019
7. 201 6:Data Quality Matters
Ask a computational biologist /
data scientist what fraction of their
time is spent fighting data quality,
formatting, and similar issues.
8. State of the Art: 2020
A miscommunication between
the wet lab and the
bioinformatics group resulted
in an “embarrassing
miscommunication” …
… to the press.
9. Genomic Data Production in ContextGenomic data production @ Broad
I did research computing at
Broad from 2014 - 2017
10. ExAC / gnomAD: A powerful example
ExAC (the Exome Aggregation
Consortium) and gnomAD (the
Genome Aggregation Database)
represent vast amounts of work to
harmonize both phenotype and
consent
11. The IT Services Perspective
Filesystem Metadata
– File attributes: Size, format, creation and modification times
– Permissions: Ownership, Access Control Lists
– Access / usage patterns: Which files are accessed, and by whom
– Compressibility / Deduplication: Lies Optimistic projections by vendors
One function of a research computing team is to bridge the gap between data
storage (usable capacity as provided by enterprise IT) and data services
(semantically usable data)
12. The filesystem / directory tree is your
default metadata database
It is what your team is using today
Any proposal less functional than
descriptive filenames will fail
13. If you have four groups working on a compiler, you’ll
get a four-pass compiler
Eric S Raymond, The New Hacker’s Dictionary, 1996
14. Most primary data files (in
bioinformatics) include valuable
metadata, usually in the header.
These can be quite verbose.
15. Bioinformatics as a discipline is
filled with duplication.
“hg19” here is identical to “build
37” from the previous slide.
16. Many file headers include the
command line and parameters that
were used to generate the file.
This is the default method for storing
experimental “provenance”
18. Container technology (Docker / Singularity)
revolutionized software deployment.
Instead of installers and configurators, we ship a
whole operating system, with the app pre-installed.
We do not have a similar solution for experimental
metadata.
We have not found a way to package and ship
Domain experts and researchers.
We don’t have containers for metadata
19. NoSQL is a delightful prototyping tool
• NoSQL databases (MongoDB, PostgreSQL, …) do
not require a fully defined schema.
• You gain flexibility at the cost of consistency and
possibly performance.
• This makes them ideal for prototyping
• “Plan to throw one away; you will, anyhow.”
Fred Brooks, 1975
The Mythical Man-Month
20. What goes in the NoSQL?
Unique key to identify the file
The path to where it is stored
A checksum (to find duplicates later)
The header (scrape and store wholesale)
Whatever else the lab said was important
– Perhaps column headers from that spreadsheet they use…
21. Capture metadata at the time of creation
• Metadata needs to be captured at the point of data creation
• This amounts to putting more work on staff who are likely
already overburdened and time conscious
• Very little of the benefit of rigorous data processes will be felt in
the lab (at least at first)
22. The Data Tzar
Data Tzar Clearly empowered. Title sparks curiosity
Data Janitor Data are trash. Low prestige job.
Data Monkey Disrespectful, vaguely racist
Chief Data Officer They mostly seem to work on licensing
23. The Data Tzar: Day 1
Engage with data generators
– Tools to make their lives earlier - Dashboards, alerting systems, backups, routine
analysis, QC checks
– Go “breadth first” across the enterprise
– Do not start with the whole genomes
Sneakily harvest metadata
Necessary resources (day 1):
– 1 – 2 early career bioinformatics programmers
– Access to an infrastructure engineer
– A modest budget on your cloud provider of choice
24. The Data Tzar: First Year
Do a lot of favors, build a lot of Shiny apps
Convene working groups around specific types of data
Create crosscutting dashboards for leadership
Make friends with the heads of information security and compliance.
Prepare a budget proposal
25. FAIR Data (within the enterprise)
Findable
• NoSQL database of metadata and checksums
• It’s plenty for a good long time.
Accessible
• Federated identity management
• Architecture of S3 buckets and production
“roles”
Interoperable
• ”It’s much easier to go FAR than to go FAIR”
Reusable
• Data standards, ontologies, strong policy
framework, including electronic consents for
human subjects data.
26. Incredible opportunities
here, and rapidly
developing data silos
The Clinical Data Ecosystem
There is an incredible wealth of
data available to support both
clinical care and research
Unfortunately, it is carved up and
isolated.
The phrase I hear most frequently
from hospital CIOs: “No Upside”
Patient Journals
Consumer products
Longitudinal Data from
other providers …
Electronic
Medical Records
Possibility of a self-normal
(N of 1) over time
Diagnostic
Imaging
Natural language processing
has strong potentialClinical Notes
Innovations in the basics of
clinical observation
Hospital Telemetry
Pressure to avoid incidental
findings prevent bias
Primary Lab Data
27. Appropriate Use and Consent
“We should be up front with participants that we can’t protect
their privacy completely, and we should ensure that the most
appropriate legislation is in place to protect participants from
being exploited in any way.”
- Eric Schadt, CEO, Sema4
28. Policies and Governance
Appropriate usage
Human readable document: Expectations of privacy and
standards of behavior.
Data Classification
Governance document: Defines the major categories of data
(corporate sensitive, clinical, …) and standards for handling of
each.
Written Information Security Policy (WISP)
Technical document: Defines how systems must be configured to
protect sensitive data and operations.
Vendor Qualification
Business SOP to establish practices around how vendor access
and systes should be managed.
29. Conclusions
Organizing data is a human practice, not a technology choice
– There is no free lunch
Start simple, with free technologies, and quick wins
– NoSQL databases with headers and checksum
– Plan to invest in infrastructure about 18 months into the journey
– Don’t start with the whole genomes
Good policy makes simple practice
– Make data somebody’s job.
– Enterprise data management has much to teach us