Creating an Urban Legend: A System for Electrophysiology Data Management and Exploration


Published on

Talk on Urban Legend project at AAAI Symposium on Discovery Informatics:

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Walk through pieces 1 by 1, also mention that this is very much an uncompleted work in progress
  • Creating an Urban Legend: A System for Electrophysiology Data Management and Exploration

    1. 1. Creating an Urban Legend: A System for Electrophysiology Data Management and Exploration Anita de Waard VP Research Data Collaborations
    2. 2. Outline: • Life is complicated • A small pilot • Context and next steps
    3. 3. Life is complicated! 1. Interspecies variability > A specimen is not a species! 2. Gene expression variability > Knowing genes is not knowing how they are expressed! 3. Microbiome > An animal is an ecosystem! 4. Systems biology > Whole is more than the sum of its parts! 5. Models vs. experiment > Are we talking about the same things? In a way we can all use? 6. Dynamics > Life is not in equilibrium! => Reductionism doesn’t work for living systems!
    4. 4. Statistics could help… With enough observations, trends and anomalies can be detected: • “Here we present resources from a population of 242 healthy adults sampled at 15 or 18 body sites up to three times, which have generated 5,177 microbial taxonomic profiles from 16S ribosomal RNA genes and over 3.5 terabases of metagenomic sequence so far.” The Human Microbiome Project Consortium, Structure, function and diversity of the healthy human microbiome, Nature 486, 207–214 (14 June 2012) doi:10.1038/nature11234 • “The large sample size — 4,298 North Americans of European descent and 2,217 African Americans — has enabled the researchers to mine down into the human genome.” Nidhi Subbaraman, Nature News, 28 November 2012, High-resolution sequencing study emphasizes importance of rare variants in disease.
    5. 5. …but biological research is insular. • Biology is small: size 10^-5 – 10^2 m, scientist can work alone (‘King’ and ‘subjects’). • Biology is messy: it doesn’t happen behind a terminal. • Biology is competitive: many Ponder people with similar skill sets, Communicate vying for the same grants • In summary: the structure of biological research does not inherently promote collaboration (vs., for instance, HE physics or astronomy (and they’re not all they’re cracked up to be, either…)). Prepare Observe Analyze
    6. 6. What if we could connect experiments? Across labs, experiments: track reagents and how they are used Observations Observations Observations Prepare Prepare Analyze Communicate Analyze Communicate
    7. 7. What if we could connect experiments? Compare outcome of interactions with these entities Observations Observations Observations Prepare Prepare Analyze Communicate Analyze Communicate
    8. 8. What if we could connect experiments? Build a ‘virtual reagent spectrogram’ by comparing how different entities interacted in different experiments Observations Think Observations Observations Prepare Prepare Analyze Communicate Reason collectively! Communicate Analyze
    9. 9. Research Data Management today: Using antibodies and squishy bits Grad Students experiment and enter details into their lab notebook. The PI then tries to make sense of their slides, and writes a paper. End of story.
    10. 10. An Urban Legend is born: • How can we make a standard neuroscience wet lab more data-sharing savvy? • Incorporate structured workflows into the daily practice of a typical electrophysiology lab (the Urban Lab at CMU) – What does it take? – Where are points of conflict? • 1-year pilot, funded by Elsevier RDS: – CMU: Shreejoy Tripathy, manage/user test – Elsevier: development, UI, project management
    11. 11. Goal: Enable Effective data sharing: • Effective data sharing = “someone who is not the person who collected the data can understand the experiment and data” (Shreejoy’s definition) – So datasets should be more or less self-describing – > 90% of data sharing use cases are an experimentalist sharing data with a future version of herself or with a labmate • Not just experimental data file, but also the experimental metadata: – What was done? What does this variable mean? – This is usually stored in paper lab notebooks, understandable by only the experimenter
    12. 12. Main Assumptions: 1. Effective data sharing includes raw data files + experimental metadata (typically stored in a lab notebook) 2. You know most about an experiment while you’re performing it 3. Improved data practices can make labs more productive and more creative SDB_MC_12_voltages.mat
    13. 13. Components:
    14. 14. Metadata App:
    15. 15. Data integration: • Syncing of metadata app and electrophysiology data acquisition via server • Each trace of experimental data annotated with metadata • IGOR-Pro specific, support pClamp, other acquisition packages as needed later
    16. 16. Electrophysiology Data Looks like this:
    17. 17. Semantic Integration: Entity tables uses a scope and an attributes field to create a NoSQL like, hierarchical key/value structure in PostgreSQL with the built-in hstore extension. Ontology Information (in normalized sql tables) map keys, values & scopes to ontology information. Entity ID : UUID Investigator : references investigators table created : timestamp last_modified : timestamp scope : string ~ /[A-Z]d+(::[AZ]d+)*/ attributes : hstore (string → string mapping)
    18. 18. Data dashboard (planned): • Use collected metadata to sort experiments: organize by mouse strain, neuron type, animal age • Enable in-browser analyses: track provenance of analyzed data back to raw data: “what was that outlier?” • Simple link in to publishing/data sharing tools: “we can publish papers no one else can”
    19. 19. Next steps Urban Legend Project: • Populate data server with many experiments: – Are people using it? Why/why not? – What questions can we answer now that we couldn’t before? • Export data to neuroscience databases: NIF, INCF Dataspace, • How adaptable is this solution for use in other labs? • Can we scale this up and make it sustainable? • Software is available! Ready to swap this simple system for something better: point is process! • How does it fit into a larger data infrastructure within the institution/nationally/internationally?
    20. 20. Elsevier Research Data Services: • Main goal: make research data optimally available, discoverable and reusable • Collaboration is tailored to partner’s unique needs: – Working with a few domain-specific and institutional repositories and institutions – Aspects where collaboration is needed are discussed – Collaboration plan is drawn up using SLA: agree on time, conditions, etc. • 2013/2014: series of pilots, studies and reports to enable feasibility study: – What are key needs? – Can Elsevier play a role: skillsets, partnerships? – Is there a (transparent) business model for this?
    21. 21. Institutional Context: Funding Agencies Performance reporting Institution Library Research Office Usage/Citation reporting Institutional Repository Indexing Integrated Performance Query Usage/Citation reporting Indexing Research Data Repositories Unified Metadata Layer Curation Deposit / Store Indexing Generic Data Storage (such as Dropbox) Electronic Lab Notebooks Integrated Data Search Data Flow Performance Reporting Deposit / Store Indexing & Search Researchers
    22. 22. Data Initiatives: • Data Citation group: – Synthesize principles of proper data citation – ‘Declaration of Data Citation Principles’, 8 principles of successful data citation - • Resource Identification Initiative: – Promote research resource identification, discovery, and reuse – Resource Identification Portal – Central location for obtaining research resource identifiers (RRIDs) for materials and software used in biomedical research • Antibody: Abgent Cat# AP7251E, ABR:AB_2140114 • Tool: CellProfiler Image Analysis Software, NIFRegistry:nif-0000-00280 • Organism: MGI:MGI:3840442
    23. 23. Summary: • Life is complicated: knowledge needs to be connected! • A small pilot: “Urban Legend” • Context and next steps: – Working with institutions and databases to piece together this puzzle – Force11 is contributing some pieces
    24. 24. Thank you! Collaborations and discussions gratefully acknowledged: • CMU: Nathan Urban, Shreejoy Tripathy, Shawn Burton, Rick Gerkin, • Santosh Chandrasekaran, Matthew Geramita, Eduard Hovy • UCSD: Phil Bourne, Brian Shoettlander, David Minor, Declan Fleming, Ilya Zaslavsky • NIF/Force11: Maryann Martone, Anita Bandrowski • OHSU: Melissa Haendel, Nicole Vasilevsky • California Digital Library: Carly Strasser, John Kunze, Stephen Abrams • Elsevier: Mark Harviston, Jez Alder, David Marques
    25. 25. Questions? Anita de Waard VP Research Data Collaborations
    26. 26. Scopes Follows the format L#::L#::L#... where L is a letter identifier and # is any number of decimal digits. Example: P1::S1::R3 = Animal Prep 1, Slice 1, Run 3 The Letter need not be globally unique but only chain unique. Example: P1::S1::E1(Electrode) is different from P1::S1::R1::E1 (Run-Electrode) Scopes are 1 indexed.
    27. 27. Attributes Each scope has an attributes field that consists of multiple key, value pairs. The keys are unique and not tied to scope. (e.g. electrode_name instead of name). Keys can be a choice, scalar (with units), or freetext field and which is determined by the ontology tables.
    28. 28. Downsides to Flexible Schema Converting to/from the flat scopes to a true hierarchy (say in JSON) is rather complicated and led to many errors in the App. Very easy to get corrupted data in the App. Schema is closely aligned to the way the lua App did things. A flexible schema was a good choice, but not scopes for hierarchies.
    29. 29. Raw Data For use in data-dashboard. Standardized on HDF5. Files uploaded via FTP. Username, filename, and metadata w/i the HDF5 file used to identify associated metadata records. Batch or individually uploaded.