Data Management Open House

2,405 views
2,236 views

Published on

data management, information, data, scholarly communication, research, science, publication, peer review, alt metrics

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,405
On SlideShare
0
From Embeds
0
Number of Embeds
452
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • MH introduce
  • MH: Introduce
  • JW
  • MH: introduce everyone
  • MH:A grass roots effort to accelerate the pace and nature of scholarly communications and e-scholarship through technology, education, and communityWhy 11? We were born in 2011
  • MH:Force11 is comprised of a diversity of participants to best aid in the redefinition of scholarly communication
  • MH: (Un)conference where stakeholders came together as equals to discuss issuesIncubator for changeWhat would you do to change scholarly communication if you had $1K? M. Haendel award winner
  • MH-why we are here today, and how all of you can help.JW: put the slide 2 here perhaps? This still has too much text; slide 2 is much less intimidating.
  • JW
  • RC: The traditional model of scientific communication is fairly straightforward. Sucessful research is shared via presentations and papers, after data is collected and analyzed.
  • RC: This model is slow, even when considred within the context of electronic journals. A recent study clocked the average timeframe from submission to to publication for biomedical journals at over 9 months: http://www.openaccesspublishing.org/2013/09/06/the-publishing-delay-in-scholarly-peer-reviewed-journals/.
  • RC: The traditional model is also very formalized in respect to when in the research cycle the science is shared (well after the study has taken place), how it is disseminated (peer-reviewed articles), to whom one is communicating (most often scientists in your specialized field) and how impact is measured (citation counts to articles).
  • RC: Finally, it is unilateral in that it doesn’t faciliate dynamic, real-time, interaction between scientists outside of the society meeting or conference. Nor does it further coversation between scientists and the public.
  • The Internet has had a profound affect on science and scientific communication, wherein the traditional model I just described is being reimainged, ulitmately in the perusuit of advancing the scientific process. But the traditional model of scholarly communication stills dominates how many scientists manage and share their research and data.
  • RC: Volume of literature has exploded since the first online journals were launched in the late 1980’s. Today, virtually all science journals are online. There are over 28,000 active peer reviewed journals, publishing nearly 2 million articles per year, with a new paper is published every 20 seconds. This is a huge industry, with revenues of about 10 billion/year. 50% of new research is freely available online either immediately or within 12 months of publication. But the other 50% lives behind high paywalls. Limiting the scope of science available to potential readers (human and compter; scientists and the public).Infographic: http://www.sciencemag.org/site/special/scicomm/infographic.jpg
  • RC: We have also seen a proliferation of new publishing modes and models. This includes a variety of open access publishers and journals with new peer-review models such as open and post-publication peer-reivew. And, new economic models, wherein authors, funders, and libraries are sharing the cost of publication. New modes include but are not limited to self-publication and social media, such as science blogs and twitter, and data sharing via public repositories.
  • Communication is also occurring at more points across the research cycle. For instance, ideas are shared and developed via on-line conversations on blogs and Twitter. Code and data are being released as they are built and recorded via open lab notebooks. This activity compliments and feeds the traditional products of research – papers and presentations.
  • RC: However, if scientists don’t thoughtfully and actively manage their research products in this new system, the advantages are minimized and all this new stuff becomes noise.
  • Data. Complex…
  • JW
  • JW
  • JW
  • JW
  • JW: this is raw crystallography data, collected at OHSU. This is visual, high resolution, data
  • The image then gets integrated along the spots, transforming the image into a series of mathematical values
  • The “model’ we are used to seeing is actually a mathematical representation of how well a model (the sticks) fits into the mathematical distillation of the raw image. This looks static, but is actually a best representaiton along one axis of the data (which is to say, confidence levels). Crystallography boils down to solving the “phase problem”, which can be done two ways: brute force (holy hell!), and by using an exisitng model as a jumping off point. This is the fastest and most efficient way of solving off structures, and is, in fact, what I did to solve this structure. I got the previously published data from pdb.org, which is also where I deposited my data.The point of this is three fold: 1) data comes in many shapes and forms, 2) data transforms, and 3) data helps inform more data.
  • JW: this is raw crystallography data, collected at OHSU. This is visual, high resolution, data
  • Ask them to think about what type of data they deal with/generate. Give a couple minutes.
  • Ask if they have additional data types that they brainstormedJW: yes, need this slide if we are to cover the examples listed later. Also, we are eventually getting to alt metrics, which means the third quadrant; therefore, important to cover here.
  • Data. Complex…
  • Data. Complex…
  • JW
  • JW
  • !Add metadata not only to your experimental results, but also your process during research, such as resources, protocols, etc.Ways to apply metadata to every moving part of your research
  • JW: this is raw crystallography data, collected at OHSU. This is visual, high resolution, data
  • JW: this is raw crystallography data, collected at OHSU. This is visual, high resolution, data
  • JW: this is raw crystallography data, collected at OHSU. This is visual, high resolution, data
  • JW
  • NV-The literature was the place we would go to find information to get protocols, information about techniques, find resources/reagentsAssuming you got to your relevant paper- look at the methods section, is there enough info there for you to be able to reuse/reproduce the info/experiment/technique?
  • NV- For example, if you look in the materials and methods section for an antibody used in a western blot, oftentimes the name is reported, along with the vendor and vendor’s locationSay here that the authors met the journal standards, but that they really aren’t sufficient.
  • NV However, there are several antibodies generated against one target, so how do you know which one works in this assay?Need to report catalog numbers…
  • NV - Alternatively, report the AR IDPermanent identifier, stays with the Ab, even as it changes vendors or catalog #’s change. Similar to genbank for antibodies. Most resources can be reported more specifically than publisher guidelines, which are not intended to support reproducibility.
  • NV: An area with poor data standards shows poor reproducibility. Here we showed how irrereproducible many studies were simply due to lack of specificity in the resources used in the experiments. We therefore developed guidelines that are now in place to support resource reporting, and these are now in effect in a number of journals, with more to come. OHSU participates in the Reproducibility Initiative, aimed at developing policies and tools to aid scientific reproducibility.Some bioinformatics tools to aid reproducibility are Workflow4Ever and RunMyCode.org.Outcomes from data standards:Reproducibility and data reusePlace urls in separate document, not on slidewww.scienceexchange.com/reproducibilitywww.wf4ever-project.orgrunmycode.org
  • Bioinformatics workflow standards such as Workflow 4Ever and Run my code have been developed to help with standardization and sharing of scientific workflows and code.Workflow 4 ever Run my code is a repository where people can share or reuse code that is associated with scientific publications. For data manipulations, here is an example of tools that can help with reproducibility.
  • MH: Yes28.0% ,No26.9% ,I don't know45.1%175 answered question
  • MHPut URL in supporting document. Too distracting here.http://www.usgs.gov/datamanagement/plan/datastandards.php
  • MH: each type serves a different purpose:Reporting guidelines serve to ensure that a minimum of metadata is reported, so that someone else can know what your data is about.Terminology artifacts allow some of the data to be structured for reuse and interoperability. Think of these as interoperability handles.Exchange formats provide the syntax for the data structure, and further enable data integration and mashup.
  • MH: each type serves a different purpose:Reporting guidelines serve to ensure that a minimum of metadata is reported, so that someone else can know what your data is about.Terminology artifacts allow some of the data to be structured for reuse and interoperability. Think of these as interoperability handles.Exchange formats provide the syntax for the data structure, and further enable data integration and mashup.
  • MH: which one to use? Need a solution to help identify the right standard, contribute to and/or extend existing ones to best support community reproducibility and reuse
  • MH –Both of these resources provide a survey of data standards of all three types – Reporting GuidelinesTerminology Artifacts (includes ontologies)Exchange FormatsBiosharing has a biology focus, CDISC is a clinical focusThere are others, these are just two resources.Take away- there are different standards, no standard meets everyone’s need.
  • NV: this is transition back to melissa
  • MH: Reusing data is not as easy as dumpster diving. You don’t always know that a coke can or a keyboard key can be a critical data element.JW: Oh. My. God.
  • MH:Slide from Chris MungallOntologies provide the handle by which data from different databases and of different types can be linked and integrated for maximal biological knowledgeDo we need this slide?JW: Maybe not IN the deck, but at the back. If soembody asks what an ontology is during the Q&A, we can bring it up. I did this all the time for my seminars – always have extra slides at the back end for potential questions.
  • MH: ontologies, unlike a file system, allow data to be classified in many different ways using logic and standardized identifiers
  • MH:When data is encoded using ontologies, it can allow mashup in novel ways. Here, we are using clinical phenotype data and comparing it with model organism phenotype data to identify candidate genes for undiagnosed human diseases.JW: Please let me clean up the original image. The pixilated borders are driving me nuts, and the human head has some white pixels that can very easily and quickly be cleaned up!
  • MH: those pesky data sharing mandates, what are they really for?Does dumping my data into a data repository with no metadata or use of standards really help?Answer- no it doesn’t. If you want your data to be a first class citizen as a scholarly product that can be cited and actually be reused, then you need to go a bit further. Need to add links to policies
  • Transition- how can I meet data sharing requirements, and actually make my data reusable?ANSWER: Just like any experiment or quality statistical approach, you need to plan ahead. There are tools to help. The library can help too.
  • FigShareDryadData.gov
  • MH: add link
  • Want people to come to library to help with archiving/data publicationWhere can you keep your data? Does it have sensitive info? Yes/noDoes it need to be archived?Make decision tree for one on one meetings
  • What does this mean? It means storing or performing analyses on (many times) unsecure shared servers that may exist anywhere in the worldWhy should you care? Tools like dropbox and googledocs are research effecience lifesavers but come with an IP risk as well as risk of sharing PHI dataSimilarly, amazon cloud servers and genomics data analysis platforms are all too easy to set up or use, and can lead to PHI data being leaked.
  • MH: Example: DOIs for publications, data, or other research productdoi: 10.1371/journal.pbio.1001339A URI will resolve to a single location on the webURIs for people
  • RC
  • Scientific output and potential impact is more complex, dynamic, and diverse than peer-reviewed papers. Actively managing your research footprint – which includes of course your data – can positively affect your scientific impact.
  • MH – I updated a bit..
  • MELISSA
  • Robin add better title? Needs cleanup still
  • Grab info for NIHMelissa also talk about NSF biosketch and how everything you create speaks to you as a scientist- make it citable!End with your scholarly footprint – lead into breakouts
  • JW
  • JW
  • MH: Should add links to libguide, library pages etc.
  • Data Management Open House

    1. 1. DATA MANAGEMENT OPEN HOUSE OHSU Library October 9th, 2013 #OHSUdata @force11rescomm
    2. 2. 0 | Introductions 1 | Scientific Communication 2 | Making it work for you 3 | Your impact 4 | Hands On 5 | Making it matter
    3. 3. 0 | Introductions
    4. 4. Melissa Haendel Ontology Development Group and DMICE Nicole Vasilevsky Ontology Development Group Jackie Wirz Research Specialist, Research Roadmap SOM Robin Champieux Scholarly Communication
    5. 5. http://www.force11.org/ @force11rescomm
    6. 6. Who is FORCE11? Publishers Library and Information scientists Policy makers Tool builders Funders Scholars Social Science HumanitiesScience Free to join!
    7. 7. Beyond-the-PDF San Diego, Jan 2011 | Amsterdam, March 2013 www.force11.org/beyondthepdf2 | #btpdf2
    8. 8. How does OHSU fit in? We won 1K to find out. Today | Discuss data-research cycle, reproducibility, and communication of findings Later | Data playground with researchers:  Your data needs  Identify the material and services you need  Get paid $50
    9. 9. 1 | Scientific Communication
    10. 10. Once upon a time…. Research, Present, Publish. Repeat.
    11. 11. You might say it wears a uniform
    12. 12. Our relationship is so one-sided.
    13. 13. From Paper to Tweet
    14. 14. www.sciencemag.org/site/special/scicomm/infographic.jpg
    15. 15. New Modes & Models
    16. 16. Manage Your Footprint
    17. 17. asdf Data can be pretty complex…
    18. 18. Data does not speak for itself…
    19. 19. You speak for your data
    20. 20. You need to manage it
    21. 21. But, even more fundamentally…
    22. 22. what does data mean to you?
    23. 23. asdf
    24. 24. asdf
    25. 25. You speak for your data
    26. 26. How do you speak for your data when you are not around?
    27. 27. Do you know what metadata is? a. Philosophy b. describes data c. dating site d. data
    28. 28. Title Author Call number Publisher ISBN
    29. 29. - Anne Gilliland Your metadata should make your data understandable to others… without your involvement Metadata Metadata Metadata Metadata Metadata Metadata Metadata Metadata Metadata Metadata Metadata Metadata Metadata Metadata Metadata Metadata
    30. 30. 2 | Making it work for you
    31. 31. http://www.phd2published.com/wp-content/uploads/2011/09/publications_image.jpg
    32. 32. Biomed Res Int. 2013;2013:350419.
    33. 33. A Solution: Antibody Registry The Antibody Registry www.antibodyregistry.org
    34. 34. Data standards can help with reproducibility Average of ~50% of resources were not identifiable Vasilevsky et al., 2013 PeerJ 1:e148 www.force11.org/node/4463 biosharing.org/bsg-000532
    35. 35. Data Analysis Pipeline Reproducibility Platforms RESOURCES www.wf4ever-project.org runmycode.org galaxyproject.org/
    36. 36. Are you aware of data standards in your field? @OHSU, 72% said no or didn’t know!
    37. 37. Data standards are the rules by which data are described and recorded. In order to share, exchange, and understand data, we must standardize the format as well as the meaning. www.usgs.gov/datamanagement/plan/datastandards.php Data Standards
    38. 38. Types of data standards Reporting guidelines Terminology Artifacts (includes ontologies) Exchange Formats Can be used together
    39. 39. Reporting guidelines Terminology Artifacts (includes ontologies) Exchange Formats MIAME Data standards examples
    40. 40. Many microarray transcriptomics standards JAMIA:sea-of-standards
    41. 41. www.cdisc.org RESOURCES Minimum Information for Biological and Biomedical Investigations biosharing.org/
    42. 42. But it isn’t just about reproducibility… It’s about
    43. 43. Data reuse? www.erp-recycling.org
    44. 44. Ontologies as a tool for unification Disease- Phenotype databases Disease phenotype ontology Expression data Gene function data Cell and tissue ontology GO annotations ontologies
    45. 45. For example, there are many useful ways to classify organism parts: its parts and their arrangement its relation to other structures what is it: part of; connected to; adjacent to, overlapping? its shape its function its developmental origins its species or clade its evolutionary history Cajal 1915, “Accept the view that nothing in nature is useless, even from the human point of view.” Ontologies classify data in multiple ways http://www.boloncol.com/images/stories/boletin19/cajal16.jpg
    46. 46. Human Disease: PFEIFFER SYNDROME Most similar mouse model: CD1.Cg-Fgfr2tm4Lni/H shortened head MP:0000435 malocclusion MP:0000120 ocular hypertelorism MP:0001300 short maxilla MP:0000097 Brachyturricephaly HP:0000244 Hypoplasia of the maxilla HP:0000327 Dental crowding HP:0000678 Hypertelorism HP:0000316 Coronal craniosynostosis HP:0004440 premature suture closure maxilla hypoplasia malocclusion shortened head ocular hypertelorism premature suture closure MP:0000081 Cross-species Phenotype Ontologies aid candidate gene identification for undiagnosed diseases
    47. 47. Data Sharing Mandates
    48. 48. How can I make my data reusable? There are tools to help!
    49. 49. Tools for research management RESOURCES www.labguru.com www.labarchives.com
    50. 50. Data management plan tool RESOURCES https://dmp.cdlib.org/
    51. 51. What to do with data? Storage Versioning Publication Back up in multiple locations:  Local hard drive  Removable storage  Shared Network  Cloud server  File name versioning  Dropbox  Version control software  CVS  SVN  Git Data sharing repositories:  Local repository  Domain specific  Generic public repository
    52. 52. Computing in the cloud
    53. 53. Uniquely identifying data  Document Object Identifier (DOI)  Unique resource identifier (URI) www.flickr.com/photos/pmeimon
    54. 54. v figshare.com datadryad.org thedata.org n2t.net/ezid www.dataone.org data.rutgers.edu/ Data journals and repositories RESOURCES nature.com/scientificdata/
    55. 55. 3 | Your impact
    56. 56. Thinking Beyond the PDF Raw Science Small publications Self-publishing Datasets Code Experimental design Argument or passage Blogging Microblogging Comments & Reviews Annotations Single figure publications Nanopublications
    57. 57. Who are you?
    58. 58. Impact.Story impactstory.orgwww.plumanalytics.com orcid.org RESOURCES Services to identify yourself and your impact
    59. 59. rubriq.com scalar.usc.edu RESOURCES Alternative publishing mechanisms thedata.org
    60. 60. http://theconversation.com/scientists-must-share-early-and-share-often-to-boost-citations-18699
    61. 61. Citing products of your research
    62. 62. 4 | Hands On
    63. 63. What is your scientific footprint?
    64. 64. 5 | Making it Matter
    65. 65.  Legitimate, citable products of research  Same importance as traditional citations  Data management is central Data
    66. 66. Data citation principles. http://thedata.org/files/thedata_new2/ files/datacitationprinciples-datacite.pdf
    67. 67. Data Management 101 libguides.ohsu.edu/d ata
    68. 68. Thank you!

    ×