This presentation will describe two studies undertaken to build two separate data catalogs: the first for NIH-funded datasets and the second for institutional datasets created within an academic medical center.
To inform the creation of an NIH data catalog, the purpose of the first study was to a) develop a set of minimal metadata elements used to describe datasets, and b) carry out an analysis to identify datasets in NIH-funded research articles that do not provide an indication that their data has been shared in a data repository. This study served as the foundation for developing an index of all NIH-funded datasets, and provided information about in what repositories researchers share their data most often.
The second study was spurred on by the first, and involved interviewing institutional faculty members and researchers to learn more about how they collect data, what challenges they face when collecting data, whether they’ve thought about sharing data, and what they would find most useful from an institutional data catalog. The results of this study informed the workflows, metadata creation, and requirements for building a data catalog within the medical center. Additionally, interview responses were used to further inform the data services provided by the health sciences library, including education, research consultations and clinical quality improvement initiatives.
Both studies provide various examples of how a librarian working in the health sciences can contribute to, and participate in data-related services within their institution.
Keynote presented to KE workshop held in conjunction with the release of the report "A Surfboard for Riding the Wave
Towards a four country action programme on research data": http://www.knowledge-exchange.info/Default.aspx?ID=469
Keynote presented to KE workshop held in conjunction with the release of the report "A Surfboard for Riding the Wave
Towards a four country action programme on research data": http://www.knowledge-exchange.info/Default.aspx?ID=469
Laurie Goodman on "Overcoming Hurdles to Data Publication" for the Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research, Oxford, 7th April 2016.
Data Publishing at Harvard's Research Data Access SymposiumMerce Crosas
Data Publishing: The research community needs reliable, standard ways to make the data produced by scientific research available to the community, while giving credit to data authors. As a result, a new form of scholarly publication is emerging: data publishing. Data publishing - or making data reusable, citable, and accessible for long periods - is more than simply providing a link to a data file or posting the data to the researcher’s web site. We will discuss best practices, including the use of persistent identifiers and full data citations, the importance of metadata, the choice between public data and restricted data with terms of use, the workflows for collaboration and review before data release, and the role of trusted archival repositories. The Harvard Dataverse repository (and the Dataverse open-source software) provides a solution for data publishing, making it easy for researchers to follow these best practices, while satisfying data management requirements and incentivizing the sharing of research data.
The Materials Data Facility: A Distributed Model for the Materials Data Commu...Ben Blaiszik
Presentation given at the UIUC Workshop on Materials Computation: data science and multiscale modeling. Materials Data Facility data publication, discovery, Globus, and associated python and REST interfaces are discussed. Video available soon.
[4.1] Data Citation and DOI's - Research Data Management - part of PhD course...3TU.Datacentrum
Training about Data Archive
You will learn:
What data citation is, and what the benefits are.
How to use DOIs for data citation.
How to cite a dataset
How to find publications with DOIs
Link your publications to your dataset (and vise versa) using DOIs
GSmith Springer Nature Data policies and practices: HKU Open Data and Data Pu...GrahamSmith646206
Supporting research data across Springer Nature: joining up policy and practice. Slides from Graham Smith (Research Data Manager, Springer Nature) at HKU Open Data and Data Publishing Seminar, 25th October 2021.
Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...GigaScience, BGI Hong Kong
Laurie Goodman at the AIBS Changing Practices in Data Pub workshop: Beyond Data Release Mandates - Helping Authors Make Data Available. 3rd December 2014
From Deadly E. coli to Endangered Polar Bear: GigaScience Provides First Cita...GigaScience, BGI Hong Kong
Slides from GigaScience press-conference at BGI's Bio-IT APAC meeting on the GigaScience website launch and release of first unpublished animal genomes released from database. Genomes include polar bear, penguin, pigeon and macaque. 6th July 2011
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Merce Crosas
Presentation for the NFAIS Webinar series: Open Data Fostering Open Science: Meeting Researchers' Needs
http://www.nfais.org/index.php?option=com_mc&view=mc&mcid=72&eventId=508850&orgId=nfais
Presentation by Eefke Smit asking whether publishers should scrap supplementary materials given as a 'provocation' in the final panel session at the Now and Future of Data Publishing Symposium, 22 May 2013, Oxford, UK
Funding agencies are instituting requirements for data management and sharing as a condition of receiving research funds. This presentation addresses why researchers should care about research data management, what libraries have to do with it, and a case study of what one research specialist at the University of Colorado Anschutz Medical Campus is doing in this area.
Laurie Goodman on "Overcoming Hurdles to Data Publication" for the Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research, Oxford, 7th April 2016.
Data Publishing at Harvard's Research Data Access SymposiumMerce Crosas
Data Publishing: The research community needs reliable, standard ways to make the data produced by scientific research available to the community, while giving credit to data authors. As a result, a new form of scholarly publication is emerging: data publishing. Data publishing - or making data reusable, citable, and accessible for long periods - is more than simply providing a link to a data file or posting the data to the researcher’s web site. We will discuss best practices, including the use of persistent identifiers and full data citations, the importance of metadata, the choice between public data and restricted data with terms of use, the workflows for collaboration and review before data release, and the role of trusted archival repositories. The Harvard Dataverse repository (and the Dataverse open-source software) provides a solution for data publishing, making it easy for researchers to follow these best practices, while satisfying data management requirements and incentivizing the sharing of research data.
The Materials Data Facility: A Distributed Model for the Materials Data Commu...Ben Blaiszik
Presentation given at the UIUC Workshop on Materials Computation: data science and multiscale modeling. Materials Data Facility data publication, discovery, Globus, and associated python and REST interfaces are discussed. Video available soon.
[4.1] Data Citation and DOI's - Research Data Management - part of PhD course...3TU.Datacentrum
Training about Data Archive
You will learn:
What data citation is, and what the benefits are.
How to use DOIs for data citation.
How to cite a dataset
How to find publications with DOIs
Link your publications to your dataset (and vise versa) using DOIs
GSmith Springer Nature Data policies and practices: HKU Open Data and Data Pu...GrahamSmith646206
Supporting research data across Springer Nature: joining up policy and practice. Slides from Graham Smith (Research Data Manager, Springer Nature) at HKU Open Data and Data Publishing Seminar, 25th October 2021.
Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...GigaScience, BGI Hong Kong
Laurie Goodman at the AIBS Changing Practices in Data Pub workshop: Beyond Data Release Mandates - Helping Authors Make Data Available. 3rd December 2014
From Deadly E. coli to Endangered Polar Bear: GigaScience Provides First Cita...GigaScience, BGI Hong Kong
Slides from GigaScience press-conference at BGI's Bio-IT APAC meeting on the GigaScience website launch and release of first unpublished animal genomes released from database. Genomes include polar bear, penguin, pigeon and macaque. 6th July 2011
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Merce Crosas
Presentation for the NFAIS Webinar series: Open Data Fostering Open Science: Meeting Researchers' Needs
http://www.nfais.org/index.php?option=com_mc&view=mc&mcid=72&eventId=508850&orgId=nfais
Presentation by Eefke Smit asking whether publishers should scrap supplementary materials given as a 'provocation' in the final panel session at the Now and Future of Data Publishing Symposium, 22 May 2013, Oxford, UK
Funding agencies are instituting requirements for data management and sharing as a condition of receiving research funds. This presentation addresses why researchers should care about research data management, what libraries have to do with it, and a case study of what one research specialist at the University of Colorado Anschutz Medical Campus is doing in this area.
2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...datacite
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
Presentation to clinicians on what they need to think about to do a large-scale Open Science project where they want to share clinical, genomic and imaging data.
There are many online and in-person courses available for librarians to learn about research data management, data analysis, and visualization, but after you have taken a course, how do you go about applying what you have learned? While it is possible to just start offering classes and consultations, your service will have a better chance of becoming relevant if you consider stakeholders and review your institutional environment. This lecture will give you some ideas to get started with data services at your institution.
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...SC CTSI at USC and CHLA
Date: Apr 4, 2018
Speaker: Hyoungjoo Park, PhD candidate, School of Information Studies, University of Wisconsin-Milwaukee, and Dietmar Wolfram, PhD
Overview: It is increasingly common for researchers to make their data freely available. This is often a requirement of funding agencies but also consistent with the principles of open science, according to which all research data should be shared and made available for reuse. Once data is reused, the researchers who have provided access to it should be acknowledged for their contributions, much as authors are recognised for their publications through citation. Hyoungjoo Park and Dietmar Wolfram have studied characteristics of data sharing, reuse, and citation and found that current data citation practices do not yet benefit data sharers, with little or no consistency in their format. More formalised citation practices might encourage more authors to make their data available for reuse.
RDAP 16 Poster: Connecting Social and Health Sciences Data – This Librarian’s...ASIS&T
Research Data Access and Preservation Summit, 2016
Atlanta, GA
May 4-7, 2016
Poster session (Wednesday, May 4)
Presenter:
Michelle B. Bass, University of Chicago
RDAP13 Elizabeth Moss: The impact of data reuseASIS&T
Kathleen Fear, ICPSR, University of Michigan
“The impact of data reuse: a pilot study of 5 measures”
Panel: Data citation and altmetrics
Research Data Access & Preservation Summit 2013
Baltimore, MD April 4, 2013 #rdap13
Accessing and Sharing Electronic Personal Health Data.Maria Karampela
An increasing attention has been given to personal health data (PHD) research over the last years. The rise of researchers’ interest could be attributed to the increasing amount of PHD that are stored across various databases, as a result of individuals’ rapidly- evolving digital life. Accessing and sharing PHD is essential to create personalized health services and to involve patients in the design process of these services. This paper conducts a survey of literature to present an overview of literature about accessing and sharing of PHD. This study aims to identify limitations in research and propose future directions. Sixteen studies were selected from various bibliographic databases and were classified according to three criteria: research type, empirical type and contribution type. The results provide a preliminary review with respect to access and sharing of PHD, addressing a need for more research about PHD accessibility and for solution proposals for both topics.
Accessing and Sharing Electronic Personal Health DataSofia Ouhbi
Accessing and sharing PHD is essential to create personalized health services and to involve patients in the design process of these services. This paper conducts a survey of literature to present an overview of literature about accessing and sharing of PHD. This study aims to identify limitations in research and propose future directions.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
1. Table of Contents
I. NIH Data Discovery Index
• Methodology
• Findings
• Questions raised
II. Institutional Data Interviews
• Methodology
• Findings
III. Outcomes
• Benefits to the library
By: Charles Dickens
1
3. NIH Big Data to Knowledge (BD2K)
Facilitating Broad Use of Biomedical Big Data
3
4. NIH Data Discovery Index
Datasets are
CITABLE
Datasets are
DISCOVERABLE
Datasets are
LINKED TO
THE
LITERATURE
Datasets are
PART OF THE
RESEARCH
ECOSYSTEM
4
5. NIH Data Sharing Repositories
http://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html
12. 383
What category of dataset was
used for the research described
in the article?
Were live human or animal
subjects used in the collection
of the data?
What were the subject(s) of
study (from which or whom
the data was collected)?
If new dataset(s) were
created, what type(s) of data
were collected?
What existing dataset(s) were
used? If any?
How many datasets are there in
each article?
12
13. Measuring blood
pressure in mice
Measuring left
hemisphere of brain
for growth factor
Staining and imaging
Analysis of images
using software 13
19. Data Types
19
Image
Genetic or Genomic
Chemical
Biochemical
Electrical
(Elecrophysiological)
Optical –
non-image
Behavioral
Computational Simulation
or model
Magnetic Resonance –
non-image
Structural
Physiological
Questionnaire/Survey
Clinical Measures
Geospatial
25. Book of the Second
Understanding institutional data challenges
via interviews
26. Institutional Data Catalog
• Organize and describe
institutional research data
• Promote collaboration
within the institution
• Promote a culture of
sharing and transparency
26
27. Methodology
• Literature review
• ID researchers/PIs using
active grant system
• Analyzed datasets in
researcher papers before
interviews
– Used NIH Data Discovery
Index method
27
29. Data Interviews
0 1 2 3 4 5 6 7
Postdocs or student leaves with data
Lack of standards/procedures
Size of data
Messiness/Disconnect between datasets
Too challenging
Challenges Organizing Data – Basic Science Researchers
30. Data Interviews
0 1 2 3 4 5 6
Storage expense
Changes in software
Lack of IT resources
Lack of preservation procedures (readme, plans, postdoc etc.)
Data in multiple storage locations
Storage space
Challenges Preserving Data – Basic Science Researchers
31. Data Interviews
0 1 2 3 4 5 6
Data quality
Messiness/Disconnect between datasets
Poor data output formats
Can't search data
Data loss
Team miscommunication on who's using data
Challenges Organizing Data – Clinical Researchers
32. Data Interviews
0 1 2 3 4 5 6 7 8 9
Collaboration only
unknown parties
data repository
general public
primary results only
Do not share
Basic Science
Clinical
Experience with Data Sharing
33. Only the best of times…
33
How the library benefitted from this exercise
38. Acknowledgements
BD2K Project
• Lou Knecht, Jim Mork, Kathel Dunn, Betsy
Humphreys, Jerry Sheehan, Mike Huerta, Dr. Donald
Lindberg
Annotators
• Preeti Kochar, Helen Ochej, Susan Schmidt, Melissa
Yorks, Shari Mohary, Olga Printseva, Janice Ward, Oleg
Rodionov, Sally Davidson, Jennie Larkin, Peter Lyster, Matt
McAuliffe, Greg Farber, Betsy Humphreys, Jerry
Sheehan, Mike Huerta, Lou Knecht, Suzy Roy, Swapna
Abhyankar, Olivier Bodenreider, Karen Gutzman, Dina
Demner Fusman, Laritza Rodriguez, Sonya
Shooshan, Samantha Tate, Matthew Simpson, Tracy
Edinger, Olubumi Akiwumi, Mary Ann Hantakas, Corinn
Sinnott
38
39. References
1. Adamick J, Canavan M, McGinty S, Reznik-Zellen R, Schmidt M, Stevens R. Building as We Climb: The Data Working Group at the University of Massachusetts Amherst [Internet]. Univ.
Massachusetts New Engl. Area Libr. e-Science Symp. 2011. Available from: http://escholarship.umassmed.edu/escience_symposium/2011/posters/3
2. Bardyn TP, Resnick T, Camina SK. Translational Researchers’ Perceptions of Data Management Practices and Data Curation Needs: Findings from a Focus Group in an Academic Health
Sciences Library. J. Web Librariansh. [Internet]. 2012 Oct [cited 2013 Jan 30];6(4):274–87. Available from: http://www.tandfonline.com/doi/abs/10.1080/19322909.2012.730375
3. Carlson J, Fosmire M, Miller CC, Nelson MS. Determining Data Information Literacy Needs: A Study of Students and Research Faculty. portal Libr. Acad. 2011;11(2):629 – 657.
4. Delserone LM. At the watershed: Preparing for research data management and stewardship at the University of Minnesota Libraries. Libr. Trends [Internet]. Urbana-Champaign, Illinois: John
Hopkins University Press and the Graduate School of Library and Information Science.; 2008 [cited 2013 Jan 11]. p. 202–10. Available from: https://www.ideals.illinois.edu/handle/2142/10670
5. Harrison A, Searle S. Not drowning , ingesting : dealing with the research data deluge at an institutional level. VALA2010 Proc. [Internet]. 2010. Available from:
http://www.vala.org.au/vala2010/papers2010/VALA2010_43_Harrison_Final.pdf
6. Hruby GW, McKiernan J, Bakken S, Weng C. A centralized research data repository enhances retrospective outcomes research capacity: a case report. J. Am. Med. Inform. Assoc. [Internet]. 2013
Jan 15 [cited 2013 Apr 11];1–5. Available from: http://www.ncbi.nlm.nih.gov/pubmed/23322812
7. Johnson LM, Butler JT, Johnston LR. Developing E-Science and Research Services and Support at the University of Minnesota Health Sciences Libraries. J. Libr. Adm. [Internet]. Routledge;
2012 Nov [cited 2013 Jan 11];52(8):754–69. Available from: http://dx.doi.org/10.1080/01930826.2012.751291
8. Jones S, Ross S, Ruusalepp R. Data Audit Framework Methodology [Internet]. Glasgow; 2009 p. 1–70. Available from: http://www.data-audit.eu/DAF_Methodology.pdf
9. Lage K, Losoff B, Maness J. Receptivity to Library Involvement in Scientific Data Curation: A Case Study at the University of Colorado Boulder. portal Libr. Acad. [Internet]. 2011 [cited 2012
Nov 21];11(4):915–37. Available from: http://muse.jhu.edu/journals/portal_libraries_and_the_academy/v011/11.4.lage.html
10. Newton MP, Miller CC, Bracke MS. Librarian Roles in Institutional Repository Data Set Collecting: Outcomes of a Research Library Task Force. Collect. Manag. 2011;36(1):53–67.
11. Peters C, Dryden AR. Assessing the Academic Library’s Role in Campus-Wide Research Data Management: A First Step at the University of Houston. Sci. Technol. Libr. [Internet]. Routledge;
2011 Sep [cited 2013 Jan 11];30(4):387–403. Available from: http://dx.doi.org/10.1080/0194262X.2011.626340
12. Piwowar H a. Who shares? Who doesn’t? Factors associated with openly archiving raw research data. PLoS One [Internet]. 2011 Jan [cited 2013 Mar 10];6(7):e18657. Available from:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3135593&tool=pmcentrez&rendertype=abstract
13. Raboin R, Reznik-Zellen RC, Salo D. Forging New Service Paths: Institutional Approaches to Providing Research Data Management Services. J. eScience Librariansh. [Internet]. 2012;1(3).
Available from: http://escholarship.umassmed.edu/jeslib/vol1/iss3/2/
14. Reznik-Zellen R, Adamick J, McGinty S. Tiers of Research Data Support Services. J. eScience Librariansh. [Internet]. 2012 [cited 2012 Nov 10];1(1):27–35. Available from:
http://escholarship.umassmed.edu/jeslib/vol1/iss1/5/
15. Scaramozzino JM, Ramirez ML, McGaughey KJ. A Study of Faculty Data Curation Behaviors and Attitudes at a Teaching-Centered University. Coll. Res. Libr. [Internet]. Association of
College & Research Libraries; 2012 Jul 1 [cited 2013 Jan 11];73(4):349–65. Available from: http://crl.acrl.org/content/73/4/349.abstract
16. Soehner C, Steeves C, Ward J. E-Science and Data Support Services. 2010 [cited 2013 Jan 11];(August). Available from: http://www.arl.org/storage/documents/publications/escience-report-
2010.pdf
17. Trinidad SB, Fullerton SM, Bares JM, Jarvik GP, Larson EB, Burke W. Genomic research and wide data sharing: views of prospective participants. Genet. Med. 2010 Aug;12(8):486–95.
18. Walters TO. Data curation program development in U.S. universities: The Georgia Institute of Technology example. Int. J. Digit. Curation [Internet]. 2009;4(3):83–92. Available from:
http://www.ijdc.net/index.php/ijdc/article/viewFile/136/153
19. Westra B. Data Services for the Sciences: A Needs Assessment. Ariadne [Internet]. 2010;(64). Available from: http://www.ariadne.ac.uk/issue64/westra
20. Williams SC. Using a Bibliographic Study to Identify Faculty Candidates for Data Services. Sci. Technol. Libr. [Internet]. Routledge; 2013 May 9 [cited 2013 May 14];1–8. Available from:
http://dx.doi.org/10.1080/0194262X.2013.774622
21. Xia J, Liu Y. Usage Patterns of Open Genomic Data. Coll. Res. Libr. [Internet]. Association of College & Research Libraries; 2013 Mar 1 [cited 2013 Mar 7];74(2):195–207. Available from:
http://crl.acrl.org/content/74/2/195.abstract
39
40. Images
Ponderings for All Things Blog. 2010. Available from:
http://ponderingsofallthings.blogspot.com/2010/05/tale-of-two-cities-charles-
dickens.html
Reading Charles Dickens Blog. Manette in Bastille. 2012. Available from:
http://readingcharlesdickens.com/wp-content/uploads/2012/07/Manette-in-
Bastille-253x300.jpg
Grandma’s Graphics. Old Scrooge say busy in his counting-house. 2000.
Available from:
http://www.grandmasgraphics.com/graphics/childrens/childrens379_2000.jpg
Sungardas Blog. Apple to Orange. 2010. Available from:
http://blog.sungardas.com/wp-content/uploads/Apple-to-Orange.jpg
Patel R. Questions?. Flickr. 2007. Available from:
https://www.flickr.com/photos/23679420@N00/545653437/
Biomedical Engineering Laboratory.Wikimedia. 2012. Available from:
http://upload.wikimedia.org/wikipedia/commons/a/a3/Biomedical_Engineerin
g_Laboratory.jpg
40
Keeping in mind some of the issues we faced with data description, and the lack of standards for developing metadata for biomedical data sets, I am now going to switch gears to talk about the second component of my project which involved searching for data sets in PubMed and PMC that have not been deposited in a repository.It is fruitful to complete this exercise to discover how much work would be required to describe all the data sets that are created in an article, and figure out if there was a sufficient way to describe the different types of data that are created.
This slide is meant to demonstrate the stages of exclusions taken to come out with our final sample from which to analyze.
It is important to mention here that while it found a large number of name variations in exclusions, when it overlapped with PubMed’s search only 230 total articles were excluded.
XML keyword exclusions.There is an issue in this case because you’ll notice the bar on the right is title “Multiple Keywords” – this is unfortunate because whenever more than one of the repositories was mentioned it counted as a multiple, as opposed to towards the repository itself.What is interesting about these multiples however is that a number of articles contained a mention of a number of repositories.
Added phrases from the 45 repository names to the Compound Word Dictionary (aka Phrase List), if not already present in the indexes for a database. For PMC, that means any mention in the full text will now be a search access point (retrievable); BUT the re-indexing of PMC has not yet been done, so none of these phrases are searchable yet. I can’t get a definite date out of NCBI; I will search every Monday until I see one of the phrases post (this is my test search because I have a PMCID where it occurs in the acknowledgements: Neuroimaging Informatics Tools and Resources Clearinghouse). For PubMed, that means any mention in the citation (essentially means the article title or abstract) will get retrieved. KathiCanese told me it was implemented for PubMed this week (after last weekend’s full re-index of the database), but I can’t get my test search to retrieve in PubMed even though I know 5 citations have that phrase. So I’ll wait for another weekend re-indexing and test again next week before reporting this to Kathi. This has nothing to do with the SI (Secondary Source Identifier) field, aka DatabanksList, where LO picks up mention of only certain repositories from the full text article. We may be expanding the list of SI databanks for the coming indexing year, but no decision on that yet. Request for advice is in to Dennis Benson, NCBI since August 6.
We had 30 NLM staff and BD2K staff look at the 383 articles – 25 articles each with the two people looking at the same 25 for validation.
This slide is designed to illustrate the different measurements and data collection that occurs within an article and really exemplifies the complexities of data that we are working with.
Hard to imagine how some data would be repurposed – e.g., virology, basic science. Access to data necessary for validation, reproducibility, but VERY difficult to do without the accompanying article that provides context, describes methodology, provides logic for drawing conclusions from multiple data sets. Is the publication the best route for accessing such data? If so, what does that mean for a data catalog