British Library Datasets Programme Feb 2011

  • 1,412 views
Uploaded on

Presentation to JISC Repository Support Programme Winter School, Feb 2011.

Presentation to JISC Repository Support Programme Winter School, Feb 2011.

More in: Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,412
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
10
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Name of person presenting the data?
  • Intro to British Library. Facts.
  • [Slide to remind people we are focusing on research data] When you read a research article, you are reading someone’s interpretation of some underlying evidence. And that’s a subjective interpretation. When we talk about data, we are really talking about the solid evidence that underpins these research articles.
  • Data is the foundation for research It is an essential component of the scientific record. Time-consuming, costly to produce. Re-acquisition may be impossible. Therefore essential that it is preserved and shared.
  • Despite the importance of this data, there is… As a result, datasets are … Many researchers are willing to share their data, but … . This is like grey literature in the 80’s and early 90’s – before the web. If you wanted a conference paper, you either had to show up in person or know someone who was there. This split research institutes and universities into haves and have-nots. You had to be on the ‘secret paper passing network’ to learn about the hottest latest research. This means that almost 80% of researchers put their datasets where? On their laptops, desktops, desk drawers, departmental servers. This is not a serious way to run a serious business! The management practices vary tremendously – some will have good practices, but many will not – placing the data at risk. 88% said they would make data available and 43% expressed the need to access other’s data. Researchers who produce the essential data that drive new science are often unrewarded and that data centres have considerable challenges justifying their budget and existence. And how is the state of resource discovery for datasets? UKRDS Study (1) Data is difficult to retain and manage once project funding ceases (compare grey and published articles) (2) Only 12% do not make their data available - but informal networks are predominant (3) 43% expressed the need to access other's data
  • As a result, datasets are: Difficult to discover Difficult to access In danger of being lost This is widely recognised…
  • [The Economist - citation appears in print and web versions, so not to save space] Good luck finding the data! Cannot: - Validate the author’s claims - Investigate the data for other interesting facts…
  • I am here to tell you about the datasets programme, which has come about because of rapid changes in the digital landscape. People are generating and sharing ever increasing volumes of data. We refer to collections of data as datasets. While the nature of datasets varies across disciplines, researchers within each discipline typically agree on what constitutes a dataset for them. Examples of datasets include (1) example of volcanic data (2) sound archive (3) cluster of chromosomes inside a breast cancer cell (4) uk poll of voting intention (blue cons, red labour, yellow liberal) Within the Dataset Programme, we consider a dataset to be an organised collection of digital objects that is produced or consumed during research. We emphasise the role that the dataset plays in the research activity, its importance to researchers, its impact, and its potential for reuse. Despite the differing nature of datasets, many of the services required by researchers are shared, such as methods of citation, discovery, and preservation.
  • The current situation for data isn’t good. Articles are well catered for by libraries and publishers. The underlying data is being neglected. Unsatisfactory.
  • We can even ask the question – are datasets first class citizens in the record of science? Contrast this situation with the one that we have for research articles. Libraries ensure long-term storage and management of articles Well established services for giving access to articles. Nearly all published articles are held in multiple national libraries Articles and citations form the backbone of impact analysis of researchers Catalogues and full-text search support discovery Clearly, this is an untenable situation and we need to take action!
  • The datasets programme has been established to explore how the Library can help… Not only do we want to ensure data is preserved, we envision a future where… Our approach is to foster collaboration and…
  • How can we achieve this? We are working on a number of projects – see www.bl.uk/datasets
  • DataCite 2 We see Persistent identification as a key component for this…
  • So what can organisations, like the British Library, do to help address these issues. Libraries have a reasonable level of credibility with identifiers and metadata to enable discovery and enhance access. We are cross-discipline, and have established relationships with publishers, universities, researchers, funders and play a core role in the national research infrastructure. We feel that we can address some of the barriers that we are seeing to data citation. We are clear that we do not want to re-invent the wheel and that we want to ensure that the right incentives are there.
  • DataCite 3 The approach that DataCite is taking – using DOIs - has some important social benefits. Researchers, authors, publishers are comfortable, understand, and know how to use them. They put datasets on a level playing field with articles. [Add citation of data in an article… REAL ONE!]
  • So what can organisations, like the British Library, do to help address these issues. Libraries have a reasonable level of credibility with identifiers and metadata to enable discovery and enhance access. We are cross-discipline, and have established relationships with publishers, universities, researchers, funders and play a core role in the national research infrastructure. We feel that we can address some of the barriers that we are seeing to data citation. We are clear that we do not want to re-invent the wheel and that we want to ensure that the right incentives are there.
  • Example Project 1 – DataCite Our long term vision is to support researchers by providing methods for them to locate, identify, and cite research datasets with confidence. Germany – TIB Germany – Gesis Leibniz Institute Germany – German Library of Medicine United Kingdom - The British Library France - INIST Switzerland - ETH Zürich Denmark - TU Delft Netherlands - TIC Canada - CISTI Australia - ANDS USA - CDL USA - Purdue
  • Today we will be talking about DataCite International association of 15 organisations, founded at the British Library Just had our 1 year anniversary (founded at the British Library in December 2009). We are working together to…
  • What is a DOI? Unique identifier, similar in concept to an ISBN Consists of a prefix and a suffix
  • (NOTE – this DOI will not resolve!)
  • Built a service or minting DOIs This is what we will tell you about today BUT FIRST, we will quickly introduce DOIs
  • How can we achieve this? We are working on a number of projects – see www.bl.uk/datasets

Transcript

  • 1. British Library Datasets Programme JISC RSP Winter School February 2011 Max Wilkinson
  • 2. Today’s Talk
      • The British Library
      • Data in scholarly communication
      • The problem with data
      • The Datasets Programme
        • Vision
        • Strategy
        • Activity (DataCite)
      • Other Projects
  • 3. The British Library
    • Exists for everyone who wants to do research – for academic, personal, and commercial purposes.
    • Covers all subject areas – sciences, technology, medicine, arts, humanities, social sciences…
    • Receives a copy of every item published in the UK.
    • Holds over 150 million items , with 3 million items added each year.
    • Used by over 16,000 people each day (on site and online).
  • 4. The British Library: some facts and figures Helping people advance knowledge to enrich lives British Library Act 1972 National centre for reference, study, bibliographical and other information services, in relation both to scientific and technological matters, and to the humanities. Science and Innovation Investment Framework 2004-2014, H.M. Treasury (2004) UK research base must have ready and efficient access to information of all kinds – such as experimental data sets, journals, theses, conference proceedings and patents. This is the life blood of research and innovation . The largest document supply service in the world. Secure e-delivery and ‘just in time’ digitisation enables desktop delivery within 2 hours
      • GIA Funding 08/09:
      • £94.8m operational, £12m capital Other funding secured 07/08: c.£33m
    National library of the UK. Serves researchers, business, libraries, education & the general public Collection includes over 2m sound recordings, 5m reports, theses and conference papers, the world’s largest patents collection (c.50m) 3 main sites in London and Yorkshire. Circa 2,000 staff Business and IP Centre: Providing inspiration, and enabling protection of creative capital and business development
      • Generates value to the UK economy each year of 4.4 times public funding
    Collection fills over 600km of shelving and grows at 11km per year 70 Tb of digital material through voluntary deposit
  • 5. Who do we serve?
      • The Researcher – We provide access to research level materials to all sectors including academia, industry, government, charities and NGOs.
      • Business - The British Library also has a critical role supporting businesses of all sizes, from individual entrepreneurs through to major organisations.
      • The Learner - We have an important role to play in supporting education from primary schools to developing future researchers of any age.
      • The Library Community – We play a key role in supporting the wider UK Library Community and information network.
      • The General Public - The services we offer include exhibitions and events, tours and web services which digitally showcase our collection.
  • 6. Modern science relies on good data
  • 7. Scholarly record Discovery Access Record Permanence Citation Metadata Exposure Trust Fabrics Copyright Scholarly record
  • 8. The Foundation for Research
    • Data is a crucial component of the scholarly record.
    • Re-acquisition may be impossible
    • Datasets are essential to the British Library’s mission to advance the World’s knowledge.
  • 9. Current Situation
    • No effective way to link between datasets and article;
    • No widely used method to identify datasets;
    • No widely used method to cite datasets.
  • 10. As a result…
    • Datasets are:
    • Difficult to discover
    • Difficult to access
    • In danger of being lost
  • 11. Difficult to Discover. Good luck finding the data! “ Source: Committee on Climate Change”
  • 12. Data are diverse in the Digital Landscape
    • Seismic measurements taken by a geologist.
    • An audio archive of birdsong created by an ornithologist.
    • Genetic data collected by a medical researcher.
    • A survey of public opinions collected by a sociologist.
  • 13. Re-join the gap…
      • (No) effective way to link between articles and datasets
      • (No) widely used method to identify datasets
      • (No) widely used method to cite datasets
    Articles Underlying data
  • 14. Datasets – first class citizens?
      • Data is difficult to manage after project funding ceases
      • Informal networks provide the primary means of sharing
      • Only 21% use a national or international facility
      • Datasets are not included in impact analysis
      • Good luck finding it or getting permission to use it (your discipline may vary)
    Source: UKRDS Study: The Data Imperative. Managing the UK’s research data for future use (Feb 2009)
  • 15. Scholarly record Discovery Access Record Permanence Citation Metadata Exposure Trust Fabrics Copyright Scholarly record
  • 16. Research training based on scholarly communication Discovery Access Record Permanence Citation Metadata Exposure Trust Fabrics Copyright Scholarly record Rarely includes data
  • 17. Scholarly communication requires intellectual exchanges Discovery Access Record Permanence Citation Metadata Exposure Trust Fabrics Copyright Scholarly record No such data fabric
  • 18. Scholarly discourse requires a record and provenance Discovery Access Record Permanence Citation Metadata Exposure Trust Fabrics Copyright Scholarly record Almost non-existent for data
  • 19. The Datasets Programme
    • We envision a future where researchers can:
    • Discover, access, reuse, and reference datasets.
    • Track the impact of the data that they generate and receive appropriate credit.
    • Our approach is to:
    • Provide a focus for the community to establish needs, requirements and agreement.
    • Explore novel technology and creative solutions.
  • 20. Two key concepts
    • INCENTIVE
    • SUSTAINABILITY
  • 21. Projects and activities www.bl.uk/ datasets Follow us on twitter @ datasetsBL
  • 22. A Key Component for Many Goals Persistent Identification Make Visible Find Access Track Impact Verify Reuse Cite ?
  • 23. Citation using Digital Object Identifiers (DOIs)
      • Dataset
      • G.Yancheva, N. R. Nowaczyk et al (2007)
      • Rock magnetism and X-ray flourescence spectrometry analyses on sediment cores of the Lake Huguang Maar, Southeast China, PANGAEA
      • Article Citation
      • G. Yancheva, N. R. Nowaczyk et al (2007)
      • Influence of the intertropical convergence zone on the East Asian monsoon
      • Nature 445, 74-77
    How to reference Published Article (Abstract or full text) The DOI system offers an easy, internet actionable way to connect the article with the underlying publication But a complete scholarly record would also link to the evidential datasets and their location, e.g. PANGAEA doi:10.1038/nature05431
  • 24. doi:10.1038/nature05431 leads to a landing page
  • 25. Connecting an Article with the Underlying Data
    • Digital Object Identifiers (DOIs) offer a solution
    • Mostly widely used identifier for scientific articles
    • Researchers, authors, publishers know how to use them
    • Put datasets on the same playing field as articles
      • Dataset
      • Yancheva et al (2007). Analyses on sediment of Lake Maar. PANGAEA.
      • doi:10.1594/PANGAEA.587840
    • URIs are commonly used but can decay
    • (e.g. Wren JD: URL decay in MEDLINE- a 4-year follow-up study . Bioinformatics. 2008, Jun 1;24(11):1381-5).
     
  • 26. doi:10.1594/PANGAEA.587840
  • 27. Dataset citation using Digital Object Identifiers (DOIs)
    • Scholarly record is complete
      • Dataset
      • G.Yancheva, N. R. Nowaczyk et al (2007)
      • Rock magnetism and X-ray flourescence spectrometry analyses on sediment cores of the Lake Huguang Maar, Southeast China, PANGAEA
      • doi:10.1594/PANGAEA.587840
      • Article
      • G. Yancheva, N. R. Nowaczyk et al (2007)
      • Influence of the intertropical convergence zone on the East Asian monsoon
      • Nature 445, 74-77
      • doi:10.1038/nature05431
    Data Citation
  • 28. Projects – DataCite
    • DataCite is an international consortium which aims to:
    • Establish easier access to scientific research data on the Internet
    • Increase acceptance of research data as legitimate, citable contributions to the scientific record
    • Support data archiving that will permit results to be verified and re-purposed for future study.
  • 29. DataCite
    • Support researchers by enabling them to locate, identify, and cite research datasets with confidence
    • Support data centres by providing persistent identifiers for datasets, workflows and standards for data publication
    • Support publishers by enabling research articles to be linked to the underlying data
    DataCite : Data Centres :: CrossRef : Publishers
  • 30. Digital Object Identifier (DOI)
    • doi:10.4124 / 0003.569
    Prefix Suffix
  • 31. DOI prefix
    • doi:10.4124 / 0003.569
    Prefix Suffix
    • The British Library provides data centres with a unique prefix for DataCite DOI
      • For example, Archaeology Data Service uses 10.5284
  • 32. DOI suffix
    • doi: 10.4124 / 0003.569
    Prefix Suffix
    • Suffix generated by the data centre
    • Guidelines for DOI syntax are being developed
  • 33. Resolving a DOI
    • doi:10.4124/0003.569
    Prefix Suffix
    • Resolving the DOI:
    • http://dx.doi.org/ 10.4124/0003.569
  • 34. DOIs resolve to an open landing page
  • 35. DataCite Service
    • Built a service for data centres to mint DOIs for datasets and store associated metadata ( http://api.datacite.org )
    • British Library is trialling the service with several UK data centres, including:
  • 36. Projects and activities www.bl.uk/ datasets
  • 37. SageCite: Data citation in bioinformatics workflow
    • Sage bionetworks data capture and analysis workflow (Tavenra: MyExperiemnt)
    • Data Citation service integration points and citation targets (e.g. data-models)
    • Recommendations
    • Benefits analysis
    SageCite: Integration of data citation services into multi-contributor bio-informatics workflow. Establishing data attribution and credit mechanisms . ► INCENTIVE Sage Bionetworks : Aggregating datasets from contributors to create massive coherent datasets that can be used for systems level analysis of disease
  • 38. Dryad UK: Repository sustainability
    • Expand Publisher base
    • Seamless integration into publisher workflow
    • Sustainability models for datasets supplementary to publication
    Dryad UK: Define a business case and pilot service integrating DataCite DOIs and dataset archiving into publisher workflows ► SUSTAINABILITY Leveraging the Dryad Consortium, which is addressing the acquisition and storage of long tail supplementary data
  • 39. For more information on the BL Datasets Programme
    • Max Wilkinson: Programme Manager; Datasets
    • Email: [email_address]
    • Email: [email_address]
    • WebSite www.bl.uk / datasets
    • Follow us on twitter @ datasetsBL