The Past, Present and Future of data


Published on

Opening keynote for NZ eResearch Symposium 2010. Discusses the past, present and future of data.

Published in: Education, Technology, Sports
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Start with a question.
    What is the different between these?
  • And these?
  • Thanks to machines like these, we now know that at genetic level
  • It’s only 1% of this.
    But that’s just genetics. What about culture?
  • We now know that a range of species (including crows!) are tool users, and they pass on particular techniques
    Think of this as a chimpanzee tutorial…
  • But this sort of transmission of culture doesn’t transcend either time or space. You need to be in the same time and place to learn.
  • For our species one of the big breakthroughs was the development of language. This now allowed for easier transmission than show and tell, but still didn’t address the time and space problem.
    So, where am I going with this? To data of course…
  • These are data from 7,000 BCE
    Each token is a particular value
    Initially they were used on their own (a bit like coins today)
  • Then around 4,000 BCE we see the emergence of these: bullae
    Explain: Seal (identify), signs for what is traded, contents as tokens.
    Essentially the first written contracts
  • To avoid having to literally break the contract to see what numbers it contained, the next step was to provide a representation on the outside.
    Then in 2900 BCE some genius made the crucial conceptual leap: if we have the numbers in symbolic form on the outside, do we need them in physical form on the inside? Answer: No, and so we get clay tablets. And those strange marks next to the numbers? The very beginnings of pictographic writing…
  • And then very quickly, the first libraries.
    I don’t have time to cover the entire history of writing, but just want to make the point that writing came from the need to capture and manage data. Or to put it another way, much of what we regard as civilisation started with accounting. Any accounting graduates in the audience?
  • So, let’s fast forward about 45 centuries to the present and look at the state of data in scholarly communication. Unfortunately, it’s inconvenient, imprisoned, invisible, inaccessible, and ignored
  • Need to retype
  • Near impossible to liberate. Talk about ChemXSeer example and DataThief Java application
  • Too transformed
  • Discipline scientist may know how to get these data but I don’t
  • Only journal like this I know. Anecdotal evidence that it is hard to get negative papers published
    All of the above problems are really about difficulties in getting to the data so it can be re-used.
    By why would you want to re-use data?
  • NOTE: Some of these arguments are at individual, national, global level
    Efficiency – don’t reinvent wheel
    Validation – repeatability of research
    Integrity – of scholarly record
    Value for Money – public money funded it, it should be available to public (ClimateGate!)
    Self-interest – sharing with a future self, greater visibility

    So, what are some good stories around data sharing?
  • Hubble Space Telescope (HST) operating since 1990
    Observations are proposed, and if accepted, data is collected and made available to the proposers – who then write a research paper
    Each year around 1,000 proposals are reviewed and approximately 200 are selected, for a total of 20,000 individual observations
    Data is stored at the Space Telescope Science Institute and made available after embargo period
  • GO = General Observation program
    AR = Archival Reuse
  • From Wikipedia: “A DNA microarray is a multiplex technology used in molecular biology. It consists of an arrayed series of thousands of microscopic spots of DNA oligonucleotides, called features, each containing picomoles (10−12 moles) of a specific DNA sequence, known as probes (or reporters). These can be a short section of a gene or other DNA element that are used to hybridize a cDNA or cRNA sample (called target) under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. Since an array can contain tens of thousands of probes, a microarray experiment can accomplish many genetic tests in parallel. Therefore arrays have dramatically accelerated many types of investigation.”
  • Heather Piwowar looked at the citation history of cancer microarray clinical trial publications
    Found that publicly available data was associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin

  • Climate researchers need to be able to run their models foreward (forecasting) and backwards (backcasting) to check they are correct.

    The southern limit of whaling is constrained by sea ice, and since 1931 whaling records have been collected for every whale caught. This paper took these records and used this.

    His analysis indicates that the Antarctic summer sea-ice edge has moved southwards by 2.8° of latitude between the mid 1950s and early 1970s

    This suggests a decline in the area covered by sea ice of some 25%
  • Number of initiatives around the world working to do a better job on data: NSF DataNet (Bill at end of conference), JISC Managing Research Data, NL SURF/DANS
  • I want to talk about one from New Zealand’s West Island…
  • 28
  • So, how are we doing this? We’ve got a whole series of programs of activity, but one way to visualise the infrastructure that is needed is to distinguish…
  • The current picture for Australian (and other) research data
  • The components that ANDS is adding to produce the ARDC
  • So, if that is a partial view of the present (Bill will tell you more tomorrow, I’m sure), what about the future?
  • Talk about ANDS was a founding member of DataCite. TIB in Germany was another and is providing the data DOIs for this example
  • So, to conclude:
    The need to manage data is not just a modern problem – it drove crucial developments in Western civilisation nearly 9,000 years ago
    For most of the last two hundred years, data has largely been the neglected stepdaughter in scholarly communication, eclipsed by its more glamorous sister the journal article. And I’ve reviewed some of the attendant problems arising from this
    Two things are driving a change in this approach: the shift to more data-intensive research and growth in information systems that can better manage and make available the underlying data
    I showed you some of the bits of the future that are starting to appear – forerunners of the way the research world might look for many disciplines in the next 10-20 years
    Or to put it another way, data is what helped to make it possible to go from this <click.
  • Thanks to all those who made their images available under CC licensing for re-use
  • And thank you for the opportunity to speak to you this morning.
  • The Past, Present and Future of data

    1. 1. The Past, Present and Future of Data Dr Andrew Treloar Director of Technology Australian National Data Service
    2. 2. THE PAST
    3. 3. THE PRESENT
    4. 4. Inconvenient data DOI: 10.1098/rsta.2005.1569
    5. 5. Imprisoned data DOI 10.1098/rsta.2006.1793
    6. 6. Invisible data DOI 10.1098/rsta.2006.1793
    7. 7. Inaccessible data
    8. 8. Ignored negative data
    9. 9. Why re-use data? • Efficiency • Validation • Integrity • Value for money • Self-interest
    10. 10. Cancer Micro-array trials
    11. 11. Piwowar, et. al., “Sharing Detailed Research Data Is Associated with Increased Citation Rate”
    12. 12. Climate Archæology de la Mare, William K., 1997, "Abrupt mid-twentieth-century decline in Antarctic sea- ice extent from whaling records", Nature, vol.389, pp 87-90, 4 Sept 97
    13. 13. ANDS
    14. 14. Australian National Data Service (ANDS)  An initiative of the Australian Government being conducted as part of the National Collaborative Research Infrastructure Strategy ($A24M) and the Super Science Initiative ($A48M)  A collaboration between Monash University, the Australian National University and CSIRO  Nearly 50 staff, funded to mid 2013  More researchers re-using more data more often  Data as a first-class object 28
    15. 15. Unmanaged
    16. 16. Managed
    17. 17. Disconnected
    18. 18. Connected
    19. 19. Invisible
    20. 20. Findable
    21. 21. Single use
    22. 22. Reusable
    23. 23. THE FUTURE
    24. 24. “The future is already here – it’s just not very evenly distributed” William Gibson
    25. 25. Create: Open Notebook Science • • • html
    26. 26. Describe, Store: PIC Cloud Demo • •
    27. 27. Discover, Access: RDA Demo • •
    28. 28. Identify: Journal Demo • • “Elsevier and PANGAEA (Publishing Network for Geoscientific & Environmental Data) announced their next step in interconnecting the diverse elements of scientific research. Elsevier articles at ScienceDirect are now enriched with graphical information linking to associated research data sets that are deposited at PANGAEA. This enrichment functionality offers a blueprint of how Elsevier would like to work with data set repositories all over the world [emphasis added].” Articles-With-Research-Data-Sets-69148.asp
    29. 29. CONCLUSION
    30. 30. 2001 • d=annotation_701469&v=TSW69UwxKbU&fea ture=iv – 5:04 through 6:00
    31. 31. Acknowledgements • • • • NASA/courtesy of • • • • • • Clip of 2001 shown in accordance with section 47(2) of Copyright Act 1994 No 143 (as at 07 July 2010) – ml#DLM345972
    32. 32. Questions/Links • • • •