2. How do I access the 23 Things?
Overview
• ands.org.au/23-things
How often do I need to do a Thing?
• 1 Thing is released each week
• Complete at your own pace
UWA participation
• Monthly catch-ups in BJM which include a
national ANDS webinar
• Discussions in our Google + UWA-only group
4. Week Mon Tue Wed Thu Fri
1 March
Kick-off Webinar
2 3 4
1 - Getting started with data 7 8 9 10 11
2 - Issues in research data management 14 15 16 17 18
3 - Data in the research lifecycle 21 22 23 24 25
Easter Week 28 29 30 31 1 April
4 - Repositories for data discovery 4 5 6 7 8
11 12
Webinar Catch-up
13 14 15
5 - Repositories for data sharing 18 19 20 21 22
6 - Long-lived data: curation & preservation 25 26 27 28 29
7 - Data citation for access & attribution 2 May 3 4 5 6
8 - Citation metrics for data 9 10 11 12 13
9 - Licensing data for reuse 16 17 18 19 20
23 24
Webinar Catch-up
25 26 27
10 - Sharing sensitive data 30 31 1 June 2 3
11 - What's my schema? 6 7 8 9 10
12 - Vocabularies for data description 13 14 15 16 17
13 - Walk the crosswalk 20 21 22 23 24
27 28
Webinar Catch-up
29 30 1 July
14 - Identifiers and linked data 4 5 6 7 8
11 12 13 14 15
15 - Data management plans 18 19 20 21 22
16 - What are publishers & funders saying about
data?
25 26 27 28 29
1 August 2
Webinar Catch-up
3 4 5
17 - Data literacy & outreach 8 9 10 11 12
18 - Data interviews: talk the talk 15 16 17 18 19
19 - Exploring APIs and Apps 22 23 24 25 26
20 - Do it with data! 29 30 31 1 September 2
5 6
Webinar Catch-up
7 8 9
12 13 14 15 16
21 - Tools of the trade 19 20 21 22 23
22 - What's in a name? 26 27 28 29 30
23 - Making connections 3 October 4 5 6 7
10 11 12 13 14
17 18
Webinar Catch-up
19 20 21
5. Calendar of Events
Catch-up webinars and topic review at BJM
each month in 2016
• 1 March
• 12 April
• 24 May
• 28 June
• 2 August
• 6 September
• 18 October
9. Getting started with research data
• What "research data" are we talking about?
Thing 1
10. • Open up this
record of research
data collected during
a CSIRO voyage
which explored the
sea floor (i.e.
Benthic zone) of the
Marmion Lagoon,
located just off
Perth, in 2007
Thing 1
13. • your favourite research data tech or software story or
experience (eg did you compete in GovHack 2015? or
Science Hackfest Melbourne 4-6 March 2016?)
• a software tool or service for research data you think others
might be interested in
• a question or research data problem to crowdsource a
solution
Thing 1
16. Issues in research data management
Research data is for everyone. Governments and universities
all around Australia and the world are now encouraging
researchers to better manage their data so others can use it.
Research data might be critical to solving the big questions of
our time, but so much data are being lost or poorly managed.
Thing 2
17. Issues in research data management
• https://www.yout
ube.com/watch?
v=66oNv_DJuPc
• As you watch the
cartoon note the
data
management
mistakes which
interest or appal
you.
Thing 2
18. Issues in research data management
"Big Data" is a term we're hearing with
increasing frequency. Data management
for Big Data brings much complexity -
citing dynamic data, software, high volume
compute, storage costs, transfer of
petabytes of data, preservation,
provenance, more.
• Read this post and presentation titled:
"Big Data: The 5Vs Everyone Must
Know.
Thing 2
27. • Laboratory Notebooks are used by researchers to formally record their
research activities. As research has become increasingly digital and
collaborative the utility of traditional hard copy Lab Notebooks has been
challenged. Not surprisingly then, eLab Notebooks (ELN) have emerged
as an alternative.
• Effective data management for constantly updated data, such as that
within ELNs, is a real challenge for projects who wish to publish their
data during the project.
Thing 2
Issues in research data management
28. • Definition
Electronic lab notebook (ELN) software allows scientists to access, search
and share results of their experiments. An ELN is essentially a computer
program that is meant to replace traditional paper laboratory notebooks so
that scientists and researchers can search their records more easily and have
more efficient means to backup and copy their data onto other electronic
devices. ELNs encourage collaboration, as it is possible for multiple
researchers or scientists to view lab data at the same time. ELNs also have
the capacity to work alongside other research instruments so that additional
data can be incorporated quickly and efficiently. ELNs should be supported by
strong security measures to ensure that the data and the researchers’
process of creating the data are not jeopardized in any way. Additionally ELNs
should be flexible to change if a particular research process is altered or new
data is required. This flexibility is best addressed when developing the
specific software for an ELN.
Thing 2
Issues in research data management
29. • International team of scientists
open sources search for malaria
cure about how an international
team of scientists and citizen
scientists are using open source
ELNs to speed up a cure for
malaria.
Thing 2
Issues in research data
management
30. • You can see their
open ELNs here
Thing 2
Issues in research data
management
31. • You can see their
open ELNs here
Thing 2
Issues in research
data management
32. Thing 2
Issues in research
data management
http://www.wellcome.ac.uk/News/Media-office/Press-releases/2016/WTP060169.htm
33. Data in the research lifecycle
• Data and its management change over time. Here we look at data and
research lifecycles and make connections between them.
• Data often have a longer lifespan than the research project that creates
them.
• Follow-up projects may analyse or add to the data, and data may be
reused by other researchers.
• Journals publishers are increasingly mandating that the data underpinning
a journal article be retained and made accessible for the long term.
Thing 3
34. Data in the research lifecycle
• A data lifecycle shows the
different phases a dataset
goes through as the
research project moves
from
o "having a brilliant idea" to
o "making ground breaking
discoveries" to
o "telling the world about it"
Thing 3
http://www.data-archive.ac.uk/create-
manage/life-cycle
35. Data in the research lifecycle
• A data lifecycle shows the
different phases a dataset
goes through as the
research project moves
from
o "having a brilliant idea" to
o "making ground breaking
discoveries" to
o "telling the world about it"
Thing 3
http://www.data-archive.ac.uk/create-
manage/life-cycle
36. Data in the research lifecycle
Thing 3
http://www.data-archive.ac.uk/create-
manage/life-cycle
• A data lifecycle shows the
different phases a dataset
goes through as the
research project moves
from
o "having a brilliant idea" to
o "making ground breaking
discoveries" to
o "telling the world about it"
37. Data in the research lifecycle
Thing 3
http://www.data-archive.ac.uk/create-
manage/life-cycle
• A data lifecycle shows the
different phases a dataset
goes through as the
research project moves
from
o "having a brilliant idea" to
o "making ground breaking
discoveries" to
o "telling the world about it"
38. Data in the research lifecycle
Thing 3
http://www.data-archive.ac.uk/create-
manage/life-cycle
• A data lifecycle shows the
different phases a dataset
goes through as the
research project moves
from
o "having a brilliant idea" to
o "making ground breaking
discoveries" to
o "telling the world about it"
39. Data in the research lifecycle
Thing 3
http://www.data-archive.ac.uk/create-
manage/life-cycle
• A data lifecycle shows the
different phases a dataset
goes through as the
research project moves
from
o "having a brilliant idea" to
o "making ground breaking
discoveries" to
o "telling the world about it"
40. Data in the research lifecycle
Thing 3
• A data lifecycle shows the
different phases a dataset
goes through as the
research project moves
from
o "having a brilliant idea" to
o "making ground breaking
discoveries" to
o "telling the world about it"
41. Thing 3
Data in the research lifecycle
http://www.dcc.ac.uk/resources/curation-lifecycle-model
• Digital Curation
Centre
• Take a look at the
DCC Curation
Lifecycle Model
which
concentrates of
preservation and
curation within
data
management.
42. Thing 3
Data in the research lifecycle
http://www.dcc.ac.uk/resources/curation-lifecycle-model
• What could we
add???
43. Thing 3
Data in the research lifecycle
http://www.dcc.ac.uk/resources/curation-lifecycle-model
45. Thing 3
Data in the research lifecycle
http://www.library.uwa.edu.au/research/services
46. Data Discovery
• Repositories enable discovery of data by publishing data descriptions
("metadata") about the data they hold - like a library catalogue describes
the materials held in a library.
• Most repositories provide access to the data itself, but not always.
Thing 4
47. Data Discovery
• Data portals or aggregators draw together research data records from a
number of repositories.
• eg Research Data Australia (RDA) aggregates records from over 100
Australian research repositories.
• https://researchdata.ands.org.au/measuring-effects-human-leptonychotes-
weddellii/640511/
Thing 4
56. Data Discovery
Thing 4
DCC checklist for evaluating data repositories
What does this checklist cover and what does it exclude?
Choosing a long-term service to look after data means asking questions similar to
those you ask when choosing a publisher; ‘if I hand this over, will they review it,
safeguard the content, and make sure it is accessible for as long as it is of
value?’ This checklist relates these questions to the following key considerations:
1. Is a reputable repository available?
2. Will it take the data you want to deposit?
3. Will it be safe in legal terms?
4. Will the repository sustain the data value?
5. Will it support analysis and track data usage?
See more at: http://www.dcc.ac.uk/resources/how-guides-checklists/where-keep-
research-data#1
57. Contacts
Contact UWA 23 Things Coordinators:
Caroline Clark
caroline.clark@uwa.edu.au
Nola Steiner
nola.steiner@uwa.edu.au
Katina Toufexis
katina.toufexis@uwa.edu.au
Editor's Notes
Data are distinct pieces of information, usually formatted in a special way. Strictly speaking, data is the plural of datum, a single piece of information. In practice, however, people use data as both the singular and plural form of the word. In database management systems, data files are the files that store the database information.
Research data is data that is collected, observed, or created, for purposes of analysis to produce original research results. The word “data” is used throughout this site to refer to research data.
Research data can be generated for different purposes and through different processes, and can be divided into different categories. Each category may require a different type of data management plan.
Observational: data captured in real-time, usually irreplaceable. For example, sensor data, survey data, sample data, neurological images.
Experimental: data from lab equipment, often reproducible, but can be expensive. For example, gene sequences, chromatograms, toroid magnetic field data.
Simulation: data generated from test models where model and metadata are more important than output data. For example, climate models, economic models.
Derived or compiled: data is reproducible but expensive. For example, text and data mining, compiled database, 3D models.
Reference or canonical: a (static or organic) conglomeration or collection of smaller (peer-reviewed) datasets, most probably published and curated. For example, gene sequence databanks, chemical structures, or spatial data portals.
Research data may include all of the following:
Text or Word documents, spreadsheets
Laboratory notebooks, field notebooks, diaries
Questionnaires, transcripts, codebooks
Audiotapes, videotapes
Photographs, films
Test responses
Slides, artifacts, specimens, samples
Collection of digital objects acquired and generated during the process of research
Data files
Database contents including video, audio, text, images
Models, algorithms, scripts
Contents of an application such as input, output, log files for analysis software, simulation software, schemas
Methodologies and workflows
Standard operating procedures and protocols
The following research records may also be important to manage during and beyond the life of a project:
Correspondence including electronic mail and paper-based correspondence
Project files
Grant applications
Ethics applications
Technical reports
Research reports
Master lists
Signed consent forms
see what different formats data comes in
Choose one of the 4 specialised data repositories below, or find another data repository of interest - particularly one in a discipline you are unfamiliar with and spend some time browsing around your chosen repository to get a feel for the data available.
Think about how the data here differs from data you are familiar with. Consider for example, format, size and access method.
Share an idea about how cross disciplinary research could be affected by discipline data conventions, and also one way cross disciplinary data access can be facilitated .
The researcher could have copied the data from the USB stick to a shareable storage option like AARNet's Cloudstor (first 100 GB free)As the software was no longer supported, the researcher could have extracted, or tried to extract, the data (in the proprietary format) to another machine readable format (e.g. CSV or XML)The researcher could have copied the data from the USB stick to a secure and backed up system (most institutions have such systems)
The researcher's continued reluctance to share the data he had collected and his repeated assertions that all the information about the data was in his journal article. This underlined, for me anyway, the researcher's basic lack of understanding surrounding the value and usefulness of the data he collected for other researchers. He failed utterly to consider the possibility that others may not only want to view his data, but actually make use of it in their own research activities. This then led on to all the mistakes he made such as; failing to abide by his publisher's open access mandate, putting the data on a USB without making any copies and then losing it and my favorite, not labeling his fields with useful names and then forgetting what they measured.I suspect this basic lack of understanding is one of the biggest barriers to the practice of good research data management.
not abiding by funder and publishers rules and regulations for retention and sharing the research data- not using opensource software that can reduce some of these problems- not having a reader friendly data, with legends, guidelines and key word definitions. - not thinking if the research is replicable before publishing the article - not thinking about storing the data appropriately to the subject area and the size of the data that the project produces- not using research infrastructure available to academics either through the university/research institute or online- not engaging with the wider research community in a productive manner to expand the boundaries of science.
* no cataolgue of what the data actually is and what the columns represent in a spreadsheet* no software that will read the data files* researchers general reluctance to share data, thinking that all the information that anyone would/should ever need is in the article* using USB hard drives to store the data, and then only having one copy.
How to avoid it - good data management:* multiple copies* secure backup up storage* abiding by publisher/funder mandates* sharing on Research Data Australia or data repository
This article uses 5V's: volume, variety, velocity, veracity and value as a concept for how big data can be managed more successfully.
This article uses 5V's: volume, variety, velocity, veracity and value as a concept for how big data can be managed more successfully.
This article uses 5V's: volume, variety, velocity, veracity and value as a concept for how big data can be managed more successfully.
This article uses 5V's: volume, variety, velocity, veracity and value as a concept for how big data can be managed more successfully.
This article uses 5V's: volume, variety, velocity, veracity and value as a concept for how big data can be managed more successfully.
This article uses 5V's: volume, variety, velocity, veracity and value as a concept for how big data can be managed more successfully.
This article uses 5V's: volume, variety, velocity, veracity and value as a concept for how big data can be managed more successfully.
This article uses 5V's: volume, variety, velocity, veracity and value as a concept for how big data can be managed more successfully.
In late November 2012, the Open Source Malaria (OSM) team gained a new member who lived and worked almost 1700 kilometers away from the synthetic chemistry hub at the University of Sydney. Of course, collaboration across continents is not unusual for scientists, but until recently, recruitment in less than 140 characters certainly was.
View the complete collection of Open Science Week articles
Patrick Thompson—who’d just submitted his PhD thesis at the University of Edinburgh—responded to a Twitter request for synthetic help on an important new target for the team. True to his promise, Patrick later delivered several compounds for biological testing in Dundee, Scotland. Although it turned out that the molecules Patrick made weren’t so good at killing the malaria parasite, these "negative" results provided invaluable data for the team.
Patrick’s contribution would not have been possible in a regular drug discovery program. Veiled in secrecy and often complicated by patents and intellectual property issues, chemists aren’t always the best at sharing their results, at least not until they are published in peer reviewed journals—and sometimes after significant cherry picking. This means that lots of data, especially "negative" data often only resides in piles of dusty paper lab notebooks, hidden from all but the immediate scientific community.
Avoiding the loss of vast quantities of data is just one of the reasons behind the formation of the OSM team. The open source drug discovery project commenced in 2011, when Matthew Todd’s lab received funding from the Medicines for Malaria Venture (MMV) and then from the Australian Research Council in the form of a linkage grant. GlaxoSmithKline (GSK), a leading pharmaceutical company, had just published a revolutionary paper containing potential antimalarial medicines and placed the information into the public domain. This open GSK data was the initial impetus behind the OSM project and led to the team synthesizing and evaluating three different series of compounds.
The laws of open science
The OSM project operates along very similar lines to traditional medicinal chemistry projects in that the team is looking for an antimalarial drug candidate suitable for Phase 1 clinical trials. However, the day to day running of the project works quite differently and is probably most clearly defined by the team’s commitment to The Six Laws of Open Science:
First law: All data are open and all ideas are shared
Second Law: Anyone can take part at any level
Third Law: There will be no patents
Fourth Law: Suggestions are the best form of criticism
Fifth Law: Public discussion is much more valuable than private email
Sixth Law: An open project is bigger than, and is not owned by, any given lab
The team uses online electronic lab notebooks (ELN) to record all experimental procedures and data. This means that anyone with access to the Internet can search for information from the project. All data, results and conclusions are posted in real time—even when things don’t quite turn out as planned! As the team processes and uploads raw data to the ELN, other scientists are free to compare their own data or to draw different conclusions to the OSM team and provide feedback in the comments section below each ELN entry. This means that people can actually use the data generated by the OSM team, for whatever purpose they wish. The transparent nature of the project also means that there should be less room for error and that results could be easily reproduced in other laboratories.
The team is ardently opposed to patents, meaning they may need to navigate murky waters if and when they discover an excellent drug candidate. For decades, patents have been an essential part of the process required for bringing new medicines to the market, but the OSM team hopes to change this model.
"There's a growing number of people questioning whether we need patents for the development of some drugs. Penicillin and the polio vaccine didn't need them. Maybe new medicines for malaria don't either," said Matthew Todd.
Malaria is a catastrophic disease that mainly affects the world’s poorest people and so it is the ideal starting point for an open source drug discovery effort. New medicines for malaria have to be affordable and ideally administered in a single dose. Attempting to profit from those in dire need of life saving medicine would be morally reprehensible, and therefore the team believes it’s time to throw patents out the window and encourage scientists to work together and openly in order to cure malaria as expediently as possible.
Coordinating in the open
The team uses G+, Twitter, and Facebook as social media platforms for the discussion of results, promotion of the science and also (as in the case of Patrick and some other key members of the team) for recruitment of new members. GitHub has proven to be a valuable tool for project organization and discussion. The team avoids email as much as possible in order to facilitate open discussion and garner input from a variety of experts.
Both members of the core team and volunteers regularly update and maintain the project wiki for use by OSM, the wider scientific community, and, of course, interested members of the public. This is just one area of the project where non-specialists are able to contribute and free up the chemists so that they can spend more time at the bench making compounds.
Achieving success
Another great success story for open science and OSM is the collaboration established with a group of 40 Lawrence University undergraduate students. The team at Sydney developed a robust method for the synthesis of a particular family of compounds, which was followed by Stefan Debbert’s lab class using different combinations of related starting materials. The class made lots of new molecules, learned how to prove the structure and purity of their offerings and had fun along the way. They evaluated the molecules for their activity against the malaria parasite and posted all experimental data to the project’s ELN.
There are marked differences between open source science and the original open source movement, but scientists certainly have a great deal to learn from the software community. Open science removes the traditional hierarchy of research and encourages scientists of all levels—student or professor—to engage and contribute. Synthetic chemists need more than just a computer and access to the Web, and of course not just anyone has access to a lab and the skills required to make molecules. However, the OSM team is trying to lower the barrier to participation, while still conducting science of the highest standard. Until open science is just called "science," accelerating the discovery of a cure for malaria and encouraging others to work more openly are the true measures of success for an initiative such as OSM.
DATA
Data, any information in binary digital form, is at the centre of the Curation Lifecycle.
This includes:
Digital Objects: simple digital objects (discrete digital items such as text files, image files or sound files, along with their related identifiers and metadata) or complex digital objects (discrete digital objects made by combining a number of other digital objects, such as websites).
Databases: structured collections of records or data stored in a computer system.
FULL LIFECYCLE ACTIONS
Description and Representation InformationAssign administrative, descriptive, technical, structural and preservation metadata, using appropriate standards, to ensure adequate description and control over the long-term. Collect and assign representation information required to understand and render both the digital material and the associated metadata.
Preservation PlanningPlan for preservation throughout the curation lifecycle of digital material. This would include plans for management and administration of all curation lifecycle actions.
Community Watch and ParticipationMaintain a watch on appropriate community activities, and participate in the development of shared standards, tools and suitable software.
Curate and PreserveBe aware of, and undertake management and administrative actions planned to promote curation and preservation throughout the curation lifecycle.
SEQUENTIAL ACTIONS
ConceptualiseConceive and plan the creation of data, including capture method and storage options.
Checklist
Create or ReceiveCreate data including administrative, descriptive, structural and technical metadata. Preservation metadata may also be added at the time of creation.
Receive data, in accordance with documented collecting policies, from data creators, other archives, repositories or data centres, and if required assign appropriate metadata.
Checklist
Appraise and SelectEvaluate data and select for long-term curation and preservation. Adhere to documented guidance, policies or legal requirements.
Checklist
IngestTransfer data to an archive, repository, data centre or other custodian. Adhere to documented guidance, policies or legal requirements.
Checklist
Preservation ActionUndertake actions to ensure long-term preservation and retention of the authoritative nature of data. Preservation actions should ensure that data remains authentic, reliable and usable while maintaining its integrity. Actions include data cleaning, validation, assigning preservation metadata, assigning representation information and ensuring acceptable data structures or file formats.
ChecklistStoreStore the data in a secure manner adhering to relevant standards.
Checklist
Access, Use and ReuseEnsure that data is accessible to both designated users and reusers, on a day-to-day basis. This may be in the form of publicly available published information. Robust access controls and authentication procedures may be applicable.
Checklist
TransformCreate new data from the original, for example:
by migration into a different format, or
by creating a subset, by selection or query, to create newly derived results, perhaps for publication
OCCASIONAL ACTIONS
DisposeDispose of data, which has not been selected for long-term curation and preservation in accordance with documented policies, guidance or legal requirements.Typically data may be transferred to another archive, repository, data centre or other custodian. In some instances data is destroyed. The data's nature may, for legal reasons, necessitate secure destruction.
ReappraiseReturn data which fails validation procedures for further appraisal and re-selection.
MigrateMigrate data to a different format. This may be done to accord with the storage environment or to ensure the data's immunity from hardware or software obsolescence.
- See more at: http://www.dcc.ac.uk/resources/curation-lifecycle-model#sthash.6GswGUR7.dpuf
Thing 3 asks us to:
Share a comment about a modification or addition you would include to make this model contextualised to your situation.
One of the comments in the meetup included:
Clickable links to their resources
Our equivalent
Have a close look at the record to see the ways the Australian Antarctic Division has made this record discoverable and accessible.
Citation info
Licencing info
Note how many times this dataset has been cited and how to cite this data. We will look at data citation in more detail in Thing 7.
Citation Info at UWA
Citation Info at UWA
Licencing Info at UWA
This doesn’t present all the research data repositories Australia has to offer: is anything missing?