HKU Data Curation MLIM7350 Student Project: Data Curation Workshop

MLIM7350 PROJECT
DATA CURATION WORKSHOP
The University of Hong Kong Ernest LAM
Apr 27, 2017

Outline
1. What is Data Curation?
2. Why Data Curation?
3. How to Start Data Curation?
4. How to Organize Data?
5. Which Data Formats to Use?
6. Where to Preserve and Share Data?

“Data Curation is maintaining and
adding value to, a trusted body of digital
information for current and future use; It
encompasses the active management of
data throughout the research lifecycle.
Digital Curation Centre (DCC)
http://www.dcc.ac.uk/about-us/dcc-charter/dcc-charter-and-statement-principles
What is
Data
Curation?

DCC Lifecycle Model DataOne Model
http://www.dcc.ac.uk/sites/default/files/documents/publications/DCCLifecycle.pdf
http://www.dataone.org/sites/all/documents/L02_DataSharing.ppt
x
Data Curation Model
A process of Creation, Preservation, Reuse
What is
Data
Curation?

80%
Data are Unavailableafter
20 years
http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416
Why Data
Curation?

New York University, Health Sciences Library
https://youtu.be/N2zK3sAtr-4
Why Data
Curation?
A story about data sharing request that may happen to
researchers...

http://www.sciencemag.org/careers/2014/04/chasing-down-data-you-need
Why Data
Curation?
To ensure the use and reuse of data
● Case Study: An ecologist failed to collect the useful data of an
agricultural researcher after his death

http://www.rss.hku.hk/integrity/research-data-records-management
To meet the local requirement and policy
● HKU’s Policy on the Management of Research Data and Records
Why Data
Curation?

3.
How to Start Data
Curation?

How to
Start Data
Curation?
Data Management Planning Tool
A tool for Researchers to start with managing the data or
writing a proposal for funding
https://dmp.cdlib.org/

A list of templates to choose
How to
Start Data
Curation?

Visibility Setting: Public, Institutional, Private
Co-worker to edit, view and download
How to
Start Data
Curation?

Guidance to help the planning
How to
Start Data
Curation?

How to
Start Data
Curation?
Preview
Export to PDF / DOCX / Print

How to
Organize
Data?
Metadata Standard: Dublin Core
15 standard elements for describing data resources
http://wiki.dublincore.org/index.php/User_Guide
http://seopressor.com/wp-content/uploads/2015/11/dublin-core-elements-2.jpg

How to
Organize
Data?
https://library.stanford.edu/research/data-management-services/data-best-practices/best-practices-file-naming
Tips for File Renaming
✓ Date format - YYYYMMDD or YYMMDD
✗ Use too long File names
✗ Use Special characters, e.g. ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ' " |
Use leading “0” for clarity and to ensure files sort in sequential order
✓ "001, 002, ...010, 011 ... 100, 101, etc."
✗ "1, 2, ...10, 11 ... 100, 101, etc."
File names with spaces must be enclosed in quotes
✓ Underscores, e.g. file_name.xxx
✓ Dashes, e.g. file-name.xxx
✓ No separation, e.g. filename.xxx
✓ Camel case, e.g. FileName.xxx
✗ Use spaces, e.g. file name.xxx
Tools OS Free?
Bulk Rename Utility Windows Yes
Renamer 4 Mac
PSRenamer Linux, Mac, or Windows Yes

How to
Organize
Data?
Tips for Organizing Spreadsheet
Be consistent
✓ Use consistent codes for categorical variables
Fill in all of the cells
✓ Use “NA” or “-” to fill the blank cells for missing data
Create a data dictionary
✓ Use a separate file to describe the data
No calculations in the raw data files
✗ Use calculations and graphs in the raw data file
Don’t use font color or highlighting as data
✓ Use an additional column that indicates the outliers
Make backups
✓ Make a copy of the file with a new version number, e.g. file_v1.xlsx, file_v2.xlsx
✓ Write-protect the file when finished entering the data
For more details: http://kbroman.org/dataorg/

How to
Organize
Data?
Data Cleaning Tools: Open Refine
“A free, open source, powerful tool for working with messy data”
http://openrefine.org/
https://github.com/OpenRefine
https://github.com/OpenRefine/OpenRefine/wiki/Sample-Datasets

How to
Organize
Data?
Network and Graphic Visualization Tools: Gephi
“Interactive visualization and exploration platform for all kinds of
networks and complex systems, dynamic and hierarchical graphs.”
https://gephi.org/
https://gephi.org/images/screenshots/preview2.png

How to
Organize
Data?
Data Visualization Tools: Silk
“Create interactive data visualizations, publish websites, and tell
interactive stories.”
● https://www.silk.co/home
https://www.silk.co/help/charts-tutorial/

cc The Wolf Law Library - https://www.flickr.com/photos/wolflawlibrary/8747894458/
Forgotten Technologies...

Which
Data
Formats
to use?
Tabular data ● SPSS portable format (.por)
● comma-separated values (.csv)
● SPSS (.sav), Stata (.dta), MS Access (.mdb/.accdb)
● MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase
(.dbf), OpenDocument Spreadsheet (.ods)
Geospatial data ● ESRI Shapefile (.shp, .shx, .dbf, .prj, .sbx, .sbn
optional)
● CAD data (.dwg)
● ESRI Geodatabase format (.mdb)
● Adobe Illustrator (.ai), CAD data (.dxf or .svg)
Textual data ● Rich Text Format (.rtf)
● plain text, ASCII (.txt)
● eXtensible Mark-up Language (.xml)
● Hypertext Mark-up Language (.html)
● MS Word (.doc/.docx)
Image data ● TIFF 6.0 uncompressed (.tif) ● JPEG (.jpeg, .jpg, .jp2)
● GIF (.gif)
● TIFF other versions (.tiff)
● RAW image format (.raw)
● Photoshop files (.psd)
● BMP (.bmp)
● PNG (.png)
Audio data ● Free Lossless Audio Codec (FLAC) (.flac) ● MPEG-1 Audio Layer 3 (.mp3)
● Audio Interchange File Format (.aif)
● Waveform Audio Format (.wav)
Video data ● MPEG-4 (.mp4)
● OGG video (.ogv, .ogg)
● motion JPEG 2000 (.mj2)
● AVCHD video (.avchd)
Documentation and
scripts
● Rich Text Format (.rtf)
● PDF (.pdf)
● plain text (.txt)
● MS Word (.doc/.docx)
https://www.ukdataservice.ac.uk/manage-data/format/recommended-formats
Better!
Recommended format for preservation,
reuse and sharing

For more details: http://5stardata.info/en/
5 ★ OPEN DATA Which
Data
Formats
to use?
Any format
available on the
web but with an
open licence, to
be Open Data
Available as
machine-
readable
structured data
As (2) + non-
proprietary
format
All the above +
use URIs to
identify things,
so that people
can point at
your stuff
All the above +
link your data
to other data to
provide context

6.
Where to preserve
and share data?

Where to
Preserve
and Share
Data?
Institutional Repository
● HKU Scholars Hub
● enhance visibility of HKU authors and their research
● opportunities for collaboration
● ~325 Datasets
● http://hub.hku.hk/

● Open source code and software
● https://github.com
● Reserve DOI for publication
● https://figshare.com
● Research data with science and medicine
● http://datadryad.org
● Research data with biology and biomedical
● http://gigadb.org/site/index
● largest collection of science dataset
● http://dataverse.org
Disciplinary Repository
● Global online archiving platforms for particular subject
● Some provide free storages
Where to
Preserve
and Share
Data?

REFERENCE
Mallery, M. (2014). Dmptool: Guidance and Resources for Your Data Management Plan;
https://dmp. cdlib. org. Technical Services Quarterly, 31(2), 197-199
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and
stewardship. Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016).

THANKS!
Any questions?
You can find me at lernest@hku.hk
CREDITS
Special thanks to all the people who made and released these awesome resources for free:
▸ Presentation template by SlidesCarnival

HKU Data Curation MLIM7350 Student Project: Data Curation Workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to HKU Data Curation MLIM7350 Student Project: Data Curation Workshop

Similar to HKU Data Curation MLIM7350 Student Project: Data Curation Workshop (20)

Recently uploaded

Recently uploaded (20)

HKU Data Curation MLIM7350 Student Project: Data Curation Workshop

Editor's Notes