SlideShare a Scribd company logo
MLIM7350 PROJECT
DATA CURATION WORKSHOP
The University of Hong Kong Ernest LAM
Apr 27, 2017
Outline
1. What is Data Curation?
2. Why Data Curation?
3. How to Start Data Curation?
4. How to Organize Data?
5. Which Data Formats to Use?
6. Where to Preserve and Share Data?
1.
What is Data Curation?
“Data Curation is maintaining and
adding value to, a trusted body of digital
information for current and future use; It
encompasses the active management of
data throughout the research lifecycle.
Digital Curation Centre (DCC)
http://www.dcc.ac.uk/about-us/dcc-charter/dcc-charter-and-statement-principles
What is
Data
Curation?
DCC Lifecycle Model DataOne Model
http://www.dcc.ac.uk/sites/default/files/documents/publications/DCCLifecycle.pdf
http://www.dataone.org/sites/all/documents/L02_DataSharing.ppt
x
Data Curation Model
A process of Creation, Preservation, Reuse
What is
Data
Curation?
2.
Why Data Curation?
80%
Data are Unavailableafter
20 years
http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416
Why Data
Curation?
New York University, Health Sciences Library
https://youtu.be/N2zK3sAtr-4
Why Data
Curation?
A story about data sharing request that may happen to
researchers...
http://www.sciencemag.org/careers/2014/04/chasing-down-data-you-need
Why Data
Curation?
To ensure the use and reuse of data
● Case Study: An ecologist failed to collect the useful data of an
agricultural researcher after his death
http://www.rss.hku.hk/integrity/research-data-records-management
To meet the local requirement and policy
● HKU’s Policy on the Management of Research Data and Records
Why Data
Curation?
3.
How to Start Data
Curation?
How to
Start Data
Curation?
Data Management Planning Tool
A tool for Researchers to start with managing the data or
writing a proposal for funding
https://dmp.cdlib.org/
Data Management Planning Tool
A list of templates to choose
How to
Start Data
Curation?
Data Management Planning Tool
Visibility Setting: Public, Institutional, Private
Co-worker to edit, view and download
How to
Start Data
Curation?
Data Management Planning Tool
Guidance to help the planning
How to
Start Data
Curation?
How to
Start Data
Curation?
Data Management Planning Tool
Preview
Export to PDF / DOCX / Print
4.
How to Organize
Data?
How to
Organize
Data?
Metadata Standard: Dublin Core
15 standard elements for describing data resources
http://wiki.dublincore.org/index.php/User_Guide
http://seopressor.com/wp-content/uploads/2015/11/dublin-core-elements-2.jpg
How to
Organize
Data?
https://library.stanford.edu/research/data-management-services/data-best-practices/best-practices-file-naming
Tips for File Renaming
✓ Date format - YYYYMMDD or YYMMDD
✗ Use too long File names
✗ Use Special characters, e.g. ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ' " |
Use leading “0” for clarity and to ensure files sort in sequential order
✓ "001, 002, ...010, 011 ... 100, 101, etc."
✗ "1, 2, ...10, 11 ... 100, 101, etc."
File names with spaces must be enclosed in quotes
✓ Underscores, e.g. file_name.xxx
✓ Dashes, e.g. file-name.xxx
✓ No separation, e.g. filename.xxx
✓ Camel case, e.g. FileName.xxx
✗ Use spaces, e.g. file name.xxx
Tools OS Free?
Bulk Rename Utility Windows Yes
Renamer 4 Mac
PSRenamer Linux, Mac, or Windows Yes
How to
Organize
Data?
Tips for Organizing Spreadsheet
Be consistent
✓ Use consistent codes for categorical variables
Fill in all of the cells
✓ Use “NA” or “-” to fill the blank cells for missing data
Create a data dictionary
✓ Use a separate file to describe the data
No calculations in the raw data files
✗ Use calculations and graphs in the raw data file
Don’t use font color or highlighting as data
✓ Use an additional column that indicates the outliers
Make backups
✓ Make a copy of the file with a new version number, e.g. file_v1.xlsx, file_v2.xlsx
✓ Write-protect the file when finished entering the data
For more details: http://kbroman.org/dataorg/
How to
Organize
Data?
Data Cleaning Tools: Open Refine
“A free, open source, powerful tool for working with messy data”
http://openrefine.org/
https://github.com/OpenRefine
https://github.com/OpenRefine/OpenRefine/wiki/Sample-Datasets
How to
Organize
Data?
Network and Graphic Visualization Tools: Gephi
“Interactive visualization and exploration platform for all kinds of
networks and complex systems, dynamic and hierarchical graphs.”
https://gephi.org/
https://gephi.org/images/screenshots/preview2.png
How to
Organize
Data?
Data Visualization Tools: Silk
“Create interactive data visualizations, publish websites, and tell
interactive stories.”
● https://www.silk.co/home
https://www.silk.co/help/charts-tutorial/
5.
Which Data Formats
to Use?
cc The Wolf Law Library - https://www.flickr.com/photos/wolflawlibrary/8747894458/
Forgotten Technologies...
Which
Data
Formats
to use?
Tabular data ● SPSS portable format (.por)
● comma-separated values (.csv)
● SPSS (.sav), Stata (.dta), MS Access (.mdb/.accdb)
● MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase
(.dbf), OpenDocument Spreadsheet (.ods)
Geospatial data ● ESRI Shapefile (.shp, .shx, .dbf, .prj, .sbx, .sbn
optional)
● CAD data (.dwg)
● ESRI Geodatabase format (.mdb)
● Adobe Illustrator (.ai), CAD data (.dxf or .svg)
Textual data ● Rich Text Format (.rtf)
● plain text, ASCII (.txt)
● eXtensible Mark-up Language (.xml)
● Hypertext Mark-up Language (.html)
● MS Word (.doc/.docx)
Image data ● TIFF 6.0 uncompressed (.tif) ● JPEG (.jpeg, .jpg, .jp2)
● GIF (.gif)
● TIFF other versions (.tiff)
● RAW image format (.raw)
● Photoshop files (.psd)
● BMP (.bmp)
● PNG (.png)
Audio data ● Free Lossless Audio Codec (FLAC) (.flac) ● MPEG-1 Audio Layer 3 (.mp3)
● Audio Interchange File Format (.aif)
● Waveform Audio Format (.wav)
Video data ● MPEG-4 (.mp4)
● OGG video (.ogv, .ogg)
● motion JPEG 2000 (.mj2)
● AVCHD video (.avchd)
Documentation and
scripts
● Rich Text Format (.rtf)
● PDF (.pdf)
● plain text (.txt)
● MS Word (.doc/.docx)
https://www.ukdataservice.ac.uk/manage-data/format/recommended-formats
Better!
Recommended format for preservation,
reuse and sharing
For more details: http://5stardata.info/en/
5 ★ OPEN DATA Which
Data
Formats
to use?
Any format
available on the
web but with an
open licence, to
be Open Data
Available as
machine-
readable
structured data
As (2) + non-
proprietary
format
All the above +
use URIs to
identify things,
so that people
can point at
your stuff
All the above +
link your data
to other data to
provide context
6.
Where to preserve
and share data?
Where to
Preserve
and Share
Data?
Institutional Repository
● HKU Scholars Hub
● enhance visibility of HKU authors and their research
● opportunities for collaboration
● ~325 Datasets
● http://hub.hku.hk/
● Open source code and software
● https://github.com
● Reserve DOI for publication
● https://figshare.com
● Research data with science and medicine
● http://datadryad.org
● Research data with biology and biomedical
● http://gigadb.org/site/index
● largest collection of science dataset
● http://dataverse.org
Disciplinary Repository
● Global online archiving platforms for particular subject
● Some provide free storages
Where to
Preserve
and Share
Data?
REFERENCE
Mallery, M. (2014). Dmptool: Guidance and Resources for Your Data Management Plan;
https://dmp. cdlib. org. Technical Services Quarterly, 31(2), 197-199
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and
stewardship. Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016).
THANKS!
Any questions?
You can find me at lernest@hku.hk
CREDITS
Special thanks to all the people who made and released these awesome resources for free:
▸ Presentation template by SlidesCarnival

More Related Content

What's hot

TIB's action for research data managament as a national library's strategy in...
TIB's action for research data managament as a national library's strategy in...TIB's action for research data managament as a national library's strategy in...
TIB's action for research data managament as a national library's strategy in...
Peter Löwe
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Ontotext
 
MongoDB
MongoDBMongoDB
MongoDB
Fayez Shayeb
 
SMRUDAS
SMRUDAS SMRUDAS
SMRUDAS
Jisc RDM
 
SharePoint Saturday Durban Presentation
SharePoint Saturday Durban PresentationSharePoint Saturday Durban Presentation
SharePoint Saturday Durban Presentation
Warren Marks
 
Digital Preservation in Production (DPN and DuraCloud Vault)
Digital Preservation in Production (DPN and DuraCloud Vault)Digital Preservation in Production (DPN and DuraCloud Vault)
Digital Preservation in Production (DPN and DuraCloud Vault)
DuraSpace
 
DBpedia InsideOut
DBpedia InsideOutDBpedia InsideOut
DBpedia InsideOut
Cristina Pattuelli
 
News from the DOI and DataCite Community
News from the DOI and DataCite CommunityNews from the DOI and DataCite Community
News from the DOI and DataCite CommunityFrauke Ziedorn
 
Portland Common Data Model (PCDM): Creating and Sharing Complex Digital Objects
Portland Common Data Model (PCDM): Creating and Sharing Complex Digital ObjectsPortland Common Data Model (PCDM): Creating and Sharing Complex Digital Objects
Portland Common Data Model (PCDM): Creating and Sharing Complex Digital Objects
Karen Estlund
 
HDF5 Life cycle of data
HDF5 Life cycle of dataHDF5 Life cycle of data
AAT LOD Microthesauri
AAT LOD MicrothesauriAAT LOD Microthesauri
AAT LOD Microthesauri
Marcia Zeng
 
Data Life Cycle
Data Life CycleData Life Cycle
Data Life Cycle
Jason Henderson
 
Mongo db
Mongo dbMongo db
Mongo db
Raghu nath
 
Sharing Between Data Repositories
Sharing Between Data RepositoriesSharing Between Data Repositories
Sharing Between Data Repositories
Kevin Clarke
 
Mongo db workshop # 01
Mongo db workshop # 01Mongo db workshop # 01
Mongo db workshop # 01
FarhatParveen10
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
EUCLID project
 
Expanding the content categories at JaLC
Expanding the content categories at JaLCExpanding the content categories at JaLC
Expanding the content categories at JaLC
National Institute of Informatics (NII)
 
Semantic HTML
Semantic HTMLSemantic HTML
Semantic HTML
hchen1
 
Baker and Dekkers, "Dublin Core: The Road from Metadata Formats to Linked Data"
Baker and Dekkers, "Dublin Core: The Road from Metadata Formats to Linked Data"Baker and Dekkers, "Dublin Core: The Road from Metadata Formats to Linked Data"
Baker and Dekkers, "Dublin Core: The Road from Metadata Formats to Linked Data"
National Information Standards Organization (NISO)
 

What's hot (20)

TIB's action for research data managament as a national library's strategy in...
TIB's action for research data managament as a national library's strategy in...TIB's action for research data managament as a national library's strategy in...
TIB's action for research data managament as a national library's strategy in...
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
 
MongoDB
MongoDBMongoDB
MongoDB
 
SMRUDAS
SMRUDAS SMRUDAS
SMRUDAS
 
SharePoint Saturday Durban Presentation
SharePoint Saturday Durban PresentationSharePoint Saturday Durban Presentation
SharePoint Saturday Durban Presentation
 
Digital Preservation in Production (DPN and DuraCloud Vault)
Digital Preservation in Production (DPN and DuraCloud Vault)Digital Preservation in Production (DPN and DuraCloud Vault)
Digital Preservation in Production (DPN and DuraCloud Vault)
 
DBpedia InsideOut
DBpedia InsideOutDBpedia InsideOut
DBpedia InsideOut
 
Mongo db
Mongo dbMongo db
Mongo db
 
News from the DOI and DataCite Community
News from the DOI and DataCite CommunityNews from the DOI and DataCite Community
News from the DOI and DataCite Community
 
Portland Common Data Model (PCDM): Creating and Sharing Complex Digital Objects
Portland Common Data Model (PCDM): Creating and Sharing Complex Digital ObjectsPortland Common Data Model (PCDM): Creating and Sharing Complex Digital Objects
Portland Common Data Model (PCDM): Creating and Sharing Complex Digital Objects
 
HDF5 Life cycle of data
HDF5 Life cycle of dataHDF5 Life cycle of data
HDF5 Life cycle of data
 
AAT LOD Microthesauri
AAT LOD MicrothesauriAAT LOD Microthesauri
AAT LOD Microthesauri
 
Data Life Cycle
Data Life CycleData Life Cycle
Data Life Cycle
 
Mongo db
Mongo dbMongo db
Mongo db
 
Sharing Between Data Repositories
Sharing Between Data RepositoriesSharing Between Data Repositories
Sharing Between Data Repositories
 
Mongo db workshop # 01
Mongo db workshop # 01Mongo db workshop # 01
Mongo db workshop # 01
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Expanding the content categories at JaLC
Expanding the content categories at JaLCExpanding the content categories at JaLC
Expanding the content categories at JaLC
 
Semantic HTML
Semantic HTMLSemantic HTML
Semantic HTML
 
Baker and Dekkers, "Dublin Core: The Road from Metadata Formats to Linked Data"
Baker and Dekkers, "Dublin Core: The Road from Metadata Formats to Linked Data"Baker and Dekkers, "Dublin Core: The Road from Metadata Formats to Linked Data"
Baker and Dekkers, "Dublin Core: The Road from Metadata Formats to Linked Data"
 

Similar to HKU Data Curation MLIM7350 Student Project: Data Curation Workshop

The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...
Projeto RCAAP
 
Data management for TA's
Data management for TA'sData management for TA's
Data management for TA'saaroncollie
 
Research Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering StudentsResearch Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering Students
Aaron Collie
 
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
Research Support Team, IT Services, University of Oxford
 
Data Engineering.pdf
Data Engineering.pdfData Engineering.pdf
Data Engineering.pdf
Datacademy.ai
 
Good Practice in Research Data Management
Good Practice in Research Data ManagementGood Practice in Research Data Management
Good Practice in Research Data Management
Historic Environment Scotland
 
What is-rdm
What is-rdmWhat is-rdm
What is-rdm
Sarah Jones
 
All data accessible to all my organization - Presentation at OW2con'19, June...
 All data accessible to all my organization - Presentation at OW2con'19, June... All data accessible to all my organization - Presentation at OW2con'19, June...
All data accessible to all my organization - Presentation at OW2con'19, June...
OW2
 
Keep Calm and Curate
Keep Calm and CurateKeep Calm and Curate
Keep Calm and Curate
GarethKnight
 
Service Integration to Enhance RDM
Service Integration to Enhance RDMService Integration to Enhance RDM
Service Integration to Enhance RDM
EDINA, University of Edinburgh
 
Hughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesHughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication Repositories
ASIS&T
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
Sarah Anna Stewart
 
Web storage
Web storage Web storage
Web storage
PratikDoiphode1
 
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
OpenAIRE
 
OU Library Research Support webinar: Working with research data
OU Library Research Support webinar: Working with research dataOU Library Research Support webinar: Working with research data
OU Library Research Support webinar: Working with research data
IzzyChad
 
EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 14, 2016...
EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 14, 2016...EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 14, 2016...
EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 14, 2016...
EUDAT
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
DataWorks Summit
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
Robert Grossman
 
Introduction to RDM for trainee physicians
Introduction to RDM for trainee physiciansIntroduction to RDM for trainee physicians
Introduction to RDM for trainee physicians
Historic Environment Scotland
 

Similar to HKU Data Curation MLIM7350 Student Project: Data Curation Workshop (20)

The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...
 
Data management for TA's
Data management for TA'sData management for TA's
Data management for TA's
 
Research Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering StudentsResearch Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering Students
 
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
 
Data Engineering.pdf
Data Engineering.pdfData Engineering.pdf
Data Engineering.pdf
 
Good Practice in Research Data Management
Good Practice in Research Data ManagementGood Practice in Research Data Management
Good Practice in Research Data Management
 
What is-rdm
What is-rdmWhat is-rdm
What is-rdm
 
All data accessible to all my organization - Presentation at OW2con'19, June...
 All data accessible to all my organization - Presentation at OW2con'19, June... All data accessible to all my organization - Presentation at OW2con'19, June...
All data accessible to all my organization - Presentation at OW2con'19, June...
 
Keep Calm and Curate
Keep Calm and CurateKeep Calm and Curate
Keep Calm and Curate
 
RDM@Edinburgh_interoperation_IDCC2015
RDM@Edinburgh_interoperation_IDCC2015RDM@Edinburgh_interoperation_IDCC2015
RDM@Edinburgh_interoperation_IDCC2015
 
Service Integration to Enhance RDM
Service Integration to Enhance RDMService Integration to Enhance RDM
Service Integration to Enhance RDM
 
Hughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesHughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication Repositories
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
 
Web storage
Web storage Web storage
Web storage
 
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
 
OU Library Research Support webinar: Working with research data
OU Library Research Support webinar: Working with research dataOU Library Research Support webinar: Working with research data
OU Library Research Support webinar: Working with research data
 
EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 14, 2016...
EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 14, 2016...EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 14, 2016...
EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 14, 2016...
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Introduction to RDM for trainee physicians
Introduction to RDM for trainee physiciansIntroduction to RDM for trainee physicians
Introduction to RDM for trainee physicians
 

Recently uploaded

Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
DhatriParmar
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 

Recently uploaded (20)

Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 

HKU Data Curation MLIM7350 Student Project: Data Curation Workshop

  • 1. MLIM7350 PROJECT DATA CURATION WORKSHOP The University of Hong Kong Ernest LAM Apr 27, 2017
  • 2. Outline 1. What is Data Curation? 2. Why Data Curation? 3. How to Start Data Curation? 4. How to Organize Data? 5. Which Data Formats to Use? 6. Where to Preserve and Share Data?
  • 3. 1. What is Data Curation?
  • 4. “Data Curation is maintaining and adding value to, a trusted body of digital information for current and future use; It encompasses the active management of data throughout the research lifecycle. Digital Curation Centre (DCC) http://www.dcc.ac.uk/about-us/dcc-charter/dcc-charter-and-statement-principles What is Data Curation?
  • 5. DCC Lifecycle Model DataOne Model http://www.dcc.ac.uk/sites/default/files/documents/publications/DCCLifecycle.pdf http://www.dataone.org/sites/all/documents/L02_DataSharing.ppt x Data Curation Model A process of Creation, Preservation, Reuse What is Data Curation?
  • 7. 80% Data are Unavailableafter 20 years http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416 Why Data Curation?
  • 8. New York University, Health Sciences Library https://youtu.be/N2zK3sAtr-4 Why Data Curation? A story about data sharing request that may happen to researchers...
  • 9. http://www.sciencemag.org/careers/2014/04/chasing-down-data-you-need Why Data Curation? To ensure the use and reuse of data ● Case Study: An ecologist failed to collect the useful data of an agricultural researcher after his death
  • 10. http://www.rss.hku.hk/integrity/research-data-records-management To meet the local requirement and policy ● HKU’s Policy on the Management of Research Data and Records Why Data Curation?
  • 11. 3. How to Start Data Curation?
  • 12. How to Start Data Curation? Data Management Planning Tool A tool for Researchers to start with managing the data or writing a proposal for funding https://dmp.cdlib.org/
  • 13. Data Management Planning Tool A list of templates to choose How to Start Data Curation?
  • 14. Data Management Planning Tool Visibility Setting: Public, Institutional, Private Co-worker to edit, view and download How to Start Data Curation?
  • 15. Data Management Planning Tool Guidance to help the planning How to Start Data Curation?
  • 16. How to Start Data Curation? Data Management Planning Tool Preview Export to PDF / DOCX / Print
  • 18. How to Organize Data? Metadata Standard: Dublin Core 15 standard elements for describing data resources http://wiki.dublincore.org/index.php/User_Guide http://seopressor.com/wp-content/uploads/2015/11/dublin-core-elements-2.jpg
  • 19. How to Organize Data? https://library.stanford.edu/research/data-management-services/data-best-practices/best-practices-file-naming Tips for File Renaming ✓ Date format - YYYYMMDD or YYMMDD ✗ Use too long File names ✗ Use Special characters, e.g. ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ' " | Use leading “0” for clarity and to ensure files sort in sequential order ✓ "001, 002, ...010, 011 ... 100, 101, etc." ✗ "1, 2, ...10, 11 ... 100, 101, etc." File names with spaces must be enclosed in quotes ✓ Underscores, e.g. file_name.xxx ✓ Dashes, e.g. file-name.xxx ✓ No separation, e.g. filename.xxx ✓ Camel case, e.g. FileName.xxx ✗ Use spaces, e.g. file name.xxx Tools OS Free? Bulk Rename Utility Windows Yes Renamer 4 Mac PSRenamer Linux, Mac, or Windows Yes
  • 20. How to Organize Data? Tips for Organizing Spreadsheet Be consistent ✓ Use consistent codes for categorical variables Fill in all of the cells ✓ Use “NA” or “-” to fill the blank cells for missing data Create a data dictionary ✓ Use a separate file to describe the data No calculations in the raw data files ✗ Use calculations and graphs in the raw data file Don’t use font color or highlighting as data ✓ Use an additional column that indicates the outliers Make backups ✓ Make a copy of the file with a new version number, e.g. file_v1.xlsx, file_v2.xlsx ✓ Write-protect the file when finished entering the data For more details: http://kbroman.org/dataorg/
  • 21. How to Organize Data? Data Cleaning Tools: Open Refine “A free, open source, powerful tool for working with messy data” http://openrefine.org/ https://github.com/OpenRefine https://github.com/OpenRefine/OpenRefine/wiki/Sample-Datasets
  • 22. How to Organize Data? Network and Graphic Visualization Tools: Gephi “Interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs.” https://gephi.org/ https://gephi.org/images/screenshots/preview2.png
  • 23. How to Organize Data? Data Visualization Tools: Silk “Create interactive data visualizations, publish websites, and tell interactive stories.” ● https://www.silk.co/home https://www.silk.co/help/charts-tutorial/
  • 25. cc The Wolf Law Library - https://www.flickr.com/photos/wolflawlibrary/8747894458/ Forgotten Technologies...
  • 26. Which Data Formats to use? Tabular data ● SPSS portable format (.por) ● comma-separated values (.csv) ● SPSS (.sav), Stata (.dta), MS Access (.mdb/.accdb) ● MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase (.dbf), OpenDocument Spreadsheet (.ods) Geospatial data ● ESRI Shapefile (.shp, .shx, .dbf, .prj, .sbx, .sbn optional) ● CAD data (.dwg) ● ESRI Geodatabase format (.mdb) ● Adobe Illustrator (.ai), CAD data (.dxf or .svg) Textual data ● Rich Text Format (.rtf) ● plain text, ASCII (.txt) ● eXtensible Mark-up Language (.xml) ● Hypertext Mark-up Language (.html) ● MS Word (.doc/.docx) Image data ● TIFF 6.0 uncompressed (.tif) ● JPEG (.jpeg, .jpg, .jp2) ● GIF (.gif) ● TIFF other versions (.tiff) ● RAW image format (.raw) ● Photoshop files (.psd) ● BMP (.bmp) ● PNG (.png) Audio data ● Free Lossless Audio Codec (FLAC) (.flac) ● MPEG-1 Audio Layer 3 (.mp3) ● Audio Interchange File Format (.aif) ● Waveform Audio Format (.wav) Video data ● MPEG-4 (.mp4) ● OGG video (.ogv, .ogg) ● motion JPEG 2000 (.mj2) ● AVCHD video (.avchd) Documentation and scripts ● Rich Text Format (.rtf) ● PDF (.pdf) ● plain text (.txt) ● MS Word (.doc/.docx) https://www.ukdataservice.ac.uk/manage-data/format/recommended-formats Better! Recommended format for preservation, reuse and sharing
  • 27. For more details: http://5stardata.info/en/ 5 ★ OPEN DATA Which Data Formats to use? Any format available on the web but with an open licence, to be Open Data Available as machine- readable structured data As (2) + non- proprietary format All the above + use URIs to identify things, so that people can point at your stuff All the above + link your data to other data to provide context
  • 29. Where to Preserve and Share Data? Institutional Repository ● HKU Scholars Hub ● enhance visibility of HKU authors and their research ● opportunities for collaboration ● ~325 Datasets ● http://hub.hku.hk/
  • 30. ● Open source code and software ● https://github.com ● Reserve DOI for publication ● https://figshare.com ● Research data with science and medicine ● http://datadryad.org ● Research data with biology and biomedical ● http://gigadb.org/site/index ● largest collection of science dataset ● http://dataverse.org Disciplinary Repository ● Global online archiving platforms for particular subject ● Some provide free storages Where to Preserve and Share Data?
  • 31. REFERENCE Mallery, M. (2014). Dmptool: Guidance and Resources for Your Data Management Plan; https://dmp. cdlib. org. Technical Services Quarterly, 31(2), 197-199 Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016).
  • 32. THANKS! Any questions? You can find me at lernest@hku.hk CREDITS Special thanks to all the people who made and released these awesome resources for free: ▸ Presentation template by SlidesCarnival

Editor's Notes

  1. This data curation workshop aims to provide fundamental concept about data curation and practical tools for data management. To help the researchers understand the workshop, the following 6 questions are used: What is Data Curation? Why Data Curation? How to Start Data Curation? How to Organize Data? Which Data Formats to Use? Where to Preserve and Share Data?
  2. Keywords: adding value; current and future use; active management; lifecycle
  3. 2 data curation models examples: DCC lifecycle and DataOne The main idea of the data curation is that it is a continuous process of creation, preservation, reuse Data curation is more than just preservation - It organizes the data through metadata, and enhanced re-usability of the data
  4. A few examples are used to show that data management is important for preservation of the data and re-use of data
  5. A statistical figure showing that 80% data are unavailable after 20 years, scientists are losing their data at a rapid rate
  6. An interesting cartoon video explaining that the researcher cannot use the data because of the poor data management, such as data format is not working, poor organization of the data name
  7. A case study of the researcher cannot collect the useful data of an agricultural researcher after his death
  8. Data curation is also needed to satisfy the local requirement or policy by the institution or government. In HK, there is only institutional policy.
  9. Data Management Planning Tool - a very simple and useful tool for Researchers to start with managing the data or writing a proposal for funding
  10. A wide variety of useful template to choose
  11. Visibility setting; co-worker editing function
  12. Guideline for the planning
  13. Preview and export of data
  14. Dublin Core for metadata standard
  15. File renaming tools and tips (particularly do not use space for renaming)
  16. Open Refine - for data cleaning, a sample dataset is used to demonstrate it is a handy tool if there is a lot of data and we need to combine the same word with different formats, such as different “Spacing”, “Capital letter”, “Articles (a/an/the)”
  17. Gephi - for social network analysis and visualization
  18. Silk - for the data publishing and visualization
  19. Some data storage technologies such as floppy disk and cassette tape are already out of date.
  20. Some formats are better choice for preservation, a list of recommended formats is provided. For example, CSV is better than XLSX/XLSX; TXT is better than DOCX/DOC; TIFF is better than JPG.
  21. The recommendation is also similar to data sharing, the 5 star open data is a simple indicator to understand which format is better for data sharing. Most of the researchers use PDF and XLS for sharing, however, CSV is a better option RDF - Resource Description Framework; a globally-accepted framework for data and knowledge representation that is intended to be read and interpreted by machines. (http://www.nature.com/articles/sdata201618#ref1) LOD - Linked Open Data; a linked data which is released under an open licence, which does not impede its reuse for free (https://www.w3.org/DesignIssues/LinkedData.html)
  22. The data repository is suggested for preservation and sharing of the data. In the case of HKU, there is an institutional repository - HKU Scholars Hub, there are approximately 325 Datasets at the moment.
  23. Disciplinary repositories are an online platform for archiving particular subject,and most of them are free: Github is a repository for open source code and software, for example Open Refine Figshare enables reserve the DOI for publication. (DOI refers to the specific persistent link for publication) Dryad is a repository for research data with science and medicine. Dataverse Network is a repository containing all kind of scientific data. It has one of the largest data collection of social science.