HKU Data Curation course MLIM7350 student final project - a 30 minute data curation workshop for researchers. Topics covered concept of data curation, tools for data management and data repository options.
The slideset used to conduct an introduction/tutorial
on DBpedia use cases, concepts and implementation
aspects held during the DBpedia community meeting
in Dublin on the 9th of February 2015.
(slide creators: M. Ackermann, M. Freudenberg
additional presenter: Ali Ismayilov)
Existing data management approaches assume control over schema, data and data generation, which is not the case in open, de-centralised environments such as the Web. The lack of control means that there are social processes necessary to generate 'ordo ab chao' and hence a new life cycle model is necessary.
Based on our experience in Linked Data publishing and consumption over the past years, we have identify involved parties and fundamental phases, which provide for a multitude of so called Linked Data life cycles.
If you want to hear me speak to the slides, you might want to check out the following videos on YouTube:
Part 1: http://www.youtube.com/watch?v=AFJSMKv5s3s
Part 2: http://www.youtube.com/watch?v=G6YJSZdXOsc
Part 3: http://www.youtube.com/watch?v=OagzNpDEPJg
The slideset used to conduct an introduction/tutorial
on DBpedia use cases, concepts and implementation
aspects held during the DBpedia community meeting
in Dublin on the 9th of February 2015.
(slide creators: M. Ackermann, M. Freudenberg
additional presenter: Ali Ismayilov)
Existing data management approaches assume control over schema, data and data generation, which is not the case in open, de-centralised environments such as the Web. The lack of control means that there are social processes necessary to generate 'ordo ab chao' and hence a new life cycle model is necessary.
Based on our experience in Linked Data publishing and consumption over the past years, we have identify involved parties and fundamental phases, which provide for a multitude of so called Linked Data life cycles.
If you want to hear me speak to the slides, you might want to check out the following videos on YouTube:
Part 1: http://www.youtube.com/watch?v=AFJSMKv5s3s
Part 2: http://www.youtube.com/watch?v=G6YJSZdXOsc
Part 3: http://www.youtube.com/watch?v=OagzNpDEPJg
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageOntotext
Many issues are faced by scholars, book researchers, museum directors who try to find the underlying connection between resources. Scholars in particular continuously emphasizes the role of digital humanities and the value of linked data in cultural heritage information systems.
Introduction to DBpedia, the most popular and interconnected source of Linked Open Data. Part of EXPLORING WIKIDATA AND THE SEMANTIC WEB FOR LIBRARIES at METRO http://metro.org/events/598/
Portland Common Data Model (PCDM): Creating and Sharing Complex Digital ObjectsKaren Estlund
Interoperability has long been a goal of digital repositories, as demonstrated by efforts ranging from OAI-PMH, to attempts to create common APIs such as IIIF, to community based metadata standards such as Dublin Core. As repositories have matured and the desire to work more collaboratively and reuse source code has grown, the need for a common understanding of how digital objects are conceived and represented is essential. The Portland Common Data Model (PCDM) is an effort to create a shared, linked data-based model for representing complex digital objects. Starting in the Hydra community but quickly expanding to include contributors from Islandora, Fedora, the Digital Public Library of America, and other repository-related service communities, PCDM is the result of over sixty practitioners’ contributions to a shared model for structuring digital objects. The process was holistic and rooted in concrete use-cases. An initial in-person meeting in Portland, Oregon in fall 2014 resulted in the release of the first draft of the data model for which it is named. With this shared model, we intend to further the goal of interoperability across repositories and related technologies. This presentation will review the origins of PCDM, provide a general technical overview, update on current status, and forecast future work.
In this talk we will discuss what happens to data when it is written from the HDF5 application to an HDF5 file. This knowledge will help developers to write more efficient applications and to avoid performance bottlenecks.
Create Linked Open Data (LOD) Microthesauri using Art & Architecture Thesaurus (AAT) LOD. View and manage options by a non-techy person. Everyone can use, create,
derive from, & map to AAT microthesauri and make the digital collection become LOD-ready dataset.
Dryad is a generic subject repository that shares author submitted data with other scientific repositories. In a part "how we done it" and part "things to consider" talk, I'll discuss 1) why we chose BagIt and OAI-ORE as mechanisms for sharing our data, 2) how we've integrated with TreeBASE -- a subject repository of phylogenetic information), and 3) the possibility of this method of data sharing being adopted by other repositories within the larger DataONE community. There will be cake.
This presentation addresses the main issues of Linked Data and scalability. In particular, it provides gives details on approaches and technologies for clustering, distributing, sharing, and caching data. Furthermore, it addresses the means for publishing data trough could deployment and the relationship between Big Data and Linked Data, exploring how some of the solutions can be transferred in the context of Linked Data.
This presentation was provided by Thomas Baker of DCMI Ltd., and Makx Dekkers of The Dublin Core Metadata Initiative Ltd. (DCMI), during the NISO Webinar "Dublin Core: The Road from Metadata Formats to Linked Data" held on August 25, 2010.
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageOntotext
Many issues are faced by scholars, book researchers, museum directors who try to find the underlying connection between resources. Scholars in particular continuously emphasizes the role of digital humanities and the value of linked data in cultural heritage information systems.
Introduction to DBpedia, the most popular and interconnected source of Linked Open Data. Part of EXPLORING WIKIDATA AND THE SEMANTIC WEB FOR LIBRARIES at METRO http://metro.org/events/598/
Portland Common Data Model (PCDM): Creating and Sharing Complex Digital ObjectsKaren Estlund
Interoperability has long been a goal of digital repositories, as demonstrated by efforts ranging from OAI-PMH, to attempts to create common APIs such as IIIF, to community based metadata standards such as Dublin Core. As repositories have matured and the desire to work more collaboratively and reuse source code has grown, the need for a common understanding of how digital objects are conceived and represented is essential. The Portland Common Data Model (PCDM) is an effort to create a shared, linked data-based model for representing complex digital objects. Starting in the Hydra community but quickly expanding to include contributors from Islandora, Fedora, the Digital Public Library of America, and other repository-related service communities, PCDM is the result of over sixty practitioners’ contributions to a shared model for structuring digital objects. The process was holistic and rooted in concrete use-cases. An initial in-person meeting in Portland, Oregon in fall 2014 resulted in the release of the first draft of the data model for which it is named. With this shared model, we intend to further the goal of interoperability across repositories and related technologies. This presentation will review the origins of PCDM, provide a general technical overview, update on current status, and forecast future work.
In this talk we will discuss what happens to data when it is written from the HDF5 application to an HDF5 file. This knowledge will help developers to write more efficient applications and to avoid performance bottlenecks.
Create Linked Open Data (LOD) Microthesauri using Art & Architecture Thesaurus (AAT) LOD. View and manage options by a non-techy person. Everyone can use, create,
derive from, & map to AAT microthesauri and make the digital collection become LOD-ready dataset.
Dryad is a generic subject repository that shares author submitted data with other scientific repositories. In a part "how we done it" and part "things to consider" talk, I'll discuss 1) why we chose BagIt and OAI-ORE as mechanisms for sharing our data, 2) how we've integrated with TreeBASE -- a subject repository of phylogenetic information), and 3) the possibility of this method of data sharing being adopted by other repositories within the larger DataONE community. There will be cake.
This presentation addresses the main issues of Linked Data and scalability. In particular, it provides gives details on approaches and technologies for clustering, distributing, sharing, and caching data. Furthermore, it addresses the means for publishing data trough could deployment and the relationship between Big Data and Linked Data, exploring how some of the solutions can be transferred in the context of Linked Data.
This presentation was provided by Thomas Baker of DCMI Ltd., and Makx Dekkers of The Dublin Core Metadata Initiative Ltd. (DCMI), during the NISO Webinar "Dublin Core: The Road from Metadata Formats to Linked Data" held on August 25, 2010.
This slideshow was used at a lunchtime session delivered at the Humanities Division, University of Oxford, on 2014-05-12. It provides a general overview of some key data management topics, plus some pointers on where to find further information.
Data Engineering is the process of collecting, transforming, and loading data into a database or data warehouse for analysis and reporting. It involves designing, building, and maintaining the infrastructure necessary to store, process, and analyze large and complex datasets. This can involve tasks such as data extraction, data cleansing, data transformation, data loading, data management, and data security. The goal of data engineering is to create a reliable and efficient data pipeline that can be used by data scientists, business intelligence teams, and other stakeholders to make informed decisions.
Visit by :- https://www.datacademy.ai/what-is-data-engineering-data-engineering-data-e/
All data accessible to all my organization - Presentation at OW2con'19, June...OW2
It is clear that all employees must have access to data wherever they are to make decisions. However, tools that allow to share data just as easily as the best collaborative tools such as a google doc or an office 365 should be used.
Open source driven by the big data ecosystem and a number of large companies have provided solutions to allow organizations to federate data systems and secure their access.
After a quick overview of existing open source solutions, and how such projects can be organized, it will be necessary to detail Dremio implementation, a unique and centralized interface on all your data. Some real feedbacks will conclude the presentation.
Leveraging Open Source Technologies to Enable Scientific Archiving and Discovery; Steve Hughes, NASA; Data Publication Repositories
The 2nd Research Data Access and Preservation (RDAP) Summit
An ASIS&T Summit
March 31-April 1, 2011 Denver, CO
In cooperation with the Coalition for Networked Information
http://asist.org/Conferences/RDAP11/index.html
Research Data (and Software) Management at Imperial: (Everything you need to ...Sarah Anna Stewart
A presentation on research data management tools, workflows and best practices at Imperial College London with a focus on software management. Presented at the 2017 session of the HPC Summer School (Dept. of Computing).
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...OpenAIRE
The 2019 International Open Access Week will be held October 21-27, 2019. This year’s theme, “Open for Whom? Equity in Open Knowledge,” builds on the groundwork laid during last year’s focus of “Designing Equitable Foundations for Open Knowledge.”
As has become a tradition of sorts, OpenAIRE organises a series of webinars during this week, highlighting OpenAIRE activities, services and tools, and reach out to the wider community with relevant talks on many aspects of Open Science.
OU Library Research Support webinar: Working with research dataIzzyChad
Slides from a webinar delivered on 31st January 2018 for OU research staff and students. Covers practical strategies for managing research data, including policies, file naming, information security, metadata and working with sensitive data.
EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 14, 2016...EUDAT
| www.eudat.eu | 2nd Session: July 14, 2016.
In this webinar, Sarah Jones (DCC) and Marjan Grootveld (DANS) talked through the aspects that Horizon 2020 requires from a DMP. They discussed examples from real DMPs and also touched upon the Software Management Plan, which for some projects can be a sensible addition
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDataWorks Summit
Standard Bank is a leading South African bank with a vision to be the leading financial services organization in and for Africa. We will share our vision, greatest challenges, and most valuable lessons learned on our journey towards enterprise adoption of a big data strategy.
This includes our implementation of: a multi-tenant enterprise data lake, a real time streaming capability, appropriate data management and governance principles, a data science workbench, and a process for model productionisation to support data science teams across the Group and across Africa and Europe.
Speakers
Zakeera Mahomen, Standard Bank, Big Data Practice Lead
Kristel Sampson, Standard Bank, Platform Lead
This a talk that I gave at BioIT World West on March 12, 2019. The talk was called: A Gen3 Perspective of Disparate Data:From Pipelines in Data Commons to AI in Data Ecosystems.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
2. Outline
1. What is Data Curation?
2. Why Data Curation?
3. How to Start Data Curation?
4. How to Organize Data?
5. Which Data Formats to Use?
6. Where to Preserve and Share Data?
4. “Data Curation is maintaining and
adding value to, a trusted body of digital
information for current and future use; It
encompasses the active management of
data throughout the research lifecycle.
Digital Curation Centre (DCC)
http://www.dcc.ac.uk/about-us/dcc-charter/dcc-charter-and-statement-principles
What is
Data
Curation?
5. DCC Lifecycle Model DataOne Model
http://www.dcc.ac.uk/sites/default/files/documents/publications/DCCLifecycle.pdf
http://www.dataone.org/sites/all/documents/L02_DataSharing.ppt
x
Data Curation Model
A process of Creation, Preservation, Reuse
What is
Data
Curation?
7. 80%
Data are Unavailableafter
20 years
http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416
Why Data
Curation?
8. New York University, Health Sciences Library
https://youtu.be/N2zK3sAtr-4
Why Data
Curation?
A story about data sharing request that may happen to
researchers...
12. How to
Start Data
Curation?
Data Management Planning Tool
A tool for Researchers to start with managing the data or
writing a proposal for funding
https://dmp.cdlib.org/
18. How to
Organize
Data?
Metadata Standard: Dublin Core
15 standard elements for describing data resources
http://wiki.dublincore.org/index.php/User_Guide
http://seopressor.com/wp-content/uploads/2015/11/dublin-core-elements-2.jpg
19. How to
Organize
Data?
https://library.stanford.edu/research/data-management-services/data-best-practices/best-practices-file-naming
Tips for File Renaming
✓ Date format - YYYYMMDD or YYMMDD
✗ Use too long File names
✗ Use Special characters, e.g. ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ' " |
Use leading “0” for clarity and to ensure files sort in sequential order
✓ "001, 002, ...010, 011 ... 100, 101, etc."
✗ "1, 2, ...10, 11 ... 100, 101, etc."
File names with spaces must be enclosed in quotes
✓ Underscores, e.g. file_name.xxx
✓ Dashes, e.g. file-name.xxx
✓ No separation, e.g. filename.xxx
✓ Camel case, e.g. FileName.xxx
✗ Use spaces, e.g. file name.xxx
Tools OS Free?
Bulk Rename Utility Windows Yes
Renamer 4 Mac
PSRenamer Linux, Mac, or Windows Yes
20. How to
Organize
Data?
Tips for Organizing Spreadsheet
Be consistent
✓ Use consistent codes for categorical variables
Fill in all of the cells
✓ Use “NA” or “-” to fill the blank cells for missing data
Create a data dictionary
✓ Use a separate file to describe the data
No calculations in the raw data files
✗ Use calculations and graphs in the raw data file
Don’t use font color or highlighting as data
✓ Use an additional column that indicates the outliers
Make backups
✓ Make a copy of the file with a new version number, e.g. file_v1.xlsx, file_v2.xlsx
✓ Write-protect the file when finished entering the data
For more details: http://kbroman.org/dataorg/
21. How to
Organize
Data?
Data Cleaning Tools: Open Refine
“A free, open source, powerful tool for working with messy data”
http://openrefine.org/
https://github.com/OpenRefine
https://github.com/OpenRefine/OpenRefine/wiki/Sample-Datasets
22. How to
Organize
Data?
Network and Graphic Visualization Tools: Gephi
“Interactive visualization and exploration platform for all kinds of
networks and complex systems, dynamic and hierarchical graphs.”
https://gephi.org/
https://gephi.org/images/screenshots/preview2.png
23. How to
Organize
Data?
Data Visualization Tools: Silk
“Create interactive data visualizations, publish websites, and tell
interactive stories.”
● https://www.silk.co/home
https://www.silk.co/help/charts-tutorial/
25. cc The Wolf Law Library - https://www.flickr.com/photos/wolflawlibrary/8747894458/
Forgotten Technologies...
26. Which
Data
Formats
to use?
Tabular data ● SPSS portable format (.por)
● comma-separated values (.csv)
● SPSS (.sav), Stata (.dta), MS Access (.mdb/.accdb)
● MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase
(.dbf), OpenDocument Spreadsheet (.ods)
Geospatial data ● ESRI Shapefile (.shp, .shx, .dbf, .prj, .sbx, .sbn
optional)
● CAD data (.dwg)
● ESRI Geodatabase format (.mdb)
● Adobe Illustrator (.ai), CAD data (.dxf or .svg)
Textual data ● Rich Text Format (.rtf)
● plain text, ASCII (.txt)
● eXtensible Mark-up Language (.xml)
● Hypertext Mark-up Language (.html)
● MS Word (.doc/.docx)
Image data ● TIFF 6.0 uncompressed (.tif) ● JPEG (.jpeg, .jpg, .jp2)
● GIF (.gif)
● TIFF other versions (.tiff)
● RAW image format (.raw)
● Photoshop files (.psd)
● BMP (.bmp)
● PNG (.png)
Audio data ● Free Lossless Audio Codec (FLAC) (.flac) ● MPEG-1 Audio Layer 3 (.mp3)
● Audio Interchange File Format (.aif)
● Waveform Audio Format (.wav)
Video data ● MPEG-4 (.mp4)
● OGG video (.ogv, .ogg)
● motion JPEG 2000 (.mj2)
● AVCHD video (.avchd)
Documentation and
scripts
● Rich Text Format (.rtf)
● PDF (.pdf)
● plain text (.txt)
● MS Word (.doc/.docx)
https://www.ukdataservice.ac.uk/manage-data/format/recommended-formats
Better!
Recommended format for preservation,
reuse and sharing
27. For more details: http://5stardata.info/en/
5 ★ OPEN DATA Which
Data
Formats
to use?
Any format
available on the
web but with an
open licence, to
be Open Data
Available as
machine-
readable
structured data
As (2) + non-
proprietary
format
All the above +
use URIs to
identify things,
so that people
can point at
your stuff
All the above +
link your data
to other data to
provide context
29. Where to
Preserve
and Share
Data?
Institutional Repository
● HKU Scholars Hub
● enhance visibility of HKU authors and their research
● opportunities for collaboration
● ~325 Datasets
● http://hub.hku.hk/
30. ● Open source code and software
● https://github.com
● Reserve DOI for publication
● https://figshare.com
● Research data with science and medicine
● http://datadryad.org
● Research data with biology and biomedical
● http://gigadb.org/site/index
● largest collection of science dataset
● http://dataverse.org
Disciplinary Repository
● Global online archiving platforms for particular subject
● Some provide free storages
Where to
Preserve
and Share
Data?
31. REFERENCE
Mallery, M. (2014). Dmptool: Guidance and Resources for Your Data Management Plan;
https://dmp. cdlib. org. Technical Services Quarterly, 31(2), 197-199
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and
stewardship. Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016).
32. THANKS!
Any questions?
You can find me at lernest@hku.hk
CREDITS
Special thanks to all the people who made and released these awesome resources for free:
▸ Presentation template by SlidesCarnival
Editor's Notes
This data curation workshop aims to provide fundamental concept about data curation and practical tools for data management. To help the researchers understand the workshop, the following 6 questions are used:
What is Data Curation?
Why Data Curation?
How to Start Data Curation?
How to Organize Data?
Which Data Formats to Use?
Where to Preserve and Share Data?
Keywords: adding value; current and future use; active management; lifecycle
2 data curation models examples: DCC lifecycle and DataOne
The main idea of the data curation is that it is a continuous process of creation, preservation, reuse
Data curation is more than just preservation - It organizes the data through metadata, and enhanced re-usability of the data
A few examples are used to show that data management is important for preservation of the data and re-use of data
A statistical figure showing that 80% data are unavailable after 20 years, scientists are losing their data at a rapid rate
An interesting cartoon video explaining that the researcher cannot use the data because of the poor data management, such as data format is not working, poor organization of the data name
A case study of the researcher cannot collect the useful data of an agricultural researcher after his death
Data curation is also needed to satisfy the local requirement or policy by the institution or government. In HK, there is only institutional policy.
Data Management Planning Tool - a very simple and useful tool for Researchers to start with managing the data or writing a proposal for funding
A wide variety of useful template to choose
Visibility setting; co-worker editing function
Guideline for the planning
Preview and export of data
Dublin Core for metadata standard
File renaming tools and tips (particularly do not use space for renaming)
Open Refine - for data cleaning, a sample dataset is used to demonstrate it is a handy tool if there is a lot of data and we need to combine the same word with different formats, such as different “Spacing”, “Capital letter”, “Articles (a/an/the)”
Gephi - for social network analysis and visualization
Silk - for the data publishing and visualization
Some data storage technologies such as floppy disk and cassette tape are already out of date.
Some formats are better choice for preservation, a list of recommended formats is provided. For example, CSV is better than XLSX/XLSX; TXT is better than DOCX/DOC; TIFF is better than JPG.
The recommendation is also similar to data sharing, the 5 star open data is a simple indicator to understand which format is better for data sharing. Most of the researchers use PDF and XLS for sharing, however, CSV is a better option
RDF - Resource Description Framework; a globally-accepted framework for data and knowledge representation that is intended to be read and interpreted by machines. (http://www.nature.com/articles/sdata201618#ref1)
LOD - Linked Open Data; a linked data which is released under an open licence, which does not impede its reuse for free (https://www.w3.org/DesignIssues/LinkedData.html)
The data repository is suggested for preservation and sharing of the data. In the case of HKU, there is an institutional repository - HKU Scholars Hub, there are approximately 325 Datasets at the moment.
Disciplinary repositories are an online platform for archiving particular subject,and most of them are free:
Github is a repository for open source code and software, for example Open Refine
Figshare enables reserve the DOI for publication. (DOI refers to the specific persistent link for publication)
Dryad is a repository for research data with science and medicine.
Dataverse Network is a repository containing all kind of scientific data. It has one of the largest data collection of social science.