This document provides an introduction to data management. It discusses why data should be managed, including benefits like enabling verification, new research, and cost savings. It also covers topics like data entry and manipulation, quality control, and sharing data. Effective data management results in high quality, accessible data that can be cited, reused, and helps researchers gain recognition.
DataONE Education Module 09: Analysis and WorkflowsDataONE
Lesson 9 in a set of 10 created by DataONE on Best Practices for Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...Nandana Mihindukulasooriya
Thesis PDF version: https://oa.upm.es/62935/
In the era of digital transformation, where most decision-making and artificial intelligence (AI) applications are becoming data-driven, data is becoming an essential asset. Linked Data, published in structured, machine-readable formats, with explicit semantics using Semantic Web standards, and with links to other data, is even more useful. The Linked (Open) Data cloud is growing with millions of new triples each year. Nevertheless, as we discuss in this thesis, such vast amounts of data bring several new challenges in ensuring the quality of Linked Data. The main goal of this thesis is to propose novel and scalable methods for automatic quality assessment and repair of Linked Data. The motivation for it is to significantly reduce the manual effort required by current quality assessment and repair, and to propose novel methods suitable for large-scale Linked Data sources such as DBpedia or Wikidata. The main hypothesis of this work is that data profiling metrics and automatic RDF Shape induction can be used to develop scalable and automatic quality assessment and repair methods. In this context, the following main contributions are delivered in this thesis: • LDQM, a Linked Data Quality Model for representing Linked Data quality in a standard manner and LD Sniffer, a tool based on LDQM for validating accessibility of Linked Data. LDQM contains 15 quality characteristics, 89 base measures, 23 derived measures, and 124 quality indicators. • Loupe, a framework for Linked Data profiling that includes the Loupe Extended Dataset Description Model and a suite of Linked Data profiling tools. The model consists of 84 Linked Data profiling metrics useful for quality assessment and repair tasks. Loupe tools have been used to evaluate 26 thousand datasets containing 34 billions of triples and Loupe contributed to the winning system of ISWC Semantic Web Challenge 2017. The Loupe Web portal has been visited more than 40,000 times by ~3000 unique visitors from 87 countries. • An automatic RDF Shape induction method that follows a data-driven approach to induce integrity constraints using data profiling metrics as features. The proposed method achieved an F1 of 98.81% in deriving maximum cardinality constraints, an F1 of 97.30% in deriving minimum cardinality constraints, and an F1 of 95.94% in deriving range constraints. • Four methods for automatic quality assessment and repair using RDF Shapes and data profiling metrics. They are motivated by several practical use cases that cover both Linked Data generation process and output and also cover both public and enterprise data. The four methods include (a) a method for detecting inconsistent mappings, (b) a method for detecting and eliminating noisy triples produced by open information extraction tools, (c) a method to repair links in RDF data, and (d) a method to complete type information in Linked Data ...
The old adage, "You are what you eat", also applies to machine learning and data science. The models and insights gained from analyzing data are only as good as the input data. To understand where data preparation falls in an analytics solution, the Extract, Transform, and Load (ETL) process is covered.
Following an overview of the necessity behind data preparation, various cleansing techniques are demonstrated. These data issues and techniques are exemplified, using real situations, with before and after snapshots of data and the code snippets that perform the cleansing.
These slides were presented at Penn State's Nittany Watson Challenge Immersion event on January 19-20, 2017.
Efficient, Scalable, and Provenance-Aware Management of Linked DataeXascale Infolab
The proliferation of heterogeneous Linked Data on the Web requires data management systems to constantly improve their scalability and efficiency. Despite recent advances in distributed Linked Data management, efficiently processing large amounts of Linked Data in a scalable way is still very challenging. In spite of their seemingly simple data models, Linked Data actually encode rich and complex graphs mixing both instance and schema level data. At the same time, users are increasingly interested in investigating or visualizing large collections of online data by performing complex analytic queries. The heterogeneity of Linked Data on the Web also poses new challenges to database systems. The capacity to store, track, and query provenance data is becoming a pivotal feature of Linked Data Management Systems. In this thesis, we tackle issues revolving around processing queries on big, unstructured, and heterogeneous Linked Data graphs.
Data preprocessing is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to data mining and machine learning projects. Data-gathering methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, missing values, etc
The Data Cleansing Process - A Roadmap to Material Master Data QualityI.M.A. Ltd.
IMA Ltd. outlines the Material Master Data Cleansing Process to deliver high quality data, increased maintenance efficiency, improved asset performance, and MRO cost savings.
Join IMA Ltd. on the road to Material Master Data Quality.
DataONE Education Module 09: Analysis and WorkflowsDataONE
Lesson 9 in a set of 10 created by DataONE on Best Practices for Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...Nandana Mihindukulasooriya
Thesis PDF version: https://oa.upm.es/62935/
In the era of digital transformation, where most decision-making and artificial intelligence (AI) applications are becoming data-driven, data is becoming an essential asset. Linked Data, published in structured, machine-readable formats, with explicit semantics using Semantic Web standards, and with links to other data, is even more useful. The Linked (Open) Data cloud is growing with millions of new triples each year. Nevertheless, as we discuss in this thesis, such vast amounts of data bring several new challenges in ensuring the quality of Linked Data. The main goal of this thesis is to propose novel and scalable methods for automatic quality assessment and repair of Linked Data. The motivation for it is to significantly reduce the manual effort required by current quality assessment and repair, and to propose novel methods suitable for large-scale Linked Data sources such as DBpedia or Wikidata. The main hypothesis of this work is that data profiling metrics and automatic RDF Shape induction can be used to develop scalable and automatic quality assessment and repair methods. In this context, the following main contributions are delivered in this thesis: • LDQM, a Linked Data Quality Model for representing Linked Data quality in a standard manner and LD Sniffer, a tool based on LDQM for validating accessibility of Linked Data. LDQM contains 15 quality characteristics, 89 base measures, 23 derived measures, and 124 quality indicators. • Loupe, a framework for Linked Data profiling that includes the Loupe Extended Dataset Description Model and a suite of Linked Data profiling tools. The model consists of 84 Linked Data profiling metrics useful for quality assessment and repair tasks. Loupe tools have been used to evaluate 26 thousand datasets containing 34 billions of triples and Loupe contributed to the winning system of ISWC Semantic Web Challenge 2017. The Loupe Web portal has been visited more than 40,000 times by ~3000 unique visitors from 87 countries. • An automatic RDF Shape induction method that follows a data-driven approach to induce integrity constraints using data profiling metrics as features. The proposed method achieved an F1 of 98.81% in deriving maximum cardinality constraints, an F1 of 97.30% in deriving minimum cardinality constraints, and an F1 of 95.94% in deriving range constraints. • Four methods for automatic quality assessment and repair using RDF Shapes and data profiling metrics. They are motivated by several practical use cases that cover both Linked Data generation process and output and also cover both public and enterprise data. The four methods include (a) a method for detecting inconsistent mappings, (b) a method for detecting and eliminating noisy triples produced by open information extraction tools, (c) a method to repair links in RDF data, and (d) a method to complete type information in Linked Data ...
The old adage, "You are what you eat", also applies to machine learning and data science. The models and insights gained from analyzing data are only as good as the input data. To understand where data preparation falls in an analytics solution, the Extract, Transform, and Load (ETL) process is covered.
Following an overview of the necessity behind data preparation, various cleansing techniques are demonstrated. These data issues and techniques are exemplified, using real situations, with before and after snapshots of data and the code snippets that perform the cleansing.
These slides were presented at Penn State's Nittany Watson Challenge Immersion event on January 19-20, 2017.
Efficient, Scalable, and Provenance-Aware Management of Linked DataeXascale Infolab
The proliferation of heterogeneous Linked Data on the Web requires data management systems to constantly improve their scalability and efficiency. Despite recent advances in distributed Linked Data management, efficiently processing large amounts of Linked Data in a scalable way is still very challenging. In spite of their seemingly simple data models, Linked Data actually encode rich and complex graphs mixing both instance and schema level data. At the same time, users are increasingly interested in investigating or visualizing large collections of online data by performing complex analytic queries. The heterogeneity of Linked Data on the Web also poses new challenges to database systems. The capacity to store, track, and query provenance data is becoming a pivotal feature of Linked Data Management Systems. In this thesis, we tackle issues revolving around processing queries on big, unstructured, and heterogeneous Linked Data graphs.
Data preprocessing is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to data mining and machine learning projects. Data-gathering methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, missing values, etc
The Data Cleansing Process - A Roadmap to Material Master Data QualityI.M.A. Ltd.
IMA Ltd. outlines the Material Master Data Cleansing Process to deliver high quality data, increased maintenance efficiency, improved asset performance, and MRO cost savings.
Join IMA Ltd. on the road to Material Master Data Quality.
Description of four techniques for Data Cleaning:
1.DWCLEANER Framework
2.Data Mining Techniques include Association Rule and Functional Dependecies
,...
Migrating clinical studies from one database to another (such as Oracle Clinical to Oracle Clinical or Oracle Clinical to Oracle InForm or Medidata Rave), is a complex process that requires a thorough understanding of clinical data management, technology, and the regulations that govern clinical trials.
In this SlideShare on clinical study migrations we:
Discuss reasons to migrate a clinical study
Provide an overview of the clinical study migration process
Look at validation, technical, and business considerations for migrating a clinical study
Present real-world case studies
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Data has become an indispensable part of every economy, industry, organization, business
function and individual. Big Data is a term used to identify the datasets that whose size is
beyond the ability of typical database software tools to store, manage and analyze. The Big
Data introduce unique computational and statistical challenges, including scalability and
storage bottleneck, noise accumulation, spurious correlation and measurement errors. These
challenges are distinguished and require new computational and statistical paradigm. This
paper presents the literature review about the Big data Mining and the issues and challenges
with emphasis on the distinguished features of Big Data. It also discusses some methods to deal
with big data.
This paper is a supplementary material for the following article -> Bicevskis, J., Nikiforova, A., Bicevska, Z., Oditis, I., & Karnitis, G. (2019, October). A step towards a data quality theory. In 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS) (pp. 303-308). IEEE.
Data quality issues have been topical for many decades. However, a unified data quality theory has not been proposed yet, since many concepts associated with the term “data quality” are not straightforward enough. The paper proposes a user-oriented data quality theory based on clearly defined concepts. The concepts are defined by using three groups of domain-specific languages (DSLs): (1) the first group uses the concept of a data object to describe the data to be analysed, (2) the second group describes the data quality requirements, and (3) the third group describes the process of data quality evaluation. The proposed idea proved to be simple enough, but at the same time very effective in identifying data defects, despite the different structures of data sets and the complexity of data. Approbation of the approach demonstrated several advantages: (a) a graphical data quality model allows defining of data quality even by non-IT and non-data quality professionals, (b) data quality model is not related to the information system that has accumulated data, i.e., this approach lets users analyse the “third-party” data, and (c) data quality can be described at least at two levels of abstraction - informally, using natural language, or formally, including executable program routines or SQL statements.
The paper proposes a user-oriented data quality theory based on clearly defined concepts. The concepts are defined byusing three groups of domain-specific languages(DSLs): (1) the first group usestheconcept of a data object to describe the data to be analysed, (2) the second group describes the data quality requirements, and (3) the third group describes the process of data quality evaluation. The proposed ideaproved to be simple enough,but at the same time very effectivein identifyingdata defects, despitethedifferent structures of data sets andthe complexity ofdata. Approbation of the approach demonstratedseveral advantages: (a) a graphical data quality model allows defining of data quality even by non-IT and non-data qualityprofessionals, (b) data quality model is not related to the information system that has accumulated data, i.e., this approach letsusers analysethe"third-party” data, and (c) data quality can be described at least attwo levelsof abstraction –informally,using natural language,or formally,including executable program routines or SQL statements.
Lesson 7 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Enhancing educational data quality in heterogeneous learning contexts using p...Alex Rayón Jerez
Workshop “Enhancing educational data quality in heterogeneous learning contexts using Pentaho Data Integration” by Alex Rayón Jerez (www.alexrayon.es). Heterogeneous data integration and quality normalization are sensitive issues to properly exploit learning data. In this hands-on tutorial, we will not only extract learning data from their databases, but also enhance data quality issues (granularities, dimensions, duplications, nulll values, etc.) through the use of Pentaho Data Integration. We will practice with the integration of learning data from technology-rich learning environments (LMS, Social Networks, wiki, etc.). It is required the use of a laptop with Pentaho Data Integration module already installed on it, but it is not required previous knowledge of Pentaho.
Saksham Sarode - Building Effective test Data Management in Distributed Envir...TEST Huddle
EuroSTAR Software Testing Conference 2010 presentation on Building Effective test Data Management in Distributed Environment by Saksham Sarode. See more at: http://conference.eurostarsoftwaretesting.com/past-presentations/
Description of four techniques for Data Cleaning:
1.DWCLEANER Framework
2.Data Mining Techniques include Association Rule and Functional Dependecies
,...
Migrating clinical studies from one database to another (such as Oracle Clinical to Oracle Clinical or Oracle Clinical to Oracle InForm or Medidata Rave), is a complex process that requires a thorough understanding of clinical data management, technology, and the regulations that govern clinical trials.
In this SlideShare on clinical study migrations we:
Discuss reasons to migrate a clinical study
Provide an overview of the clinical study migration process
Look at validation, technical, and business considerations for migrating a clinical study
Present real-world case studies
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Data has become an indispensable part of every economy, industry, organization, business
function and individual. Big Data is a term used to identify the datasets that whose size is
beyond the ability of typical database software tools to store, manage and analyze. The Big
Data introduce unique computational and statistical challenges, including scalability and
storage bottleneck, noise accumulation, spurious correlation and measurement errors. These
challenges are distinguished and require new computational and statistical paradigm. This
paper presents the literature review about the Big data Mining and the issues and challenges
with emphasis on the distinguished features of Big Data. It also discusses some methods to deal
with big data.
This paper is a supplementary material for the following article -> Bicevskis, J., Nikiforova, A., Bicevska, Z., Oditis, I., & Karnitis, G. (2019, October). A step towards a data quality theory. In 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS) (pp. 303-308). IEEE.
Data quality issues have been topical for many decades. However, a unified data quality theory has not been proposed yet, since many concepts associated with the term “data quality” are not straightforward enough. The paper proposes a user-oriented data quality theory based on clearly defined concepts. The concepts are defined by using three groups of domain-specific languages (DSLs): (1) the first group uses the concept of a data object to describe the data to be analysed, (2) the second group describes the data quality requirements, and (3) the third group describes the process of data quality evaluation. The proposed idea proved to be simple enough, but at the same time very effective in identifying data defects, despite the different structures of data sets and the complexity of data. Approbation of the approach demonstrated several advantages: (a) a graphical data quality model allows defining of data quality even by non-IT and non-data quality professionals, (b) data quality model is not related to the information system that has accumulated data, i.e., this approach lets users analyse the “third-party” data, and (c) data quality can be described at least at two levels of abstraction - informally, using natural language, or formally, including executable program routines or SQL statements.
The paper proposes a user-oriented data quality theory based on clearly defined concepts. The concepts are defined byusing three groups of domain-specific languages(DSLs): (1) the first group usestheconcept of a data object to describe the data to be analysed, (2) the second group describes the data quality requirements, and (3) the third group describes the process of data quality evaluation. The proposed ideaproved to be simple enough,but at the same time very effectivein identifyingdata defects, despitethedifferent structures of data sets andthe complexity ofdata. Approbation of the approach demonstratedseveral advantages: (a) a graphical data quality model allows defining of data quality even by non-IT and non-data qualityprofessionals, (b) data quality model is not related to the information system that has accumulated data, i.e., this approach letsusers analysethe"third-party” data, and (c) data quality can be described at least attwo levelsof abstraction –informally,using natural language,or formally,including executable program routines or SQL statements.
Lesson 7 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Enhancing educational data quality in heterogeneous learning contexts using p...Alex Rayón Jerez
Workshop “Enhancing educational data quality in heterogeneous learning contexts using Pentaho Data Integration” by Alex Rayón Jerez (www.alexrayon.es). Heterogeneous data integration and quality normalization are sensitive issues to properly exploit learning data. In this hands-on tutorial, we will not only extract learning data from their databases, but also enhance data quality issues (granularities, dimensions, duplications, nulll values, etc.) through the use of Pentaho Data Integration. We will practice with the integration of learning data from technology-rich learning environments (LMS, Social Networks, wiki, etc.). It is required the use of a laptop with Pentaho Data Integration module already installed on it, but it is not required previous knowledge of Pentaho.
Saksham Sarode - Building Effective test Data Management in Distributed Envir...TEST Huddle
EuroSTAR Software Testing Conference 2010 presentation on Building Effective test Data Management in Distributed Environment by Saksham Sarode. See more at: http://conference.eurostarsoftwaretesting.com/past-presentations/
You Need a Data Catalog. Do You Know Why?Precisely
Data catalog has become a more popular discussion topic within data management and data governance circles. “What is it?” and “Do I need one?” are two common questions; along with “How does a catalog relate to and support the data governance program?”
The data catalog plays a key role in the governance process: how well information can be managed, aligned to business objectives and monetized depends in great part to what you know about your data.
In this webinar you will learn about:
• The role of the data catalog
• What kinds of information should be in your data catalog
• Those catalog items that can be harvested systemically versus those that require stewardship involvement
• The role of the catalog in your data quality program
We hope you’ll join us and learn how a data catalog should be part of your governance and data quality program!
You Need a Data Catalog. Do You Know Why?Precisely
Data catalog has become a more popular discussion topic within data management and data governance circles. “What is it?” and “Do I need one?” are two common questions; along with “How does a catalog relate to and support the data governance program?”
The data catalog plays a key role in the governance process; How well information can be managed, aligned to business objectives and monetized depends in great part to what you know about your data.
In this webinar you will learn about:
- The role of the data catalog
- What kinds of information should be in your data catalog
- Those catalog items that can be harvested systemically versus those that require stewardship involvement
- The role of the catalog in your data quality program
We hope you’ll join this on-demand webinar and learn how a data catalog should be part of your governance and data quality program!
Data-Ed: Unlock Business Value through Data Quality Engineering Data Blueprint
Organizations must realize what it means to utilize data quality management in support of business strategy. This webinar focuses on obtaining business value from data quality initiatives. I will illustrate how organizations with chronic business challenges often can trace the root of the problem to poor data quality. Showing how data quality should be engineered provides a useful framework in which to develop an effective approach. This in turn allows organizations to more quickly identify business problems as well as data problems caused by structural issues versus practice-oriented defects and prevent these from re-occurring.
You can sign up for future Data-Ed webinars here: http://www.datablueprint.com/resource-center/webinar-schedule/
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...Neo4j
Watch this webinar and learn how Neo4j and ICC Technology can help you remove risk from your data governance by improving the way you approach data lineage. We’ll cover some of the common approaches, driving regulations and biggest risks for banks and finances services.
-Find out how Data Lineage is becoming more complex for Banks and Financial Services companies
-Learn how a native-graph model can improve tracing data sources to targets as well as store transformations.
-Watch a demonstration on how you might approach regulations such as BCBS 239
Combining Human+Machine Intelligence to Successfully Integrate Biomedical Datarusselltamr
Tamr Field Engineer Timothy Danford, Ph.D., discusses how Data Variety -- the natural, siloed nature of data as it’s created -- is creating a bottleneck to biomedical data analytics. Rule-based, deterministic data unification approaches are “too brittle” scale to the hundreds or thousands of different data formats, sources and silos within the enterprise. Danford submits, instead, that Tamr’s bottom-up, probabilistic approach with “active learning” is proving successful at unifying heterogeneous data at scale.
Agile Testing Alliance hosted it's 16th Meetup in Pune on 9th Dec, 2017. Shreya Pal was one of the speakers in the meetup and gave a insightful session on BigData Testing. All the rights belong the author
Automating Data Science over a Human Genomics Knowledge BaseVaticle
# Automating Data Science over a Human Genomics Knowledge Base
Radouane Oudrhiri, the CTO of Eagle Genomics, will talk about how Eagle Genomics is building a platform for automating data science over a human genomics knowledge base. Rad will dive into the architecture Eagle Genomics and also discuss how Grakn serves as the knowledge base foundation of the system. Rad also give a brief history of databases, semantic expressiveness and how Grakn fits in the big picture.
# Radouane Oudrhiri, CTO, Eagle Genomics
Radouane has an extensive experience in leading world-class software and data-intensive system developments in different industries from Telecom to Healthcare, Nuclear, Automotive, Financials. Radouane is Lean/Six Sigma Master Black Belt with speciality in high-tech, IT and Software engineering and he is recognised as the leader and early adaptor of Lean/Six Sigma and DFSS to IT and Software. He is a fellow of the Royal Statistical Society (RSS) and member of the ISO Technical Committee (TC69: Applications of Statistical methods) where he is co-author of the Lean & Six Sigma Standard (ISO 18404) as well as the new standard under development (Design for Six Sigma). He is also part of the newly formed international Group on Big Data (nominated by BSI as the UK representative/expert). Radouane has also been Chair of the working group on Measurement Systems for Automated Processes/Systems within the ISPE (International Society for Pharmaceutical Engineering).
Five Things to Consider About Data Mesh and Data GovernanceDATAVERSITY
Data mesh was among the most discussed and controversial enterprise data management topics of 2021. One of the reasons people struggle with data mesh concepts is we still have a lot of open questions that we are not thinking about:
Are you thinking beyond analytics? Are you thinking about all possible stakeholders? Are you thinking about how to be agile? Are you thinking about standardization and policies? Are you thinking about organizational structures and roles?
Join data.world VP of Product Tim Gasper and Principal Scientist Juan Sequeda for an honest, no-bs discussion about data mesh and its role in data governance.
Data Quality: A Raising Data Warehousing ConcernAmin Chowdhury
Characteristics of Data Warehouse
Benefits of a data warehouse
Designing of Data Warehouse
Extract, Transform, Load (ETL)
Data Quality
Classification Of Data Quality Issues
Causes Of Data Quality
Impact of Data Quality Issues
Cost of Poor Data Quality
Confidence and Satisfaction-based impacts
Impact on Productivity
Risk and Compliance impacts
Why Data Quality Influences?
Causes of Data Quality Problems
How to deal: Missing Data
Data Corruption
Data: Out of Range error
Techniques of Data Quality Control
Data warehousing security
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
Nucleophilic Addition of carbonyl compounds.pptxSSR02
Nucleophilic addition is the most important reaction of carbonyls. Not just aldehydes and ketones, but also carboxylic acid derivatives in general.
Carbonyls undergo addition reactions with a large range of nucleophiles.
Comparing the relative basicity of the nucleophile and the product is extremely helpful in determining how reversible the addition reaction is. Reactions with Grignards and hydrides are irreversible. Reactions with weak bases like halides and carboxylates generally don’t happen.
Electronic effects (inductive effects, electron donation) have a large impact on reactivity.
Large groups adjacent to the carbonyl will slow the rate of reaction.
Neutral nucleophiles can also add to carbonyls, although their additions are generally slower and more reversible. Acid catalysis is sometimes employed to increase the rate of addition.
BREEDING METHODS FOR DISEASE RESISTANCE.pptxRASHMI M G
Plant breeding for disease resistance is a strategy to reduce crop losses caused by disease. Plants have an innate immune system that allows them to recognize pathogens and provide resistance. However, breeding for long-lasting resistance often involves combining multiple resistance genes
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Data Management Best Practices
1. Introduction to Data Management
CCimagebyUniversityofMarylandPressReleasesonFlickr
Adapted from
curriculum
developed by
2. Introduction to Data Management
• Why Manage Data?
• Data Sharing
• Data Entry & Manipulation
• Quality Control & Assurance
• Backup
• Metadata
• Data Citation
3. Introduction to Data Management
• Organization
• Reproducibility
• Version control
• Quality control
• Valuable asset
• Accuracy
• Integrity
• Data sharing
• Sustainability & accessibility
4. Introduction to Data Management
• If data are:
o Well-organized
o Documented
o Preserved
o Accessible
o Verified as to Accuracy and validity
• Result is:
o High quality data
o Easy to share and re-use in science
o Citation and credibility to the researcher
o Cost-savings to science
6. Introduction to Data Management
Data sharing requires effort, resources, and faith in others.
Why do it?
For the benefit of:
o the public
o the research sponsor
o the research community
o the researcher
CCimagebyJessicaLuciaonFlickr
7. Introduction to Data Management
A better informed public yields better decision making with
regard to:
o Environmental and economic planning
o Federal, state, and local policies
o social choices such as use of tax dollars and education options
o personal lifestyle and health such as nutrition and recreation
CCimagebyfalonyatesonFlickr
8. Introduction to Data Management
• Organizations that sponsor research must maximize the
value of research dollars
• Data sharing enhances the value of research investments by
enabling:
o verification of performance metrics and outcomes
o new research and increased return on investment
o advancement of the science
o reduced data duplication expenditures
9. Introduction to Data Management
Access to related research enables community members to:
o build upon the work of others
o perform meta analyses
o share resources and perspectives
CCimagebyLawrenceBerkeleyNational
LaboratoryonFlickr
10. Introduction to Data Management
Access to related research enables community members to
(cont’d):
o increase transparency, reproducibility and comparability of results
o expand methodology assessment, recommendations and
improvement
o educate new researchers as to the most current and significant
findings
11. Introduction to Data Management
Scientists that share data gain the benefit of:
o Recognition
o improved data quality
o greater opportunity for data exchange
o improved connections
CCimagebySLUMadridCampus
onFlickr
12. Introduction to Data Management
Step One:
Create robust metadata that is discoverable
o Geographic and temporal coverage
o Discipline specific metadata schema
o Discipline specific vocabulary
o Describe attributes
13. Introduction to Data Management
Step Two:
Include archival and reference information
o Include a data citation
o Include Persistent Identifier (e.g. DOI)
Data Citation Example: Sidlauskas, B. 2007. Data from: Testing for unequal rates of
morphological diversification in the absence of a detailed phylogeny: a case study from
characiform fishes. Dryad Digital Repository. doi:10.5061/dryad.20
14. Introduction to Data Management
Step Three:
Have data contributors review your metadata to ensure validity
and organizational ‘correctness’
o are the processes described accurately?
o are all contributions adequately identified?
o has management reviewed the product and documentation?
o is the funding organization properly recognized?
15. Introduction to Data Management
Step Four:
Publish your data and metadata via:
Data Repositories/Clearinghouses
• Discipline-specific
◦ Sciences
• Knowledge Network for Biodiversity (KNB) Data Portal
• Long Term Ecological Research (LTER) Network Data Portal
◦ Social Sciences
• ICPSR
• Institutional
◦ Trace
16. Introduction to Data Management
• Why Manage Data?
• Data Sharing
• Data Entry & Manipulation
17. Introduction to Data Management
• Create data sets that are:
o Valid
o Organized to support ease of use
CCimagebyTravisSonFlickr
18. Introduction to Data Management
• Inconsistency between data collection events
– Location of Date information
– Inconsistent Date format
– Column names
– Order of columns
19. Introduction to Data Management
• Inconsistency between data collection events
– Different site spellings, capitalization, spaces
in site names—hard to filter
– Codes used for site names for some data, but
spelled out for others
– Mean1 value is in Weight column
– Text and numbers in same column – what is
the mean of 12, “escaped < 15”, and 91?
20. Introduction to Data Management
• Columns of data are consistent:
only numbers, dates, or text
• Consistent Names, Codes, Formats (date) used in each column
• Data are all in one table, which is much easier for a statistical program to work
with than multiple small tables which each require human intervention
21. Introduction to Data Management
• Descriptive column names
◦ Soil T30 Soil_Temp_30cm
◦ Species-Code Species_Code (avoid using -,+,*,^ in column names.
Some software may interpret these symbols as an operator)
• Descriptive file names
◦ Mammal data-.csv FieldVisit1_SmallMammalData_2010-04-11.csv
22. Introduction to Data Management
• Enter complete lines of data
Sorting an
Excel file with
empty cells is
not a good
idea!
23. Introduction to Data Management
• Missing data
o Preferably leave field empty (NULL = no value)
o In numeric fields, use a distinct value such as 9999 to indicate a missing
value
o In text fields, use NA (“Not Applicable” or “Not Available”)
o Use Data flags in a separate column to qualify missing value
Date Time NO3_N_Conc NO3_N_Conc_Flag
20081011 1300 0.013
20081011 1330 0.016
20081011 1400 M1
20081011 1430 0.018
20081011 1500 0.001 E1
M1 = missing; no sample
collected
E1 = estimated from
grab sample
26. Introduction to Data Management
• Great for charts, graphs,
calculations
• Flexible about cell content
type—cells in same column
can contain numbers or text
• Lack record integrity--can
sort a column independently
of all others)
• Easy to use – but harder to
maintain as complexity and
size of data grows
• Easy to query to select
portions of data
• Data fields are typed – For
example, only integers are
allowed in integer fields
• Columns cannot be sorted
independently of each other
• Steeper learning curve than
a spreadsheet
27. Introduction to Data Management
• A set of tables
• Relationships
• A command
language
*siteID
site_name
latitude
longitude
description
Sample sites
*speciesID
species_name
common_name
family
order
Species
*sampleID
siteID
sample_date
speciesID
height
flowering
flag
comments
samples
*sampleID
siteID
sample_date
speciesID
height
flowering
flag
comments
Samples
28. Introduction to Data Management
Date Site Height Flowering
<dates only> <text only> <real numbers only> <‘y’ and ‘n’ only>
Advantages
• quality control
• performance
29. Introduction to Data Management
Date Site Species Flowering?
2/13/2010 A BOGR2 y
2/13/2010 B HODR y
4/15/2010 B BOER4 y
4/15/2010 C PLJA n
Site Latitude Longitude
A 34.1 -109.3
B 35.2 -108.6
C 32.6 -107.5
Date Site Species Flowering? Latitude Longitude
2/13/2010 A BOGR2 y 34.1 -109.3
2/13/2010 B HODR y 35.2 -108.6
4/15/2010 B BOER4 y 35.2 -108.6
4/15/2010 C PLJA n 32.6 -107.5
Mix and
Match
data on
the fly
30. Introduction to Data Management
Date Plot Treatment SensorDepth Soil_Temperature
2010-02-01 C R 30 12.8
2010-02-01 B C 10 13.2
2010-02-02 C R 0 6.3
2010-02-02 A N 0 15.1
SQL examples: Select Date, Plot, Treatment, SensorDepth, Soil_Temperature from
SoilTemp where Date = ‘2010-02-01’
Date Plot Treatment SensorDepth Soil_Temperature
2010-02-01 C R 30 12.8
2010-02-01 B C 10 13.2
Date Plot Treatment SensorDepth Soil_Temperature
2010-02-02 A N 0 15.1
This table is called SoilTemp
Select * from SoilTemp where Treatment=‘N’ and SensorDepth=‘0’
32. Introduction to Data Management
• Be aware of Best Practices when designing data file structures
• Choose a data entry method that allows some validation of
data as it is entered
• Consider investing time in learning how to use a database if
datasets are large or complex
CCimagebyfo.olonFlickr
33. Introduction to Data Management
• Why Manage Data?
• Data Sharing
• Data Entry & Manipulation
• Quality Control & Assurance
34. Introduction to Data Management
• Errors of Commission
o Incorrect or inaccurate data entered
o Examples: malfunctioning instrument, mistyped data
• Errors of Omission
o Data or metadata not recorded
o Examples: inadequate documentation, human error, anomalies in the
field
CCimagebyNickJWebbonFlickr
35. Introduction to Data Management
• Define & enforce standards
◦ Formats
◦ Codes
◦ Measurement units
◦ Metadata
• Assign responsibility for data quality
◦ Be sure assigned person is educated in QA/QC
36. Introduction to Data Management
• Double entry
◦ Data keyed in by two independent people
◦ Check for agreement with computer verification
• Record a reading of the data and transcribe from the
recording
• Use text-to-speech program to read data back
CCimagebyweskrieselonFlickr
37. Introduction to Data Management
• Design data storage well
◦ Minimize number of times items that must be entered repeatedly
◦ Use consistent terminology
◦ Atomize data: one cell per piece of information
• Document changes to data
◦ Avoids duplicate error checking
◦ Allows undo if necessary
38. Introduction to Data Management
• Make sure data line up in proper columns
• No missing, impossible, or anomalous values
• Perform statistical summaries
CCimagebychesapeakeclimateonFlickr
39. Introduction to Data Management
• Look for outliers
◦ Outliers are extreme values for a variable given the statistical model
being used
◦ The goal is not to eliminate outliers but to identify potential data
contamination
0
10
20
30
40
50
60
0 5 10 15 20 25 30 35
40. Introduction to Data Management
• Methods to look for outliers
◦ Graphical
• Normal probability plots
• Regression
• Scatter plots
◦ Plotting on maps
◦ Deviation
41. Introduction to Data Management
• Beware of errors of commission and omission
• Execute quality assurance and quality control strategies
◦ Data entry
◦ Data visualization
42. Introduction to Data Management
• Why Manage Data?
• Data Sharing
• Data Entry & Manipulation
• Quality Control & Assurance
• Backup
43. Introduction to Data Management
• Backups vs. Archives
◦ Backups: a copy (or copies) of the original file is made before the
original is overwritten
o Archives: preservation of the file
• Data Preservation
o Includes archiving in addition to processes such as data rescue, data
reformatting, data conversion, metadata
44. Introduction to Data Management
• Backups
o periodic snapshots
o usually copies of files
o performed on regular schedule
• Archiving
o preserve data
o usually the final version
o performed at the end of a project or milestones
It is a good idea to have multiple copies of your backups and
archives, in case one copy fails.
45. Introduction to Data Management
• Limit or negate loss of data
• Save time, money, productivity
• Help prepare for disasters
o Accidental deletions
o Fires, natural disasters
o Software bugs, hardware failures
• Reproduce results
• Respond to data requests
• Limit liability
CCImagecourtesyofBrianJMatisonFlickr
46. Introduction to Data Management
• Includes backups and archiving
• Also includes
◦ data conversion
◦ data reformatting
◦ data rescue
47. Introduction to Data Management
• Data Conversions and Formats
o Use non-proprietary, standard formats
o Textual documents .txt
o Spreadsheets .csv
o Digital images .tiff
• Versioning
• File Naming
48. Introduction to Data Management
• Create a backup plan that clearly identifies:
o roles,
o responsibilities,
o where the data is backed up,
o how often the files are backed up,
o how to access the files,
o recommended file formats to be used, and
o policies for migrating data to assure data are not lost due to media
degradation or changing formats or programs
• Review your backup plan regularly
• Update as needed
49. Introduction to Data Management
• Why Manage Data?
• Data Sharing
• Data Entry & Manipulation
• Quality Control & Assurance
• Backup
• Metadata
50. Introduction to Data Management
• When you provide data to someone else, what types of
information would you want to include with the data?
• When you receive a dataset from an external source, what
types of details do you want to know about the data?
51. Introduction to Data Management
• Why were the data created?
• What limitations, if any, do the data have?
• What does the data mean?
• How should the data be cited if it is re-used in a
new study?
52. Introduction to Data Management
• What are the data gaps?
• What processes were used for creating the
data?
• Are there any fees associated with the data?
• In what scale were the data created?
• What do the values in the tables mean?
• What software do I need in order to read the
data?
• What projection are the data in?
• Can I give these data to someone else?
53. Introduction to Data Management
Metadata is: Data ‘reporting’
• WHO created the data?
• WHAT is the content of the data?
• WHEN were the data created?
• WHERE is it geographically?
• HOW were the data developed?
• WHY were the data developed?
PhotobyMichelleChang.AllRightsReserved
54. Introduction to Data Management
Author(s) Boullosa, Carmen.
Title(s) They're cows, we're pigs /
by Carmen Boullosa
Place New York : Grove Press, 1997.
Physical Descr viii, 180 p ; 22 cm.
Subject(s) Pirates Caribbean Area Fiction.
Format Fiction
CCimagebyUSDAgovonFlickr
CCimagebyMskaduonFlickr
55. Introduction to Data Management
DATADETAILS
Time of data development
Specific details about problems with individual items or
specific dates are lost relatively rapidly
General details about datasets are
lost through time
Accident or
technology
change may
make data
unusable
Retirement or career change
makes access to “mental
storage” difficult or unlikely
Loss of data
developer leads to
loss of remaining
information
TIME (From Michener et al 1997)
56. Introduction to Data Management
• A Standard provides a structure to describe data with:
◦ Common terms to allow consistency between records
◦ Common definitions for easier interpretation
◦ Common language for ease of communication
◦ Common structure to quickly locate information
• In search and retrieval, standards provide:
◦ Documentation structure in a reliable and predictable format for
computer interpretation
◦ A uniform summary description of the dataset
CCimagebyccarlstead
onFlickr
58. Introduction to Data Management
• Why Manage Data?
• Data Sharing
• Data Entry & Manipulation
• Quality Control & Assurance
• Backup
• Metadata
• Data Citation
59. Introduction to Data Management
• Similar to citing a published
article or book
o Provide information necessary to
identify and locate the work cited
• No standards yet
• Use format recommended by
journal, repository, or
professional organization
CCimagebyPaxsimiusonFlickr
60. Introduction to Data Management
Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB,
Trice M, Clark V (2015) Data from: Landscape-level variation in
disease susceptibility related to shallow-water hypoxia. PLOS
ONE. http://dx.doi.org/10.5061/dryad.9k231
61. Introduction to Data Management
Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB,
Trice M, Clark V (2015) Data from: Landscape-level variation in
disease susceptibility related to shallow-water hypoxia. PLOS
ONE. http://dx.doi.org/10.5061/dryad.9k231
62. Introduction to Data Management
Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB,
Trice M, Clark V (2015) Data from: Landscape-level variation in
disease susceptibility related to shallow-water hypoxia. PLOS
ONE. http://dx.doi.org/10.5061/dryad.9k231
63. Introduction to Data Management
Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB,
Trice M, Clark V (2015) Data from: Landscape-level variation in
disease susceptibility related to shallow-water hypoxia. PLOS
ONE. http://dx.doi.org/10.5061/dryad.9k231
64. Introduction to Data Management
Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB,
Trice M, Clark V (2015) Data from: Landscape-level variation in
disease susceptibility related to shallow-water hypoxia. PLOS
ONE. http://dx.doi.org/10.5061/dryad.9k231
65. Introduction to Data Management
Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB,
Trice M, Clark V (2015) Data from: Landscape-level variation in
disease susceptibility related to shallow-water hypoxia. PLOS
ONE. http://dx.doi.org/10.5061/dryad.9k231
66. Introduction to Data Management
A persistent identifier should be included in the citation:
• DOI (Digital Object Identifier)
o Globally unique, alphanumeric string assigned by a registration
agency to identify content and provide a persistent link to its
location.
o May be assigned to any item of intellectual property that is defined
by structured metadata
o Examples:10.1234/NP5678, 10.5678/ISBN-0-7645-4889-4;
10.2224/2004-10-ISO-DOI
• The UT Libraries can assign a DOI to your data set.
67. Introduction to Data Management
Chris Eaker
Data Curation Librarian
Hodges Library, Room 236
865-974-4404
chris@utk.edu
Available to help with…
• Data management plan support
• Data repositories
• Metadata
• Data management consulting
Editor's Notes
Manage your data for yourself:
Keep yourself organized – be able to find your files (data inputs, analytic scripts, outputs at various stages of the analytic process, etc)
Track your science processes for reproducibility – be able to match up your outputs with exact inputs and transformations that produced them
Better control versions of data – identify easily versions that can be periodically purged
Quality control your data more efficiently
Data is a valuable asset – it is expensive and time consuming to collect
Data should be managed to:
maximize the effective use and value of data and information assets
continually improve the quality including: data accuracy, integrity, integration, timeliness of data capture and presentation, relevance and usefulness
ensure appropriate use of data and information
facilitate data sharing
ensure sustainability and accessibility in long term for re-use in science
To summarize, the goals of effective DM is to get data that are well organized – both within the files themselves, and the groups of files, documented adequately with metadata, preserved for future reuse, accessible by others for that reuse, and accurate and valid.
If data are all those things, then you will have high quality data that is easy to share and reuse in science. The data can then be cited by other researchers which will add credibility to the researcher who prepared them. Overall, it saves money and advances science.
One of the main goals of effective data management is to facilitate sharing data with other researchers. Grant funding agencies want to see a greater return on their investment. Let’s talk a bit about data sharing and why it’s important.
Why expend the extra effort to share data? Because it benefits the public, the research sponsor, the research community and, perhaps most importantly, the researcher.
How does the public benefit from shared research?
The more informed the public is, the better they are able to understand and contribute toward effective public and personal decisions:
The public needs data to help with environmental and economic planning
Data help inform federal, state and local policies
The public can use data to help with social choices such as who they will vote for, how they want their tax dollars to be used, and where they will send their children to school.
It can even help them make personal lifestyles and health choices such as exercise, smoking, and nutrition
Why do research sponsors encourage data sharing? Because sponsors have an obligation to maximize the investment of research dollars.
Data sharing enhances the value of the research investment by enabling external reviewers to verify the project performance metrics and outcomes. This not only increases the credibility of the data but also spurs new research that can build upon the initial investment and advance the science rather than duplicate expenditures.
The scientific community as a whole also benefits from sharing among researchers. Data sharing allows researchers to build upon one another’s work and to further, rather than duplicate, the science by exploring new findings or combining findings into meta analyses that cannot be performed with individual data. In sharing data, the scientific community expands both individual perspectives and the collective comprehension.
Access to related research enables members of the scientific community to better reproduce, compare and assess methods and results. Scientists are able to learn from one another and educate new researchers as to the most current and significant findings.
And finally, how does the independent researcher benefit from data sharing? When scientists share their data, they gain recognition as an authoritative source and respect as a wise investment for research dollars. When data are exposed, feedback from the broader community can be used to improve the quality and presentation of the data. Shared data also allows for greater opportunity for data exchange and networking opportunities with peers and potential collaborators.
I’m going to go through 4 steps to make data shareable. Then we’ll go through the nuts and bolts of executing on those four steps.
The more robust your metadata, the easier your data will be discovered and the more appropriately it will be used.
Specifically, when creating metadata, be specific in regards to the geographic and time period coverage of your data. For example….
Use a discipline specific metadata schema, if possible. For example…
Use discipline specific themes, place names, and keywords.
Describe any attributes (variable names, specimen names, etc) thoroughly
Step 2: Be sure to include archival and reference information with properly formatted data citations for sources and content. Include persistent data identifiers with the data citation. We’re going to go into data citation in more depth later, but this is an example of a citation for a data set on the Dryad Data Repository. As you can see, there is a DOI added to the end, which allows the data to be located easily.
Step 3: Be sure to have data contributors review their metadata to ensure validity and organizational correctness. Are the processes correct? Is your contribution adequately represented and reflected? Is your organization properly recognized and is the funding organization properly recognized? Be sure to get management and sponsor approval on the data publication including the content, presentation, and manner in which contributors are identified.
The last step to make your data Shareable: Step 4: Publish your metadata in data portals and clearinghouses. Seek out relevant government portals and portals developed by specific communities of practice. The nice thing about these two science data repositories is that the metadata in them is harvested by the DataONE search portal and so people just have to search one interface to find data in these two, among many other, data repositories.
Now, one of the most important things one can do when managing their research data is to take steps to make sure data entry, processing, and analysis are done in such a way to minimize error. We’ll talk now of ways to do that during data entry and manipulation.
The goals of data entry are to create data that are valid, or have gone through a process to assure quality, and are organized to support use of the data or for ease of archiving.
Many researchers like to manage their data in Excel, and Excel makes it easy to use poor data entry practices. Excel doesn’t enforce any rules on your data entry unless you tell it to.
These are data entered in to Excel from a small mammal trapping study. Each block of data represents a different trapping period (2/13, 3/15, and 4/10/2010). Inconsistencies in how the data were entered for each sampling period make the data difficult to analyze and difficult for anyone but the data collector to understand. Note that the date is listed in different places in each block. Date is a column in the first block, but listed in the header in the block on the right. Inconsistent date formats were also used. In one place the date is formatted as day-month-year, with the first three letters of the month spelled out, while elsewhere the format is mm/dd/yyyy. Note also that the order of the columns is inconsistent- Site, Date in the first block, and Site, Plot in the bottom block. Even the columns are named differently. Species is called Species in the first block, and RodentSp in the block on the right. This can be confusing to any user who must try to make sense of these data! And it would be a nightmare to try to write metadata for this spreadsheet.
There are other problems with how these data was entered. Naming of sites is also inconsistent. For instance, Deep Well is used in the first block vs. DW in the block on the right. The file contains several typos, also such as rioSalado vs. rioSlado. A human can figure out what each of these site names refers to, but the names would have to be harmonized for a statistical program to use. It would be easier to filter for just Deep Well (with a space), and not have to know you need to filter for DeepWell (no space), also. Similarly, in one place a species code is capitalized PERO, and lowercase elsewhere.
Further, in the first block of data, a mean was calculated for the weight of the rodents. The value for that mean, called Mean1, is in the same column as the weights of the individual animals. In later manipulations of these data, it would be easy to copy that value as though it represented the weight of a single animal. It is bad practice to mix types of information in one column. It is best for raw data should be maintained in one file, and calculations should be done elsewhere.
In addition, there is text data mixed with numeric data in the Weight column in the block on the right – it says “escaped < 15” (presumably indicating that a rodent less than 15 grams escaped). A statistical program will not know how to deal with text data mixed with numeric data. What is the mean of 12, 91, and “escaped < 15”?
To analyze all these data using statistical software, and to make it much easier to understand by any user, these data will need to be organized in to a column for each variable. Therefore it essential that only one type of info be entered in to each column, and that spellings, codes, formats, etc. be consistent.
This shows the same data entered in a way that would make it easy to understand and analyze.
The data are not entered in separate blocks arrayed in a single worksheet. They are entered in one table with columns defined by variables Date, Site, Plot, Species, and Weight, Adult, and Comments that are recorded for each sampling event.
The columns of data have consistent types. Each column contains only numbers, dates, or text.
There are consistent names, codes, and formats used in each column. For instance, all dates are in the same format (mm/dd/yyyy), and there are no typos in the Site Names. Species are all referred to by standard codes. Therefore, if the user wanted to subset the data for species = ‘PERO”, they could easily filter the file for just those data. Additionally, there are only numeric data in the Weight column, so a statistical program or Excel could readily calculate statistics on this column. Preparing metadata for this file would also be straightforward.
Descriptive column names:
A best practice in data entry is to create descriptive column names without spaces or special characters. Sometimes statistical programs have special uses for some characters, so you should avoid using them in your data file.
Descriptive file names: For instance, a file named FieldVisit1_SmallMammalData_2010-04-11.csv indicates that this file contains data collected during Field Visit 1, contains small mammal data, and is version dated April 11, 2010. We know it’s April 11, and not November 4, because they’re using the standard ISO data format, which is four digit year, followed by 2 digit month, and 2 digit day. This name is much more helpful than a file named Mammal data.csv.
There are a lot of great things about spreadsheets, but one must be wary of problems that can arise from their use. Spreadsheets, for instance, can sort one column independently of all others. The data entry person for the upper spreadsheet elected to leave empty cells for site, treat, web, plot, quad. It’s obvious why and doesn’t cause the human reader any problems. But if someone happens to decide to sort on Species, it is no longer clear which species maps to which time period or to which measurements. This could make the spreadsheet unusable. It is good practice to fill in all cells when using a spreadsheet for data entry.
A best practice is to enter complete lines of data, so that the data are sorted on one column without loss of information
When you collect data, you’re bound to have some data points with no values. There are ways to ensure that those values are consistent and don’t throw off your analysis.
A preferred way to identify missing data is with an empty field. If for some reason an empty cell is not possible, for example, the software you’re using requires something to be there, then use an impossible value such as 9999 in numeric fields and in text fields use NA.
If you need to explain something, you can use data flags in a separate column to qualify empty cells. For instance, in this example of stream chemistry data, the flag M1 indicates that the sample was not collected at that interval. The flag E1 indicates that value was estimated.
Excel is a very popular data entry tool. It also allows you to enforce data validation rules. Here, a dropdown list has been generated that allows the user to only select entries from this list. In this way, only defined species codes get entered, and the data is consistent.
Here is another example of data validation using Excel. Height has been defined to contain values between 11 and 15. When 20 is entered, the user is told that they have entered an illegal value.
Some researchers are turning to database software instead of spreadsheets for their data management needs. Databases are a powerful option for storing and manipulating datasets. Here, we list some of the pros and cons of spreadsheets vs. databases (which include software such as Oracle, MySQL, SQL Server and Microsoft Access). Spreadsheets are good at making charts and graphs, and doing calculations. They are easy to use, but they become unwieldly as the number of records grows and a dataset becomes complex. Databases, on the other hand, work well with high volumes of data, and they are much easier to query in order to select data having particular characteristics. They also maintain data integrity – that is, one column cannot be sorted separately of all others, as spreadsheets can. Databases also enforce data typing, which is a best practice. This means that only data of type ‘text’, for example, can be entered in to a column of type ‘text’. This helps prevents data entry errors. Databases do have a steeper learning curve than a spreadsheet such as Excel does, but there are many benefits
A relational database matches data stored in tables by using common characteristics found within the data set. This helps preserve data integrity and also makes it possible to flexibly mix and match data to get different combinations of information. A database consists of a set of tables and each table has a defined relationship with another table or tables using a common piece of information called the Primary Key. Databases also have a powerful command language for querying and manipulating information called Structured Query Language, or SQL.
Here, a dataset for plant phenology has been divided into three tables, one describing site information, one describing characteristics of each sample, and one describing the plant species found.
Relational databases are currently the predominant choice in storing data like financial records, medical records, personal information and manufacturing and logistical data.
Database features includes explicit control over data types and has the advantages of quality control and performance. Here, in the plant phenology table, only dates are allowed in the Date column, only text is allows in the site column, only real numbers are allowed in the Height column. If a user tries to enter a ? Under flowering, the database will reject the entry. This is useful for defining how data is to be entered.
Relationships can be defined between two sets of data or in this example between two tables. Suppose that you have two tables used in the plant phenology study, one for observations and one for sites, and you want a table that contains both observations and the latitude and longitude of your sites. Because both tables contain Site info, they can be joined to create a table containing the info you want.
Database features also includes a powerful command language called Structured Query Language (SQL)
These are just a couple of examples of what you can do with SQL. The table at the top of this slide is named SoilTemp in the database. The first example SQL command returns all records collected on 2010-02-01.
The second select statement, returns all records from table SoilTemp where treatment is N and SensorDepth is 0. From this example you can get a sense of how easy SQL is to use to subset data based on different criteria. This is only very simple SQL. There is much, much more than can be done with it.
Forms can be created that make entering data in to a relational database as easy as entering it in to Excel. The screenshot below shows embedded forms that were quickly generated in MS Access for adding data to three tables in a database of plant cover measurements
Be aware of best practices when designing data file structures. Choose a data entry method that allows validation of data entered and be sure to invest time in learning how to use a database especially if the dataset are large or complex.
Ok, now you’ve taken steps during your data entry process to minimize the possibility of data error, but that doesn’t mean you need to stop there. There are other ways to help with quality control, as well.
In general, there are two types of errors that can occur in a data set.
First, errors of commission are the result of incorrect or inaccurate data being included in the data set. This may happen because of a malfunctioning instrument that produces faulty results, data that are mistyped during entry, or other problems.
Errors of omission are the second type of errors. These result from data or metadata being omitted. Situations that result in omission errors are when data are inadequately documented, when there are human errors during data collection or entry, or when there are anomalies in the field that affect the data.
The remainder of the module will cover best practices for quality control and quality assurance for the different stages of a research project.
First, before data collection, a researcher should think about defining and enforcing standards that will be used during the project. Consider formats that will be used for the data tables or data entry forms. Also, if abbreviations or codes are used, they should be defined up front. Measurement units should also be specified and relevant metadata should be identified before collection.
Second, you should assign responsibility for data quality before collection begins. Ideally, the person responsible for data quality assurance and data quality control is the person collecting the data, and is educated in quality control and assurance methods.
Consider using techniques that help eliminate mistakes during data entry. Examples are using Double data entry, where non-digital data are keyed in by two people independently. Differences in entries can then be detected via computer programs and examined further for mistakes. Another way to reduce data entry error is to record yourself reading off the data, and then transcribe it from the recording. You may also use a text-to-speech program reads the data to you while you type it into the computer.
If you are using spreadsheets or databases, you should carefully consider their design before and during data entry. Use consistent terminology within the database, and atomize data. This means only one piece of information is in each cell of the spreadsheet -- multiple pieces of information embedded in a single data cell will be problematic during data analysis. If you are using a database, restrict what can be entered into the database; for example, set up a field to accept only text or only numerical values, choose a maximum number of characters or a range of values a field will accept, or set a field to accept only unique values.
Finally, document any changes made to data. It saves time if good records of data editing are kept since multiple users are less likely to spend time on error-checking with old versions of data. If mistakes in data editing or cleaning are made, good data records will allow these mistakes to be undone. Documenting data changes may be as simple as creating a text file to accompany the data set, or it may involve using a scripted program for correcting errors so that each step taken is clearly documented.
Once data are entered, basic quality assurance measures can be taken. First, if data are in spreadsheets or databases, be sure they line up in their proper columns. Also check for any missing, impossible, or anomalous values. One way to check for these problems is to sort data fields and check for discrepancies. It is often also useful to perform basic statistical summaries, such as means, and standard errors. If data transformation was performed for analysis, compare the statistical summaries before and after transformation to ensure no mistakes were made during transformation.
Another strategy for quality control after data are entered is to look for outliers. Outliers are extreme values for a variable. Extreme values are those that lie outside of the statistical model being used to describe the data. Keep in mind that the goal is not to eliminate outliers but to identify potential data contamination. An easy way to do this is to plot your data on a graph. In this graph, most of the data fall along the black line. The outlier identified with the red arrow should be flagged for further investigation. You may find that that data point is legitimately correct, in which case you would keep it. But you may find it’s the result of a typo when you entered the data, in which case you have the opportunity to correct it before it messes up your analysis.
One common strategy for identifying outliers is using graphical methods, for instance normal probability plots, regression (as in the previous slide), or scatter plots.
If data are geographical, mapping the points can be to ensure latitude and longitude were correctly entered. This map shows an example of an error that is the result of mis-entering latitude data.
Another method for identifying outliers is using statistics. You can look at the deviation of a value, which is the difference between the observed value and the mean of that variable. By subtracting values from the mean of the data set, the presence of outliers or faulty data points can become apparent.
During this tutorial we first defined several concepts important for understanding quality assurance and quality control. This included data contamination and the types of data errors that can result in poor quality data.
We then covered best practices for quality assurance and quality control. These strategies prevent errors from entering a dataset, or identifying those errors if they are present in the data.
It is important to define and enforce quality assurance and quality control standards before, during, and after the collection and entry of data.
Now you’ve made sure your data were entered properly and you’ve cleaned up the data from errors of commission and omission. All that hard work could be for naught if you don’t have a good backup plan.
The terms data backups and data archiving are often used interchangeably as they both relate to saving a specific version of a file, but they do convey different processes. The term “backup” is used specifically when making copies of various files with the knowledge that the files may change. Backups are kept for a certain amount of time, but can be discarded after a specified time has passed. Archiving is used when a file is to be preserved as-is, often at the end of a project and acts as a static (and usually final) record.
Data preservation encompasses many of these same methodologies, but can also include things like data rescue, reformatting of files, converting data, and the creation of metadata. We’ll talk about backups, archives, and preservation in this section. Metadata has its own section next.
The main difference between data backups and archiving is that backups deal with data that is copied elsewhere and potentially can be overwritten again as the data change. Archiving makes a record of data that is usually in its final state.
When a user performs a backup, they are in essence taking a snapshot of the data at that moment in time. This allows the user to restore the file as needed, such as when the current version of the file is corrupted, lost, or somehow destroyed or altered. Backups are often used for short-term storage or near long-term storage, depending upon the user’s backup needs and procedures. Backups are usually scheduled on a frequent basis.
Archiving deals more with records that are no longer in use and is used to create a historical snapshot of the data. This provides for preservation of the data for future needs. Usually, archives are made when a project ends, or when appropriate.
Regardless of whether you are dealing with backups or archives, you should have multiple copies in case one (or many) versions fail.
There are many reasons to perform backups including:
- mitigate or prohibit the loss of data, which may or may not be reproducible
- save time, money, and productivity as little to none of the data will have to be reproduced
- having a backup already in place means you are prepared for when the unexpected happens, such as human error, disasters, or computer failures
- allows you to go back to earlier versions and see what your results were. For example, if you are creating models and used data from an earlier model run, the most recent file you have on your computer may not have the same data as when you first created the model output.
- provides for the ability to send older files to others, regardless of the current version or state (for example, if the current version has been corrupted)
- may allow you to respond during times when questioned results were based on older versions of files. For example, you may find that you will have to justify your results in court or to other scientists. By having access to older files, you may be able to respond to their requests for information. Or, you may not be able to reproduce the data, and the original copy may be the only evidence of the data collection.
Our last topic covers Data Preservation. Data preservation is a comprehensive topic, which includes things such as backups, archives, data conversion, reformatting, and rescue.
“Data rescue” deals directly with older files that may no longer be in a format that is easily accessible and will require some “rescuing” before it can be used again
Data rescue becomes more and more important as projects end. Even if data was being preserved through the lifetime of the project, often files go untouched or orphaned. Frequently, data has not been managed properly, and requires that some form of data rescue be performed so that the data isn’t a total loss.
When preserving your data, you need to consider many things:
Data formats: it is best to use non-proprietary and standardized formats. This will better ensure readability in the future. Be sure to check files after converting them, as data, metadata, and formatting loss can occur.
Versioning: make sure to use some sort of naming convention (such as incremental letters or numbers) to help keep track of file edits and revisions. This will also help you more quickly locate the correct version of a specific file.
Naming: use file names that are consistent, descriptive, and concise. Many software programs use a generic file name as their default file output and usually these names are too general to be useful.
If data is well-preserved, then data rescue may not be necessary. With proper file naming (can help the file from getting lost in the system), utilization of proper file formats (lets you open the file without having to convert the file), backups (limits loss of files), and media types (limits degradation of files), you may limit or prevent the need for data rescue. A good data management plan, which is discussed in another lesson, is another important tool in limiting the need for data rescue.
The first best practice is to create a backup plan. This may be a physical document, a web page, or listed as someone’s job description. Regardless of how you create it, it’s important to address any of the previously mentioned issues and concerns. For example, who do I contact when I need to get a file off of backup? I’ve included a data backup and security checklist in your handouts for you to use to create that policy or plan.
Once you have a backup plan in place, it is good practice to review it periodically to ensure the information still has value and is applicable. Hardware, software, projects, and staff can change over time.
We have two more sections: metadata and data citation.
The two questions you need to consider when you think about metadata are
What information would you need to provide someone else so they would be able to work with your data?
And two, what kind of information would you want to receive from someone else to be able to work with their data?
The answers to these two questions will indicate to you what metadata you need to provide with your data. Just to define the term, metadata is descriptive information that describes your data, the project, the people involved, etc. It’s important for someone else to make sense of your data, and even for you to make sense of it if you go back to it after not having worked on it for a while.
When sharing data, some considerations include:
- why the data was created;
- what limitations, if any, the data have;.
- what the data means; and who should be cited if someone publishes something that utilized the data.
When receiving data from an alternative source, consider:
What are the data gaps?
What processes were used for creating the current data?
Are there any fees associated with the data?
In what scale were the data created?
What do the values in the tables mean?
What software do I need in order to read the data?
What projection is the data in?
Can I give this data to someone else?
Metadata is data about data. It describes the content, quality, condition, and other characteristics of a dataset.
Metadata records answer questions such as:
Why was the data set created?
What processes were used to create the data set?
What projection is the data in?
When was the data last updated?
Who created the data?
What scale was used?
What fields are in the table?
What do the values in those fields mean?
Who do I contact about getting more information about the data?
How do I obtain a copy of the data?
Do the data cost anything?
Are there any limitations to the data?
Metadata is a valuable tool. Metadata records preserve the usefulness of data over time by detailing methods for data collection and data set creation. Metadata greatly minimizes duplication of effort in the collection of expensive digital data and fosters the sharing of digital data resources.
Metadata is all around us. . .from Mp3 players, to nutrition labels, to library card catalogues.
For example, a card catalogue tell us more information than just the title of the book, they also tells the user:
Who is the author?
Who published the book?
What subject area does the book fall in?
And finally, where is it located in the library?
Another example of metadata that we see in our daily lives is the nutrition and ingredient information on food labels.
Nutrition labels answer questions such as:
What ingredients were used?
Who made the food?
How many calories per serving?
How many servings in the can?
What percentage of daily vitamins are in each serving?
This graph illustrates the phenomenon of “information entropy”, associated with research. At the time of the research project, a scientists memory is fresh. Details about the development of the dataset are easily recalled, and it is a good time to document information about the process. Over time, memory of the details begins to fade. A variety of circumstances can intervene, and eventually detailed knowledge about the dataset fades. Without a metadata record, this data might be unusable. A dataset it not considered complete without a metadata record to accompany it.
An established standard provides common terms, definitions and structure that allow for consistent communication. The use of standards also support search and retrieval in automated systems.
This is an example of a metadata record using the Federal Geographic Data Committee (FGDC) standard. The benefit is that if every geographic data set used this standard metadata schema, then everyone would be able to understand the metadata and computers would be able to process it.