2. 2
Background
Sometimes manual editing of an XML document is necessary. XML editors provide a
human friendly interface to understanding and modifying EML metadata.
3. 3
Here is the greenish title slide
Objectives
Become familiar with a few XML editors and the most commonly used features
Navigate the EML schema within an XML editor
4. XML (what is is?)
A set of hierarchical custom elements for a community’s use
● Platform-independent, human and machine readable
● Common exchange format for information, like metadata
● Relatively verbose, requires more storage space than some other formats
● Does not have strong data typing or access control, so schemas have to include
this
● Schema language requires training
4
5. XML basics
XML Schema
● Describes the structure of an XML document
● Also referred to as “XML Schema Definition” or XSD
So XML is (kind of) a language, and EML is a dialect. A community will write up its
own specification for a standard in XML Schema.
5
15. Data Table metadata
Some components can be automated
Attribute list: your knowledge of the
data is essential
16. Attributes & Units
● Attribute: a “property” of an object
(data table)
○ In databases, a table column is called an
“attribute”
○ Often referred to as “variables”,
“parameters”, “columns” or “field
names”
● Unit: a particular physical quantity
○ Defined and adopted by convention
○ Comparable
To describe a data table you need
a moderate understanding of
A. how to define the table’s
attributes,
B. when and how to define a
unit, and
C. the relationship between the
two.
17. EML Attribute components
1 Attribute Name:
Usually the name you would
give that column in a script
2 Attribute Label:
Longer, for display, Use whole
words, capitals, etc.
3 Attribute Definition:
As complete and unambiguous
as you need them to be, for the
data to be understood
4 Measurement Scale:
● Nominal - attribute can be considered a category
● Ordinal - categories that have a logical or ordered
relationship to one another
● Interval - the magnitude between the steps is known;
equidistant points
● Ratio - have a meaningful zero, which allows ratios
between values to have meaning
● Datetime - Gregorian dates and times
5 Unit:
Interval and Ratio measurements only
Choose from: TBD
18. EML Attributes - Measurement Scale
Nominal Values are members of a category string Place and taxon names, coded values (eg,
1=male, 2=female), text comments
Ordinal Nominal categories that have a logical or
ordered relationship to one another
string Academic grades, quality rankings (eg,
1=high, 2=medium, 3=low)
Interval Ordinal, but the magnitude between the
steps is known; equidistant points
numeric Celsius scale, pH
Ratio Interval, with a meaningful zero, so ratios
between values to have meaning
numeric Temperature in Kelvin, lengths,
concentrations, organism densities
Datetime Gregorian dates and times datetime Points in time, e.g., with formats like
YYYY-MM-DD, hh:mm:ss.s
19. EML Attributes - code lists
Nominal Values are
members of a
category
Place and taxon names,
coded values (eg, 1=male, 2=female)
<codeDefinition>
<code>ABUR</code>
<definition>Arroyo Burro Reef</definition>
</codeDefinition>
<codeDefinition>
<code>NAPL</code>
<definition>Naples Reef</definition>
</codeDefinition>
...
Ordinal Nominal
categories
that have a
logical or
ordered
relationship
to one
another
Academic grades,
quality rankings (eg, 1=high,
2=medium, 3=low)
<codeDefinition>
<code>A</code>
<definition>scored higher than 90%/definition>
</codeDefinition>
<codeDefinition>
<code>B</code>
<definition>score 80 - 89%</definition>
</codeDefinition>
<codeDefinition>
<code>C</code>
<definition>score 70 - 79%</definition>
</codeDefinition>
...
Editor's Notes
A high level view of EML schema, so you knows where to look
Describe how to work with XML editors
The deep end… only if you need this to answer questions.
If needed, could replace image, and map columns to EML. ?? to do??
In general use, the term ‘attribute’ defines a property of an object, element or (in computer science) a file. In database vocabulary, a table column is called an ‘attribute’. If data were arranged in rows instead, then a row name could also be called an attribute. In ecology and environmental sciences where data are often arranged in tables or arrays, attributes may be referred to as variables, parameters, columns or field names.
A ‘unit’ is “a particular physical quantity, defined and adopted by convention, with which other particular quantities of the same kind are compared to express their value.” (quoted from eml-docs, find a ref).
There is often a blending or overlap between units and attributes in local laboratory conventions. But on a structural level and for an unambiguous comparison of measurements, the attribute and unit must be distinguished.
Units may be one of the most problematic categories of metadata. For instance, there are many attributes that clearly have no unit, such as named places and letter grades. There are other attributes for which a unit is difficult to identify, despite a suspicion that one should exist (e.g. pH, dates, times). In still other cases, a unit may be meaningful, but apparently absent due to dimensional analysis (e.g. grams of carbon per grams of soil).
Anyone describing a data table will need a moderate understanding of a) how to define the table’s attributes, b) when and how to define a unit, and c) the relationship between the two.
5 basic parts to an attribute:
Name: BP is to make this match the table header. And to keep those clean (only ascii alpha-nums, please, no wonky chars).
Label, definition, pretty self explanatory (see next slide)
The measurement scales: a typology. This is where EML differs from other specs, DC has a variable-value model (which is by design uncontrolled for flexibility). ISO-19115 generally relies on external lists.
EML wanted to encode at least some of the attribute info so that metadata could have some control over data values, and data packages could be self-contained.
Meas scales range from simple to complex, and build on each other. This measurement scale model comes from Statistics, and has been around since the 1940s. It’s not perfect, and there are others. But it works pretty well for the kinds of measurements we have in environmental data.
Nominal: values that can be considered categories. Values are assigned to distinguish them from other observations. Simple strings.
Ordinal: values are categories that have a logical or ordered relationship to one another, but the magnitude of the difference between values is irregular or is not defined. Scores, like high/med/low.
Interval: is used for data which consist of equidistant points on a scale, i.e., it is ordinal but now, the magnitude between the steps is known, and quantified. This is the first one of the series that is numeric
Ratio: builds further - now those equidistant points also have a meaningful zero point, which allows ratios between values to have meaning.
The 5th is dateTime. Not part of the original measurement scale model, but essential for environmental data. These are labels for points in time, and adhere to a convention - the Gregorian calendar. datetimes have characteristics of both the ordinal type (in that they are ordered categories) and interval type (equidistant points on a scale). By making dateTime a separate category and providing a mechanism for describing date formats, datasets contain the information needed to parse date values into their appropriate components (e.g., days, months, years).
Unit: assigned only for two numeric types, for Interval and ratio. Cannot quite use this for EML 2.2: http://unit.lternet.edu
So having this typology means that values can be typed differently: eg, the first two are strings, interval and ratio are numeric.
Here are some examples of the way you would categorize measurements in this typology
Nominal: values that can be considered categories. Values are assigned to distinguish them from other observations. Examples: using the number 1 for male and 2 for female, a species code or binomial, or the name of the site where the observation was made. Columns that contain strings or simple text are nominal type.
Ordinal: values are categories that have a logical or ordered relationship to one another, but the magnitude of the difference between values is irregular or is not defined. Examples: academic grades: A, B, C, D, F, or ranking quality 1=high, 2=medium, 3=low.
Interval: is used for data which consist of equidistant points on a scale, i.e., it is ordinal but the magnitude between the steps is equdistant. Examples: the Celsius scale is an interval scale, since degrees are equally spaced but there is no natural zero point. Since the ‘0’ of the Celsius scale is tied to a property of water, 20 C is not twice as hot as 10 C. Another example is pH.
Ratio: is used for data which consists of equidistant points that also have a meaningful zero point, which allows ratios between values to have meaning. Examples of a ratio scale include the Kelvin temperature scale (200K is half as hot as 400K) and length in meters (e.g., 10 meters is twice as long as 5 meters). Concentrations are of ratio type.
dateTime: A label for a point in time. Not a duration.
In EML, Nominal and ordinal types can be either free text, or can have code lists, where you define the meaning of the categories. Code lists are used if the incoming data was “controlled”, eg, was part of a FK constraint in a database, or you know what values are allowed and you want to keep in “controlled” in the data package.
Even for data that has never been explicitly controlled (ie, did not come out of a database, listing the allowable codes and their definitions will help a user (or even you) later on. It’s a good way to keep track of what your codes mean.
Ordinal types: (side note: I have seen very few datasets that use ordinal type, but when they do, they have code lists. Almost all my datasets have had at least one nominal attribute with a code list)
In R, things that have code lists are typed as “factor”
If you have a lot of codes, or they are reused a lot, or if the definition is longer than simple text, you could put the codes and definitions into a separate table (getting into bp here. Talk to Kristin)