This document discusses generating training data for machine learning models from noisy measurements of land cover classifications. It describes a workflow that uses Sentinel-2 satellite imagery and GlobeLand30 land cover labels to train a random forests model for land cover classification. Key points include:
- Sentinel-2 and GlobeLand30 data are used as input, with GlobeLand30 labels filtered and resampled to the Sentinel-2 grid to create reference labels.
- A random forests model is trained separately for each Sentinel-2 scene using stratified samples of pixels.
- Initial results show 88.75% average accuracy across scenes, with some classes like water predicting well and others like wetlands being more difficult.
1. Generating Training Data from Noisy
Measurements
HAMED ALEMOHAMMAD
LEAD GEOSPATIAL DATA SCIENTIST
2. ML Hub Earth
Machine Learning commons for EO
Training data
Models
Standards and best practices
3. Global Land Cover Training Dataset
Human-verified training dataset
Using open-source Sentinel-2 imagery
10 m spatial resolution.
Global and geo-diverse
5. Data
Input Data:
10 Sentinel-2 bands: Red, Green, Blue, Red-Edge1-3, NIR, Narrow NIR, SWIR1-2
20 m bands scaled to 10m using bi-cubic interpolation
Reference/Label Data:
GlobeLand30 labels for 2010 used as a source
Classes mapped to REF Land Cover Taxonomy
Labels re-gridded to Sentinel-2 grid using nearest neighbor
Labels filtered by agreement with classes from Sentinel-2’s 20m scene classification
(produced as part of atmospheric correction)
Filtered labels used as reference labels for training
6.
7. Methodology
A pixel-based supervised Random Forests model trained for each scene.
Pixels without valid reflectance are excluded from training.
Training on class-stratified samples of half the pixels in a scene with one
Sentinel-2 pixel at 10 m for each label pixel at 30 m.
Predictions are made on all pixels marked with usable classes during Level-2A
processing, including pixels labeled as unclassified.
Annual labels will be generated by aggregating time series of predictions and
probabilities from the same tile throughout the year.
8. Results
88.75% average model accuracy across 4 diverse scenes.
Some classes, like water and snow/ice, predicted with high accuracy and high
confidence across all scenes.
Other classes, like wetland and (semi) natural vegetation, are subtler and were
expected to be more difficult to classify.
Woody vegetation and cultivated vegetation were predicted relatively
accurately and not confused with each other, as a result of including 20 m red
edge bands, resampled to 10 m.
Artificial bare ground tended to be predicted in unclassified regions (in
reference data), taking over areas of natural bare ground and cultivated
vegetation and suggesting that traces of human activity would lead to pixels
classified as artificial bare ground in off-vegetation season.
11. What about non-categorical variables?
True value of categorical variables vs true value of continuous variables:
Crop Yield
Soil Moisture
Temperature
Precipitation
All measurements of continuous variables are prone to uncertainty (noise and
bias).
How to reduce/eliminate these uncertainties in training data?
13. Generating Training Dataset
Triple collocation (TC) is a technique for estimating the unknown error standard
deviations (or RMSEs) of three mutually independent measurement systems,
without treating any one system as zero-error “truth”.
𝑄𝑖𝑗 ≡ 𝐶𝑜𝑣 𝑋𝑖, 𝑋𝑗 𝜎𝜀𝑖
= 𝑄𝑖𝑖 −
𝑄 𝑖𝑗 𝑄𝑖𝑘
𝑄 𝑗𝑘
TC-based RMSE estimates at each pixel are used to compute a priori probability
(𝑃𝑖) of selecting a particular dataset:
𝑃𝑖 =
1
𝜎𝜀𝑖
2
𝑖=1
3 1
𝜎𝜀𝑖
2