We take a closer look at the generation process of the SDOBenchmark.
For more information about the machine learning dataset "SDOBenchmark", go check out http://i4ds.github.io/SDOBenchmark and https://www.kaggle.com/fhnw-i4ds/sdobenchmark
3. At its core, SDOBenchmark is an image dataset
tailored towards Machine Learning
AIA 171 AIA 1700 HMI magnetogram
Can we predict that a large X9 flare happens in 24h?
4. At its core, SDOBenchmark is an image dataset
tailored towards Machine Learning
Up to 40 images…
… for a single
prediction
5. At its core, SDOBenchmark is an image dataset
tailored towards Machine Learning
12h 5h 1.5h 10min 24h prediction
6. At its core, SDOBenchmark is an image dataset
tailored towards Machine Learning
AIA
HMI
7. What makes this dataset special?
1. Built for Machine Learners without a solar physics background
2. High accessibility and high scientific quality
3. Specifically engineered to avoid common overfitting issues
4. Public, open source: Website, Kaggle
9. «First competitive model» (TSS 0.34)
Would the model have predicted the large flare of September 2017?
Peak class Predicted
X2 M5
X9 M5
Result: No. It would have predicted M instead of a strong X.
13. Raw data
3 different sources of raw data:
• SWPC and SSW Latest Events from HEK
-> sample selection
• GOES profiles (X-ray measurements)
-> prediction label (peak_flux), sample selection
• AIA and HMI FITS
-> images
Raw data
download
Sample
selection
Sample
creation
16. Raw data: FITS files
Not yet:
Once we have the samples, we can download the FITS raw data.
Raw data
download
Sample
selection
Sample
creation
17. Sample selection: Definition
4 time steps -> 24h prediction period
12h 5h 1.5h 10min peak flux?
Constraints
• No overlaps
• Avoid Active Region overfitting
Raw data
download
Sample
selection
Sample
creation
18. Sample selection: Event Processing
Raw data
download
Sample
selection
Sample
creation
01.09. 02.09. 03.09. 04.09. 05.09. 06.09. 07.09. 08.09. 09.09. ...10.09. 11.09. 12.09. 13.09.
Each Active Region is split into ranges.
Range:
• start time
• end time
• largest flare (if any)
X8 (40h)
X1 (48h)
X2 (48h)
X9 (48h)
M1-9
19. Sample selection: Ranges -> Samples
Raw data
download
Sample
selection
Sample
creation
01.09. 02.09. 03.09. 04.09. 05.09. 06.09. 07.09. 08.09. 09.09. ...10.09. 11.09. 12.09. 13.09.
From each range 1 or 2 samples are created
Sample:
• start time
• end time
• largest flare in 24h (if any)
X8 (40h)
X2 (24h)
X9 (24h)
X1 (24h)
= sample input
= prediction period
X9 (24h)
X1 (24h)
20. Sample selection: Test/Training
Active Regions are split into a test and a training set
Lastly, samples go through plent of verification and validation.
Raw data
download
Sample
selection
Sample
creation
21. Output creation
Finally, the actual samples are created in three steps:
1. Request FITS data urls
2. Download FITS raw files
3. Process FITS files to create output images
Raw data
download
Sample
selection
Sample
creation
22. Output creation: FITS download
FITS = Image raw data
AIA and HMI FITS files
from JSOC with their Python REST client «Drms»
Download
query = f"hmi.Ic_45s[{qt:%Y.%m.%d_%H:%M:%S_TAI}]{magnetogram}“
client.export(query, method="url_quick", protocol="as-is")
Raw data
download
Sample
selection
Sample
creation
23. Output creation: Processing
For each sample, time step, wavelength:
1. Load the FITS file with Sunpy
2. Run aiaprep / hmiprep:
Rotates, scales and translates the image
3. Find the Active Region center
4. Crop out a square around it
5. Replace NaNs with 0
6. Clip and rescale image values to predefined ranges
(similar to helioviewer.org ranges)
7. Flag images whose FITS raw files are flagged (elipses, maintenance, etc.)
8. Save resulting JPEG in the sample folder
(8-bit, 256px from a 512 cropout)
Raw data
download
Sample
selection
Sample
creation
FITS
Image
24. Output creation: Processing
For each sample, time step, wavelength:
1. Load the FITS file with Sunpy
Raw data
download
Sample
selection
Sample
creation
FITS
Image
current_map = sunpy.map.Map(fits_file)
25. Output creation: Processing
For each sample, time step, wavelength:
2. Run aiaprep / hmiprep:
Rotates, scales and translates the image
Raw data
download
Sample
selection
Sample
creation
FITS
Image
if isinstance(current_map, sunpy.map.sources.AIAMap):
current_map = sunpy.instr.aia.aiaprep(current_map)
else:
hmi_scale_factor = current_map.scale.axis1 / (0.6 * u.arcsec)
current_map = current_map.rotate(recenter=True,
scale=hmi_scale_factor.value, missing=0.0)
26. Output creation: Processing
For each sample, time step, wavelength:
3. Find the Active Region center
Raw data
download
Sample
selection
Sample
creation
FITS
Imageregion_position_rotated = sunpy.physics.differential_rotation.solar_rotate_coordinate(
active_region_position,
observation_date
)
region_position = astropy.coordinates.SkyCoord(
float(closest_region_event["hpc_x"]) * u.arcsec,
float(closest_region_event["hpc_y"]) * u.arcsec,
frame="helioprojective",
obstime=closest_region_event["starttime"]
)
center_x, center_y = current_map.world_to_pixel(region_position_rotated)
a) Closest AR event HEK
b) Solar rotation sample date
c) Hpc pixel coordinates
27. Output creation: Processing
For each sample, time step, wavelength:
4. Crop out a square around the center
Raw data
download
Sample
selection
Sample
creation
FITS
Image
(3144, 1536)
28. Output creation: Processing
For each sample, time step, wavelength:
5. Replace NaNs with 0
6. Clip and rescale image values to predefined ranges
(similar to helioviewer.org ranges)
Raw data
download
Sample
selection
Sample
creation
FITS
Image
…
},
"171": {
'dataMin': 5,
'dataMax': 3500,
'dataScalingType': 3 # 0 - linear, 1 - sqrt, 3 - log10
},
"193": {
…
29. Output creation: Processing
For each sample, time step, wavelength:
7. If FITS file is flagged (eclipses, maintenance, etc.):
Add “flagged” to image meta data
8. Save resulting JPEG in the sample folder
(8-bit, 256px from a 512px crop out)
Raw data
download
Sample
selection
Sample
creation
FITS
Image
31. Output creation: Result
Raw data
download
Sample
selection
Sample
creation
40 images
(4 time steps, 10 channels)
2012_01_01_19_06_00_0
RangeStart_SampleNr
Noaa_num
32. Want to know more?
• SDOBenchmark website at i4ds.github.io/SDOBenchmark
• GitHub repository at https://github.com/i4Ds/SDOBenchmark,
with publicly available source code and more information about the dataset creation process at
https://github.com/i4Ds/SDOBenchmark/blob/master/STRUCTURE.md
• Kaggle SDOBenchmark dataset at https://www.kaggle.com/fhnw-i4ds/sdobenchmark
Editor's Notes
Solar flares can disrupt the power grids of a continent, shut down the GPS system or irradiate people exposed in space.
The sun is constantly observed in various wavelengths.
Here we see 3 example images of the sun. They show an Active Region in 3 different wavelengths, 24h before an X9 flare (06 September 2007).
Solar flares can disrupt the power grids of a continent, shut down the GPS system or irradiate people exposed in space.
4 time steps: 12h, 5h, … before the prediction period.
24h prediction: «Peak X-ray flux to be expected within 24h»
10 images for each time step: 8 from AIA, 2 from HMI
Two instruments on a SDO satellite
(AIA 94, 131, 171, 193, 211, 304, 335, 1700. HMI continuum & magnetogram)
X2 first flare, X9 second flare a day later
A featured dataset at kaggle.com, amongst StackOverflow, X-ray research, Data Science for Good, etc.!
HEK is the «Helio Events Knowledgebase», containing countless various solar events.
For this dataset, we chose to use 4 input time steps in a 12h window for a 24h window prediction period.
AR overfitting: AR shapes change only slowly in time. We try to avoid having Nets just recognize ARs.
For each sample (yellow), the following prediction period (24h, gray) contains the range’s peak value
Samples are split by Active Regions
The X flare of September 2017 has a special rule and will always be in the test set.
FITS data urls for the sample’s input duration are requested from JSOC
The FITS raw files of a completed request are downloaded
Downloaded FITS files are processed to create output images
For each sample, we download the required FITS raw data files
(The Noaa nr directory is primarily for performance issues. Otherwise some file systems have issues with ten thousands of directories on a single level.)
(The Noaa nr directory is primarily for performance issues. Otherwise some file systems have issues with ten thousands of directories on a single level.)