1
ESA UNCLASSIFIED - For ESA Official Use Only
Solving Large-scale Data Challenges with ESA
Datalabs
Pablo Gómez
Data Science Section, SCI-SAS
24/11/2023
ESA ESAC
2
2
• Part of the Data Science and Archives Division
• Focused on science data exploitation
• Works with different missions & interdisciplinary
Background – Data Science Section
3
3
• Part of the Data Science and Archives Division
• Focused on science data exploitation
• Works with different missions & interdisciplinary
Background – Data Science Section
4
4
“Big” Data – Where we are and what is coming
Euclid First Images
5
5
“Big” Data – Where we are and what is coming
Gaia Data Release 3
6
6
Importance of archival data – Hubble Space Telescope
HST publications by type
https://archive.stsci.edu/hst/bibliography/pubstat.html
de Marchi & Merín, presented at EAS 2023
Not assigned
Partly Archival
Archival
General
Observer
7
7
“Big” Data – Where we are and what is coming
ESAC Science Data Center
8
8
ESA Datalabs – datalabs.esa.int
8
in beta mode
9
9
ESA Datalabs main functionality
System/Core
Discovery
Pipelines
10
10
Datalabs Catalogue
11
11
Example: JWST Data Analysis Tools Notebooks
12
12
A Platform Designed to Boost Science Collaboration
13
13
Web-Based & Desktop-Based Datalabs
14
14
A Platform Designed to Boost Research Productivity
14
SaaS
PaaS
IaaS
System Development
IT Development
Science Development
You can start HERE!
15
15
A Platform Designed to Boost Access to Data
SCI
… …
ESA
16
16
Leveraging on ESA’s Digital Ecosystem of Platforms
datalabs.esa.int gssc.esa.int
18
18
Data Discovery Portal / Volume Catalogue
19
19
Computing & Data Colocation – Data Volume Catalog
20
20
Datalab & Volume Integration
22
22
Pipelines Catalogue
23
23
Pipelines: Integrated Development Environment
24
24
Pipelines: Integrated Development Environment
Common Workflow Language - CWL
25
25
Upcoming in 0.10.0 – Datalabs Marketplace (like App Store)
26
26
Recent Events
• Euclid Consortium meeting June 2023
• 200+ new users
• Stress test
• Lots of feedback
• Focus on user experience
• With ESA missions
• Experimental onboarding of external projects
ideas for new use-cases; UI improvements
27
27
JWST @ ESA Datalabs: baseline JWST area
JWST area @ ESA Datalabs
• JWST calibration pipeline
• Astroquery (inc. ESA JWST module)
• pyESASky
• JDAVIZ
• astropy
• matplotlib
• ….
Access to JWST NFS volume:
• JWST calibration files
• Example notebooks for eJWST
• Example notebooks from STSCI
28
28
The ESA Space Science Exploitation Platform
• SCI Data available for researches to work on it, made easy
• Reusable for fast implementation of Scientific Processing Pipelines
• Reusable for fast implementation of Scientific Analysis and Visualisation Tools
High-level messages
Increase Space Science Operations Efficiency
Enable Collaboration and Open Science
• Share complex processing tools and data with your team
• Share your contributions with the community in SCI‘s AppStore
29
29
Catalogue of interacting galaxies in HST archives
One example use case of ESA: Datalabs
Harnessing the Hubble Space Telescope Archives: A
Catalogue of 21,926 Interacting Galaxies
O’Ryan et al. 2023, arXiv:2303.00366
➢ Access to data directly (open large
FITS file is a few seconds, 100k
cutouts created on the order of
minutes)
➢ 92 million cutouts produced (2.5 TB)
➢ Using fine-tuned Zoobot on a sample
of mergers from CANDELS &
COSMOS
➢ Predict interacting galaxies in HST
archives: 21,926 interacting galaxies
found with high confidence (p>0.95)
➢ Other gems: strong lenses, proto-
planetary disks
30
30
ESA Datalabs for Euclid pilot studies
Detecting Solar System Object Preserving Low-Surface Brightness
Detecting Transients Cosmology Likelihood for Observables in Euclid
32
Perspective – A typical ML project
1. Setup
Tools &
Frameworks
Local folders etc.
Getting the data
33
Perspective – A typical ML project
1. Setup
Tools &
Frameworks
Local folders etc.
Getting the data
34
Perspective – A typical ML project
1 - Setup
Tools &
Frameworks
Local folders etc.
Getting the data
2 - Data Prep
I/O
Data Cleaning
Data Labeling
Gaia Data Release 3
Bing
35
Perspective – A typical ML project
1 - Setup
Tools &
Frameworks
Local folders etc.
Getting the data
2 - Data Prep
I/O
Data Cleaning
Data Labeling
3 - Models
Training
Inference
Clustering
…
36
Perspective – A typical ML project
1 - Setup
Tools &
Frameworks
Local folders etc.
Getting the data
2 - Data Prep
I/O
Data Cleaning
Data Labeling
3 - Models
Training
Inference
Clustering
…
37
Perspective – What we can build
38
Datalabs – Quo vadis?
Anomaly Detection
Finding interesting things
Dealing with the flood
Etseneth et al. 2023
39
Datalabs – Quo vadis?
Anomaly Detection
Finding interesting things
Dealing with the flood
Learning with Few Labels
Etseneth et al. 2023
Get a few
labels
Train a semi-
supervised
model
Different Downstream Tasks
• Roughly sort unlabeled data
• Find other instances
• Incremental improvements
40
Datalabs – Quo vadis?
Anomaly Detection
Finding interesting things
Dealing with the flood
Learning with Few Labels
Etseneth et al. 2023
Get a few
labels
Train a semi-
supervised
model
Different Downstream Tasks
• Roughly sort unlabeled data
• Find other instances
• Incremental improvements
Standardized ML Data Preprocessing
41
Thanks!
Questions?

Pablo Gomez - Solving Large-scale Challenges with ESA Datalabs

  • 1.
    1 ESA UNCLASSIFIED -For ESA Official Use Only Solving Large-scale Data Challenges with ESA Datalabs Pablo Gómez Data Science Section, SCI-SAS 24/11/2023 ESA ESAC
  • 2.
    2 2 • Part ofthe Data Science and Archives Division • Focused on science data exploitation • Works with different missions & interdisciplinary Background – Data Science Section
  • 3.
    3 3 • Part ofthe Data Science and Archives Division • Focused on science data exploitation • Works with different missions & interdisciplinary Background – Data Science Section
  • 4.
    4 4 “Big” Data –Where we are and what is coming Euclid First Images
  • 5.
    5 5 “Big” Data –Where we are and what is coming Gaia Data Release 3
  • 6.
    6 6 Importance of archivaldata – Hubble Space Telescope HST publications by type https://archive.stsci.edu/hst/bibliography/pubstat.html de Marchi & Merín, presented at EAS 2023 Not assigned Partly Archival Archival General Observer
  • 7.
    7 7 “Big” Data –Where we are and what is coming ESAC Science Data Center
  • 8.
    8 8 ESA Datalabs –datalabs.esa.int 8 in beta mode
  • 9.
    9 9 ESA Datalabs mainfunctionality System/Core Discovery Pipelines
  • 10.
  • 11.
    11 11 Example: JWST DataAnalysis Tools Notebooks
  • 12.
    12 12 A Platform Designedto Boost Science Collaboration
  • 13.
  • 14.
    14 14 A Platform Designedto Boost Research Productivity 14 SaaS PaaS IaaS System Development IT Development Science Development You can start HERE!
  • 15.
    15 15 A Platform Designedto Boost Access to Data SCI … … ESA
  • 16.
    16 16 Leveraging on ESA’sDigital Ecosystem of Platforms datalabs.esa.int gssc.esa.int
  • 17.
    18 18 Data Discovery Portal/ Volume Catalogue
  • 18.
    19 19 Computing & DataColocation – Data Volume Catalog
  • 19.
  • 20.
  • 21.
  • 22.
    24 24 Pipelines: Integrated DevelopmentEnvironment Common Workflow Language - CWL
  • 23.
    25 25 Upcoming in 0.10.0– Datalabs Marketplace (like App Store)
  • 24.
    26 26 Recent Events • EuclidConsortium meeting June 2023 • 200+ new users • Stress test • Lots of feedback • Focus on user experience • With ESA missions • Experimental onboarding of external projects ideas for new use-cases; UI improvements
  • 25.
    27 27 JWST @ ESADatalabs: baseline JWST area JWST area @ ESA Datalabs • JWST calibration pipeline • Astroquery (inc. ESA JWST module) • pyESASky • JDAVIZ • astropy • matplotlib • …. Access to JWST NFS volume: • JWST calibration files • Example notebooks for eJWST • Example notebooks from STSCI
  • 26.
    28 28 The ESA SpaceScience Exploitation Platform • SCI Data available for researches to work on it, made easy • Reusable for fast implementation of Scientific Processing Pipelines • Reusable for fast implementation of Scientific Analysis and Visualisation Tools High-level messages Increase Space Science Operations Efficiency Enable Collaboration and Open Science • Share complex processing tools and data with your team • Share your contributions with the community in SCI‘s AppStore
  • 27.
    29 29 Catalogue of interactinggalaxies in HST archives One example use case of ESA: Datalabs Harnessing the Hubble Space Telescope Archives: A Catalogue of 21,926 Interacting Galaxies O’Ryan et al. 2023, arXiv:2303.00366 ➢ Access to data directly (open large FITS file is a few seconds, 100k cutouts created on the order of minutes) ➢ 92 million cutouts produced (2.5 TB) ➢ Using fine-tuned Zoobot on a sample of mergers from CANDELS & COSMOS ➢ Predict interacting galaxies in HST archives: 21,926 interacting galaxies found with high confidence (p>0.95) ➢ Other gems: strong lenses, proto- planetary disks
  • 28.
    30 30 ESA Datalabs forEuclid pilot studies Detecting Solar System Object Preserving Low-Surface Brightness Detecting Transients Cosmology Likelihood for Observables in Euclid
  • 29.
    32 Perspective – Atypical ML project 1. Setup Tools & Frameworks Local folders etc. Getting the data
  • 30.
    33 Perspective – Atypical ML project 1. Setup Tools & Frameworks Local folders etc. Getting the data
  • 31.
    34 Perspective – Atypical ML project 1 - Setup Tools & Frameworks Local folders etc. Getting the data 2 - Data Prep I/O Data Cleaning Data Labeling Gaia Data Release 3 Bing
  • 32.
    35 Perspective – Atypical ML project 1 - Setup Tools & Frameworks Local folders etc. Getting the data 2 - Data Prep I/O Data Cleaning Data Labeling 3 - Models Training Inference Clustering …
  • 33.
    36 Perspective – Atypical ML project 1 - Setup Tools & Frameworks Local folders etc. Getting the data 2 - Data Prep I/O Data Cleaning Data Labeling 3 - Models Training Inference Clustering …
  • 34.
  • 35.
    38 Datalabs – Quovadis? Anomaly Detection Finding interesting things Dealing with the flood Etseneth et al. 2023
  • 36.
    39 Datalabs – Quovadis? Anomaly Detection Finding interesting things Dealing with the flood Learning with Few Labels Etseneth et al. 2023 Get a few labels Train a semi- supervised model Different Downstream Tasks • Roughly sort unlabeled data • Find other instances • Incremental improvements
  • 37.
    40 Datalabs – Quovadis? Anomaly Detection Finding interesting things Dealing with the flood Learning with Few Labels Etseneth et al. 2023 Get a few labels Train a semi- supervised model Different Downstream Tasks • Roughly sort unlabeled data • Find other instances • Incremental improvements Standardized ML Data Preprocessing
  • 38.