Julien Peloton, CNRS
Accelerating
Astronomical Discoveries
with Apache Spark
#UnifiedDataAnalytics #SparkAISummit
1
XXIst century astronomy
2
How we can get different data?
3
~1/100,000 of sky
Large
butshallow
Hubble FoV
D
eep
butsm
all
#UnifiedDataAnalytics #SparkAISummit
Large Synoptic Survey Telescope
2022-2032: Deep & large survey
Non-profit corporation
Site: Chile (Cerro Pachón)
US led, international
collaboration (1000+)
4#UnifiedDataAnalytics #SparkAISummit
Million pieces puzzle
• LSST will deliver ~full sky map every 3 nights
– 3.2 Gigapixels camera (car size!)
– 15 TB/night of raw image data collected
– 1 TB/night of alerts streamed
5#UnifiedDataAnalytics #SparkAISummit
?
We would like to be able to do at scale:
• Exploring large catalogs of data
• Cross-matching large catalogs
• Processing telescope images
• Classifying light-curves
• Processing telescope alerts
• ...
Apache Spark for astronomy?
6#UnifiedDataAnalytics #SparkAISummit
FITS: astronomical data format
• First (last) release: 1981 (2016).
• Endorsed by NASA and the International
Astronomical Union.
• Multi-purposes: vectors, images, tables, ...
• Backward compatible
• Set of blocks.1 block: ASCII header+binary
data arrays of arbitrary dimension
• Support for C, C++, C#, Fortran, IDL, Java,
Julia, MATLAB, Perl, Python, R, and more…
7#UnifiedDataAnalytics #SparkAISummit
spark-fits
• FITS data source for Spark SQL and DataFrames.
• Data Source V1 API.
• Images + tables available.
• Schema automatically inferred from the FITS header.
8#UnifiedDataAnalytics #SparkAISummit
spark-fits in practice
• Spark 2.3.1 / Hadoop 2.8.4
• 1.1 billion rows, 153 cores
• Run it 100 times (no cache).
• Performances (IO throughput)
comparable to other built-in
Spark connectors (no attempt
to optimise anything
anywhere…)
9#UnifiedDataAnalytics #SparkAISummit
Current limitations
Some limitations currently though…
• Need to migrate to Apache Spark DSv2.
• No column pruning, no filters at the level of the connector.
• (De)Compression is not handled yet.
• Scala FITS library lacks of many features.
10#UnifiedDataAnalytics #SparkAISummit
We live in a 3D world
• Manipulating 2D data with Spark:
Geotrellis, Magellan, Geospark,
GeoMesa, …
• Very little about 3D!
• Need for e.g. astronomy, particle
physics, meteorology.
11#UnifiedDataAnalytics #SparkAISummit
Manipulating 3D spatial data: spark3D
• 3D distributed partitioning
– KDTree, Octree, shells, ...
• Distributed spatial queries & data mining
– KNN, join, dbscan, …
– Typical usage on million/billion rows
• Visualisation
– Client/server architecture
12
Student:
Mayur
Bhosale (now at Qubole)
#UnifiedDataAnalytics #SparkAISummit
On the repartitioning...
Frequent as data comes unstructured, but
• Repartitioning implies heavy shuffle
between executors.
• Complex UDF in Spark are often
inefficient.
13#UnifiedDataAnalytics #SparkAISummit
Need for (efficient) streaming
• We explored the static sky - namely what has been observed.
• But what about what is happening right now? E.g.
– Supernovae (star explosion)
– Black hole merger counterparts (multi-messenger astronomy)
– Micro-lensing (extrasolar planet search)
– Earth killers!
– Anomaly detection (unforeseen astronomical sources)
• Correlation past/present/future?
• Timescales range from seconds to months...
14#UnifiedDataAnalytics #SparkAISummit
Desiderata & solution
We would like
• To work efficiently at scale
• Multi-modals analytics
capability (streaming & batch)
• Good integration with the
current ecosystem
15
Structured
Streaming
#UnifiedDataAnalytics #SparkAISummit
Introducing Fink
Fink is
• A broker system for sky alerts
• Based on Apache Spark
Fink does
• Collect, enrich & distribute sky
alerts
16#UnifiedDataAnalytics #SparkAISummit
03
01
02
Distribute
Enrich
Collect
On a quiet night...
17#UnifiedDataAnalytics #SparkAISummit
• 10,000 Avro alerts every 30 seconds
• 1TB alerts per night
• Parquet Database
Observation
Template
Difference
Credits: E. Bellm
03
01
Distribute
Enrich
Collect
02
Who’s who
18#UnifiedDataAnalytics #SparkAISummit
Add values to the raw alerts
• Stream-static join
• Classification (BNN)
Structured
Streaming
Alert
stream
Internal
catalogs
Alert
database
03
01
02
Distribute
Enrich
Collect
Alert database
Alert database
Structured
Streaming
Joining external information
19#UnifiedDataAnalytics #SparkAISummit
Structured
Streaming
Neutrino
alert stream
Gamma ray
alert stream
Optical
alert stream
Gravitational
wave
alert stream
Join
output
03
01
02
Distribute
Enrich
Collect
Spark does all the hard work
• Small delays
• Record throughput
• Stream position recovery
But it cannot do everything...
• Large delays
• False positives
Still need humans to take decisions
The Hero’s Return
Processing based on Adaptive Learning (PoC)
• Ranking of promising candidates
• Improved classification over time
20
New Candidates
Follow-up &
DiscoveryTraining
Streaming infrastructure by:
Abhishek Chauhan
(now at Morgan Stanley)
03
01
02
Distribute
Enrich
Collect
The fear of the shutdown!
What if we miss a
night?
• 14 million alerts, 830
GB of data
• Let Spark do the
hard work again
(offsets, updates...)
21
Broker shutdown…. Collect & write
100 minutes on 3 machines
Collect alerts
(cache)
Limiting factors
• Number of machines
• Network
Some lessons learned
Handling stream offsets
• Manual or not? Still not obvious...
Schema evolution
• User needs change often… Database choice is crucial
Dynamic filtering
• Need to adapt quickly to new situations
Handling watermarks
• How long shall we wait for data? Switch to post-processing.
Communication
• Using common communication protocols & data format...
22#UnifiedDataAnalytics #SparkAISummit
Thanks!
You have a public/private project in
mind? You want to contribute to
astronomy?
Come talk to me!
23#UnifiedDataAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
24

Accelerating Astronomical Discoveries with Apache Spark

  • 1.
    Julien Peloton, CNRS Accelerating AstronomicalDiscoveries with Apache Spark #UnifiedDataAnalytics #SparkAISummit 1
  • 2.
  • 3.
    How we canget different data? 3 ~1/100,000 of sky Large butshallow Hubble FoV D eep butsm all #UnifiedDataAnalytics #SparkAISummit
  • 4.
    Large Synoptic SurveyTelescope 2022-2032: Deep & large survey Non-profit corporation Site: Chile (Cerro Pachón) US led, international collaboration (1000+) 4#UnifiedDataAnalytics #SparkAISummit
  • 5.
    Million pieces puzzle •LSST will deliver ~full sky map every 3 nights – 3.2 Gigapixels camera (car size!) – 15 TB/night of raw image data collected – 1 TB/night of alerts streamed 5#UnifiedDataAnalytics #SparkAISummit ?
  • 6.
    We would liketo be able to do at scale: • Exploring large catalogs of data • Cross-matching large catalogs • Processing telescope images • Classifying light-curves • Processing telescope alerts • ... Apache Spark for astronomy? 6#UnifiedDataAnalytics #SparkAISummit
  • 7.
    FITS: astronomical dataformat • First (last) release: 1981 (2016). • Endorsed by NASA and the International Astronomical Union. • Multi-purposes: vectors, images, tables, ... • Backward compatible • Set of blocks.1 block: ASCII header+binary data arrays of arbitrary dimension • Support for C, C++, C#, Fortran, IDL, Java, Julia, MATLAB, Perl, Python, R, and more… 7#UnifiedDataAnalytics #SparkAISummit
  • 8.
    spark-fits • FITS datasource for Spark SQL and DataFrames. • Data Source V1 API. • Images + tables available. • Schema automatically inferred from the FITS header. 8#UnifiedDataAnalytics #SparkAISummit
  • 9.
    spark-fits in practice •Spark 2.3.1 / Hadoop 2.8.4 • 1.1 billion rows, 153 cores • Run it 100 times (no cache). • Performances (IO throughput) comparable to other built-in Spark connectors (no attempt to optimise anything anywhere…) 9#UnifiedDataAnalytics #SparkAISummit
  • 10.
    Current limitations Some limitationscurrently though… • Need to migrate to Apache Spark DSv2. • No column pruning, no filters at the level of the connector. • (De)Compression is not handled yet. • Scala FITS library lacks of many features. 10#UnifiedDataAnalytics #SparkAISummit
  • 11.
    We live ina 3D world • Manipulating 2D data with Spark: Geotrellis, Magellan, Geospark, GeoMesa, … • Very little about 3D! • Need for e.g. astronomy, particle physics, meteorology. 11#UnifiedDataAnalytics #SparkAISummit
  • 12.
    Manipulating 3D spatialdata: spark3D • 3D distributed partitioning – KDTree, Octree, shells, ... • Distributed spatial queries & data mining – KNN, join, dbscan, … – Typical usage on million/billion rows • Visualisation – Client/server architecture 12 Student: Mayur Bhosale (now at Qubole) #UnifiedDataAnalytics #SparkAISummit
  • 13.
    On the repartitioning... Frequentas data comes unstructured, but • Repartitioning implies heavy shuffle between executors. • Complex UDF in Spark are often inefficient. 13#UnifiedDataAnalytics #SparkAISummit
  • 14.
    Need for (efficient)streaming • We explored the static sky - namely what has been observed. • But what about what is happening right now? E.g. – Supernovae (star explosion) – Black hole merger counterparts (multi-messenger astronomy) – Micro-lensing (extrasolar planet search) – Earth killers! – Anomaly detection (unforeseen astronomical sources) • Correlation past/present/future? • Timescales range from seconds to months... 14#UnifiedDataAnalytics #SparkAISummit
  • 15.
    Desiderata & solution Wewould like • To work efficiently at scale • Multi-modals analytics capability (streaming & batch) • Good integration with the current ecosystem 15 Structured Streaming #UnifiedDataAnalytics #SparkAISummit
  • 16.
    Introducing Fink Fink is •A broker system for sky alerts • Based on Apache Spark Fink does • Collect, enrich & distribute sky alerts 16#UnifiedDataAnalytics #SparkAISummit 03 01 02 Distribute Enrich Collect
  • 17.
    On a quietnight... 17#UnifiedDataAnalytics #SparkAISummit • 10,000 Avro alerts every 30 seconds • 1TB alerts per night • Parquet Database Observation Template Difference Credits: E. Bellm 03 01 Distribute Enrich Collect 02
  • 18.
    Who’s who 18#UnifiedDataAnalytics #SparkAISummit Addvalues to the raw alerts • Stream-static join • Classification (BNN) Structured Streaming Alert stream Internal catalogs Alert database 03 01 02 Distribute Enrich Collect Alert database Alert database Structured Streaming
  • 19.
    Joining external information 19#UnifiedDataAnalytics#SparkAISummit Structured Streaming Neutrino alert stream Gamma ray alert stream Optical alert stream Gravitational wave alert stream Join output 03 01 02 Distribute Enrich Collect Spark does all the hard work • Small delays • Record throughput • Stream position recovery But it cannot do everything... • Large delays • False positives Still need humans to take decisions
  • 20.
    The Hero’s Return Processingbased on Adaptive Learning (PoC) • Ranking of promising candidates • Improved classification over time 20 New Candidates Follow-up & DiscoveryTraining Streaming infrastructure by: Abhishek Chauhan (now at Morgan Stanley) 03 01 02 Distribute Enrich Collect
  • 21.
    The fear ofthe shutdown! What if we miss a night? • 14 million alerts, 830 GB of data • Let Spark do the hard work again (offsets, updates...) 21 Broker shutdown…. Collect & write 100 minutes on 3 machines Collect alerts (cache) Limiting factors • Number of machines • Network
  • 22.
    Some lessons learned Handlingstream offsets • Manual or not? Still not obvious... Schema evolution • User needs change often… Database choice is crucial Dynamic filtering • Need to adapt quickly to new situations Handling watermarks • How long shall we wait for data? Switch to post-processing. Communication • Using common communication protocols & data format... 22#UnifiedDataAnalytics #SparkAISummit
  • 23.
    Thanks! You have apublic/private project in mind? You want to contribute to astronomy? Come talk to me! 23#UnifiedDataAnalytics #SparkAISummit
  • 24.
    DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT 24