Accelerating Astronomical Discoveries with Apache Spark

Julien Peloton, CNRS
Accelerating
Astronomical Discoveries
with Apache Spark
#UnifiedDataAnalytics #SparkAISummit
1

How we can get different data?
3
~1/100,000 of sky
Large
butshallow
Hubble FoV
D
eep
butsm
all

Large Synoptic Survey Telescope
2022-2032: Deep & large survey
Non-profit corporation
Site: Chile (Cerro Pachón)
US led, international
collaboration (1000+)
4#UnifiedDataAnalytics #SparkAISummit

Million pieces puzzle
• LSST will deliver ~full sky map every 3 nights
– 3.2 Gigapixels camera (car size!)
– 15 TB/night of raw image data collected
– 1 TB/night of alerts streamed
?

We would like to be able to do at scale:
• Exploring large catalogs of data
• Cross-matching large catalogs
• Processing telescope images
• Classifying light-curves
• Processing telescope alerts
• ...
Apache Spark for astronomy?

FITS: astronomical data format
• First (last) release: 1981 (2016).
• Endorsed by NASA and the International
Astronomical Union.
• Multi-purposes: vectors, images, tables, ...
• Backward compatible
• Set of blocks.1 block: ASCII header+binary
data arrays of arbitrary dimension
• Support for C, C++, C#, Fortran, IDL, Java,
Julia, MATLAB, Perl, Python, R, and more…

spark-fits
• FITS data source for Spark SQL and DataFrames.
• Data Source V1 API.
• Images + tables available.
• Schema automatically inferred from the FITS header.

spark-fits in practice
• Spark 2.3.1 / Hadoop 2.8.4
• 1.1 billion rows, 153 cores
• Run it 100 times (no cache).
• Performances (IO throughput)
comparable to other built-in
Spark connectors (no attempt
to optimise anything
anywhere…)

Current limitations
Some limitations currently though…
• Need to migrate to Apache Spark DSv2.
• No column pruning, no filters at the level of the connector.
• (De)Compression is not handled yet.
• Scala FITS library lacks of many features.

We live in a 3D world
• Manipulating 2D data with Spark:
Geotrellis, Magellan, Geospark,
GeoMesa, …
• Very little about 3D!
• Need for e.g. astronomy, particle
physics, meteorology.

Manipulating 3D spatial data: spark3D
• 3D distributed partitioning
– KDTree, Octree, shells, ...
• Distributed spatial queries & data mining
– KNN, join, dbscan, …
– Typical usage on million/billion rows
• Visualisation
– Client/server architecture
12
Student:
Mayur
Bhosale (now at Qubole)

On the repartitioning...
Frequent as data comes unstructured, but
• Repartitioning implies heavy shuffle
between executors.
• Complex UDF in Spark are often
inefficient.

Need for (efficient) streaming
• We explored the static sky - namely what has been observed.
• But what about what is happening right now? E.g.
– Supernovae (star explosion)
– Black hole merger counterparts (multi-messenger astronomy)
– Micro-lensing (extrasolar planet search)
– Earth killers!
– Anomaly detection (unforeseen astronomical sources)
• Correlation past/present/future?
• Timescales range from seconds to months...

Desiderata & solution
We would like
• To work efficiently at scale
• Multi-modals analytics
capability (streaming & batch)
• Good integration with the
current ecosystem
15
Structured
Streaming

Introducing Fink
Fink is
• A broker system for sky alerts
• Based on Apache Spark
Fink does
• Collect, enrich & distribute sky
alerts
03
01
02
Distribute
Enrich
Collect

On a quiet night...
• 10,000 Avro alerts every 30 seconds
• 1TB alerts per night
• Parquet Database
Observation
Template
Difference
Credits: E. Bellm
03
01
Distribute
Enrich
Collect
02

Who’s who
Add values to the raw alerts
• Stream-static join
• Classification (BNN)
Structured
Streaming
Alert
stream
Internal
catalogs
Alert
database
03
01
02
Distribute
Enrich
Collect
Alert database
Alert database
Structured
Streaming

Joining external information
Structured
Streaming
Neutrino
alert stream
Gamma ray
alert stream
Optical
alert stream
Gravitational
wave
alert stream
Join
output
03
01
02
Distribute
Enrich
Collect
Spark does all the hard work
• Small delays
• Record throughput
• Stream position recovery
But it cannot do everything...
• Large delays
• False positives
Still need humans to take decisions

The Hero’s Return
Processing based on Adaptive Learning (PoC)
• Ranking of promising candidates
• Improved classification over time
20
New Candidates
Follow-up &
DiscoveryTraining
Streaming infrastructure by:
Abhishek Chauhan
(now at Morgan Stanley)
03
01
02
Distribute
Enrich
Collect

The fear of the shutdown!
What if we miss a
night?
• 14 million alerts, 830
GB of data
• Let Spark do the
hard work again
(offsets, updates...)
21
Broker shutdown…. Collect & write
100 minutes on 3 machines
Collect alerts
(cache)
Limiting factors
• Number of machines
• Network

Some lessons learned
Handling stream offsets
• Manual or not? Still not obvious...
Schema evolution
• User needs change often… Database choice is crucial
Dynamic filtering
• Need to adapt quickly to new situations
Handling watermarks
• How long shall we wait for data? Switch to post-processing.
Communication
• Using common communication protocols & data format...

Thanks!
You have a public/private project in
mind? You want to contribute to
astronomy?
Come talk to me!

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
24

Accelerating Astronomical Discoveries with Apache Spark

More Related Content

What's hot

Similar to Accelerating Astronomical Discoveries with Apache Spark

More from Databricks

Recently uploaded

Accelerating Astronomical Discoveries with Apache Spark