Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Julien Peloton, CNRS
Accelerating
Astronomical Discoveries
with Apache Spark
#UnifiedDataAnalytics #SparkAISummit
1
XXIst century astronomy
2
How we can get different data?
3
~1/100,000 of sky
Large
butshallow
Hubble FoV
D
eep
butsm
all
#UnifiedDataAnalytics #Spar...
Large Synoptic Survey Telescope
2022-2032: Deep & large survey
Non-profit corporation
Site: Chile (Cerro Pachón)
US led, i...
Million pieces puzzle
• LSST will deliver ~full sky map every 3 nights
– 3.2 Gigapixels camera (car size!)
– 15 TB/night o...
We would like to be able to do at scale:
• Exploring large catalogs of data
• Cross-matching large catalogs
• Processing t...
FITS: astronomical data format
• First (last) release: 1981 (2016).
• Endorsed by NASA and the International
Astronomical ...
spark-fits
• FITS data source for Spark SQL and DataFrames.
• Data Source V1 API.
• Images + tables available.
• Schema au...
spark-fits in practice
• Spark 2.3.1 / Hadoop 2.8.4
• 1.1 billion rows, 153 cores
• Run it 100 times (no cache).
• Perform...
Current limitations
Some limitations currently though…
• Need to migrate to Apache Spark DSv2.
• No column pruning, no fil...
We live in a 3D world
• Manipulating 2D data with Spark:
Geotrellis, Magellan, Geospark,
GeoMesa, …
• Very little about 3D...
Manipulating 3D spatial data: spark3D
• 3D distributed partitioning
– KDTree, Octree, shells, ...
• Distributed spatial qu...
On the repartitioning...
Frequent as data comes unstructured, but
• Repartitioning implies heavy shuffle
between executors...
Need for (efficient) streaming
• We explored the static sky - namely what has been observed.
• But what about what is happ...
Desiderata & solution
We would like
• To work efficiently at scale
• Multi-modals analytics
capability (streaming & batch)...
Introducing Fink
Fink is
• A broker system for sky alerts
• Based on Apache Spark
Fink does
• Collect, enrich & distribute...
On a quiet night...
17#UnifiedDataAnalytics #SparkAISummit
• 10,000 Avro alerts every 30 seconds
• 1TB alerts per night
• ...
Who’s who
18#UnifiedDataAnalytics #SparkAISummit
Add values to the raw alerts
• Stream-static join
• Classification (BNN)
...
Joining external information
19#UnifiedDataAnalytics #SparkAISummit
Structured
Streaming
Neutrino
alert stream
Gamma ray
a...
The Hero’s Return
Processing based on Adaptive Learning (PoC)
• Ranking of promising candidates
• Improved classification ...
The fear of the shutdown!
What if we miss a
night?
• 14 million alerts, 830
GB of data
• Let Spark do the
hard work again
...
Some lessons learned
Handling stream offsets
• Manual or not? Still not obvious...
Schema evolution
• User needs change of...
Thanks!
You have a public/private project in
mind? You want to contribute to
astronomy?
Come talk to me!
23#UnifiedDataAna...
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
24
Upcoming SlideShare
Loading in …5
×

of

Accelerating Astronomical Discoveries with Apache Spark Slide 1 Accelerating Astronomical Discoveries with Apache Spark Slide 2 Accelerating Astronomical Discoveries with Apache Spark Slide 3 Accelerating Astronomical Discoveries with Apache Spark Slide 4 Accelerating Astronomical Discoveries with Apache Spark Slide 5 Accelerating Astronomical Discoveries with Apache Spark Slide 6 Accelerating Astronomical Discoveries with Apache Spark Slide 7 Accelerating Astronomical Discoveries with Apache Spark Slide 8 Accelerating Astronomical Discoveries with Apache Spark Slide 9 Accelerating Astronomical Discoveries with Apache Spark Slide 10 Accelerating Astronomical Discoveries with Apache Spark Slide 11 Accelerating Astronomical Discoveries with Apache Spark Slide 12 Accelerating Astronomical Discoveries with Apache Spark Slide 13 Accelerating Astronomical Discoveries with Apache Spark Slide 14 Accelerating Astronomical Discoveries with Apache Spark Slide 15 Accelerating Astronomical Discoveries with Apache Spark Slide 16 Accelerating Astronomical Discoveries with Apache Spark Slide 17 Accelerating Astronomical Discoveries with Apache Spark Slide 18 Accelerating Astronomical Discoveries with Apache Spark Slide 19 Accelerating Astronomical Discoveries with Apache Spark Slide 20 Accelerating Astronomical Discoveries with Apache Spark Slide 21 Accelerating Astronomical Discoveries with Apache Spark Slide 22 Accelerating Astronomical Discoveries with Apache Spark Slide 23 Accelerating Astronomical Discoveries with Apache Spark Slide 24
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

1 Like

Share

Download to read offline

Accelerating Astronomical Discoveries with Apache Spark

Download to read offline

Our research group is investigating how to leverage Apache Spark (batch, streaming & real-time) to analyse current and future data sets in astronomy. Among the future large experiments, the Large Synoptic Survey Telescope (LSST) will start soon collecting terabytes of data per observation night, and the efficient processing and analysis of both real-time and historical data remains a major challenge. In this talk we will expose the main challenges and explore the latest developments tailored for big data problems in astronomy.

On the one hand we designed a new Data Source API extension to natively manipulate telescope images and astronomical tables within Apache Spark. We then extended the functionalities of the Apache Spark SQL module to ease the manipulation of 3D data sets and perform efficient queries: partitioning, data sets join and cross-match, nearest neighbors search, spatial queries, and more.

On the other hand we are using the new possibilities offered by Structured Streaming APIs in recent Apache Spark versions to enable real-time decisions by rapidly accessing and analysing the alerts sent by telescopes every night. Given the unprecedented precision of next generation of telescopes, the streams of alerts will be made of millions of alerts per night, and relying on Structured Streaming is a guarantee of not missing the latest Black Hole event in a sea of data! We will also share active learning developments used on top to improve real-time event selection and classification for the LSST telescope.

You will walk away with an understanding of modern challenges in astronomy, appreciate some beautiful night skies, and how Apache Spark can help pushing further the frontiers of Science!

Accelerating Astronomical Discoveries with Apache Spark

  1. 1. Julien Peloton, CNRS Accelerating Astronomical Discoveries with Apache Spark #UnifiedDataAnalytics #SparkAISummit 1
  2. 2. XXIst century astronomy 2
  3. 3. How we can get different data? 3 ~1/100,000 of sky Large butshallow Hubble FoV D eep butsm all #UnifiedDataAnalytics #SparkAISummit
  4. 4. Large Synoptic Survey Telescope 2022-2032: Deep & large survey Non-profit corporation Site: Chile (Cerro Pachón) US led, international collaboration (1000+) 4#UnifiedDataAnalytics #SparkAISummit
  5. 5. Million pieces puzzle • LSST will deliver ~full sky map every 3 nights – 3.2 Gigapixels camera (car size!) – 15 TB/night of raw image data collected – 1 TB/night of alerts streamed 5#UnifiedDataAnalytics #SparkAISummit ?
  6. 6. We would like to be able to do at scale: • Exploring large catalogs of data • Cross-matching large catalogs • Processing telescope images • Classifying light-curves • Processing telescope alerts • ... Apache Spark for astronomy? 6#UnifiedDataAnalytics #SparkAISummit
  7. 7. FITS: astronomical data format • First (last) release: 1981 (2016). • Endorsed by NASA and the International Astronomical Union. • Multi-purposes: vectors, images, tables, ... • Backward compatible • Set of blocks.1 block: ASCII header+binary data arrays of arbitrary dimension • Support for C, C++, C#, Fortran, IDL, Java, Julia, MATLAB, Perl, Python, R, and more… 7#UnifiedDataAnalytics #SparkAISummit
  8. 8. spark-fits • FITS data source for Spark SQL and DataFrames. • Data Source V1 API. • Images + tables available. • Schema automatically inferred from the FITS header. 8#UnifiedDataAnalytics #SparkAISummit
  9. 9. spark-fits in practice • Spark 2.3.1 / Hadoop 2.8.4 • 1.1 billion rows, 153 cores • Run it 100 times (no cache). • Performances (IO throughput) comparable to other built-in Spark connectors (no attempt to optimise anything anywhere…) 9#UnifiedDataAnalytics #SparkAISummit
  10. 10. Current limitations Some limitations currently though… • Need to migrate to Apache Spark DSv2. • No column pruning, no filters at the level of the connector. • (De)Compression is not handled yet. • Scala FITS library lacks of many features. 10#UnifiedDataAnalytics #SparkAISummit
  11. 11. We live in a 3D world • Manipulating 2D data with Spark: Geotrellis, Magellan, Geospark, GeoMesa, … • Very little about 3D! • Need for e.g. astronomy, particle physics, meteorology. 11#UnifiedDataAnalytics #SparkAISummit
  12. 12. Manipulating 3D spatial data: spark3D • 3D distributed partitioning – KDTree, Octree, shells, ... • Distributed spatial queries & data mining – KNN, join, dbscan, … – Typical usage on million/billion rows • Visualisation – Client/server architecture 12 Student: Mayur Bhosale (now at Qubole) #UnifiedDataAnalytics #SparkAISummit
  13. 13. On the repartitioning... Frequent as data comes unstructured, but • Repartitioning implies heavy shuffle between executors. • Complex UDF in Spark are often inefficient. 13#UnifiedDataAnalytics #SparkAISummit
  14. 14. Need for (efficient) streaming • We explored the static sky - namely what has been observed. • But what about what is happening right now? E.g. – Supernovae (star explosion) – Black hole merger counterparts (multi-messenger astronomy) – Micro-lensing (extrasolar planet search) – Earth killers! – Anomaly detection (unforeseen astronomical sources) • Correlation past/present/future? • Timescales range from seconds to months... 14#UnifiedDataAnalytics #SparkAISummit
  15. 15. Desiderata & solution We would like • To work efficiently at scale • Multi-modals analytics capability (streaming & batch) • Good integration with the current ecosystem 15 Structured Streaming #UnifiedDataAnalytics #SparkAISummit
  16. 16. Introducing Fink Fink is • A broker system for sky alerts • Based on Apache Spark Fink does • Collect, enrich & distribute sky alerts 16#UnifiedDataAnalytics #SparkAISummit 03 01 02 Distribute Enrich Collect
  17. 17. On a quiet night... 17#UnifiedDataAnalytics #SparkAISummit • 10,000 Avro alerts every 30 seconds • 1TB alerts per night • Parquet Database Observation Template Difference Credits: E. Bellm 03 01 Distribute Enrich Collect 02
  18. 18. Who’s who 18#UnifiedDataAnalytics #SparkAISummit Add values to the raw alerts • Stream-static join • Classification (BNN) Structured Streaming Alert stream Internal catalogs Alert database 03 01 02 Distribute Enrich Collect Alert database Alert database Structured Streaming
  19. 19. Joining external information 19#UnifiedDataAnalytics #SparkAISummit Structured Streaming Neutrino alert stream Gamma ray alert stream Optical alert stream Gravitational wave alert stream Join output 03 01 02 Distribute Enrich Collect Spark does all the hard work • Small delays • Record throughput • Stream position recovery But it cannot do everything... • Large delays • False positives Still need humans to take decisions
  20. 20. The Hero’s Return Processing based on Adaptive Learning (PoC) • Ranking of promising candidates • Improved classification over time 20 New Candidates Follow-up & DiscoveryTraining Streaming infrastructure by: Abhishek Chauhan (now at Morgan Stanley) 03 01 02 Distribute Enrich Collect
  21. 21. The fear of the shutdown! What if we miss a night? • 14 million alerts, 830 GB of data • Let Spark do the hard work again (offsets, updates...) 21 Broker shutdown…. Collect & write 100 minutes on 3 machines Collect alerts (cache) Limiting factors • Number of machines • Network
  22. 22. Some lessons learned Handling stream offsets • Manual or not? Still not obvious... Schema evolution • User needs change often… Database choice is crucial Dynamic filtering • Need to adapt quickly to new situations Handling watermarks • How long shall we wait for data? Switch to post-processing. Communication • Using common communication protocols & data format... 22#UnifiedDataAnalytics #SparkAISummit
  23. 23. Thanks! You have a public/private project in mind? You want to contribute to astronomy? Come talk to me! 23#UnifiedDataAnalytics #SparkAISummit
  24. 24. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT 24
  • NatashaPerez18

    Nov. 26, 2021

Our research group is investigating how to leverage Apache Spark (batch, streaming & real-time) to analyse current and future data sets in astronomy. Among the future large experiments, the Large Synoptic Survey Telescope (LSST) will start soon collecting terabytes of data per observation night, and the efficient processing and analysis of both real-time and historical data remains a major challenge. In this talk we will expose the main challenges and explore the latest developments tailored for big data problems in astronomy. On the one hand we designed a new Data Source API extension to natively manipulate telescope images and astronomical tables within Apache Spark. We then extended the functionalities of the Apache Spark SQL module to ease the manipulation of 3D data sets and perform efficient queries: partitioning, data sets join and cross-match, nearest neighbors search, spatial queries, and more. On the other hand we are using the new possibilities offered by Structured Streaming APIs in recent Apache Spark versions to enable real-time decisions by rapidly accessing and analysing the alerts sent by telescopes every night. Given the unprecedented precision of next generation of telescopes, the streams of alerts will be made of millions of alerts per night, and relying on Structured Streaming is a guarantee of not missing the latest Black Hole event in a sea of data! We will also share active learning developments used on top to improve real-time event selection and classification for the LSST telescope. You will walk away with an understanding of modern challenges in astronomy, appreciate some beautiful night skies, and how Apache Spark can help pushing further the frontiers of Science!

Views

Total views

296

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

3

Shares

0

Comments

0

Likes

1

×