SlideShare a Scribd company logo
Julien Peloton, CNRS
Accelerating
Astronomical Discoveries
with Apache Spark
#UnifiedDataAnalytics #SparkAISummit
1
XXIst century astronomy
2
How we can get different data?
3
~1/100,000 of sky
Large
butshallow
Hubble FoV
D
eep
butsm
all
#UnifiedDataAnalytics #SparkAISummit
Large Synoptic Survey Telescope
2022-2032: Deep & large survey
Non-profit corporation
Site: Chile (Cerro Pachón)
US led, international
collaboration (1000+)
4#UnifiedDataAnalytics #SparkAISummit
Million pieces puzzle
• LSST will deliver ~full sky map every 3 nights
– 3.2 Gigapixels camera (car size!)
– 15 TB/night of raw image data collected
– 1 TB/night of alerts streamed
5#UnifiedDataAnalytics #SparkAISummit
?
We would like to be able to do at scale:
• Exploring large catalogs of data
• Cross-matching large catalogs
• Processing telescope images
• Classifying light-curves
• Processing telescope alerts
• ...
Apache Spark for astronomy?
6#UnifiedDataAnalytics #SparkAISummit
FITS: astronomical data format
• First (last) release: 1981 (2016).
• Endorsed by NASA and the International
Astronomical Union.
• Multi-purposes: vectors, images, tables, ...
• Backward compatible
• Set of blocks.1 block: ASCII header+binary
data arrays of arbitrary dimension
• Support for C, C++, C#, Fortran, IDL, Java,
Julia, MATLAB, Perl, Python, R, and more…
7#UnifiedDataAnalytics #SparkAISummit
spark-fits
• FITS data source for Spark SQL and DataFrames.
• Data Source V1 API.
• Images + tables available.
• Schema automatically inferred from the FITS header.
8#UnifiedDataAnalytics #SparkAISummit
spark-fits in practice
• Spark 2.3.1 / Hadoop 2.8.4
• 1.1 billion rows, 153 cores
• Run it 100 times (no cache).
• Performances (IO throughput)
comparable to other built-in
Spark connectors (no attempt
to optimise anything
anywhere…)
9#UnifiedDataAnalytics #SparkAISummit
Current limitations
Some limitations currently though…
• Need to migrate to Apache Spark DSv2.
• No column pruning, no filters at the level of the connector.
• (De)Compression is not handled yet.
• Scala FITS library lacks of many features.
10#UnifiedDataAnalytics #SparkAISummit
We live in a 3D world
• Manipulating 2D data with Spark:
Geotrellis, Magellan, Geospark,
GeoMesa, …
• Very little about 3D!
• Need for e.g. astronomy, particle
physics, meteorology.
11#UnifiedDataAnalytics #SparkAISummit
Manipulating 3D spatial data: spark3D
• 3D distributed partitioning
– KDTree, Octree, shells, ...
• Distributed spatial queries & data mining
– KNN, join, dbscan, …
– Typical usage on million/billion rows
• Visualisation
– Client/server architecture
12
Student:
Mayur
Bhosale (now at Qubole)
#UnifiedDataAnalytics #SparkAISummit
On the repartitioning...
Frequent as data comes unstructured, but
• Repartitioning implies heavy shuffle
between executors.
• Complex UDF in Spark are often
inefficient.
13#UnifiedDataAnalytics #SparkAISummit
Need for (efficient) streaming
• We explored the static sky - namely what has been observed.
• But what about what is happening right now? E.g.
– Supernovae (star explosion)
– Black hole merger counterparts (multi-messenger astronomy)
– Micro-lensing (extrasolar planet search)
– Earth killers!
– Anomaly detection (unforeseen astronomical sources)
• Correlation past/present/future?
• Timescales range from seconds to months...
14#UnifiedDataAnalytics #SparkAISummit
Desiderata & solution
We would like
• To work efficiently at scale
• Multi-modals analytics
capability (streaming & batch)
• Good integration with the
current ecosystem
15
Structured
Streaming
#UnifiedDataAnalytics #SparkAISummit
Introducing Fink
Fink is
• A broker system for sky alerts
• Based on Apache Spark
Fink does
• Collect, enrich & distribute sky
alerts
16#UnifiedDataAnalytics #SparkAISummit
03
01
02
Distribute
Enrich
Collect
On a quiet night...
17#UnifiedDataAnalytics #SparkAISummit
• 10,000 Avro alerts every 30 seconds
• 1TB alerts per night
• Parquet Database
Observation
Template
Difference
Credits: E. Bellm
03
01
Distribute
Enrich
Collect
02
Who’s who
18#UnifiedDataAnalytics #SparkAISummit
Add values to the raw alerts
• Stream-static join
• Classification (BNN)
Structured
Streaming
Alert
stream
Internal
catalogs
Alert
database
03
01
02
Distribute
Enrich
Collect
Alert database
Alert database
Structured
Streaming
Joining external information
19#UnifiedDataAnalytics #SparkAISummit
Structured
Streaming
Neutrino
alert stream
Gamma ray
alert stream
Optical
alert stream
Gravitational
wave
alert stream
Join
output
03
01
02
Distribute
Enrich
Collect
Spark does all the hard work
• Small delays
• Record throughput
• Stream position recovery
But it cannot do everything...
• Large delays
• False positives
Still need humans to take decisions
The Hero’s Return
Processing based on Adaptive Learning (PoC)
• Ranking of promising candidates
• Improved classification over time
20
New Candidates
Follow-up &
DiscoveryTraining
Streaming infrastructure by:
Abhishek Chauhan
(now at Morgan Stanley)
03
01
02
Distribute
Enrich
Collect
The fear of the shutdown!
What if we miss a
night?
• 14 million alerts, 830
GB of data
• Let Spark do the
hard work again
(offsets, updates...)
21
Broker shutdown…. Collect & write
100 minutes on 3 machines
Collect alerts
(cache)
Limiting factors
• Number of machines
• Network
Some lessons learned
Handling stream offsets
• Manual or not? Still not obvious...
Schema evolution
• User needs change often… Database choice is crucial
Dynamic filtering
• Need to adapt quickly to new situations
Handling watermarks
• How long shall we wait for data? Switch to post-processing.
Communication
• Using common communication protocols & data format...
22#UnifiedDataAnalytics #SparkAISummit
Thanks!
You have a public/private project in
mind? You want to contribute to
astronomy?
Come talk to me!
23#UnifiedDataAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
24

More Related Content

What's hot

Detecting solar farms with deep learning
Detecting solar farms with deep learningDetecting solar farms with deep learning
Detecting solar farms with deep learning
Jason Brown
 
Fundamentals of Internet and Communication Competition
Fundamentals of Internet and Communication CompetitionFundamentals of Internet and Communication Competition
Fundamentals of Internet and Communication Competition
MarcoFasanella
 
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic... NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
Igor Sfiligoi
 
Global Grid of Grapes
Global Grid of GrapesGlobal Grid of Grapes
Global Grid of Grapes
Derek Groen
 
Burst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runBurst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud run
Igor Sfiligoi
 
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado BlascoDSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
Deltares
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
Igor Sfiligoi
 
Climate data in r with the raster package
Climate data in r with the raster packageClimate data in r with the raster package
Climate data in r with the raster package
Alberto Labarga
 
NASA's Movement Towards Cloud Computing
NASA's Movement Towards Cloud ComputingNASA's Movement Towards Cloud Computing
NASA's Movement Towards Cloud Computing
Software & Information Industry Association
 
GeoMesa LocationTech DC
GeoMesa LocationTech DCGeoMesa LocationTech DC
GeoMesa LocationTech DC
CCRinc
 
Cluster formation over huge volatile robotic data
Cluster formation over huge volatile robotic data Cluster formation over huge volatile robotic data
Cluster formation over huge volatile robotic data
Eirini Ntoutsi
 
Training and hands-on SNAP Sentinel-1 IW SLC Interferogram and Displacement
Training and hands-on SNAP Sentinel-1 IW SLC Interferogram and DisplacementTraining and hands-on SNAP Sentinel-1 IW SLC Interferogram and Displacement
Training and hands-on SNAP Sentinel-1 IW SLC Interferogram and Displacement
Emmanuel Mathot
 
Continuous and Parallel LiDAR Point-cloud Clustering
Continuous and Parallel LiDAR Point-cloud ClusteringContinuous and Parallel LiDAR Point-cloud Clustering
Continuous and Parallel LiDAR Point-cloud Clustering
Hannaneh Najdataei
 
13.00 o7 j adams
13.00 o7 j adams13.00 o7 j adams
13.00 o7 j adams
NZIP
 
Using Deep Learning to Derive 3D Cities from Satellite Imagery
Using Deep Learning to Derive 3D Cities from Satellite ImageryUsing Deep Learning to Derive 3D Cities from Satellite Imagery
Using Deep Learning to Derive 3D Cities from Satellite Imagery
Astraea, Inc.
 
Solar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status UpdateSolar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status Update
Mario Juric
 
Geospatial Data Visualization: WorldMap Integration by Raman Prasad
Geospatial Data Visualization: WorldMap Integration by Raman PrasadGeospatial Data Visualization: WorldMap Integration by Raman Prasad
Geospatial Data Visualization: WorldMap Integration by Raman Prasad
datascienceiqss
 
ExoNetLoopFinalShort
ExoNetLoopFinalShortExoNetLoopFinalShort
ExoNetLoopFinalShort
Greg McNeill
 
Power Point Project
Power Point ProjectPower Point Project
Power Point Project
Haganl
 

What's hot (19)

Detecting solar farms with deep learning
Detecting solar farms with deep learningDetecting solar farms with deep learning
Detecting solar farms with deep learning
 
Fundamentals of Internet and Communication Competition
Fundamentals of Internet and Communication CompetitionFundamentals of Internet and Communication Competition
Fundamentals of Internet and Communication Competition
 
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic... NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 
Global Grid of Grapes
Global Grid of GrapesGlobal Grid of Grapes
Global Grid of Grapes
 
Burst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runBurst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud run
 
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado BlascoDSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
Climate data in r with the raster package
Climate data in r with the raster packageClimate data in r with the raster package
Climate data in r with the raster package
 
NASA's Movement Towards Cloud Computing
NASA's Movement Towards Cloud ComputingNASA's Movement Towards Cloud Computing
NASA's Movement Towards Cloud Computing
 
GeoMesa LocationTech DC
GeoMesa LocationTech DCGeoMesa LocationTech DC
GeoMesa LocationTech DC
 
Cluster formation over huge volatile robotic data
Cluster formation over huge volatile robotic data Cluster formation over huge volatile robotic data
Cluster formation over huge volatile robotic data
 
Training and hands-on SNAP Sentinel-1 IW SLC Interferogram and Displacement
Training and hands-on SNAP Sentinel-1 IW SLC Interferogram and DisplacementTraining and hands-on SNAP Sentinel-1 IW SLC Interferogram and Displacement
Training and hands-on SNAP Sentinel-1 IW SLC Interferogram and Displacement
 
Continuous and Parallel LiDAR Point-cloud Clustering
Continuous and Parallel LiDAR Point-cloud ClusteringContinuous and Parallel LiDAR Point-cloud Clustering
Continuous and Parallel LiDAR Point-cloud Clustering
 
13.00 o7 j adams
13.00 o7 j adams13.00 o7 j adams
13.00 o7 j adams
 
Using Deep Learning to Derive 3D Cities from Satellite Imagery
Using Deep Learning to Derive 3D Cities from Satellite ImageryUsing Deep Learning to Derive 3D Cities from Satellite Imagery
Using Deep Learning to Derive 3D Cities from Satellite Imagery
 
Solar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status UpdateSolar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status Update
 
Geospatial Data Visualization: WorldMap Integration by Raman Prasad
Geospatial Data Visualization: WorldMap Integration by Raman PrasadGeospatial Data Visualization: WorldMap Integration by Raman Prasad
Geospatial Data Visualization: WorldMap Integration by Raman Prasad
 
ExoNetLoopFinalShort
ExoNetLoopFinalShortExoNetLoopFinalShort
ExoNetLoopFinalShort
 
Power Point Project
Power Point ProjectPower Point Project
Power Point Project
 

Similar to Accelerating Astronomical Discoveries with Apache Spark

Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
InfluxData
 
"Building and running the cloud GPU vacuum cleaner"
"Building and running the cloud GPU vacuum cleaner""Building and running the cloud GPU vacuum cleaner"
"Building and running the cloud GPU vacuum cleaner"
Frank Wuerthwein
 
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
Big Data Spain
 
Astronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache SparkAstronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache Spark
Databricks
 
Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19
ExtremeEarth
 
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
Larry Smarr
 
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Frank Wuerthwein
 
Near Exascale Computing in the Cloud
Near Exascale Computing in the CloudNear Exascale Computing in the Cloud
Near Exascale Computing in the Cloud
Frank Wuerthwein
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science Workflows
Databricks
 
Emc 2013 Big Data in Astronomy
Emc 2013 Big Data in AstronomyEmc 2013 Big Data in Astronomy
Emc 2013 Big Data in Astronomy
Fabio Porto
 
Data Capacitor II at Indiana University
Data Capacitor II at Indiana UniversityData Capacitor II at Indiana University
Data Capacitor II at Indiana University
inside-BigData.com
 
Science and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraScience and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated Era
Larry Smarr
 
ApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTRApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTR
LucaCinquini
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
Larry Smarr
 
AstroAccelerate - GPU Accelerated Signal Processing on the Path to the Square...
AstroAccelerate - GPU Accelerated Signal Processing on the Path to the Square...AstroAccelerate - GPU Accelerated Signal Processing on the Path to the Square...
AstroAccelerate - GPU Accelerated Signal Processing on the Path to the Square...
inside-BigData.com
 
Introduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackIntroduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software Stack
Jérôme Kehrli
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
Robert Grossman
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Spark Summit
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
Robert Grossman
 

Similar to Accelerating Astronomical Discoveries with Apache Spark (20)

Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
 
"Building and running the cloud GPU vacuum cleaner"
"Building and running the cloud GPU vacuum cleaner""Building and running the cloud GPU vacuum cleaner"
"Building and running the cloud GPU vacuum cleaner"
 
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
 
Astronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache SparkAstronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache Spark
 
Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19
 
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
 
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
 
Near Exascale Computing in the Cloud
Near Exascale Computing in the CloudNear Exascale Computing in the Cloud
Near Exascale Computing in the Cloud
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science Workflows
 
Emc 2013 Big Data in Astronomy
Emc 2013 Big Data in AstronomyEmc 2013 Big Data in Astronomy
Emc 2013 Big Data in Astronomy
 
Data Capacitor II at Indiana University
Data Capacitor II at Indiana UniversityData Capacitor II at Indiana University
Data Capacitor II at Indiana University
 
Science and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraScience and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated Era
 
ApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTRApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTR
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
 
AstroAccelerate - GPU Accelerated Signal Processing on the Path to the Square...
AstroAccelerate - GPU Accelerated Signal Processing on the Path to the Square...AstroAccelerate - GPU Accelerated Signal Processing on the Path to the Square...
AstroAccelerate - GPU Accelerated Signal Processing on the Path to the Square...
 
Introduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackIntroduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software Stack
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 

Recently uploaded (20)

一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 

Accelerating Astronomical Discoveries with Apache Spark

  • 1. Julien Peloton, CNRS Accelerating Astronomical Discoveries with Apache Spark #UnifiedDataAnalytics #SparkAISummit 1
  • 3. How we can get different data? 3 ~1/100,000 of sky Large butshallow Hubble FoV D eep butsm all #UnifiedDataAnalytics #SparkAISummit
  • 4. Large Synoptic Survey Telescope 2022-2032: Deep & large survey Non-profit corporation Site: Chile (Cerro Pachón) US led, international collaboration (1000+) 4#UnifiedDataAnalytics #SparkAISummit
  • 5. Million pieces puzzle • LSST will deliver ~full sky map every 3 nights – 3.2 Gigapixels camera (car size!) – 15 TB/night of raw image data collected – 1 TB/night of alerts streamed 5#UnifiedDataAnalytics #SparkAISummit ?
  • 6. We would like to be able to do at scale: • Exploring large catalogs of data • Cross-matching large catalogs • Processing telescope images • Classifying light-curves • Processing telescope alerts • ... Apache Spark for astronomy? 6#UnifiedDataAnalytics #SparkAISummit
  • 7. FITS: astronomical data format • First (last) release: 1981 (2016). • Endorsed by NASA and the International Astronomical Union. • Multi-purposes: vectors, images, tables, ... • Backward compatible • Set of blocks.1 block: ASCII header+binary data arrays of arbitrary dimension • Support for C, C++, C#, Fortran, IDL, Java, Julia, MATLAB, Perl, Python, R, and more… 7#UnifiedDataAnalytics #SparkAISummit
  • 8. spark-fits • FITS data source for Spark SQL and DataFrames. • Data Source V1 API. • Images + tables available. • Schema automatically inferred from the FITS header. 8#UnifiedDataAnalytics #SparkAISummit
  • 9. spark-fits in practice • Spark 2.3.1 / Hadoop 2.8.4 • 1.1 billion rows, 153 cores • Run it 100 times (no cache). • Performances (IO throughput) comparable to other built-in Spark connectors (no attempt to optimise anything anywhere…) 9#UnifiedDataAnalytics #SparkAISummit
  • 10. Current limitations Some limitations currently though… • Need to migrate to Apache Spark DSv2. • No column pruning, no filters at the level of the connector. • (De)Compression is not handled yet. • Scala FITS library lacks of many features. 10#UnifiedDataAnalytics #SparkAISummit
  • 11. We live in a 3D world • Manipulating 2D data with Spark: Geotrellis, Magellan, Geospark, GeoMesa, … • Very little about 3D! • Need for e.g. astronomy, particle physics, meteorology. 11#UnifiedDataAnalytics #SparkAISummit
  • 12. Manipulating 3D spatial data: spark3D • 3D distributed partitioning – KDTree, Octree, shells, ... • Distributed spatial queries & data mining – KNN, join, dbscan, … – Typical usage on million/billion rows • Visualisation – Client/server architecture 12 Student: Mayur Bhosale (now at Qubole) #UnifiedDataAnalytics #SparkAISummit
  • 13. On the repartitioning... Frequent as data comes unstructured, but • Repartitioning implies heavy shuffle between executors. • Complex UDF in Spark are often inefficient. 13#UnifiedDataAnalytics #SparkAISummit
  • 14. Need for (efficient) streaming • We explored the static sky - namely what has been observed. • But what about what is happening right now? E.g. – Supernovae (star explosion) – Black hole merger counterparts (multi-messenger astronomy) – Micro-lensing (extrasolar planet search) – Earth killers! – Anomaly detection (unforeseen astronomical sources) • Correlation past/present/future? • Timescales range from seconds to months... 14#UnifiedDataAnalytics #SparkAISummit
  • 15. Desiderata & solution We would like • To work efficiently at scale • Multi-modals analytics capability (streaming & batch) • Good integration with the current ecosystem 15 Structured Streaming #UnifiedDataAnalytics #SparkAISummit
  • 16. Introducing Fink Fink is • A broker system for sky alerts • Based on Apache Spark Fink does • Collect, enrich & distribute sky alerts 16#UnifiedDataAnalytics #SparkAISummit 03 01 02 Distribute Enrich Collect
  • 17. On a quiet night... 17#UnifiedDataAnalytics #SparkAISummit • 10,000 Avro alerts every 30 seconds • 1TB alerts per night • Parquet Database Observation Template Difference Credits: E. Bellm 03 01 Distribute Enrich Collect 02
  • 18. Who’s who 18#UnifiedDataAnalytics #SparkAISummit Add values to the raw alerts • Stream-static join • Classification (BNN) Structured Streaming Alert stream Internal catalogs Alert database 03 01 02 Distribute Enrich Collect Alert database Alert database Structured Streaming
  • 19. Joining external information 19#UnifiedDataAnalytics #SparkAISummit Structured Streaming Neutrino alert stream Gamma ray alert stream Optical alert stream Gravitational wave alert stream Join output 03 01 02 Distribute Enrich Collect Spark does all the hard work • Small delays • Record throughput • Stream position recovery But it cannot do everything... • Large delays • False positives Still need humans to take decisions
  • 20. The Hero’s Return Processing based on Adaptive Learning (PoC) • Ranking of promising candidates • Improved classification over time 20 New Candidates Follow-up & DiscoveryTraining Streaming infrastructure by: Abhishek Chauhan (now at Morgan Stanley) 03 01 02 Distribute Enrich Collect
  • 21. The fear of the shutdown! What if we miss a night? • 14 million alerts, 830 GB of data • Let Spark do the hard work again (offsets, updates...) 21 Broker shutdown…. Collect & write 100 minutes on 3 machines Collect alerts (cache) Limiting factors • Number of machines • Network
  • 22. Some lessons learned Handling stream offsets • Manual or not? Still not obvious... Schema evolution • User needs change often… Database choice is crucial Dynamic filtering • Need to adapt quickly to new situations Handling watermarks • How long shall we wait for data? Switch to post-processing. Communication • Using common communication protocols & data format... 22#UnifiedDataAnalytics #SparkAISummit
  • 23. Thanks! You have a public/private project in mind? You want to contribute to astronomy? Come talk to me! 23#UnifiedDataAnalytics #SparkAISummit
  • 24. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT 24