SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Petar Zečević, SV Group, University of Zagreb
Mario Jurić, DIRAC Institute, University of Washington
AXS - Astronomical Data
Processing on the LSST
Scale with Apache Spark
#UnifiedDataAnalytics #SparkAISummit
About us
Mario Jurić
• Prof. of Astronomy at the University of Washington
• Founding faculty of DIRAC & eScience Institute Fellow
• Fmr. lead of LSST Data Management
Petar Zečević
• CTO at SV Group, Croatia
• CS PhD student at University of Zagreb
• Visiting Fellow at DiRAC institute @ UW
• Author of “Spark in Action”
3#UnifiedDataAnalytics #SparkAISummit
About us
4#UnifiedDataAnalytics #SparkAISummit
Context: The Large Survey
Revolution in Astronomy
Hipparchus of Rhodes (180-125 BC)
In 129 BC, constructed one of the first star
catalogs, containing about 850 stars.
Galileo Galilei (1564-1642)
Researched a variety of topics in physics,
but called out here for the introduction of
the Galilean telescope.
Galileo’s telescope allowed us for the first
time to zoom in on the cosmos, and study
the individual objects in great detail.
The Astrophysics Two-Step
• Surveys
– Construct catalogs and maps of objects in the sky. Focus on coarse
classification and discovering targets for further follow-up.
• Large telescopes
– Acquire detailed observations of a few representative objects.
Understand the details of astrophysical processes that govern them,
and extrapolate that understanding to the entire class.
The Story of Astronomy:
2000 Years of being Data Poor
10
Sloan Digital Sky Survey
2.5m telescope >14,500 deg2 0.1” astrometry r<22.5 flux limit
5 band, 1%, photometry for over 900M stars
Over 3M R=2000 spectra
10 years of ops: ~10 TB of imaging
1,231,051,050 rows (SDSS DR10, PhotoObjAll table)
~500 columns
Facilitated the development
of large databases, data-
driven discovery, motion
towards what we recognize
as Data Science today.
Panoramic Survey Telescope and Rapid Response System
1.8m telescope 30,000 deg2 50mas astrometry r<23 flux limit
5 band, better than 1% photometry (goal)
~700 GB/night
14
https://sci.esa.int/s/wV6oG5w
Gaia DR2: 1.7 billion stars
First Light: 2020 Operations: 2022
Deep (24th mag), Wide (60% of the sky), Fast (every 15 seconds)
Largest astronomical camera in the world
Will repeatedly observe the night sky over 10 years
10 million alerts each night (60 seconds)
37 billion astronomical sources, with time series
30 trillion measurements
The Large Synoptic Survey Telescope
A Public, Deep, Wide and Fast, Optical Sky Survey
Overview
LSST’s mission is to build a well-understood system that
provides a vast astronomical dataset for unprecedented
discovery of the deep and dynamic universe.
The Scale of Things to Come
17
Metric Amount
Number of detections 7 trillion rows
Number of objects 37 billion rows
Nightly alert rate 10 million
Nightly data rate >15 TB
Alert latency 60 seconds
Total images after 10 yrs 50 PB
Total data after 10 yrs 83 PB
Objects detected, measured, and stored in queryable catalogs (tables)
Catalog-driven Science
• Once a catalog is available, astronomers “ask” all kinds of questions
18#UnifiedDataAnalytics #SparkAISummit
– Download data locally
– Analyze (usually Python)
•
• The traditional paradigm:
– Subset (filter data using a catalog SQL interface online)
Challenges (part 0)
Dataset Size
(keeping ~PBs of data in RBDMSes is not easy, or cheap)
What do you do when the dataset subset is a few ~TBs?
Challenges (part 1)
I Want it AllBetter Together
(joining datasets is powerful) (interesting science w. whole dataset operations)
Dataset Size
(keeping ~TBs of data in RBDMs-es is not easy)
Challenges (part 2)
Scalability Resources
(how do I write an analysis code that will
scale to petabytes of data?)
(where are the resources to run this code?)
How do you scale exploratory data analysis to ~PB-sized datasets
and thousands of simultaneous users?
Enter Spark, AXS
• AXS: Astronomy eXtensions for Spark
• The main idea:
– Spark is a proven, scalable, cloud-ready and widely-supported analytics
framework with full SQL support (legacy support).
– Extend it to exploratory data analysis.
– Add a scalable positional cross-match operator
– Add a domain-specific Python API layer to PySpark
– Couple to S3 API for storage, Kubernetes for orchestration…
• … A scalable platform supporting an arbitrarily sized dataset and a
large number of users, deployable on either public or private cloud.
22
Key Issue: Scalable Cross-matching
23#UnifiedDataAnalytics #SparkAISummit
DEC and RA coordinates
Search perimeter
(can also use similarity)
A match
AXS data partitioning
• Data partitioning is at the root of AXS' efficient cross-
matching
• Based on (late) Jim Gray's “zones algorithm” (MS Rsch)
• Sky divided into horizontal “zones” of a certain height
• Adapted for distributed architectures
• Data stored in Parquet files
– bucketed by zone
– sorted by zone and ra columns
– data from zone borders duplicated to the zone below
24
AXS data partitioning
25
AXS - optimal joins
26
AXS - optimal joins
27
Epsilon join
SELECT ... FROM TA, TB
WHERE TA.zone = TB.zone
AND TA.ra BETWEEN TB.ra - e
AND TB.ra + e
28
SPARK-24020: Sort-merge join “inner
range optimization”
Other approaches
Other systems use
HEALPix
or Hierarchical Triangular Mesh (HTM)
29
AXS performance results
Gaia (1.7 B) x SDSS (800 M)
37s warm (148s cold)
Gaia (1.7 B) x ZTF (2.9 B)
39s warm (315s cold)
Left: tests on a single large
machine. An AWS deployment
scales out nearly linearly, as
long as there are sufficient
partitions in the dataset.
30#UnifiedDataAnalytics #SparkAISummit
AXS API
31#UnifiedDataAnalytics #SparkAISummit
AXS - other functionalities
• crossmatch (return all or the first crossmatch candidate)
• region queries
• cone queries
• histogram
• histogram2d
• Spark array functions for handling lightcurve data
• All other Spark functions
Astronomy Example: Computing Light
Curve Features with Python UDFs
This works on arbitrarily large datasets!
Cesium (Naul, 2016), Astronomy eXtensions for Spark (Zecevic+ 2018)
Observations and experiences
• Spark scales really well!
• SQL support is fantastic for supporting legacy code
• Efficient data exchange with Python is key to having reasonable
performance (Arrow and friends)
• The language barrier is non-trivial: astronomy is in Python, little
experience with JVM/Scala
• Pushing Spark into exploratory data analysis – the challenge of
converting a batch system to support more dynamic workflows.
“Astronomy 2025”
Towards a scalable
astronomical analysis
platform
DATA INTENSIVE RESEARCH IN
ASTROPHYSICS AND COSMOLOGY
DIRAC Data Engineering Group
We’re a collaborative incubator that supports people and communities
researching and building next generations of software technologies for
astronomy.
We emphasize cross-pollination with other fields, the industry, and delivering
usable, community supported, projects.
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
Backups
38
39
http://astro.washington.e
du
EPSC-DPS Meeting 2019 • Geneva, Switzerland • September 16, 2019 4
0
Cataloging the Solar System
• Potentially Hazardous Asteroids
• Main Belt Asteroids
• Census of small bodies in the Solar
System
Exploring the Transient sky
• Variable stars, Supernovae
• Fill in the variability phase-space
• Discovery of new classes of transients
Dark Matter, Dark Energy
• Weak Lensing
• Baryon acoustic oscillations
• Supernovae, Quasars
Milky Way Structure & Formation
• Structure and evolutionary history
• Spatial maps of stellar characteristics
• Reach well into the halo
LSST Science Drivers
Solar System Science with LSST
Animation: SDSS Asteroids
(Alex Parker, SwRI)
About ~0.7 million are known
Will grow to >5 million in the next 5 years
Estimates: Lynne Jones et al.
Whole Dataset Operations• Galactic structure: density/proper motion maps of
the Galaxy
– => forall stars, compute distance, bin, create 5D map
• Galactic structure: dust distribution
– => forall stars, compute g-r color, bin, find blue tip edge,
infer dust distribution
• Near-field cosmology: MW satellite searches
– => forall stars, compute colors, convolve with spatial
filters, report any satellite-like peaks
• Variability: Bayesian classification of transients and
discovery of variables
– => forall stars, get light curves, compute likelihoods,
alert if interesting
• …
Astronomical catalogs
• Just (big!) databases
• Each row corresponds to a detection or an object
(star/galaxy/asteroid)
• Producing catalogs from images is not trivial - non-exhaustive list of
problems (for software to solve):
– background estimation
– PSF estimation
– object detection
– image co-addition
– deblending
44
AXS history: LSD by Mario Jurić
• Tool for querying, cross-matching and analysis of positionally or
temporally indexed datasets
• Inspired by Google's BigTable and MapReduce papers
• However it has some shortcomings:
– Fixed data partitioning (significant data skew)
– Time-partitioning problematic (most queries do not slice by
time)
– Not resilient to worker failures
– Contains a lot of custom solutions for functionalities that are
common today
45
Enter Spark and AXS
• Astronomy eXtensions for Spark
• DiRAC institute @ UW saw the need for next generation
astronomical analysis tool
• Efficient cross-matching
• Based on industry standards (Apache Spark)
• Provides simple (but powerful) astronomical API
extensions
• Easy to use on-premises or in the cloud
46
Scaling with Spark
https://www.toptal.com/spark/introduction-to-apache-spark
+ government-sponsored private clouds (e.g., JetStream)
Meeting the Challenges
Resources
Dataset Storage
Scalable
Analysis Code
Interface

More Related Content

What's hot

Hubble Telescope
Hubble TelescopeHubble Telescope
Hubble Telescope
Yatish Bathla
 
Agn presentation
Agn presentationAgn presentation
Agn presentationITAES
 
Astrochemistry
AstrochemistryAstrochemistry
Astrochemistry
mert baki
 
Black hole ppt
Black hole pptBlack hole ppt
Black hole ppt
Shashank Karamballi
 
Astrochemistry
AstrochemistryAstrochemistry
Astrochemistry
Prince Tiwari
 
Astronomical distance meassurements.pptx
Astronomical distance meassurements.pptxAstronomical distance meassurements.pptx
Astronomical distance meassurements.pptx
KSSuresh6
 
Brighton Astro - Neutron Star Presentation
Brighton Astro - Neutron Star PresentationBrighton Astro - Neutron Star Presentation
Brighton Astro - Neutron Star Presentation
Gareth Jenkins
 
Black Hole By Pranita & Priyanka
Black Hole By Pranita & PriyankaBlack Hole By Pranita & Priyanka
Black Hole By Pranita & Priyankasubzero64
 
Astronomical Spectroscopy
Astronomical SpectroscopyAstronomical Spectroscopy
Astronomical Spectroscopy
apoorvumang
 
Dark matter
Dark matterDark matter
Dark matter
nereasilviaangela
 
Chandryaan 2
Chandryaan 2Chandryaan 2
Chandryaan 2
AhmadAlImaad
 
Satellite fundamentals
Satellite fundamentals  Satellite fundamentals
Satellite fundamentals
Ghassan Hadi
 
What is Astrochemistry?
What is Astrochemistry?What is Astrochemistry?
What is Astrochemistry?
JoannaThompsonYezek
 
Chandrayaan-2
Chandrayaan-2Chandrayaan-2
Chandrayaan-2
ShahinShaik13
 
Sarita chauhan seminar on black hole
Sarita chauhan seminar on black holeSarita chauhan seminar on black hole
Sarita chauhan seminar on black hole
vishakhasarita
 
Cyclotron
CyclotronCyclotron
International space station
International space stationInternational space station
International space stationSultana Parwin
 
TOPIC 1: HISTORY OF RADIATION
TOPIC 1: HISTORY OF RADIATIONTOPIC 1: HISTORY OF RADIATION
TOPIC 1: HISTORY OF RADIATION
Nik Noor Ashikin Nik Ab Razak
 
The blackhole origins........
The blackhole origins........The blackhole origins........
The blackhole origins........
Jahnavi jaanu
 

What's hot (20)

Hubble Telescope
Hubble TelescopeHubble Telescope
Hubble Telescope
 
Agn presentation
Agn presentationAgn presentation
Agn presentation
 
Astrochemistry
AstrochemistryAstrochemistry
Astrochemistry
 
Black hole ppt
Black hole pptBlack hole ppt
Black hole ppt
 
Astrochemistry
AstrochemistryAstrochemistry
Astrochemistry
 
Astronomical distance meassurements.pptx
Astronomical distance meassurements.pptxAstronomical distance meassurements.pptx
Astronomical distance meassurements.pptx
 
Brighton Astro - Neutron Star Presentation
Brighton Astro - Neutron Star PresentationBrighton Astro - Neutron Star Presentation
Brighton Astro - Neutron Star Presentation
 
Black Hole By Pranita & Priyanka
Black Hole By Pranita & PriyankaBlack Hole By Pranita & Priyanka
Black Hole By Pranita & Priyanka
 
Astronomical Spectroscopy
Astronomical SpectroscopyAstronomical Spectroscopy
Astronomical Spectroscopy
 
Telescopes
TelescopesTelescopes
Telescopes
 
Dark matter
Dark matterDark matter
Dark matter
 
Chandryaan 2
Chandryaan 2Chandryaan 2
Chandryaan 2
 
Satellite fundamentals
Satellite fundamentals  Satellite fundamentals
Satellite fundamentals
 
What is Astrochemistry?
What is Astrochemistry?What is Astrochemistry?
What is Astrochemistry?
 
Chandrayaan-2
Chandrayaan-2Chandrayaan-2
Chandrayaan-2
 
Sarita chauhan seminar on black hole
Sarita chauhan seminar on black holeSarita chauhan seminar on black hole
Sarita chauhan seminar on black hole
 
Cyclotron
CyclotronCyclotron
Cyclotron
 
International space station
International space stationInternational space station
International space station
 
TOPIC 1: HISTORY OF RADIATION
TOPIC 1: HISTORY OF RADIATIONTOPIC 1: HISTORY OF RADIATION
TOPIC 1: HISTORY OF RADIATION
 
The blackhole origins........
The blackhole origins........The blackhole origins........
The blackhole origins........
 

Similar to Astronomical Data Processing on the LSST Scale with Apache Spark

AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...
AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...
AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...
Mario Juric
 
Pablo Gomez - Solving Large-scale Challenges with ESA Datalabs
Pablo Gomez - Solving Large-scale Challenges with ESA DatalabsPablo Gomez - Solving Large-scale Challenges with ESA Datalabs
Pablo Gomez - Solving Large-scale Challenges with ESA Datalabs
Advanced-Concepts-Team
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
Robert Grossman
 
Solar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status UpdateSolar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status Update
Mario Juric
 
Round Table Introduction: Analytics on 100 TB+ catalogs
Round Table Introduction: Analytics on 100 TB+ catalogsRound Table Introduction: Analytics on 100 TB+ catalogs
Round Table Introduction: Analytics on 100 TB+ catalogs
Mario Juric
 
Computational Training and Data Literacy for Domain Scientists
Computational Training and Data Literacy for Domain ScientistsComputational Training and Data Literacy for Domain Scientists
Computational Training and Data Literacy for Domain Scientists
Joshua Bloom
 
AstroCV: A computer vision library for Astronomy
AstroCV: A computer vision library for AstronomyAstroCV: A computer vision library for Astronomy
AstroCV: A computer vision library for Astronomy
Roberto Muñoz
 
The Pacific Research Platform
 Two Years In
The Pacific Research Platform
 Two Years InThe Pacific Research Platform
 Two Years In
The Pacific Research Platform
 Two Years In
Larry Smarr
 
LSST Solar System Science: MOPS Status, the Science, and Your Questions
LSST Solar System Science: MOPS Status, the Science, and Your QuestionsLSST Solar System Science: MOPS Status, the Science, and Your Questions
LSST Solar System Science: MOPS Status, the Science, and Your Questions
Mario Juric
 
LambdaGrids--Earth and Planetary Sciences Driving High Performance Networks a...
LambdaGrids--Earth and Planetary Sciences Driving High Performance Networks a...LambdaGrids--Earth and Planetary Sciences Driving High Performance Networks a...
LambdaGrids--Earth and Planetary Sciences Driving High Performance Networks a...
Larry Smarr
 
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...
Mario Juric
 
SKA Regional Sciences Centres - A Platform for Global Astronomy
SKA Regional Sciences Centres - A Platform for Global AstronomySKA Regional Sciences Centres - A Platform for Global Astronomy
SKA Regional Sciences Centres - A Platform for Global Astronomy
EUDAT
 
Comaskey_William_Poster_SULI_FALL_2014
Comaskey_William_Poster_SULI_FALL_2014Comaskey_William_Poster_SULI_FALL_2014
Comaskey_William_Poster_SULI_FALL_2014William Comaskey
 
Cyberinfrastructure to Support Ocean Observatories
Cyberinfrastructure to Support Ocean ObservatoriesCyberinfrastructure to Support Ocean Observatories
Cyberinfrastructure to Support Ocean Observatories
Larry Smarr
 
NASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & EngineeringNASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & Engineering
inside-BigData.com
 
Toward a Global Interactive Earth Observing Cyberinfrastructure
Toward a Global Interactive Earth Observing CyberinfrastructureToward a Global Interactive Earth Observing Cyberinfrastructure
Toward a Global Interactive Earth Observing Cyberinfrastructure
Larry Smarr
 
ApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTRApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTR
LucaCinquini
 
Cyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and BeyondCyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and Beyond
University of Illinois at Urbana-Champaign
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science Workflows
Databricks
 

Similar to Astronomical Data Processing on the LSST Scale with Apache Spark (20)

Presentation
PresentationPresentation
Presentation
 
AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...
AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...
AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...
 
Pablo Gomez - Solving Large-scale Challenges with ESA Datalabs
Pablo Gomez - Solving Large-scale Challenges with ESA DatalabsPablo Gomez - Solving Large-scale Challenges with ESA Datalabs
Pablo Gomez - Solving Large-scale Challenges with ESA Datalabs
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Solar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status UpdateSolar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status Update
 
Round Table Introduction: Analytics on 100 TB+ catalogs
Round Table Introduction: Analytics on 100 TB+ catalogsRound Table Introduction: Analytics on 100 TB+ catalogs
Round Table Introduction: Analytics on 100 TB+ catalogs
 
Computational Training and Data Literacy for Domain Scientists
Computational Training and Data Literacy for Domain ScientistsComputational Training and Data Literacy for Domain Scientists
Computational Training and Data Literacy for Domain Scientists
 
AstroCV: A computer vision library for Astronomy
AstroCV: A computer vision library for AstronomyAstroCV: A computer vision library for Astronomy
AstroCV: A computer vision library for Astronomy
 
The Pacific Research Platform
 Two Years In
The Pacific Research Platform
 Two Years InThe Pacific Research Platform
 Two Years In
The Pacific Research Platform
 Two Years In
 
LSST Solar System Science: MOPS Status, the Science, and Your Questions
LSST Solar System Science: MOPS Status, the Science, and Your QuestionsLSST Solar System Science: MOPS Status, the Science, and Your Questions
LSST Solar System Science: MOPS Status, the Science, and Your Questions
 
LambdaGrids--Earth and Planetary Sciences Driving High Performance Networks a...
LambdaGrids--Earth and Planetary Sciences Driving High Performance Networks a...LambdaGrids--Earth and Planetary Sciences Driving High Performance Networks a...
LambdaGrids--Earth and Planetary Sciences Driving High Performance Networks a...
 
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...
 
SKA Regional Sciences Centres - A Platform for Global Astronomy
SKA Regional Sciences Centres - A Platform for Global AstronomySKA Regional Sciences Centres - A Platform for Global Astronomy
SKA Regional Sciences Centres - A Platform for Global Astronomy
 
Comaskey_William_Poster_SULI_FALL_2014
Comaskey_William_Poster_SULI_FALL_2014Comaskey_William_Poster_SULI_FALL_2014
Comaskey_William_Poster_SULI_FALL_2014
 
Cyberinfrastructure to Support Ocean Observatories
Cyberinfrastructure to Support Ocean ObservatoriesCyberinfrastructure to Support Ocean Observatories
Cyberinfrastructure to Support Ocean Observatories
 
NASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & EngineeringNASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & Engineering
 
Toward a Global Interactive Earth Observing Cyberinfrastructure
Toward a Global Interactive Earth Observing CyberinfrastructureToward a Global Interactive Earth Observing Cyberinfrastructure
Toward a Global Interactive Earth Observing Cyberinfrastructure
 
ApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTRApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTR
 
Cyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and BeyondCyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and Beyond
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science Workflows
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 

Recently uploaded (20)

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 

Astronomical Data Processing on the LSST Scale with Apache Spark

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Petar Zečević, SV Group, University of Zagreb Mario Jurić, DIRAC Institute, University of Washington AXS - Astronomical Data Processing on the LSST Scale with Apache Spark #UnifiedDataAnalytics #SparkAISummit
  • 3. About us Mario Jurić • Prof. of Astronomy at the University of Washington • Founding faculty of DIRAC & eScience Institute Fellow • Fmr. lead of LSST Data Management Petar Zečević • CTO at SV Group, Croatia • CS PhD student at University of Zagreb • Visiting Fellow at DiRAC institute @ UW • Author of “Spark in Action” 3#UnifiedDataAnalytics #SparkAISummit
  • 5. Context: The Large Survey Revolution in Astronomy
  • 6.
  • 7. Hipparchus of Rhodes (180-125 BC) In 129 BC, constructed one of the first star catalogs, containing about 850 stars.
  • 8. Galileo Galilei (1564-1642) Researched a variety of topics in physics, but called out here for the introduction of the Galilean telescope. Galileo’s telescope allowed us for the first time to zoom in on the cosmos, and study the individual objects in great detail.
  • 9. The Astrophysics Two-Step • Surveys – Construct catalogs and maps of objects in the sky. Focus on coarse classification and discovering targets for further follow-up. • Large telescopes – Acquire detailed observations of a few representative objects. Understand the details of astrophysical processes that govern them, and extrapolate that understanding to the entire class.
  • 10. The Story of Astronomy: 2000 Years of being Data Poor 10
  • 11. Sloan Digital Sky Survey 2.5m telescope >14,500 deg2 0.1” astrometry r<22.5 flux limit 5 band, 1%, photometry for over 900M stars Over 3M R=2000 spectra 10 years of ops: ~10 TB of imaging
  • 12. 1,231,051,050 rows (SDSS DR10, PhotoObjAll table) ~500 columns Facilitated the development of large databases, data- driven discovery, motion towards what we recognize as Data Science today.
  • 13. Panoramic Survey Telescope and Rapid Response System 1.8m telescope 30,000 deg2 50mas astrometry r<23 flux limit 5 band, better than 1% photometry (goal) ~700 GB/night
  • 15. First Light: 2020 Operations: 2022 Deep (24th mag), Wide (60% of the sky), Fast (every 15 seconds) Largest astronomical camera in the world Will repeatedly observe the night sky over 10 years 10 million alerts each night (60 seconds) 37 billion astronomical sources, with time series 30 trillion measurements The Large Synoptic Survey Telescope A Public, Deep, Wide and Fast, Optical Sky Survey
  • 16. Overview LSST’s mission is to build a well-understood system that provides a vast astronomical dataset for unprecedented discovery of the deep and dynamic universe.
  • 17. The Scale of Things to Come 17 Metric Amount Number of detections 7 trillion rows Number of objects 37 billion rows Nightly alert rate 10 million Nightly data rate >15 TB Alert latency 60 seconds Total images after 10 yrs 50 PB Total data after 10 yrs 83 PB Objects detected, measured, and stored in queryable catalogs (tables)
  • 18. Catalog-driven Science • Once a catalog is available, astronomers “ask” all kinds of questions 18#UnifiedDataAnalytics #SparkAISummit – Download data locally – Analyze (usually Python) • • The traditional paradigm: – Subset (filter data using a catalog SQL interface online)
  • 19. Challenges (part 0) Dataset Size (keeping ~PBs of data in RBDMSes is not easy, or cheap) What do you do when the dataset subset is a few ~TBs?
  • 20. Challenges (part 1) I Want it AllBetter Together (joining datasets is powerful) (interesting science w. whole dataset operations) Dataset Size (keeping ~TBs of data in RBDMs-es is not easy)
  • 21. Challenges (part 2) Scalability Resources (how do I write an analysis code that will scale to petabytes of data?) (where are the resources to run this code?) How do you scale exploratory data analysis to ~PB-sized datasets and thousands of simultaneous users?
  • 22. Enter Spark, AXS • AXS: Astronomy eXtensions for Spark • The main idea: – Spark is a proven, scalable, cloud-ready and widely-supported analytics framework with full SQL support (legacy support). – Extend it to exploratory data analysis. – Add a scalable positional cross-match operator – Add a domain-specific Python API layer to PySpark – Couple to S3 API for storage, Kubernetes for orchestration… • … A scalable platform supporting an arbitrarily sized dataset and a large number of users, deployable on either public or private cloud. 22
  • 23. Key Issue: Scalable Cross-matching 23#UnifiedDataAnalytics #SparkAISummit DEC and RA coordinates Search perimeter (can also use similarity) A match
  • 24. AXS data partitioning • Data partitioning is at the root of AXS' efficient cross- matching • Based on (late) Jim Gray's “zones algorithm” (MS Rsch) • Sky divided into horizontal “zones” of a certain height • Adapted for distributed architectures • Data stored in Parquet files – bucketed by zone – sorted by zone and ra columns – data from zone borders duplicated to the zone below 24
  • 26. AXS - optimal joins 26
  • 27. AXS - optimal joins 27
  • 28. Epsilon join SELECT ... FROM TA, TB WHERE TA.zone = TB.zone AND TA.ra BETWEEN TB.ra - e AND TB.ra + e 28 SPARK-24020: Sort-merge join “inner range optimization”
  • 29. Other approaches Other systems use HEALPix or Hierarchical Triangular Mesh (HTM) 29
  • 30. AXS performance results Gaia (1.7 B) x SDSS (800 M) 37s warm (148s cold) Gaia (1.7 B) x ZTF (2.9 B) 39s warm (315s cold) Left: tests on a single large machine. An AWS deployment scales out nearly linearly, as long as there are sufficient partitions in the dataset. 30#UnifiedDataAnalytics #SparkAISummit
  • 32. AXS - other functionalities • crossmatch (return all or the first crossmatch candidate) • region queries • cone queries • histogram • histogram2d • Spark array functions for handling lightcurve data • All other Spark functions
  • 33. Astronomy Example: Computing Light Curve Features with Python UDFs This works on arbitrarily large datasets! Cesium (Naul, 2016), Astronomy eXtensions for Spark (Zecevic+ 2018)
  • 34. Observations and experiences • Spark scales really well! • SQL support is fantastic for supporting legacy code • Efficient data exchange with Python is key to having reasonable performance (Arrow and friends) • The language barrier is non-trivial: astronomy is in Python, little experience with JVM/Scala • Pushing Spark into exploratory data analysis – the challenge of converting a batch system to support more dynamic workflows.
  • 35. “Astronomy 2025” Towards a scalable astronomical analysis platform
  • 36. DATA INTENSIVE RESEARCH IN ASTROPHYSICS AND COSMOLOGY DIRAC Data Engineering Group We’re a collaborative incubator that supports people and communities researching and building next generations of software technologies for astronomy. We emphasize cross-pollination with other fields, the industry, and delivering usable, community supported, projects.
  • 37. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
  • 40. EPSC-DPS Meeting 2019 • Geneva, Switzerland • September 16, 2019 4 0 Cataloging the Solar System • Potentially Hazardous Asteroids • Main Belt Asteroids • Census of small bodies in the Solar System Exploring the Transient sky • Variable stars, Supernovae • Fill in the variability phase-space • Discovery of new classes of transients Dark Matter, Dark Energy • Weak Lensing • Baryon acoustic oscillations • Supernovae, Quasars Milky Way Structure & Formation • Structure and evolutionary history • Spatial maps of stellar characteristics • Reach well into the halo LSST Science Drivers
  • 41. Solar System Science with LSST Animation: SDSS Asteroids (Alex Parker, SwRI) About ~0.7 million are known Will grow to >5 million in the next 5 years Estimates: Lynne Jones et al.
  • 42.
  • 43. Whole Dataset Operations• Galactic structure: density/proper motion maps of the Galaxy – => forall stars, compute distance, bin, create 5D map • Galactic structure: dust distribution – => forall stars, compute g-r color, bin, find blue tip edge, infer dust distribution • Near-field cosmology: MW satellite searches – => forall stars, compute colors, convolve with spatial filters, report any satellite-like peaks • Variability: Bayesian classification of transients and discovery of variables – => forall stars, get light curves, compute likelihoods, alert if interesting • …
  • 44. Astronomical catalogs • Just (big!) databases • Each row corresponds to a detection or an object (star/galaxy/asteroid) • Producing catalogs from images is not trivial - non-exhaustive list of problems (for software to solve): – background estimation – PSF estimation – object detection – image co-addition – deblending 44
  • 45. AXS history: LSD by Mario Jurić • Tool for querying, cross-matching and analysis of positionally or temporally indexed datasets • Inspired by Google's BigTable and MapReduce papers • However it has some shortcomings: – Fixed data partitioning (significant data skew) – Time-partitioning problematic (most queries do not slice by time) – Not resilient to worker failures – Contains a lot of custom solutions for functionalities that are common today 45
  • 46. Enter Spark and AXS • Astronomy eXtensions for Spark • DiRAC institute @ UW saw the need for next generation astronomical analysis tool • Efficient cross-matching • Based on industry standards (Apache Spark) • Provides simple (but powerful) astronomical API extensions • Easy to use on-premises or in the cloud 46
  • 48. + government-sponsored private clouds (e.g., JetStream)
  • 49. Meeting the Challenges Resources Dataset Storage Scalable Analysis Code Interface