SlideShare a Scribd company logo
How to interactively visualise and
explore a billion objects (with vaex)
Maarten Breddels
&
Amina Helmi
PyData Paris 2016
Outline
• Motivation
• Technical
• Vaex
• Conclusions
Copyright ESA/ATG medialab; background: ESO/S. Brunier
• Gaia satellite
• launched by ESA in december 2013
• determines positions, velocities and
astrophysical parameters of >109
stars of the Milky Way
• First catalogue in September 2016
• Final catalogue ~2020
Credit: NASA/Adler/U. Chicago/Wesleyan/JPL-Caltech
image:Devin Powell
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
• Python
• rich set of libraries, becoming the default in
science
Situation
• TOPCAT comes close, not fast enough, works with
individual rows/particles, no exploratory tools,
written in Java.
• Your own IDL/Python code: a lot to consider to do it
optimal (multicore, efficient storage, efficient
algorithms, interactive becomes complex)
• DataShader?
• We want something to visualize 109 rows/objects in
~1 second
• Do we need to resort to Big Data solutions?
Big data?
• Big data
• definitions vary
• Practical question to ask
• Do you need Big Data solutions?
• Many computers / grid
• Invest in software
• Or
• Can we do it on a single computer ‘old style’
Outline
• Motivation
• Technical
• Vaex
• Conclusions
Interactive? How?
• Interactive?
• 10
9
* 2 * 8 bytes = 15 GiB (double is 8 bytes)
• Memory bandwidth: 10-20 GiB/s: ~1 second
• CPU: 3 Ghz (but multicore, say 4): 12 cycles/
second
• Few cycles per row/object, simple algorithm
• Histograms/Density grids
• Yes, but
• If it fits/cached in memory, otherwise sdd/hdd
speeds (30-100 seconds)
• proper storage and reading of data
• simple and fast algorithm for binning
•Aquarius A-2: pure dark matter N
body simulation of Milky Way like Halo
•6. *108 particles
• < 1 second
Interactive? How?
• Interactive?
• 10
9
* 2 * 8 bytes = 15 GiB (double is 8 bytes)
• Memory bandwidth: 10-20 GiB/s: ~1 second
• CPU: 3 Ghz (but multicore, say 4): 12 cycles/
second
• Few cycles per row/object, simple algorithm
• Histograms/Density grids
• Yes, but
• If it fits/cached in memory, otherwise sdd/hdd
speeds (30-100 seconds)
• proper storage and reading of data
• simple and fast algorithm for binning
•Aquarius A-2: pure dark matter N
body simulation of Milky Way like Halo
•6. *108 particles
• < 1 second
• Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if
unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the
file system cache
cache
How to store and read the data
• Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if
unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the
file system cache
cache
How to store and read the data
• Memory mapping:
• get direct access to OS memory cache, no copy, no
setup (apart from the kernel doing setting up the
pages)
• avoid memory copies, more cache available
• In previous example:
• copying 15 GB will take about 0.5-1.0 second, at
10-20 GB/s
• Can be 2-3x slower (cpu cache helps a bit)
• Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if
unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the
file system cache
cache
How to store and read the data
• Memory mapping:
• get direct access to OS memory cache, no copy, no
setup (apart from the kernel doing setting up the
pages)
• avoid memory copies, more cache available
• In previous example:
• copying 15 GB will take about 0.5-1.0 second, at
10-20 GB/s
• Can be 2-3x slower (cpu cache helps a bit)
Translate to something
visual
• 1d: histogram
• 2d:
• histogram
• use colormap to map
to a color
• 3d: volume rendering
Translate to something
visual
• 1d: histogram
• 2d:
• histogram
• use colormap to map
to a color
• 3d: volume rendering
Translate to something
visual
• 1d: histogram
• 2d:
• histogram
• use colormap to map
to a color
• 3d: volume rendering
How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
600 times faster
Great… 2d histograms…
1 32 4 .. 9 11x
4 7 41 .. 91 61
y
+1
Great… 2d histograms…
• There is more
• Weighted histogram
• Don’t sums 1’s
• Sum values
1 32 4 .. 9 11x
4 7 41 .. 91 61
y
+1
Great… 2d histograms…
• There is more
• Weighted histogram
• Don’t sums 1’s
• Sum values
1 32 4 .. 9 11x
4 7 41 .. 91 61
y
+1
2 1 4 .. 8 3v
+v, or v2
Great… 2d histograms…
• There is more
• Weighted histogram
• Don’t sums 1’s
• Sum values
1 32 4 .. 9 11x
4 7 41 .. 91 61
y
+1
2 1 4 .. 8 3v
+v, or v2
• Possibilities
• Total: flux, mass
• Mean: velocity, metallicity
• Dispersions: velocity…
• Basically statistics on a grid
Examples
Examples
Examples
Exploration
• Selections
• expressions
• visual
• linked views
• Subspace ranking
Exploration
• Selections
• expressions
• visual
• linked views
• Subspace ranking
Exploration
• Selections
• expressions
• visual
• linked views
• Subspace ranking
Exploration
• Selections
• expressions
• visual
• linked views
• Subspace ranking
Outline
• Motivation
• Technical
• Vaex
• Conclusions
Vaex: Visualization And EXploration
• A library
• python package
• ‘import vaex’
• reading of data
• multithreading
• binning (1,2,3, Nd)
• selections/queries
• server/client
• integrates with IPython notebook
• A GUI program
• Gives interactive navigation, zoom,
pan
• interactive selection (lasso,
rectangle)
• client
• undo/redo
Example data / vaex.example()
• Helmi and de Zeeuw 2000
• build up of a MW (stellar) halo from 33 satellites
• 3.3 * 106 particles
• Almost smooth in configuration space (x,y,z)
• Structure visible in E-Lz-L space
Exploration:Automated
• High dimensionality → many subspaces
• E-L, E-y, E-x, E-vx, E-vy, E-vz, E-z, E-Lz, L-y, L-
x, L-vx, L-vy, L-vz, L-z, L-Lz, y-x, y-vx, y-vy, y-
vz, y-z, y-Lz, x-vx, x-vy, x-vz, x-z, x-Lz, vx-vy,
vx-vz, vx-z, vx-Lz, vy-vz, vy-z, vy-Lz, vz-z, vz-
Lz, z-Lz
• Can we automate this / at least help?
• Ranking subspaces using Mutual Information
Mutual information
• Measure the information loss between
p(x,y) and p(x)p(y)
• information loss measured using the KL
divergence:
• In short:
• How much does breaking up correlation
(not just linear) change density
distribution?
Ranking by Mutual information
p(x,y) p(x)p(y) I(X;Y)
0.003
0.837
x,y
Lz,L
Ranking by Mutual information
p(x,y) p(x)p(y) I(X;Y)
0.003
0.837
x,y
Lz,L
Expressions/Virtual columns
• Not just visualize existing columns
• mathematical operations
• sqrt(x**2+y**2)
• log(r)
• Virtual columns
• r=sqrt(x**2+y**2)
• Suggest common ones
• equatorial to galactic
coordinates
How to get the data
• Gaia: ~1-2 TB
• download
• torrent
• Do you need to?
• server/client?
Client/Server
• 2d histogram example
• raw data:15 GB
• binned 256x256 grid
• 500KB, or 10-100kb compressed
• Proof of concept
• over http / websocket
• vaex webserver [file …]
Random subset / shuffled storage
Random subset / shuffled storage
Random subset / shuffled storage
Random subset / shuffled storage
Random subset / shuffled storage
…
dask?
wendelin?
Get vaex
• standalone binary (OS X , Linux) (just download and start)
• www.astro.rug.nl/~breddels/vaex (or google ‘vaex
visualisation’)
• www.github.com/maartenbreddels/vaex
• In Python
• Vanilla Python
• pip install —user —pre vaex
• Anaconda
• conda install -c maartenbreddels vaex
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Vaex talk-pydata-paris

More Related Content

Viewers also liked

Python Ireland Feb '11 Talks: Introduction to Python
Python Ireland Feb '11 Talks: Introduction to PythonPython Ireland Feb '11 Talks: Introduction to Python
Python Ireland Feb '11 Talks: Introduction to Python
Python Ireland
 
Kez`s story board
Kez`s story boardKez`s story board
Kez`s story board
Stephendodd64
 
Baabtra.com little coder chapter - 3
Baabtra.com little coder   chapter - 3Baabtra.com little coder   chapter - 3
Baabtra.com little coder chapter - 3
baabtra.com - No. 1 supplier of quality freshers
 
Python lecture 12
Python lecture 12Python lecture 12
Python lecture 12
Tanwir Zaman
 
Practical idioms in Python - Pycon india 2014 proposal
Practical idioms in Python -  Pycon india 2014 proposalPractical idioms in Python -  Pycon india 2014 proposal
Practical idioms in Python - Pycon india 2014 proposal
Pratik M
 
Python a practical perspective
Python   a practical perspectivePython   a practical perspective
Python a practical perspective
Iulian-Nicu Şerbănoiu
 
Mastering python lesson2
Mastering python lesson2Mastering python lesson2
Mastering python lesson2
Ruth Marvin
 
Python interview questions
Python interview questionsPython interview questions
Python interview questions
Pragati Singh
 
Mastering Python lesson 3a
Mastering Python lesson 3aMastering Python lesson 3a
Mastering Python lesson 3a
Ruth Marvin
 
Python Programming Essentials - M35 - Iterators & Generators
Python Programming Essentials - M35 - Iterators & GeneratorsPython Programming Essentials - M35 - Iterators & Generators
Python Programming Essentials - M35 - Iterators & Generators
P3 InfoTech Solutions Pvt. Ltd.
 
Teach your kids how to program with Python and the Raspberry Pi
Teach your kids how to program with Python and the Raspberry PiTeach your kids how to program with Python and the Raspberry Pi
Teach your kids how to program with Python and the Raspberry Pi
Juan Gomez
 
web programming Unit VIII complete about python by Bhavsingh Maloth
web programming Unit VIII complete about python  by Bhavsingh Malothweb programming Unit VIII complete about python  by Bhavsingh Maloth
web programming Unit VIII complete about python by Bhavsingh Maloth
Bhavsingh Maloth
 
R Programming
R ProgrammingR Programming
R Programming
Abhishek Pratap Singh
 
Programming Basics
Programming BasicsProgramming Basics
Programming Basics
Abhishek Pratap Singh
 
Python programming - Everyday(ish) Examples
Python programming - Everyday(ish) ExamplesPython programming - Everyday(ish) Examples
Python programming - Everyday(ish) Examples
Ashish Sharma
 
Python-List comprehension
Python-List comprehensionPython-List comprehension
Python-List comprehension
Colin Su
 
Thinking hard about_python
Thinking hard about_pythonThinking hard about_python
Thinking hard about_python
Daniel Greenfeld
 
Mastering Python 3 I/O (Version 2)
Mastering Python 3 I/O (Version 2)Mastering Python 3 I/O (Version 2)
Mastering Python 3 I/O (Version 2)
David Beazley (Dabeaz LLC)
 
Begin with Python
Begin with PythonBegin with Python
Begin with Python
Narong Intiruk
 
Python Tricks That You Can't Live Without
Python Tricks That You Can't Live WithoutPython Tricks That You Can't Live Without
Python Tricks That You Can't Live Without
Audrey Roy
 

Viewers also liked (20)

Python Ireland Feb '11 Talks: Introduction to Python
Python Ireland Feb '11 Talks: Introduction to PythonPython Ireland Feb '11 Talks: Introduction to Python
Python Ireland Feb '11 Talks: Introduction to Python
 
Kez`s story board
Kez`s story boardKez`s story board
Kez`s story board
 
Baabtra.com little coder chapter - 3
Baabtra.com little coder   chapter - 3Baabtra.com little coder   chapter - 3
Baabtra.com little coder chapter - 3
 
Python lecture 12
Python lecture 12Python lecture 12
Python lecture 12
 
Practical idioms in Python - Pycon india 2014 proposal
Practical idioms in Python -  Pycon india 2014 proposalPractical idioms in Python -  Pycon india 2014 proposal
Practical idioms in Python - Pycon india 2014 proposal
 
Python a practical perspective
Python   a practical perspectivePython   a practical perspective
Python a practical perspective
 
Mastering python lesson2
Mastering python lesson2Mastering python lesson2
Mastering python lesson2
 
Python interview questions
Python interview questionsPython interview questions
Python interview questions
 
Mastering Python lesson 3a
Mastering Python lesson 3aMastering Python lesson 3a
Mastering Python lesson 3a
 
Python Programming Essentials - M35 - Iterators & Generators
Python Programming Essentials - M35 - Iterators & GeneratorsPython Programming Essentials - M35 - Iterators & Generators
Python Programming Essentials - M35 - Iterators & Generators
 
Teach your kids how to program with Python and the Raspberry Pi
Teach your kids how to program with Python and the Raspberry PiTeach your kids how to program with Python and the Raspberry Pi
Teach your kids how to program with Python and the Raspberry Pi
 
web programming Unit VIII complete about python by Bhavsingh Maloth
web programming Unit VIII complete about python  by Bhavsingh Malothweb programming Unit VIII complete about python  by Bhavsingh Maloth
web programming Unit VIII complete about python by Bhavsingh Maloth
 
R Programming
R ProgrammingR Programming
R Programming
 
Programming Basics
Programming BasicsProgramming Basics
Programming Basics
 
Python programming - Everyday(ish) Examples
Python programming - Everyday(ish) ExamplesPython programming - Everyday(ish) Examples
Python programming - Everyday(ish) Examples
 
Python-List comprehension
Python-List comprehensionPython-List comprehension
Python-List comprehension
 
Thinking hard about_python
Thinking hard about_pythonThinking hard about_python
Thinking hard about_python
 
Mastering Python 3 I/O (Version 2)
Mastering Python 3 I/O (Version 2)Mastering Python 3 I/O (Version 2)
Mastering Python 3 I/O (Version 2)
 
Begin with Python
Begin with PythonBegin with Python
Begin with Python
 
Python Tricks That You Can't Live Without
Python Tricks That You Can't Live WithoutPython Tricks That You Can't Live Without
Python Tricks That You Can't Live Without
 

Similar to Vaex talk-pydata-paris

Vaex pygrunn
Vaex pygrunnVaex pygrunn
Vaex pygrunn
Maarten Breddels
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
DataStax Academy
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
DataStax Academy
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQuery
Csaba Toth
 
Decima Engine: Visibility in Horizon Zero Dawn
Decima Engine: Visibility in Horizon Zero DawnDecima Engine: Visibility in Horizon Zero Dawn
Decima Engine: Visibility in Horizon Zero Dawn
Guerrilla
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
huguk
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
c.titus.brown
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
Christos Charmatzis
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
DataStax Academy
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
DataStax Academy
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
DataStax Academy
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
Srinath Perera
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
c.titus.brown
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
Doug Needham
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
Srinath Perera
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Docker, Inc.
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
Databricks
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
bodaceacat
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
Sara-Jayne Terp
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science Workflows
Databricks
 

Similar to Vaex talk-pydata-paris (20)

Vaex pygrunn
Vaex pygrunnVaex pygrunn
Vaex pygrunn
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQuery
 
Decima Engine: Visibility in Horizon Zero Dawn
Decima Engine: Visibility in Horizon Zero DawnDecima Engine: Visibility in Horizon Zero Dawn
Decima Engine: Visibility in Horizon Zero Dawn
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science Workflows
 

Recently uploaded

Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 

Recently uploaded (20)

Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 

Vaex talk-pydata-paris

  • 1. How to interactively visualise and explore a billion objects (with vaex) Maarten Breddels & Amina Helmi PyData Paris 2016
  • 3. Copyright ESA/ATG medialab; background: ESO/S. Brunier • Gaia satellite • launched by ESA in december 2013 • determines positions, velocities and astrophysical parameters of >109 stars of the Milky Way • First catalogue in September 2016 • Final catalogue ~2020
  • 6. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network
  • 7. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking
  • 8. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking
  • 9. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking
  • 10. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking
  • 11. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking
  • 12. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking • Python • rich set of libraries, becoming the default in science
  • 13. Situation • TOPCAT comes close, not fast enough, works with individual rows/particles, no exploratory tools, written in Java. • Your own IDL/Python code: a lot to consider to do it optimal (multicore, efficient storage, efficient algorithms, interactive becomes complex) • DataShader? • We want something to visualize 109 rows/objects in ~1 second • Do we need to resort to Big Data solutions?
  • 14. Big data? • Big data • definitions vary • Practical question to ask • Do you need Big Data solutions? • Many computers / grid • Invest in software • Or • Can we do it on a single computer ‘old style’
  • 16. Interactive? How? • Interactive? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-20 GiB/s: ~1 second • CPU: 3 Ghz (but multicore, say 4): 12 cycles/ second • Few cycles per row/object, simple algorithm • Histograms/Density grids • Yes, but • If it fits/cached in memory, otherwise sdd/hdd speeds (30-100 seconds) • proper storage and reading of data • simple and fast algorithm for binning •Aquarius A-2: pure dark matter N body simulation of Milky Way like Halo •6. *108 particles • < 1 second
  • 17. Interactive? How? • Interactive? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-20 GiB/s: ~1 second • CPU: 3 Ghz (but multicore, say 4): 12 cycles/ second • Few cycles per row/object, simple algorithm • Histograms/Density grids • Yes, but • If it fits/cached in memory, otherwise sdd/hdd speeds (30-100 seconds) • proper storage and reading of data • simple and fast algorithm for binning •Aquarius A-2: pure dark matter N body simulation of Milky Way like Halo •6. *108 particles • < 1 second
  • 18. • Storage: native, column based (hdf5, fits) • Normal (POSIX read) method: • Allocate memory • read from disk to memory • Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy) • Wastes memory (actually disk cache) • 15 GB data, requires 30 GB is you want to use the file system cache cache How to store and read the data
  • 19. • Storage: native, column based (hdf5, fits) • Normal (POSIX read) method: • Allocate memory • read from disk to memory • Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy) • Wastes memory (actually disk cache) • 15 GB data, requires 30 GB is you want to use the file system cache cache How to store and read the data • Memory mapping: • get direct access to OS memory cache, no copy, no setup (apart from the kernel doing setting up the pages) • avoid memory copies, more cache available • In previous example: • copying 15 GB will take about 0.5-1.0 second, at 10-20 GB/s • Can be 2-3x slower (cpu cache helps a bit)
  • 20. • Storage: native, column based (hdf5, fits) • Normal (POSIX read) method: • Allocate memory • read from disk to memory • Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy) • Wastes memory (actually disk cache) • 15 GB data, requires 30 GB is you want to use the file system cache cache How to store and read the data • Memory mapping: • get direct access to OS memory cache, no copy, no setup (apart from the kernel doing setting up the pages) • avoid memory copies, more cache available • In previous example: • copying 15 GB will take about 0.5-1.0 second, at 10-20 GB/s • Can be 2-3x slower (cpu cache helps a bit)
  • 21. Translate to something visual • 1d: histogram • 2d: • histogram • use colormap to map to a color • 3d: volume rendering
  • 22. Translate to something visual • 1d: histogram • 2d: • histogram • use colormap to map to a color • 3d: volume rendering
  • 23. Translate to something visual • 1d: histogram • 2d: • histogram • use colormap to map to a color • 3d: volume rendering
  • 24. How to get good performance • Pure python • slow, GIL • Numpy • numpy.histogram slow • C extension • fast, no GIL, slower development • less boilerplate code: cffi • Numba @jit(nopython=True, nogil=True) • Python code, c performance • (nogil didn’t exist)
  • 25. How to get good performance • Pure python • slow, GIL • Numpy • numpy.histogram slow • C extension • fast, no GIL, slower development • less boilerplate code: cffi • Numba @jit(nopython=True, nogil=True) • Python code, c performance • (nogil didn’t exist)
  • 26. How to get good performance • Pure python • slow, GIL • Numpy • numpy.histogram slow • C extension • fast, no GIL, slower development • less boilerplate code: cffi • Numba @jit(nopython=True, nogil=True) • Python code, c performance • (nogil didn’t exist)
  • 27. How to get good performance • Pure python • slow, GIL • Numpy • numpy.histogram slow • C extension • fast, no GIL, slower development • less boilerplate code: cffi • Numba @jit(nopython=True, nogil=True) • Python code, c performance • (nogil didn’t exist) 600 times faster
  • 28.
  • 29.
  • 30. Great… 2d histograms… 1 32 4 .. 9 11x 4 7 41 .. 91 61 y +1
  • 31. Great… 2d histograms… • There is more • Weighted histogram • Don’t sums 1’s • Sum values 1 32 4 .. 9 11x 4 7 41 .. 91 61 y +1
  • 32. Great… 2d histograms… • There is more • Weighted histogram • Don’t sums 1’s • Sum values 1 32 4 .. 9 11x 4 7 41 .. 91 61 y +1 2 1 4 .. 8 3v +v, or v2
  • 33. Great… 2d histograms… • There is more • Weighted histogram • Don’t sums 1’s • Sum values 1 32 4 .. 9 11x 4 7 41 .. 91 61 y +1 2 1 4 .. 8 3v +v, or v2 • Possibilities • Total: flux, mass • Mean: velocity, metallicity • Dispersions: velocity… • Basically statistics on a grid
  • 37. Exploration • Selections • expressions • visual • linked views • Subspace ranking
  • 38. Exploration • Selections • expressions • visual • linked views • Subspace ranking
  • 39. Exploration • Selections • expressions • visual • linked views • Subspace ranking
  • 40. Exploration • Selections • expressions • visual • linked views • Subspace ranking
  • 42. Vaex: Visualization And EXploration • A library • python package • ‘import vaex’ • reading of data • multithreading • binning (1,2,3, Nd) • selections/queries • server/client • integrates with IPython notebook • A GUI program • Gives interactive navigation, zoom, pan • interactive selection (lasso, rectangle) • client • undo/redo
  • 43. Example data / vaex.example() • Helmi and de Zeeuw 2000 • build up of a MW (stellar) halo from 33 satellites • 3.3 * 106 particles • Almost smooth in configuration space (x,y,z) • Structure visible in E-Lz-L space
  • 44.
  • 45.
  • 46.
  • 47.
  • 48. Exploration:Automated • High dimensionality → many subspaces • E-L, E-y, E-x, E-vx, E-vy, E-vz, E-z, E-Lz, L-y, L- x, L-vx, L-vy, L-vz, L-z, L-Lz, y-x, y-vx, y-vy, y- vz, y-z, y-Lz, x-vx, x-vy, x-vz, x-z, x-Lz, vx-vy, vx-vz, vx-z, vx-Lz, vy-vz, vy-z, vy-Lz, vz-z, vz- Lz, z-Lz • Can we automate this / at least help? • Ranking subspaces using Mutual Information
  • 49. Mutual information • Measure the information loss between p(x,y) and p(x)p(y) • information loss measured using the KL divergence: • In short: • How much does breaking up correlation (not just linear) change density distribution?
  • 50. Ranking by Mutual information p(x,y) p(x)p(y) I(X;Y) 0.003 0.837 x,y Lz,L
  • 51. Ranking by Mutual information p(x,y) p(x)p(y) I(X;Y) 0.003 0.837 x,y Lz,L
  • 52. Expressions/Virtual columns • Not just visualize existing columns • mathematical operations • sqrt(x**2+y**2) • log(r) • Virtual columns • r=sqrt(x**2+y**2) • Suggest common ones • equatorial to galactic coordinates
  • 53.
  • 54.
  • 55.
  • 56.
  • 57. How to get the data • Gaia: ~1-2 TB • download • torrent • Do you need to? • server/client?
  • 58. Client/Server • 2d histogram example • raw data:15 GB • binned 256x256 grid • 500KB, or 10-100kb compressed • Proof of concept • over http / websocket • vaex webserver [file …]
  • 59.
  • 60.
  • 61. Random subset / shuffled storage
  • 62. Random subset / shuffled storage
  • 63. Random subset / shuffled storage
  • 64. Random subset / shuffled storage
  • 65. Random subset / shuffled storage
  • 67. Get vaex • standalone binary (OS X , Linux) (just download and start) • www.astro.rug.nl/~breddels/vaex (or google ‘vaex visualisation’) • www.github.com/maartenbreddels/vaex • In Python • Vanilla Python • pip install —user —pre vaex • Anaconda • conda install -c maartenbreddels vaex
  • 68. Summary • vaex: visualisation and exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)
  • 69. Summary • vaex: visualisation and exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)
  • 70. Summary • vaex: visualisation and exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)
  • 71. Summary • vaex: visualisation and exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)
  • 72. Summary • vaex: visualisation and exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)
  • 73. Summary • vaex: visualisation and exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)