With datasets becoming larger, new visualizations techniques are needed. One problem is that the number of datapoints is so large, that a scatter plot becomes cluttered. Another problem is that with over a billion objects, only a few cpu cycles are available per object if one wants to process them within a second, making traditional methods not viable.
I will show that it is possible to visualize a billion objects in about 1 second on a modern desktop computer using memory mapping of hdf5 files together with a simple binning algorithm in Python. Density fields in 1, 2 and 3d are computed and statistics of any properties of the objects, or any mathematical operation on them can be done on the fly. This enables efficient exploration or large datasets interactively, making science exploration feasible.
This idea is implemented in a Python library called vaex, which integrates well in the Jupyter/Numpy/Astropy stack. Build on top of this is the vaex application, which allows for interactive exploration and visualization.
The motivation for vaex is the upcoming astronomical catalogue of the Gaia satellite, which will contain properties of over a billion stars. However, vaex can also be used on N-body simulations, other catalogues or any tabular data.
https://www.astro.rug.nl/~breddels/vaex
https://github.com/maartenbreddels/vaex
3. Copyright ESA/ATG medialab; background: ESO/S. Brunier
• Gaia satellite
• launched by ESA in december 2013
• determines positions, velocities and
astrophysical parameters of >109
stars of the Milky Way
• First catalogue in September 2016
• Final catalogue ~2020
6. Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
7. Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
8. Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
9. Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
10. Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
11. Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
12. Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
• Python
• rich set of libraries, becoming the default in
science
13. Situation
• TOPCAT comes close, not fast enough, works with
individual rows/particles, no exploratory tools,
written in Java.
• Your own IDL/Python code: a lot to consider to do it
optimal (multicore, efficient storage, efficient
algorithms, interactive becomes complex)
• DataShader?
• We want something to visualize 109 rows/objects in
~1 second
• Do we need to resort to Big Data solutions?
14. Big data?
• Big data
• definitions vary
• Practical question to ask
• Do you need Big Data solutions?
• Many computers / grid
• Invest in software
• Or
• Can we do it on a single computer ‘old style’
16. Interactive? How?
• Interactive?
• 10
9
* 2 * 8 bytes = 15 GiB (double is 8 bytes)
• Memory bandwidth: 10-20 GiB/s: ~1 second
• CPU: 3 Ghz (but multicore, say 4): 12 cycles/
second
• Few cycles per row/object, simple algorithm
• Histograms/Density grids
• Yes, but
• If it fits/cached in memory, otherwise sdd/hdd
speeds (30-100 seconds)
• proper storage and reading of data
• simple and fast algorithm for binning
•Aquarius A-2: pure dark matter N
body simulation of Milky Way like Halo
•6. *108 particles
• < 1 second
17. Interactive? How?
• Interactive?
• 10
9
* 2 * 8 bytes = 15 GiB (double is 8 bytes)
• Memory bandwidth: 10-20 GiB/s: ~1 second
• CPU: 3 Ghz (but multicore, say 4): 12 cycles/
second
• Few cycles per row/object, simple algorithm
• Histograms/Density grids
• Yes, but
• If it fits/cached in memory, otherwise sdd/hdd
speeds (30-100 seconds)
• proper storage and reading of data
• simple and fast algorithm for binning
•Aquarius A-2: pure dark matter N
body simulation of Milky Way like Halo
•6. *108 particles
• < 1 second
18. • Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if
unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the
file system cache
cache
How to store and read the data
19. • Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if
unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the
file system cache
cache
How to store and read the data
• Memory mapping:
• get direct access to OS memory cache, no copy, no
setup (apart from the kernel doing setting up the
pages)
• avoid memory copies, more cache available
• In previous example:
• copying 15 GB will take about 0.5-1.0 second, at
10-20 GB/s
• Can be 2-3x slower (cpu cache helps a bit)
20. • Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if
unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the
file system cache
cache
How to store and read the data
• Memory mapping:
• get direct access to OS memory cache, no copy, no
setup (apart from the kernel doing setting up the
pages)
• avoid memory copies, more cache available
• In previous example:
• copying 15 GB will take about 0.5-1.0 second, at
10-20 GB/s
• Can be 2-3x slower (cpu cache helps a bit)
24. How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
25. How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
26. How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
27. How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
600 times faster
42. Vaex: Visualization And EXploration
• A library
• python package
• ‘import vaex’
• reading of data
• multithreading
• binning (1,2,3, Nd)
• selections/queries
• server/client
• integrates with IPython notebook
• A GUI program
• Gives interactive navigation, zoom,
pan
• interactive selection (lasso,
rectangle)
• client
• undo/redo
43. Example data / vaex.example()
• Helmi and de Zeeuw 2000
• build up of a MW (stellar) halo from 33 satellites
• 3.3 * 106 particles
• Almost smooth in configuration space (x,y,z)
• Structure visible in E-Lz-L space
44.
45.
46.
47.
48. Exploration:Automated
• High dimensionality → many subspaces
• E-L, E-y, E-x, E-vx, E-vy, E-vz, E-z, E-Lz, L-y, L-
x, L-vx, L-vy, L-vz, L-z, L-Lz, y-x, y-vx, y-vy, y-
vz, y-z, y-Lz, x-vx, x-vy, x-vz, x-z, x-Lz, vx-vy,
vx-vz, vx-z, vx-Lz, vy-vz, vy-z, vy-Lz, vz-z, vz-
Lz, z-Lz
• Can we automate this / at least help?
• Ranking subspaces using Mutual Information
49. Mutual information
• Measure the information loss between
p(x,y) and p(x)p(y)
• information loss measured using the KL
divergence:
• In short:
• How much does breaking up correlation
(not just linear) change density
distribution?
50. Ranking by Mutual information
p(x,y) p(x)p(y) I(X;Y)
0.003
0.837
x,y
Lz,L
51. Ranking by Mutual information
p(x,y) p(x)p(y) I(X;Y)
0.003
0.837
x,y
Lz,L
52. Expressions/Virtual columns
• Not just visualize existing columns
• mathematical operations
• sqrt(x**2+y**2)
• log(r)
• Virtual columns
• r=sqrt(x**2+y**2)
• Suggest common ones
• equatorial to galactic
coordinates
53.
54.
55.
56.
57. How to get the data
• Gaia: ~1-2 TB
• download
• torrent
• Do you need to?
• server/client?
58. Client/Server
• 2d histogram example
• raw data:15 GB
• binned 256x256 grid
• 500KB, or 10-100kb compressed
• Proof of concept
• over http / websocket
• vaex webserver [file …]
67. Get vaex
• standalone binary (OS X , Linux) (just download and start)
• www.astro.rug.nl/~breddels/vaex (or google ‘vaex
visualisation’)
• www.github.com/maartenbreddels/vaex
• In Python
• Vanilla Python
• pip install —user —pre vaex
• Anaconda
• conda install -c maartenbreddels vaex
68. Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
69. Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
70. Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
71. Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
72. Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
73. Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)