How to interactively visualise and
explore a billion objects (with vaex)
Maarten Breddels
&
Amina Helmi
PyData Paris 2016
Outline
• Motivation
• Technical
• Vaex
• Conclusions
Copyright ESA/ATG medialab; background: ESO/S. Brunier
• Gaia satellite
• launched by ESA in december 2013
• determines positions, velocities and
astrophysical parameters of >109
stars of the Milky Way
• First catalogue in September 2016
• Final catalogue ~2020
Credit: NASA/Adler/U. Chicago/Wesleyan/JPL-Caltech
image:Devin Powell
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
• Python
• rich set of libraries, becoming the default in
science
Situation
• TOPCAT comes close, not fast enough, works with
individual rows/particles, no exploratory tools,
written in Java.
• Your own IDL/Python code: a lot to consider to do it
optimal (multicore, efficient storage, efficient
algorithms, interactive becomes complex)
• DataShader?
• We want something to visualize 109 rows/objects in
~1 second
• Do we need to resort to Big Data solutions?
Big data?
• Big data
• definitions vary
• Practical question to ask
• Do you need Big Data solutions?
• Many computers / grid
• Invest in software
• Or
• Can we do it on a single computer ‘old style’
Outline
• Motivation
• Technical
• Vaex
• Conclusions
Interactive? How?
• Interactive?
• 10
9
* 2 * 8 bytes = 15 GiB (double is 8 bytes)
• Memory bandwidth: 10-20 GiB/s: ~1 second
• CPU: 3 Ghz (but multicore, say 4): 12 cycles/
second
• Few cycles per row/object, simple algorithm
• Histograms/Density grids
• Yes, but
• If it fits/cached in memory, otherwise sdd/hdd
speeds (30-100 seconds)
• proper storage and reading of data
• simple and fast algorithm for binning
•Aquarius A-2: pure dark matter N
body simulation of Milky Way like Halo
•6. *108 particles
• < 1 second
Interactive? How?
• Interactive?
• 10
9
* 2 * 8 bytes = 15 GiB (double is 8 bytes)
• Memory bandwidth: 10-20 GiB/s: ~1 second
• CPU: 3 Ghz (but multicore, say 4): 12 cycles/
second
• Few cycles per row/object, simple algorithm
• Histograms/Density grids
• Yes, but
• If it fits/cached in memory, otherwise sdd/hdd
speeds (30-100 seconds)
• proper storage and reading of data
• simple and fast algorithm for binning
•Aquarius A-2: pure dark matter N
body simulation of Milky Way like Halo
•6. *108 particles
• < 1 second
• Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if
unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the
file system cache
cache
How to store and read the data
• Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if
unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the
file system cache
cache
How to store and read the data
• Memory mapping:
• get direct access to OS memory cache, no copy, no
setup (apart from the kernel doing setting up the
pages)
• avoid memory copies, more cache available
• In previous example:
• copying 15 GB will take about 0.5-1.0 second, at
10-20 GB/s
• Can be 2-3x slower (cpu cache helps a bit)
• Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if
unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the
file system cache
cache
How to store and read the data
• Memory mapping:
• get direct access to OS memory cache, no copy, no
setup (apart from the kernel doing setting up the
pages)
• avoid memory copies, more cache available
• In previous example:
• copying 15 GB will take about 0.5-1.0 second, at
10-20 GB/s
• Can be 2-3x slower (cpu cache helps a bit)
Translate to something
visual
• 1d: histogram
• 2d:
• histogram
• use colormap to map
to a color
• 3d: volume rendering
Translate to something
visual
• 1d: histogram
• 2d:
• histogram
• use colormap to map
to a color
• 3d: volume rendering
Translate to something
visual
• 1d: histogram
• 2d:
• histogram
• use colormap to map
to a color
• 3d: volume rendering
How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
600 times faster
Great… 2d histograms…
1 32 4 .. 9 11x
4 7 41 .. 91 61
y
+1
Great… 2d histograms…
• There is more
• Weighted histogram
• Don’t sums 1’s
• Sum values
1 32 4 .. 9 11x
4 7 41 .. 91 61
y
+1
Great… 2d histograms…
• There is more
• Weighted histogram
• Don’t sums 1’s
• Sum values
1 32 4 .. 9 11x
4 7 41 .. 91 61
y
+1
2 1 4 .. 8 3v
+v, or v2
Great… 2d histograms…
• There is more
• Weighted histogram
• Don’t sums 1’s
• Sum values
1 32 4 .. 9 11x
4 7 41 .. 91 61
y
+1
2 1 4 .. 8 3v
+v, or v2
• Possibilities
• Total: flux, mass
• Mean: velocity, metallicity
• Dispersions: velocity…
• Basically statistics on a grid
Examples
Examples
Examples
Exploration
• Selections
• expressions
• visual
• linked views
• Subspace ranking
Exploration
• Selections
• expressions
• visual
• linked views
• Subspace ranking
Exploration
• Selections
• expressions
• visual
• linked views
• Subspace ranking
Exploration
• Selections
• expressions
• visual
• linked views
• Subspace ranking
Outline
• Motivation
• Technical
• Vaex
• Conclusions
Vaex: Visualization And EXploration
• A library
• python package
• ‘import vaex’
• reading of data
• multithreading
• binning (1,2,3, Nd)
• selections/queries
• server/client
• integrates with IPython notebook
• A GUI program
• Gives interactive navigation, zoom,
pan
• interactive selection (lasso,
rectangle)
• client
• undo/redo
Example data / vaex.example()
• Helmi and de Zeeuw 2000
• build up of a MW (stellar) halo from 33 satellites
• 3.3 * 106 particles
• Almost smooth in configuration space (x,y,z)
• Structure visible in E-Lz-L space
Exploration:Automated
• High dimensionality → many subspaces
• E-L, E-y, E-x, E-vx, E-vy, E-vz, E-z, E-Lz, L-y, L-
x, L-vx, L-vy, L-vz, L-z, L-Lz, y-x, y-vx, y-vy, y-
vz, y-z, y-Lz, x-vx, x-vy, x-vz, x-z, x-Lz, vx-vy,
vx-vz, vx-z, vx-Lz, vy-vz, vy-z, vy-Lz, vz-z, vz-
Lz, z-Lz
• Can we automate this / at least help?
• Ranking subspaces using Mutual Information
Mutual information
• Measure the information loss between
p(x,y) and p(x)p(y)
• information loss measured using the KL
divergence:
• In short:
• How much does breaking up correlation
(not just linear) change density
distribution?
Ranking by Mutual information
p(x,y) p(x)p(y) I(X;Y)
0.003
0.837
x,y
Lz,L
Ranking by Mutual information
p(x,y) p(x)p(y) I(X;Y)
0.003
0.837
x,y
Lz,L
Expressions/Virtual columns
• Not just visualize existing columns
• mathematical operations
• sqrt(x**2+y**2)
• log(r)
• Virtual columns
• r=sqrt(x**2+y**2)
• Suggest common ones
• equatorial to galactic
coordinates
How to get the data
• Gaia: ~1-2 TB
• download
• torrent
• Do you need to?
• server/client?
Client/Server
• 2d histogram example
• raw data:15 GB
• binned 256x256 grid
• 500KB, or 10-100kb compressed
• Proof of concept
• over http / websocket
• vaex webserver [file …]
Random subset / shuffled storage
Random subset / shuffled storage
Random subset / shuffled storage
Random subset / shuffled storage
Random subset / shuffled storage
…
dask?
wendelin?
Get vaex
• standalone binary (OS X , Linux) (just download and start)
• www.astro.rug.nl/~breddels/vaex (or google ‘vaex
visualisation’)
• www.github.com/maartenbreddels/vaex
• In Python
• Vanilla Python
• pip install —user —pre vaex
• Anaconda
• conda install -c maartenbreddels vaex
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Vaex talk-pydata-paris

Vaex talk-pydata-paris

  • 1.
    How to interactivelyvisualise and explore a billion objects (with vaex) Maarten Breddels & Amina Helmi PyData Paris 2016
  • 2.
  • 3.
    Copyright ESA/ATG medialab;background: ESO/S. Brunier • Gaia satellite • launched by ESA in december 2013 • determines positions, velocities and astrophysical parameters of >109 stars of the Milky Way • First catalogue in September 2016 • Final catalogue ~2020
  • 4.
  • 5.
  • 6.
    Motivation • We willsoon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network
  • 7.
    Motivation • We willsoon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking
  • 8.
    Motivation • We willsoon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking
  • 9.
    Motivation • We willsoon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking
  • 10.
    Motivation • We willsoon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking
  • 11.
    Motivation • We willsoon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking
  • 12.
    Motivation • We willsoon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking • Python • rich set of libraries, becoming the default in science
  • 13.
    Situation • TOPCAT comesclose, not fast enough, works with individual rows/particles, no exploratory tools, written in Java. • Your own IDL/Python code: a lot to consider to do it optimal (multicore, efficient storage, efficient algorithms, interactive becomes complex) • DataShader? • We want something to visualize 109 rows/objects in ~1 second • Do we need to resort to Big Data solutions?
  • 14.
    Big data? • Bigdata • definitions vary • Practical question to ask • Do you need Big Data solutions? • Many computers / grid • Invest in software • Or • Can we do it on a single computer ‘old style’
  • 15.
  • 16.
    Interactive? How? • Interactive? •10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-20 GiB/s: ~1 second • CPU: 3 Ghz (but multicore, say 4): 12 cycles/ second • Few cycles per row/object, simple algorithm • Histograms/Density grids • Yes, but • If it fits/cached in memory, otherwise sdd/hdd speeds (30-100 seconds) • proper storage and reading of data • simple and fast algorithm for binning •Aquarius A-2: pure dark matter N body simulation of Milky Way like Halo •6. *108 particles • < 1 second
  • 17.
    Interactive? How? • Interactive? •10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-20 GiB/s: ~1 second • CPU: 3 Ghz (but multicore, say 4): 12 cycles/ second • Few cycles per row/object, simple algorithm • Histograms/Density grids • Yes, but • If it fits/cached in memory, otherwise sdd/hdd speeds (30-100 seconds) • proper storage and reading of data • simple and fast algorithm for binning •Aquarius A-2: pure dark matter N body simulation of Milky Way like Halo •6. *108 particles • < 1 second
  • 18.
    • Storage: native,column based (hdf5, fits) • Normal (POSIX read) method: • Allocate memory • read from disk to memory • Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy) • Wastes memory (actually disk cache) • 15 GB data, requires 30 GB is you want to use the file system cache cache How to store and read the data
  • 19.
    • Storage: native,column based (hdf5, fits) • Normal (POSIX read) method: • Allocate memory • read from disk to memory • Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy) • Wastes memory (actually disk cache) • 15 GB data, requires 30 GB is you want to use the file system cache cache How to store and read the data • Memory mapping: • get direct access to OS memory cache, no copy, no setup (apart from the kernel doing setting up the pages) • avoid memory copies, more cache available • In previous example: • copying 15 GB will take about 0.5-1.0 second, at 10-20 GB/s • Can be 2-3x slower (cpu cache helps a bit)
  • 20.
    • Storage: native,column based (hdf5, fits) • Normal (POSIX read) method: • Allocate memory • read from disk to memory • Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy) • Wastes memory (actually disk cache) • 15 GB data, requires 30 GB is you want to use the file system cache cache How to store and read the data • Memory mapping: • get direct access to OS memory cache, no copy, no setup (apart from the kernel doing setting up the pages) • avoid memory copies, more cache available • In previous example: • copying 15 GB will take about 0.5-1.0 second, at 10-20 GB/s • Can be 2-3x slower (cpu cache helps a bit)
  • 21.
    Translate to something visual •1d: histogram • 2d: • histogram • use colormap to map to a color • 3d: volume rendering
  • 22.
    Translate to something visual •1d: histogram • 2d: • histogram • use colormap to map to a color • 3d: volume rendering
  • 23.
    Translate to something visual •1d: histogram • 2d: • histogram • use colormap to map to a color • 3d: volume rendering
  • 24.
    How to getgood performance • Pure python • slow, GIL • Numpy • numpy.histogram slow • C extension • fast, no GIL, slower development • less boilerplate code: cffi • Numba @jit(nopython=True, nogil=True) • Python code, c performance • (nogil didn’t exist)
  • 25.
    How to getgood performance • Pure python • slow, GIL • Numpy • numpy.histogram slow • C extension • fast, no GIL, slower development • less boilerplate code: cffi • Numba @jit(nopython=True, nogil=True) • Python code, c performance • (nogil didn’t exist)
  • 26.
    How to getgood performance • Pure python • slow, GIL • Numpy • numpy.histogram slow • C extension • fast, no GIL, slower development • less boilerplate code: cffi • Numba @jit(nopython=True, nogil=True) • Python code, c performance • (nogil didn’t exist)
  • 27.
    How to getgood performance • Pure python • slow, GIL • Numpy • numpy.histogram slow • C extension • fast, no GIL, slower development • less boilerplate code: cffi • Numba @jit(nopython=True, nogil=True) • Python code, c performance • (nogil didn’t exist) 600 times faster
  • 30.
    Great… 2d histograms… 132 4 .. 9 11x 4 7 41 .. 91 61 y +1
  • 31.
    Great… 2d histograms… •There is more • Weighted histogram • Don’t sums 1’s • Sum values 1 32 4 .. 9 11x 4 7 41 .. 91 61 y +1
  • 32.
    Great… 2d histograms… •There is more • Weighted histogram • Don’t sums 1’s • Sum values 1 32 4 .. 9 11x 4 7 41 .. 91 61 y +1 2 1 4 .. 8 3v +v, or v2
  • 33.
    Great… 2d histograms… •There is more • Weighted histogram • Don’t sums 1’s • Sum values 1 32 4 .. 9 11x 4 7 41 .. 91 61 y +1 2 1 4 .. 8 3v +v, or v2 • Possibilities • Total: flux, mass • Mean: velocity, metallicity • Dispersions: velocity… • Basically statistics on a grid
  • 34.
  • 35.
  • 36.
  • 37.
    Exploration • Selections • expressions •visual • linked views • Subspace ranking
  • 38.
    Exploration • Selections • expressions •visual • linked views • Subspace ranking
  • 39.
    Exploration • Selections • expressions •visual • linked views • Subspace ranking
  • 40.
    Exploration • Selections • expressions •visual • linked views • Subspace ranking
  • 41.
  • 42.
    Vaex: Visualization AndEXploration • A library • python package • ‘import vaex’ • reading of data • multithreading • binning (1,2,3, Nd) • selections/queries • server/client • integrates with IPython notebook • A GUI program • Gives interactive navigation, zoom, pan • interactive selection (lasso, rectangle) • client • undo/redo
  • 43.
    Example data /vaex.example() • Helmi and de Zeeuw 2000 • build up of a MW (stellar) halo from 33 satellites • 3.3 * 106 particles • Almost smooth in configuration space (x,y,z) • Structure visible in E-Lz-L space
  • 48.
    Exploration:Automated • High dimensionality→ many subspaces • E-L, E-y, E-x, E-vx, E-vy, E-vz, E-z, E-Lz, L-y, L- x, L-vx, L-vy, L-vz, L-z, L-Lz, y-x, y-vx, y-vy, y- vz, y-z, y-Lz, x-vx, x-vy, x-vz, x-z, x-Lz, vx-vy, vx-vz, vx-z, vx-Lz, vy-vz, vy-z, vy-Lz, vz-z, vz- Lz, z-Lz • Can we automate this / at least help? • Ranking subspaces using Mutual Information
  • 49.
    Mutual information • Measurethe information loss between p(x,y) and p(x)p(y) • information loss measured using the KL divergence: • In short: • How much does breaking up correlation (not just linear) change density distribution?
  • 50.
    Ranking by Mutualinformation p(x,y) p(x)p(y) I(X;Y) 0.003 0.837 x,y Lz,L
  • 51.
    Ranking by Mutualinformation p(x,y) p(x)p(y) I(X;Y) 0.003 0.837 x,y Lz,L
  • 52.
    Expressions/Virtual columns • Notjust visualize existing columns • mathematical operations • sqrt(x**2+y**2) • log(r) • Virtual columns • r=sqrt(x**2+y**2) • Suggest common ones • equatorial to galactic coordinates
  • 57.
    How to getthe data • Gaia: ~1-2 TB • download • torrent • Do you need to? • server/client?
  • 58.
    Client/Server • 2d histogramexample • raw data:15 GB • binned 256x256 grid • 500KB, or 10-100kb compressed • Proof of concept • over http / websocket • vaex webserver [file …]
  • 61.
    Random subset /shuffled storage
  • 62.
    Random subset /shuffled storage
  • 63.
    Random subset /shuffled storage
  • 64.
    Random subset /shuffled storage
  • 65.
    Random subset /shuffled storage
  • 66.
  • 67.
    Get vaex • standalonebinary (OS X , Linux) (just download and start) • www.astro.rug.nl/~breddels/vaex (or google ‘vaex visualisation’) • www.github.com/maartenbreddels/vaex • In Python • Vanilla Python • pip install —user —pre vaex • Anaconda • conda install -c maartenbreddels vaex
  • 68.
    Summary • vaex: visualisationand exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)
  • 69.
    Summary • vaex: visualisationand exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)
  • 70.
    Summary • vaex: visualisationand exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)
  • 71.
    Summary • vaex: visualisationand exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)
  • 72.
    Summary • vaex: visualisationand exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)
  • 73.
    Summary • vaex: visualisationand exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)