Vaex talk-pydata-paris

1. How to interactively visualise and explore a billion objects (with vaex) Maarten Breddels & Amina Helmi PyData Paris 2016

2. Outline • Motivation • Technical • Vaex • Conclusions

3. Copyright ESA/ATG medialab; background: ESO/S. Brunier • Gaia satellite • launched by ESA in december 2013 • determines positions, velocities and astrophysical parameters of >109 stars of the Milky Way • First catalogue in September 2016 • Final catalogue ~2020

4. Credit: NASA/Adler/U. Chicago/Wesleyan/JPL-Caltech

5. image:Devin Powell

6. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network

7. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking

12. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking • Python • rich set of libraries, becoming the default in science

13. Situation • TOPCAT comes close, not fast enough, works with individual rows/particles, no exploratory tools, written in Java. • Your own IDL/Python code: a lot to consider to do it optimal (multicore, efﬁcient storage, efﬁcient algorithms, interactive becomes complex) • DataShader? • We want something to visualize 109 rows/objects in ~1 second • Do we need to resort to Big Data solutions?

14. Big data? • Big data • deﬁnitions vary • Practical question to ask • Do you need Big Data solutions? • Many computers / grid • Invest in software • Or • Can we do it on a single computer ‘old style’

16. Interactive? How? • Interactive? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-20 GiB/s: ~1 second • CPU: 3 Ghz (but multicore, say 4): 12 cycles/ second • Few cycles per row/object, simple algorithm • Histograms/Density grids • Yes, but • If it ﬁts/cached in memory, otherwise sdd/hdd speeds (30-100 seconds) • proper storage and reading of data • simple and fast algorithm for binning •Aquarius A-2: pure dark matter N body simulation of Milky Way like Halo •6. *108 particles • < 1 second

17. Interactive? How? • Interactive? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-20 GiB/s: ~1 second • CPU: 3 Ghz (but multicore, say 4): 12 cycles/ second • Few cycles per row/object, simple algorithm • Histograms/Density grids • Yes, but • If it ﬁts/cached in memory, otherwise sdd/hdd speeds (30-100 seconds) • proper storage and reading of data • simple and fast algorithm for binning •Aquarius A-2: pure dark matter N body simulation of Milky Way like Halo •6. *108 particles • < 1 second

18. • Storage: native, column based (hdf5, ﬁts) • Normal (POSIX read) method: • Allocate memory • read from disk to memory • Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy) • Wastes memory (actually disk cache) • 15 GB data, requires 30 GB is you want to use the ﬁle system cache cache How to store and read the data

19. • Storage: native, column based (hdf5, ﬁts) • Normal (POSIX read) method: • Allocate memory • read from disk to memory • Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy) • Wastes memory (actually disk cache) • 15 GB data, requires 30 GB is you want to use the ﬁle system cache cache How to store and read the data • Memory mapping: • get direct access to OS memory cache, no copy, no setup (apart from the kernel doing setting up the pages) • avoid memory copies, more cache available • In previous example: • copying 15 GB will take about 0.5-1.0 second, at 10-20 GB/s • Can be 2-3x slower (cpu cache helps a bit)

20. • Storage: native, column based (hdf5, ﬁts) • Normal (POSIX read) method: • Allocate memory • read from disk to memory • Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy) • Wastes memory (actually disk cache) • 15 GB data, requires 30 GB is you want to use the ﬁle system cache cache How to store and read the data • Memory mapping: • get direct access to OS memory cache, no copy, no setup (apart from the kernel doing setting up the pages) • avoid memory copies, more cache available • In previous example: • copying 15 GB will take about 0.5-1.0 second, at 10-20 GB/s • Can be 2-3x slower (cpu cache helps a bit)

21. Translate to something visual • 1d: histogram • 2d: • histogram • use colormap to map to a color • 3d: volume rendering

24. How to get good performance • Pure python • slow, GIL • Numpy • numpy.histogram slow • C extension • fast, no GIL, slower development • less boilerplate code: cfﬁ • Numba @jit(nopython=True, nogil=True) • Python code, c performance • (nogil didn’t exist)

27. How to get good performance • Pure python • slow, GIL • Numpy • numpy.histogram slow • C extension • fast, no GIL, slower development • less boilerplate code: cfﬁ • Numba @jit(nopython=True, nogil=True) • Python code, c performance • (nogil didn’t exist) 600 times faster

30. Great… 2d histograms… 1 32 4 .. 9 11x 4 7 41 .. 91 61 y +1

31. Great… 2d histograms… • There is more • Weighted histogram • Don’t sums 1’s • Sum values 1 32 4 .. 9 11x 4 7 41 .. 91 61 y +1

32. Great… 2d histograms… • There is more • Weighted histogram • Don’t sums 1’s • Sum values 1 32 4 .. 9 11x 4 7 41 .. 91 61 y +1 2 1 4 .. 8 3v +v, or v2

33. Great… 2d histograms… • There is more • Weighted histogram • Don’t sums 1’s • Sum values 1 32 4 .. 9 11x 4 7 41 .. 91 61 y +1 2 1 4 .. 8 3v +v, or v2 • Possibilities • Total: ﬂux, mass • Mean: velocity, metallicity • Dispersions: velocity… • Basically statistics on a grid

34. Examples

35. Examples

36. Examples

37. Exploration • Selections • expressions • visual • linked views • Subspace ranking

42. Vaex: Visualization And EXploration • A library • python package • ‘import vaex’ • reading of data • multithreading • binning (1,2,3, Nd) • selections/queries • server/client • integrates with IPython notebook • A GUI program • Gives interactive navigation, zoom, pan • interactive selection (lasso, rectangle) • client • undo/redo

43. Example data / vaex.example() • Helmi and de Zeeuw 2000 • build up of a MW (stellar) halo from 33 satellites • 3.3 * 106 particles • Almost smooth in conﬁguration space (x,y,z) • Structure visible in E-Lz-L space

48. Exploration:Automated • High dimensionality → many subspaces • E-L, E-y, E-x, E-vx, E-vy, E-vz, E-z, E-Lz, L-y, L- x, L-vx, L-vy, L-vz, L-z, L-Lz, y-x, y-vx, y-vy, y- vz, y-z, y-Lz, x-vx, x-vy, x-vz, x-z, x-Lz, vx-vy, vx-vz, vx-z, vx-Lz, vy-vz, vy-z, vy-Lz, vz-z, vz- Lz, z-Lz • Can we automate this / at least help? • Ranking subspaces using Mutual Information

49. Mutual information • Measure the information loss between p(x,y) and p(x)p(y) • information loss measured using the KL divergence: • In short: • How much does breaking up correlation (not just linear) change density distribution?

50. Ranking by Mutual information p(x,y) p(x)p(y) I(X;Y) 0.003 0.837 x,y Lz,L

51. Ranking by Mutual information p(x,y) p(x)p(y) I(X;Y) 0.003 0.837 x,y Lz,L

52. Expressions/Virtual columns • Not just visualize existing columns • mathematical operations • sqrt(x**2+y**2) • log(r) • Virtual columns • r=sqrt(x**2+y**2) • Suggest common ones • equatorial to galactic coordinates

57. How to get the data • Gaia: ~1-2 TB • download • torrent • Do you need to? • server/client?

58. Client/Server • 2d histogram example • raw data:15 GB • binned 256x256 grid • 500KB, or 10-100kb compressed • Proof of concept • over http / websocket • vaex webserver [ﬁle …]

61. Random subset / shufﬂed storage

66. … dask? wendelin?

67. Get vaex • standalone binary (OS X , Linux) (just download and start) • www.astro.rug.nl/~breddels/vaex (or google ‘vaex visualisation’) • www.github.com/maartenbreddels/vaex • In Python • Vanilla Python • pip install —user —pre vaex • Anaconda • conda install -c maartenbreddels vaex

68. Summary • vaex: visualisation and exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector ﬁeld overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)

Vaex talk-pydata-paris

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Vaex talk-pydata-paris

Similar to Vaex talk-pydata-paris (20)

Recently uploaded

Recently uploaded (20)

Vaex talk-pydata-paris