SlideShare a Scribd company logo
A Billion stars in the Jupyter notebook
Maarten Breddels
Groningen - 2017
@MaartenBreddels
Outline
• Motivation/Problem
• Solution (vaex)
• Demo
• vaex
• ipwidgets / ipyvolume
Copyright ESA/ATG medialab; background: ESO/S. Brunier
• Gaia satellite
• launched by ESA in december 2013
• determine for >109
stars/objects in our
Milky Way
• positions
• velocities
• astrophysical parameters
• First catalogue (DR1) is out
• sky positions, G magnitude
Motivation
Motivation• We have Gaia DR1, soon DR2
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data/Quality checks
• Science/Discovery
• Trends, relations, clustering
• You are the (biological) neutral network
Motivation• We have Gaia DR1, soon DR2
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data/Quality checks
• Science/Discovery
• Trends, relations, clustering
• You are the (biological) neutral network
• Problem
• Scatter plots do not work well for 109
rows/objects
• Work with densities/statistics in 0,1,2 and 3d
• Interactive?
• Zoom, pan etc
• Explore: selections/queries
Motivation• We have Gaia DR1, soon DR2
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data/Quality checks
• Science/Discovery
• Trends, relations, clustering
• You are the (biological) neutral network
• Problem
• Scatter plots do not work well for 109
rows/objects
• Work with densities/statistics in 0,1,2 and 3d
• Interactive?
• Zoom, pan etc
• Explore: selections/queries
Motivation• We have Gaia DR1, soon DR2
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data/Quality checks
• Science/Discovery
• Trends, relations, clustering
• You are the (biological) neutral network
• Problem
• Scatter plots do not work well for 109
rows/objects
• Work with densities/statistics in 0,1,2 and 3d
• Interactive?
• Zoom, pan etc
• Explore: selections/queries
How fast can it be processed?
• What can be done?
• 10
9
* 2 * 8 bytes = 15 GiB (double is 8 bytes)
• Memory bandwidth: 10-20 GiB/s: ~1 second
• CPU: 3 Ghz (but multicore, say 4-8): 12-24 cycles/second
• Few cycles per row/object, simple algorithm
• Histograms/Density grids
• Yes, but
• If it fits/cached in memory, otherwise ssd/hdd speeds (10-100 seconds)
• proper storage and reading of data (no .csv!)
• simple and fast algorithm for binning
• ~1 second
How fast can it be processed?
• What can be done?
• 10
9
* 2 * 8 bytes = 15 GiB (double is 8 bytes)
• Memory bandwidth: 10-20 GiB/s: ~1 second
• CPU: 3 Ghz (but multicore, say 4-8): 12-24 cycles/second
• Few cycles per row/object, simple algorithm
• Histograms/Density grids
• Yes, but
• If it fits/cached in memory, otherwise ssd/hdd speeds (10-100 seconds)
• proper storage and reading of data (no .csv!)
• simple and fast algorithm for binning
• ~1 second
How fast can it be processed?
• What can be done?
• 10
9
* 2 * 8 bytes = 15 GiB (double is 8 bytes)
• Memory bandwidth: 10-20 GiB/s: ~1 second
• CPU: 3 Ghz (but multicore, say 4-8): 12-24 cycles/second
• Few cycles per row/object, simple algorithm
• Histograms/Density grids
• Yes, but
• If it fits/cached in memory, otherwise ssd/hdd speeds (10-100 seconds)
• proper storage and reading of data (no .csv!)
• simple and fast algorithm for binning
• ~1 second
How to store and read the data
cache
• Storage: native, column based (hdf5)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory
• Wastes memory/cache
• 15 GB data , requires 30 GB is you want to use the file system cache
How to store and read the data
cache
• Storage: native, column based (hdf5)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory
• Wastes memory/cache
• 15 GB data , requires 30 GB is you want to use the file system cache
How to store and read the data
cache
• Storage: native, column based (hdf5)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory
• Wastes memory/cache
• 15 GB data , requires 30 GB is you want to use the file system cache
• Memory mapping:
• get direct access to OS memory cache, no copy, no
setup (apart from the kernel doing setting up the pages)
• avoid memory copies, more cache available
• In previous example:
• copying 15 GB will take about ~1.0 second, at 10-20
GB/s
• Can be 2-3x slower (cpu cache helps a bit)
2d
1d
2d
1d
3d
2d
1d
3d
0d 330,000 rows
2d
1d
3d
0d 330,000 rows
2d
1d
3d
0d 330,000 rows
2d
1d
3d
0d 330,000 rows mean: -0.083
2d
1d
3d
0d 330,000 rows mean: -0.083
vaex
• Python library (conda/pip installable)
• Think pandas-like but for large datasets
• simple api: 1 class (Dataset), the rest are strings/dicts/tuples/lists/ndarray
• From/to pandas dataframes, plain numpy arrays, astropy tables
• Focusses mostly on statistics on N-d grids (count/mean/max/std/…)
• >1 billion rows / sec on a `decent` desktop (quad core 3Gz)
• >50x faster than scipy.stats.binned_statistic_2d
• 0.1-0.2 billion rows on this 2yr old Macbook Air
• CPU bound
• Does visualisation / matplotlib / bqplot / ipyvolume / ipyleaflet
For what?
• Astronomical catalogues
• Gaia > 1 billion
• SDSS > 150 million
• PANSTARRS > 1 billion
• LSST > 10 billion
• N-body simulations (0.1-100 billion)
• columns: x,y,z,vx,vy,vz
• SPH: density, temperature, metallicity…
• rows: particle
• Any tabular data
• rows and columns
For what?
• Astronomical catalogues
• Gaia > 1 billion
• SDSS > 150 million
• PANSTARRS > 1 billion
• LSST > 10 billion
• N-body simulations (0.1-100 billion)
• columns: x,y,z,vx,vy,vz
• SPH: density, temperature, metallicity…
• rows: particle
• Any tabular data
• rows and columns
For what?
• Astronomical catalogues
• Gaia > 1 billion
• SDSS > 150 million
• PANSTARRS > 1 billion
• LSST > 10 billion
• N-body simulations (0.1-100 billion)
• columns: x,y,z,vx,vy,vz
• SPH: density, temperature, metallicity…
• rows: particle
• Any tabular data
• rows and columns
For what?
• Astronomical catalogues
• Gaia > 1 billion
• SDSS > 150 million
• PANSTARRS > 1 billion
• LSST > 10 billion
• N-body simulations (0.1-100 billion)
• columns: x,y,z,vx,vy,vz
• SPH: density, temperature, metallicity…
• rows: particle
• Any tabular data
• rows and columns
Transformation of the data
• Not ‘z = x + y’
• for 1 billion rows: 8GB of RAM wasted
• Everything is an expression: dataset.mean(‘x + y’)
• Computed in chunks not to waste RAM
• Virtual columns:
• dataset.add_virtual_column(‘r’, ‘sqrt(x**2+y**2)’)
• dataset.mean(‘r’)
Selections/subsets
• You don’t use all data / filtering
• Efficient subsets (do NOT copy the data)
• vaex
• Selections result in a boolean mask
• dataset.select(‘x > 0’, name=‘xpos’)
• Use in any statistic
• dataset.mean(‘x + y’, selection=‘xpos’)
Demo library
Conclusions
• vaex
• N-d statistics on regular grids
• ~1 billion rows/sec
• efficient transformations/selections
• regular grids can be used for visualisation
• Interactive in the Notebook due to ipywidgets/bqplot/ipyleaflet
• ipyvolume
• adds interactive 3d plotting to the notebook, (as an ipywidget)
• can be used as backend for vaex
• works outside the notebook, in the browser
• tablet (paperless office)
• sharing, outreach, press release material
Questions
• maartenbreddels@gmail.com
• twitter @maartenbreddels
• vaex
• http://vaex.astro.rug.nl
• https://github.com/maartenbreddels/vaex
• pip install —pre vaex / conda install -c conda-forge vaex
• ipyvolume
• https://ipywidgets.readthedocs.io
• https://github.com/maartenbreddels/ipyvolume
• pip install ipyvolume / conda install -c conda-forge ipyvolume
• if pip: jupyter nbextension enable --py --user ipyvolume (but really, use anaconda)
• Notebook:
• https://github.com/maartenbreddels/pygrunn-2017

More Related Content

What's hot

Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Jihoon Son
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
Jihoon Son
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profit
Tyler Treat
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
Jihoon Son
 
Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data era
Bill GU
 
Using PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataUsing PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic Data
Jimmy Angelakos
 
Performance evaluation of apache tajo
Performance evaluation of apache tajoPerformance evaluation of apache tajo
Performance evaluation of apache tajo
Jihoon Son
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL DatabasesReal-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases
Eugene Dvorkin
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
Frens Jan Rumph
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Jimmy Angelakos
 
MongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: Sharding
MongoDB
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure way
Bahadir Cambel
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
DataPlato, Crossing the line
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
robertlz
 
Cassandra data structures and algorithms
Cassandra data structures and algorithmsCassandra data structures and algorithms
Cassandra data structures and algorithms
Duyhai Doan
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
Hakan Ilter
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
Cloudera, Inc.
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
Nathan Bijnens
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
Maria Stylianou
 
2015-12-05 Александр Коротков, Иван Панченко - Слабо-структурированные данные...
2015-12-05 Александр Коротков, Иван Панченко - Слабо-структурированные данные...2015-12-05 Александр Коротков, Иван Панченко - Слабо-структурированные данные...
2015-12-05 Александр Коротков, Иван Панченко - Слабо-структурированные данные...
HappyDev
 

What's hot (20)

Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack World
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profit
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data era
 
Using PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataUsing PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic Data
 
Performance evaluation of apache tajo
Performance evaluation of apache tajoPerformance evaluation of apache tajo
Performance evaluation of apache tajo
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL DatabasesReal-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
 
MongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: Sharding
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure way
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
 
Cassandra data structures and algorithms
Cassandra data structures and algorithmsCassandra data structures and algorithms
Cassandra data structures and algorithms
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
2015-12-05 Александр Коротков, Иван Панченко - Слабо-структурированные данные...
2015-12-05 Александр Коротков, Иван Панченко - Слабо-структурированные данные...2015-12-05 Александр Коротков, Иван Панченко - Слабо-структурированные данные...
2015-12-05 Александр Коротков, Иван Панченко - Слабо-структурированные данные...
 

Similar to Vaex pygrunn

Vaex talk-pydata-paris
Vaex talk-pydata-parisVaex talk-pydata-paris
Vaex talk-pydata-paris
Maarten Breddels
 
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
Ali-ziane Myriam
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
ThoughtWorks
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQuery
Csaba Toth
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
Csaba Toth
 
Decima Engine: Visibility in Horizon Zero Dawn
Decima Engine: Visibility in Horizon Zero DawnDecima Engine: Visibility in Horizon Zero Dawn
Decima Engine: Visibility in Horizon Zero Dawn
Guerrilla
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
bodaceacat
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
Sara-Jayne Terp
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
huguk
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Cloudera, Inc.
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014
Hassan Islamov
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
huguk
 
Java Memory Analysis: Problems and Solutions
Java Memory Analysis: Problems and SolutionsJava Memory Analysis: Problems and Solutions
Java Memory Analysis: Problems and Solutions
"Mikhail "Misha"" Dmitriev
 
PostgreSQL is the new NoSQL - at Devoxx 2018
PostgreSQL is the new NoSQL  - at Devoxx 2018PostgreSQL is the new NoSQL  - at Devoxx 2018
PostgreSQL is the new NoSQL - at Devoxx 2018
Quentin Adam
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
Joe Alex
 
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
srisatish ambati
 
Extended memory access in PHP
Extended memory access in PHPExtended memory access in PHP
Extended memory access in PHP
Andrew Goodwin
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Alexey Zinoviev
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Ontico
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
Shyam Raj
 

Similar to Vaex pygrunn (20)

Vaex talk-pydata-paris
Vaex talk-pydata-parisVaex talk-pydata-paris
Vaex talk-pydata-paris
 
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQuery
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Decima Engine: Visibility in Horizon Zero Dawn
Decima Engine: Visibility in Horizon Zero DawnDecima Engine: Visibility in Horizon Zero Dawn
Decima Engine: Visibility in Horizon Zero Dawn
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Java Memory Analysis: Problems and Solutions
Java Memory Analysis: Problems and SolutionsJava Memory Analysis: Problems and Solutions
Java Memory Analysis: Problems and Solutions
 
PostgreSQL is the new NoSQL - at Devoxx 2018
PostgreSQL is the new NoSQL  - at Devoxx 2018PostgreSQL is the new NoSQL  - at Devoxx 2018
PostgreSQL is the new NoSQL - at Devoxx 2018
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
 
Extended memory access in PHP
Extended memory access in PHPExtended memory access in PHP
Extended memory access in PHP
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 

Recently uploaded

GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
Areesha Ahmad
 
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills MN
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
PirithiRaju
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Selcen Ozturkcan
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
Scintica Instrumentation
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
Frédéric Baudron
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
Gadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdfGadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdf
PirithiRaju
 
23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
RDhivya6
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 

Recently uploaded (20)

GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
 
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
Gadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdfGadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdf
 
23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 

Vaex pygrunn

  • 1. A Billion stars in the Jupyter notebook Maarten Breddels Groningen - 2017 @MaartenBreddels
  • 2. Outline • Motivation/Problem • Solution (vaex) • Demo • vaex • ipwidgets / ipyvolume
  • 3. Copyright ESA/ATG medialab; background: ESO/S. Brunier • Gaia satellite • launched by ESA in december 2013 • determine for >109 stars/objects in our Milky Way • positions • velocities • astrophysical parameters • First catalogue (DR1) is out • sky positions, G magnitude Motivation
  • 4. Motivation• We have Gaia DR1, soon DR2 • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data/Quality checks • Science/Discovery • Trends, relations, clustering • You are the (biological) neutral network
  • 5. Motivation• We have Gaia DR1, soon DR2 • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data/Quality checks • Science/Discovery • Trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 109 rows/objects • Work with densities/statistics in 0,1,2 and 3d • Interactive? • Zoom, pan etc • Explore: selections/queries
  • 6. Motivation• We have Gaia DR1, soon DR2 • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data/Quality checks • Science/Discovery • Trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 109 rows/objects • Work with densities/statistics in 0,1,2 and 3d • Interactive? • Zoom, pan etc • Explore: selections/queries
  • 7. Motivation• We have Gaia DR1, soon DR2 • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data/Quality checks • Science/Discovery • Trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 109 rows/objects • Work with densities/statistics in 0,1,2 and 3d • Interactive? • Zoom, pan etc • Explore: selections/queries
  • 8. How fast can it be processed? • What can be done? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-20 GiB/s: ~1 second • CPU: 3 Ghz (but multicore, say 4-8): 12-24 cycles/second • Few cycles per row/object, simple algorithm • Histograms/Density grids • Yes, but • If it fits/cached in memory, otherwise ssd/hdd speeds (10-100 seconds) • proper storage and reading of data (no .csv!) • simple and fast algorithm for binning • ~1 second
  • 9. How fast can it be processed? • What can be done? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-20 GiB/s: ~1 second • CPU: 3 Ghz (but multicore, say 4-8): 12-24 cycles/second • Few cycles per row/object, simple algorithm • Histograms/Density grids • Yes, but • If it fits/cached in memory, otherwise ssd/hdd speeds (10-100 seconds) • proper storage and reading of data (no .csv!) • simple and fast algorithm for binning • ~1 second
  • 10. How fast can it be processed? • What can be done? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-20 GiB/s: ~1 second • CPU: 3 Ghz (but multicore, say 4-8): 12-24 cycles/second • Few cycles per row/object, simple algorithm • Histograms/Density grids • Yes, but • If it fits/cached in memory, otherwise ssd/hdd speeds (10-100 seconds) • proper storage and reading of data (no .csv!) • simple and fast algorithm for binning • ~1 second
  • 11. How to store and read the data cache • Storage: native, column based (hdf5) • Normal (POSIX read) method: • Allocate memory • read from disk to memory • Actually: from disk, to OS cache, to memory • Wastes memory/cache • 15 GB data , requires 30 GB is you want to use the file system cache
  • 12. How to store and read the data cache • Storage: native, column based (hdf5) • Normal (POSIX read) method: • Allocate memory • read from disk to memory • Actually: from disk, to OS cache, to memory • Wastes memory/cache • 15 GB data , requires 30 GB is you want to use the file system cache
  • 13. How to store and read the data cache • Storage: native, column based (hdf5) • Normal (POSIX read) method: • Allocate memory • read from disk to memory • Actually: from disk, to OS cache, to memory • Wastes memory/cache • 15 GB data , requires 30 GB is you want to use the file system cache • Memory mapping: • get direct access to OS memory cache, no copy, no setup (apart from the kernel doing setting up the pages) • avoid memory copies, more cache available • In previous example: • copying 15 GB will take about ~1.0 second, at 10-20 GB/s • Can be 2-3x slower (cpu cache helps a bit)
  • 14.
  • 15. 2d 1d
  • 20. 2d 1d 3d 0d 330,000 rows mean: -0.083
  • 21. 2d 1d 3d 0d 330,000 rows mean: -0.083
  • 22. vaex • Python library (conda/pip installable) • Think pandas-like but for large datasets • simple api: 1 class (Dataset), the rest are strings/dicts/tuples/lists/ndarray • From/to pandas dataframes, plain numpy arrays, astropy tables • Focusses mostly on statistics on N-d grids (count/mean/max/std/…) • >1 billion rows / sec on a `decent` desktop (quad core 3Gz) • >50x faster than scipy.stats.binned_statistic_2d • 0.1-0.2 billion rows on this 2yr old Macbook Air • CPU bound • Does visualisation / matplotlib / bqplot / ipyvolume / ipyleaflet
  • 23. For what? • Astronomical catalogues • Gaia > 1 billion • SDSS > 150 million • PANSTARRS > 1 billion • LSST > 10 billion • N-body simulations (0.1-100 billion) • columns: x,y,z,vx,vy,vz • SPH: density, temperature, metallicity… • rows: particle • Any tabular data • rows and columns
  • 24. For what? • Astronomical catalogues • Gaia > 1 billion • SDSS > 150 million • PANSTARRS > 1 billion • LSST > 10 billion • N-body simulations (0.1-100 billion) • columns: x,y,z,vx,vy,vz • SPH: density, temperature, metallicity… • rows: particle • Any tabular data • rows and columns
  • 25. For what? • Astronomical catalogues • Gaia > 1 billion • SDSS > 150 million • PANSTARRS > 1 billion • LSST > 10 billion • N-body simulations (0.1-100 billion) • columns: x,y,z,vx,vy,vz • SPH: density, temperature, metallicity… • rows: particle • Any tabular data • rows and columns
  • 26. For what? • Astronomical catalogues • Gaia > 1 billion • SDSS > 150 million • PANSTARRS > 1 billion • LSST > 10 billion • N-body simulations (0.1-100 billion) • columns: x,y,z,vx,vy,vz • SPH: density, temperature, metallicity… • rows: particle • Any tabular data • rows and columns
  • 27. Transformation of the data • Not ‘z = x + y’ • for 1 billion rows: 8GB of RAM wasted • Everything is an expression: dataset.mean(‘x + y’) • Computed in chunks not to waste RAM • Virtual columns: • dataset.add_virtual_column(‘r’, ‘sqrt(x**2+y**2)’) • dataset.mean(‘r’)
  • 28. Selections/subsets • You don’t use all data / filtering • Efficient subsets (do NOT copy the data) • vaex • Selections result in a boolean mask • dataset.select(‘x > 0’, name=‘xpos’) • Use in any statistic • dataset.mean(‘x + y’, selection=‘xpos’)
  • 30. Conclusions • vaex • N-d statistics on regular grids • ~1 billion rows/sec • efficient transformations/selections • regular grids can be used for visualisation • Interactive in the Notebook due to ipywidgets/bqplot/ipyleaflet • ipyvolume • adds interactive 3d plotting to the notebook, (as an ipywidget) • can be used as backend for vaex • works outside the notebook, in the browser • tablet (paperless office) • sharing, outreach, press release material
  • 32. • maartenbreddels@gmail.com • twitter @maartenbreddels • vaex • http://vaex.astro.rug.nl • https://github.com/maartenbreddels/vaex • pip install —pre vaex / conda install -c conda-forge vaex • ipyvolume • https://ipywidgets.readthedocs.io • https://github.com/maartenbreddels/ipyvolume • pip install ipyvolume / conda install -c conda-forge ipyvolume • if pip: jupyter nbextension enable --py --user ipyvolume (but really, use anaconda) • Notebook: • https://github.com/maartenbreddels/pygrunn-2017