With datasets becoming larger, new visualizations techniques are needed. One problem is that the number of datapoints is so large, that a scatter plot becomes cluttered. Another problem is that with over a billion objects, only a few cpu cycles are available per object if one wants to process them within a second, making traditional methods not viable.
I will show that it is possible to visualize a billion objects in about 1 second on a modern desktop computer using memory mapping of hdf5 files together with a simple binning algorithm in Python. Density fields in 1, 2 and 3d are computed and statistics of any properties of the objects, or any mathematical operation on them can be done on the fly. This enables efficient exploration or large datasets interactively, making science exploration feasible.
This idea is implemented in a Python library called vaex, which integrates well in the Jupyter/Numpy/Astropy stack. Build on top of this is the vaex application, which allows for interactive exploration and visualization.
The motivation for vaex is the upcoming astronomical catalogue of the Gaia satellite, which will contain properties of over a billion stars. However, vaex can also be used on N-body simulations, other catalogues or any tabular data.
https://www.astro.rug.nl/~breddels/vaex
https://github.com/maartenbreddels/vaex
This document discusses machine learning techniques for recommendations and clustering. It introduces recommendation algorithms that analyze user-item interaction data to find items users who interacted with one item also interacted with another. It also discusses techniques for fast, scalable clustering of large datasets including using a surrogate to quickly cluster data before applying a higher quality algorithm to cluster centroids. The document emphasizes that simple techniques like logging, counting and session analysis often work best at large scale and provides examples of using recommendations for queries, videos and music.
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
The task of “data profiling”—assessing the overall content and quality of a data set—is a core aspect of the analytic experience. Traditionally, profiling was a fairly cut-and-dried task: load the raw numbers into a stat package, run some basic descriptive statistics, and report the output in a summary file or perhaps a simple data visualization. However, data volumes can be so large today that traditional tools and methods for computing descriptive statistics become intractable; even with scalable infrastructure like Hadoop, aggressive optimization and statistical approximation techniques must be used. In this talk Sean will cover technical challenges in keeping data profiling agile in the Big Data era. He will discuss both research results and real-world best practices used by analysts in the field, including methods for sampling, summarizing and sketching data, and the pros and cons of using these various approaches.
Sean is Trifacta’s Chief Technical Officer. He completed his Ph.D. at Stanford University, where his research focused on user interfaces for database systems. At Stanford, Sean led development of new tools for data transformation and discovery, such as Data Wrangler. He previously worked as a data analyst at Citadel Investment Group.
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
Slideset of the training we gave at the Spark Summit East.
Blog : https://doubleclix.wordpress.com/2015/03/25/data-science-with-spark-on-the-databricks-cloud-training-at-sparksummit-east/
Video is posted at Youtube https://www.youtube.com/watch?v=oTOgaMZkBKQ
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...Databricks
In the past years, the increasing computational power has made possible larger scientific experiments that have high computing demands, such as brain tissue simulations. In general, larger simulations imply generating larger amounts of data that need to be then analyzed by neuroscientists. Currently, simulation reports are analyzed by neuroscientists with the help of Python scripts, thanks to its programming simplicity and its performance of the NumPy library.
However, this analysis workflow will become unfeasible in the near future, as we foresee a 10x increase of the dataset size in the next year. Therefore, we are exploring how to accelerate data analysis of brain activity simulations with big data technologies, like Spark. In this talk, we will present how we address this challenge: from building RDDs/DataFrames from custom binary files to data queries and transformations to achieve the desired scientific analyses. In order to reach our goals, we have implemented our workflow in five different ways, combining RDDs, DataFrames, different data structures and representations and different data partitioning.
After significant engineering and programming efforts, we would like to share with the community our lessons learned: how Spark features can leverage data analysis in our neuroscience research area and what type of decisions can negatively impact performance. Moreover, we would also like to open a discussion with some critical limitations we have found in Spark applied to our use cases, and how to address them in the future as a joint community effort. In brief, as takeaway messages, we will highlight the suitability of Spark for our data analysis, how data generation can highly impact subsequent data analysis and how the decision of data types and formats can have a significant impact in Spark performance. We will present our experiments run on Cooley, the Argonne National Laboratory (ANL) data analysis cluster.
This document summarizes a presentation given by Diane Mueller from ActiveState and Dr. Mike Müller from Python Academy. It compares MATLAB and Python capabilities for scientific computing. Python has many libraries like NumPy, SciPy, IPython and matplotlib that provide similar functionality to MATLAB. Together these are often called "Pylab". The presentation provides an overview of Python, NumPy arrays, visualization with matplotlib, and integrating Python with other languages.
A Jeff Young presentation for co-asis&t on the emerging topic of Linked Data and how libraries of all kinds can extend their interaction with the Web.
For more information visit our website:http://www.asis.org/Chapters/coasis
This document discusses machine learning techniques for recommendations and clustering. It introduces recommendation algorithms that analyze user-item interaction data to find items users who interacted with one item also interacted with another. It also discusses techniques for fast, scalable clustering of large datasets including using a surrogate to quickly cluster data before applying a higher quality algorithm to cluster centroids. The document emphasizes that simple techniques like logging, counting and session analysis often work best at large scale and provides examples of using recommendations for queries, videos and music.
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
The task of “data profiling”—assessing the overall content and quality of a data set—is a core aspect of the analytic experience. Traditionally, profiling was a fairly cut-and-dried task: load the raw numbers into a stat package, run some basic descriptive statistics, and report the output in a summary file or perhaps a simple data visualization. However, data volumes can be so large today that traditional tools and methods for computing descriptive statistics become intractable; even with scalable infrastructure like Hadoop, aggressive optimization and statistical approximation techniques must be used. In this talk Sean will cover technical challenges in keeping data profiling agile in the Big Data era. He will discuss both research results and real-world best practices used by analysts in the field, including methods for sampling, summarizing and sketching data, and the pros and cons of using these various approaches.
Sean is Trifacta’s Chief Technical Officer. He completed his Ph.D. at Stanford University, where his research focused on user interfaces for database systems. At Stanford, Sean led development of new tools for data transformation and discovery, such as Data Wrangler. He previously worked as a data analyst at Citadel Investment Group.
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
Slideset of the training we gave at the Spark Summit East.
Blog : https://doubleclix.wordpress.com/2015/03/25/data-science-with-spark-on-the-databricks-cloud-training-at-sparksummit-east/
Video is posted at Youtube https://www.youtube.com/watch?v=oTOgaMZkBKQ
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...Databricks
In the past years, the increasing computational power has made possible larger scientific experiments that have high computing demands, such as brain tissue simulations. In general, larger simulations imply generating larger amounts of data that need to be then analyzed by neuroscientists. Currently, simulation reports are analyzed by neuroscientists with the help of Python scripts, thanks to its programming simplicity and its performance of the NumPy library.
However, this analysis workflow will become unfeasible in the near future, as we foresee a 10x increase of the dataset size in the next year. Therefore, we are exploring how to accelerate data analysis of brain activity simulations with big data technologies, like Spark. In this talk, we will present how we address this challenge: from building RDDs/DataFrames from custom binary files to data queries and transformations to achieve the desired scientific analyses. In order to reach our goals, we have implemented our workflow in five different ways, combining RDDs, DataFrames, different data structures and representations and different data partitioning.
After significant engineering and programming efforts, we would like to share with the community our lessons learned: how Spark features can leverage data analysis in our neuroscience research area and what type of decisions can negatively impact performance. Moreover, we would also like to open a discussion with some critical limitations we have found in Spark applied to our use cases, and how to address them in the future as a joint community effort. In brief, as takeaway messages, we will highlight the suitability of Spark for our data analysis, how data generation can highly impact subsequent data analysis and how the decision of data types and formats can have a significant impact in Spark performance. We will present our experiments run on Cooley, the Argonne National Laboratory (ANL) data analysis cluster.
This document summarizes a presentation given by Diane Mueller from ActiveState and Dr. Mike Müller from Python Academy. It compares MATLAB and Python capabilities for scientific computing. Python has many libraries like NumPy, SciPy, IPython and matplotlib that provide similar functionality to MATLAB. Together these are often called "Pylab". The presentation provides an overview of Python, NumPy arrays, visualization with matplotlib, and integrating Python with other languages.
A Jeff Young presentation for co-asis&t on the emerging topic of Linked Data and how libraries of all kinds can extend their interaction with the Web.
For more information visit our website:http://www.asis.org/Chapters/coasis
Python Ireland Feb '11 Talks: Introduction to PythonPython Ireland
"Introduction to Python" by Sean O'Donnell
Level: Beginner
Abstract:
The content is an introduction to python, by means of comparing it to C, Java and Ruby.
Video:
http://vimeo.com/groups/pythonireland/videos/20239008
Thanks to all who came along. Approx. 30 people turned up.
Thanks to Science Gallery for being our host.
sciencegallery.com/
Kez created a story board. The story board contained visual representations and notes used to plan out scenes, shots, progression and elements of a story or video project. The document refers to a story board but provides no other details about its content or purpose.
This document provides a summary of key concepts from a Python programming tutorial, including strings, lists, tuples, and maps (dictionaries). It defines each concept and provides examples of how to create, print, add/remove elements from each data type. It also includes exercises asking the reader to apply the concepts by creating and manipulating variables of each data type. The exercises test creating and modifying strings, lists, tuples, and dictionaries with different operations like printing, adding, removing elements.
This document provides an overview of iterators, generators, and the Python Imaging Library (PIL) basics in Python. It discusses how iterators allow iteration over objects using the iterator protocol with __iter__ and next() methods. Generators are lazy functions that yield values and can be created via generator factories or comprehensions. The PIL allows loading, saving, and modifying image files in Python through functions like Image.new(), getpixel(), putpixel(), and ImageDraw for drawing primitives.
Practical idioms in Python - Pycon india 2014 proposalPratik M
This document discusses common Python idioms and patterns including iterators, comprehensions, lazy evaluation, decorators, context managers, object-oriented programming, and functional programming concepts like map-reduce. It explores how to use decorators to build caches and multi-threaded systems, how context managers work with the __enter__ and __exit__ methods, and how functional idioms can be applied in Python for tasks like analytics.
Python presentation with examples available on github. Main concepts can be found by exploring the small and concise code samples available.
It presents some of Python's capabilities and characteristics. The code samples (end of June 2014) are tested with latest Python 2.7 but should be easily convertable to Python 3.
LESSON 2. CONDITIONAL LOGIC, IF ELSE STATEMENTS, SELECTION, DEBUGGING
Introduction to, with examples, conditional logic and the use of IF and ELSE statements. Look at SELECTION in game design. Learn about Debugging and Error Checking. Analyse the use of a flow chart and how to design before implementation. Discuss: Video gaming addiction! Create a password checker and a username and password (login) app. Learn about the use of ELIF. Learn about Boolean variables and their use. Learn about Multiple comparisons using and/or. Includes a suggested videos, ‘Big ideas’ discussion, and HW/research projects section. Discussion on Artificial Intelligence and Robotics.
Dear readers, these Python Programming Language Interview Questions have been designed specially to get you acquainted with the nature of questions you may encounter during your interview for the subject of Python Programming Language. As per my experience good interviewers hardly plan to ask any particular question during your interview, normally questions start with some basic concept of the subject and later they continue based on further discussion and what you answer −
LESSON 3A. INTRODUCTION TO ITERATION: LOOPS, TRACE TABLES, WHILE LOOPS
Introduction to Iteration and loops. The theory behind loops and how they work. Create and adapt programs using loops. Intro to the random number generator. Learn about trace tabling (white box testing). Example of a trace table and dry run. Wonders of the Fibonacci sequence. Examples of Iteration in game design. Focus on While loops. Challenges, tasks (with solutions), suggested videos, big ideas discussion and research and HW included. Introducing Ada Lovelace and Charles Babbage.
The document discusses iterators and generators in Python. It explains that iterators are objects that can be used in for loops over iterable objects like lists, strings, and dictionaries. Generators are functions that produce a sequence of results using yield statements, making them iterators as well. The document provides examples of using built-in iterators, implementing custom iterators as classes, and creating generator functions and expressions.
Teach your kids how to program with Python and the Raspberry PiJuan Gomez
RaspberryPis are the new frontier in enabling kids (and curious adults) to get access to an affordable and easy-to-program platform to build cool things. Over a million of these nifty little devices have been sold in less than a year and part of their popularity has been due to how easy it is to start programming on them.
In this session you'll learn how to get started with the Raspberry PI, initial set-up, configuration and some tips and tricks. Then we'll have a brief introduction to basic Python and we'll write a few simple programs that run on the RaspberryPI. The last section of the session will be dedicated to PyGame, we'll learn about surfaces, events, inputs, sprites, etc and demonstrate how to build very simple games that are as much fun for kids to write, than to play!
web programming Unit VIII complete about python by Bhavsingh MalothBhavsingh Maloth
The document provides an introduction to Python programming. It discusses key Python concepts like functions, scopes, arguments, and iterators. The introduction covers defining functions with def statements, the LEGB rule for scopes, passing arguments by assignment, and using lambda expressions for inline functions. Iterators and list comprehensions are also introduced as ways to iterate over objects in Python.
It is one of the Best Presentation on the topic "R Programming" having interesting Slides consisting of Amazing Images & Very Useful Information. It also have Transitions & Animation which makes the Presentation more Interesting & Attractive.
Created By - Abhishek Pratap Singh (Aps)
This document summarizes a presentation on programming basics given by Abhishek Pratap Singh. The presentation covered an introduction to the topic, motivation for learning programming basics, an abstract of what would be covered, applications of programming, the evolution of the C programming language, and some interesting facts about C and C++. It provided examples of writing C code without using loops or semicolons to demonstrate programming techniques.
Learn about how to use the Python language and its constructs in different situations. With classification of scenarios done already you can use this material as a reference guide too.
Learn the language and its idioms by trying out the sample code or by looking at code snippet.
This presentation was created for Next Craft JMP Student Program to help students jump start with Python programming.
This document discusses list comprehensions in Python. It provides examples of using list comprehensions to generate lists based on conditions. It describes generating a list of squares of numbers from 1 to 20 and generating a list of letters from a dictionary whose values are greater than or equal to 3. It then discusses using list comprehensions to solve practice problems involving loading height data from a file, summarizing statistics, looking up heights, finding people above a certain height, and printing a report of heights.
My keynote speech from EuroPython, this talk explores what it is like being a developer in a community filled with experts from around the world. The goal of the talk is to provide useful content for beginners and topics of discussion for more advanced developers, while also focusing on Python’s strengths. Video of this talk is at http://www.youtube.com/watch?v=7TImWbnUDeI
This document provides an overview of a tutorial on mastering Python 3 I/O. The tutorial will cover the reimplemented I/O system in Python 3, including text handling, formatting, binary data handling, the new I/O stack, system interfaces, and library design issues. It assumes some familiarity with how I/O works in Python 2 and will take a detailed tour of the changes in Python 3.
Here are the steps to solve this problem:
1. Convert both lists of numbers to sets:
set1 = {11, 2, 3, 4, 15, 6, 7, 8, 9, 10}
set2 = {15, 2, 3, 4, 15, 6}
2. Find the intersection of the two sets:
intersection = set1.intersection(set2)
3. The number of elements in the intersection is the number of similar elements:
similarity = len(intersection)
4. Print the result:
print(similarity)
The similarity between the two sets is 4, since they both contain the elements {2, 3, 4, 15}.
Python Tricks That You Can't Live WithoutAudrey Roy
Audrey Roy gave a presentation on Python tricks for code readability and reuse at PyCon Philippines 2012. She discussed writing clean, understandable code by following PEP8 style guidelines and using linters. She also explained how to find and install reusable Python libraries from the standard library and PyPI, and how to write packages and modules to create reusable code.
This talk will show what is possible huge datasets that are becoming more prevalent in the era of big data. I will demonstrate this and the 3d visualization in the Jupyter notebook, the by now almost standard environment of (data) scientists.
With large astronomical catalogues containing more than a billion stars becoming common, we are preparing for methods to visualize and explore these large datasets. Data volumes of this size requires different visualization techniques, since scatter plots become too slow and meaningless due to overplotting. We solve the performance and visualization issue using binned statistics, e.g. histograms, density maps, and volume rendering in 3d. The calculation of statistics on N-dimensional grids is handled by Python library called vaex, which I will introduce. It can process at least a billion samples per second, to produce for instance the mean of a quantity on a regular grid. This statistics can be calculated for any mathematical expression on the data (numpy style) and can be on the full dataset or subsets, specified by queries/selections.
However, to visualize higher dimensional data in the notebook interactively, no proper solution existed. This led to the development of ipyvolume, which can render 3d volumes and up to a million glyphs (scatter plots and quiver) in the Jupyter notebook as a widget. With the browser as a platform, and the release of ipywidgets 6.0, these 3d plots can also be embedded in static html files and renders on nbviewer. This allows for sharing with colleagues, rendering on your tablet (paperless office), outreach, press release material, etc. Full screen stereo rendering allows for a virtual reality experience using your phone and Google Cardboard, a minor investment compared to other VR head mountables. Overlaying 3d quiver plots on a 3d volume rendering allows exploring a 6d (or higher) space.
Vaex and ipyvolume can be used together to explore and visualize any large tabular data set, or separately to calculate statistics, and render 3d plots in the notebook and outside.
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
This document provides guidance on diagnosing problems in Cassandra production systems. It recommends first using OpsCenter to identify issues, then monitoring servers, applications, and logs. Common problems discussed include incorrect timestamps, tombstones slowing queries, not using a snitch, version mismatches, and disk space not being reclaimed. Diagnostic tools like htop, iostat, and nodetool are presented. The document also covers JVM garbage collection profiling to identify issues like early object promotion and long minor GCs slowing the system.
Python Ireland Feb '11 Talks: Introduction to PythonPython Ireland
"Introduction to Python" by Sean O'Donnell
Level: Beginner
Abstract:
The content is an introduction to python, by means of comparing it to C, Java and Ruby.
Video:
http://vimeo.com/groups/pythonireland/videos/20239008
Thanks to all who came along. Approx. 30 people turned up.
Thanks to Science Gallery for being our host.
sciencegallery.com/
Kez created a story board. The story board contained visual representations and notes used to plan out scenes, shots, progression and elements of a story or video project. The document refers to a story board but provides no other details about its content or purpose.
This document provides a summary of key concepts from a Python programming tutorial, including strings, lists, tuples, and maps (dictionaries). It defines each concept and provides examples of how to create, print, add/remove elements from each data type. It also includes exercises asking the reader to apply the concepts by creating and manipulating variables of each data type. The exercises test creating and modifying strings, lists, tuples, and dictionaries with different operations like printing, adding, removing elements.
This document provides an overview of iterators, generators, and the Python Imaging Library (PIL) basics in Python. It discusses how iterators allow iteration over objects using the iterator protocol with __iter__ and next() methods. Generators are lazy functions that yield values and can be created via generator factories or comprehensions. The PIL allows loading, saving, and modifying image files in Python through functions like Image.new(), getpixel(), putpixel(), and ImageDraw for drawing primitives.
Practical idioms in Python - Pycon india 2014 proposalPratik M
This document discusses common Python idioms and patterns including iterators, comprehensions, lazy evaluation, decorators, context managers, object-oriented programming, and functional programming concepts like map-reduce. It explores how to use decorators to build caches and multi-threaded systems, how context managers work with the __enter__ and __exit__ methods, and how functional idioms can be applied in Python for tasks like analytics.
Python presentation with examples available on github. Main concepts can be found by exploring the small and concise code samples available.
It presents some of Python's capabilities and characteristics. The code samples (end of June 2014) are tested with latest Python 2.7 but should be easily convertable to Python 3.
LESSON 2. CONDITIONAL LOGIC, IF ELSE STATEMENTS, SELECTION, DEBUGGING
Introduction to, with examples, conditional logic and the use of IF and ELSE statements. Look at SELECTION in game design. Learn about Debugging and Error Checking. Analyse the use of a flow chart and how to design before implementation. Discuss: Video gaming addiction! Create a password checker and a username and password (login) app. Learn about the use of ELIF. Learn about Boolean variables and their use. Learn about Multiple comparisons using and/or. Includes a suggested videos, ‘Big ideas’ discussion, and HW/research projects section. Discussion on Artificial Intelligence and Robotics.
Dear readers, these Python Programming Language Interview Questions have been designed specially to get you acquainted with the nature of questions you may encounter during your interview for the subject of Python Programming Language. As per my experience good interviewers hardly plan to ask any particular question during your interview, normally questions start with some basic concept of the subject and later they continue based on further discussion and what you answer −
LESSON 3A. INTRODUCTION TO ITERATION: LOOPS, TRACE TABLES, WHILE LOOPS
Introduction to Iteration and loops. The theory behind loops and how they work. Create and adapt programs using loops. Intro to the random number generator. Learn about trace tabling (white box testing). Example of a trace table and dry run. Wonders of the Fibonacci sequence. Examples of Iteration in game design. Focus on While loops. Challenges, tasks (with solutions), suggested videos, big ideas discussion and research and HW included. Introducing Ada Lovelace and Charles Babbage.
The document discusses iterators and generators in Python. It explains that iterators are objects that can be used in for loops over iterable objects like lists, strings, and dictionaries. Generators are functions that produce a sequence of results using yield statements, making them iterators as well. The document provides examples of using built-in iterators, implementing custom iterators as classes, and creating generator functions and expressions.
Teach your kids how to program with Python and the Raspberry PiJuan Gomez
RaspberryPis are the new frontier in enabling kids (and curious adults) to get access to an affordable and easy-to-program platform to build cool things. Over a million of these nifty little devices have been sold in less than a year and part of their popularity has been due to how easy it is to start programming on them.
In this session you'll learn how to get started with the Raspberry PI, initial set-up, configuration and some tips and tricks. Then we'll have a brief introduction to basic Python and we'll write a few simple programs that run on the RaspberryPI. The last section of the session will be dedicated to PyGame, we'll learn about surfaces, events, inputs, sprites, etc and demonstrate how to build very simple games that are as much fun for kids to write, than to play!
web programming Unit VIII complete about python by Bhavsingh MalothBhavsingh Maloth
The document provides an introduction to Python programming. It discusses key Python concepts like functions, scopes, arguments, and iterators. The introduction covers defining functions with def statements, the LEGB rule for scopes, passing arguments by assignment, and using lambda expressions for inline functions. Iterators and list comprehensions are also introduced as ways to iterate over objects in Python.
It is one of the Best Presentation on the topic "R Programming" having interesting Slides consisting of Amazing Images & Very Useful Information. It also have Transitions & Animation which makes the Presentation more Interesting & Attractive.
Created By - Abhishek Pratap Singh (Aps)
This document summarizes a presentation on programming basics given by Abhishek Pratap Singh. The presentation covered an introduction to the topic, motivation for learning programming basics, an abstract of what would be covered, applications of programming, the evolution of the C programming language, and some interesting facts about C and C++. It provided examples of writing C code without using loops or semicolons to demonstrate programming techniques.
Learn about how to use the Python language and its constructs in different situations. With classification of scenarios done already you can use this material as a reference guide too.
Learn the language and its idioms by trying out the sample code or by looking at code snippet.
This presentation was created for Next Craft JMP Student Program to help students jump start with Python programming.
This document discusses list comprehensions in Python. It provides examples of using list comprehensions to generate lists based on conditions. It describes generating a list of squares of numbers from 1 to 20 and generating a list of letters from a dictionary whose values are greater than or equal to 3. It then discusses using list comprehensions to solve practice problems involving loading height data from a file, summarizing statistics, looking up heights, finding people above a certain height, and printing a report of heights.
My keynote speech from EuroPython, this talk explores what it is like being a developer in a community filled with experts from around the world. The goal of the talk is to provide useful content for beginners and topics of discussion for more advanced developers, while also focusing on Python’s strengths. Video of this talk is at http://www.youtube.com/watch?v=7TImWbnUDeI
This document provides an overview of a tutorial on mastering Python 3 I/O. The tutorial will cover the reimplemented I/O system in Python 3, including text handling, formatting, binary data handling, the new I/O stack, system interfaces, and library design issues. It assumes some familiarity with how I/O works in Python 2 and will take a detailed tour of the changes in Python 3.
Here are the steps to solve this problem:
1. Convert both lists of numbers to sets:
set1 = {11, 2, 3, 4, 15, 6, 7, 8, 9, 10}
set2 = {15, 2, 3, 4, 15, 6}
2. Find the intersection of the two sets:
intersection = set1.intersection(set2)
3. The number of elements in the intersection is the number of similar elements:
similarity = len(intersection)
4. Print the result:
print(similarity)
The similarity between the two sets is 4, since they both contain the elements {2, 3, 4, 15}.
Python Tricks That You Can't Live WithoutAudrey Roy
Audrey Roy gave a presentation on Python tricks for code readability and reuse at PyCon Philippines 2012. She discussed writing clean, understandable code by following PEP8 style guidelines and using linters. She also explained how to find and install reusable Python libraries from the standard library and PyPI, and how to write packages and modules to create reusable code.
This talk will show what is possible huge datasets that are becoming more prevalent in the era of big data. I will demonstrate this and the 3d visualization in the Jupyter notebook, the by now almost standard environment of (data) scientists.
With large astronomical catalogues containing more than a billion stars becoming common, we are preparing for methods to visualize and explore these large datasets. Data volumes of this size requires different visualization techniques, since scatter plots become too slow and meaningless due to overplotting. We solve the performance and visualization issue using binned statistics, e.g. histograms, density maps, and volume rendering in 3d. The calculation of statistics on N-dimensional grids is handled by Python library called vaex, which I will introduce. It can process at least a billion samples per second, to produce for instance the mean of a quantity on a regular grid. This statistics can be calculated for any mathematical expression on the data (numpy style) and can be on the full dataset or subsets, specified by queries/selections.
However, to visualize higher dimensional data in the notebook interactively, no proper solution existed. This led to the development of ipyvolume, which can render 3d volumes and up to a million glyphs (scatter plots and quiver) in the Jupyter notebook as a widget. With the browser as a platform, and the release of ipywidgets 6.0, these 3d plots can also be embedded in static html files and renders on nbviewer. This allows for sharing with colleagues, rendering on your tablet (paperless office), outreach, press release material, etc. Full screen stereo rendering allows for a virtual reality experience using your phone and Google Cardboard, a minor investment compared to other VR head mountables. Overlaying 3d quiver plots on a 3d volume rendering allows exploring a 6d (or higher) space.
Vaex and ipyvolume can be used together to explore and visualize any large tabular data set, or separately to calculate statistics, and render 3d plots in the notebook and outside.
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
This document provides guidance on diagnosing problems in Cassandra production systems. It recommends first using OpsCenter to identify issues, then monitoring servers, applications, and logs. Common problems discussed include incorrect timestamps, tombstones slowing queries, not using a snitch, version mismatches, and disk space not being reclaimed. Diagnostic tools like htop, iostat, and nodetool are presented. The document also covers JVM garbage collection profiling to identify issues like early object promotion and long minor GCs slowing the system.
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
This session covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Viewers will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster.
This document provides an introduction to Google BigQuery, a cloud-based data warehouse that allows users to interactively query and analyze massive datasets. It begins with background on big data and technologies like Hadoop, Hive, and Spark. It then explains the differences between row-based and column-based data stores, with BigQuery using a columnar approach. The rest of the document demonstrates BigQuery through an example query on public datasets and provides pricing and resource information.
Decima Engine: Visibility in Horizon Zero DawnGuerrilla
Download the full presentation here: http://www.guerrilla-games.com/read/decima-engine-visibility-in-horizon-zero-dawn
Abstract: Horizon Zero Dawn presented the Decima engine with new challenges in rendering large and dense environments. In particular, we needed to be able to quickly query a very large set of potential objects to find which should be visible. This talk looks at the problems we faced moving from more constrained Killzone levels to Horizon's open world, and our approach to fast visibility queries using the PS4's asynchronous compute hardware. It also covers our recent work on efficiently collecting batches of object instances during the query to reduce load on the entire rendering pipeline. A basic familiarity with GPU compute will be helpful to get the most out of this talk.
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
This talk will describe his research into using Hadoop to query and manage big geographic datasets, specifically OpenStreetMap(OSM). OSM is an “open-source” map of the world, growing at a large rate, currently around 5TB of data. The talk will introduce OSM, detail some aspects of the research, but also discuss his experiences with using the SpatialHadoop stack on Azure and Google Cloud.
This document summarizes challenges in assembling large DNA sequence data sets and strategies to address them.
1. The cost to generate DNA sequence data is decreasing rapidly, creating data sets too large for most computers to assemble. Hundreds to thousands of such data sets are generated each year.
2. Techniques like streaming compression and low-memory probabilistic data structures allow assembly memory usage to scale linearly with the sample size rather than the total data, enabling assembly of larger datasets.
3. Benchmarking different computational platforms revealed that while some platforms have faster processors, the ability to store large amounts of data locally is also important for assembly tasks. Scaling algorithms, rather than just optimizing code, is key to addressing
Big Data Analytics: Finding diamonds in the rough with AzureChristos Charmatzis
In this session it will presented main workflows and technologies of getting value from Big Data stored in our Enterprise using Azure.
- When we have a Big Data problem
- Finding the best solution for our Big Data
- Working inside the Data Team
- Extract the true value of our data.
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionDataStax Academy
Diagnosing Problems in Production involves first preparing monitoring tools like OpsCenter, Server monitoring, Application metrics, and Log aggregation. Common issues include incorrect server times causing data inconsistencies, tombstone overhead slowing queries, not using the right snitch, version mismatches breaking functionality, and disk space not being reclaimed properly. Diagnostic tools like htop, iostat, vmstat, dstat, strace, jstack, tcpdump and nodetool can help narrow down issues like performance bottlenecks, garbage collection problems, and compaction issues.
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionDataStax Academy
Speaker(s): Jon Haddad, Apache Cassandra Evangelist at DataStax
This sessions covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Attendees will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster. This talk is intended for people with a general understanding of Cassandra, but it not required to have experience running it in production.
Cassandra Day London 2015: Diagnosing Problems in ProductionDataStax Academy
Speaker(s): Jon Haddad, Apache Cassandra Evangelist at DataStax
This sessions covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Attendees will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster. This talk is intended for people with a general understanding of Cassandra, but it not required to have experience running it in production.
Big Data Analysis : Deciphering the haystack Srinath Perera
A primary outcome of Bigdata is to derive useful and actionable insights from large or challenges data collections. The goal is to run the transformations from data, to information, to knowledge, and finally to insights. This includes calculating simple analytics like Mean, Max, and Median, to derive overall understanding about data by building models, and finally to derive predictions from data. Some cases we can afford to wait to collect and processes them, while in other cases we need to know the outputs right away. MapReduce has been the defacto standard for data processing, and we will start our discussion from there. However, that is only one side of the problem. There are other technologies like Apache Spark and Apache Drill graining ground, and also realtime processing technologies like Stream Processing and Complex Event Processing. Finally there are lot of work on porting decision technologies like Machine learning into big data landscape. This talk discusses big data processing in general and look at each of those different technologies comparing and contrasting them.
2013 py con awesome big data algorithmsc.titus.brown
This document provides an overview of algorithms for analyzing large datasets, referred to as "big data". It discusses skip lists, HyperLogLog counting, and Bloom filters as examples of probabilistic data structures that can be used for problems involving big data. These algorithms provide approximate answers but are more scalable and memory efficient than exact algorithms. The document also describes applications of these algorithms to analyzing shotgun DNA sequencing data from metagenomics studies.
This document provides an overview of big data and how it can be used to forecast and predict outcomes. It discusses how large amounts of data are now being collected from various sources like the internet, sensors, and real-world transactions. This data is stored and processed using technologies like MapReduce, Hadoop, stream processing, and complex event processing to discover patterns, build models, and make predictions. Examples of current predictions include weather forecasts, traffic patterns, and targeted marketing recommendations. The document outlines challenges in big data like processing speed, security, and privacy, but argues that with the right techniques big data can help further human goals of understanding, explaining, and anticipating what will happen in the future.
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardDocker, Inc.
This document discusses using containers and supercomputers to enable open science. It describes how supercomputers are used for diverse scientific research in many fields. Containers can help address issues with portability and scalability on supercomputers by replicating software environments. Shifter enables the use of Docker containers on supercomputers while addressing security and performance concerns. Examples are given of scientific projects using containers, such as astronomy, particle physics, and biology projects. Ensuring reproducibility of results through containerization is also discussed.
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
In the analysis of big data there are often problem queries that don’t scale because they require huge compute resources to generate exact results, or don’t parallelize well.
This document discusses handling larger datasets and moving to distributed systems. It begins by explaining different storage sizes from gigabytes to exabytes and yottabytes. For too big data, it recommends reading data in chunks, using parallel processing libraries like Dask, and compiled Python. It then discusses distributed file systems, MapReduce frameworks, and distributed programming platforms like Hadoop and Spark. The document also covers SQL and NoSQL databases, data warehouses, data lakes, and typical big data science team roles including data scientists, engineers, and analysts. It provides examples of distributed systems and concludes with exercises and suggestions for further reading.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
3. Copyright ESA/ATG medialab; background: ESO/S. Brunier
• Gaia satellite
• launched by ESA in december 2013
• determines positions, velocities and
astrophysical parameters of >109
stars of the Milky Way
• First catalogue in September 2016
• Final catalogue ~2020
6. Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
7. Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
8. Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
9. Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
10. Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
11. Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
12. Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
• Python
• rich set of libraries, becoming the default in
science
13. Situation
• TOPCAT comes close, not fast enough, works with
individual rows/particles, no exploratory tools,
written in Java.
• Your own IDL/Python code: a lot to consider to do it
optimal (multicore, efficient storage, efficient
algorithms, interactive becomes complex)
• DataShader?
• We want something to visualize 109 rows/objects in
~1 second
• Do we need to resort to Big Data solutions?
14. Big data?
• Big data
• definitions vary
• Practical question to ask
• Do you need Big Data solutions?
• Many computers / grid
• Invest in software
• Or
• Can we do it on a single computer ‘old style’
16. Interactive? How?
• Interactive?
• 10
9
* 2 * 8 bytes = 15 GiB (double is 8 bytes)
• Memory bandwidth: 10-20 GiB/s: ~1 second
• CPU: 3 Ghz (but multicore, say 4): 12 cycles/
second
• Few cycles per row/object, simple algorithm
• Histograms/Density grids
• Yes, but
• If it fits/cached in memory, otherwise sdd/hdd
speeds (30-100 seconds)
• proper storage and reading of data
• simple and fast algorithm for binning
•Aquarius A-2: pure dark matter N
body simulation of Milky Way like Halo
•6. *108 particles
• < 1 second
17. Interactive? How?
• Interactive?
• 10
9
* 2 * 8 bytes = 15 GiB (double is 8 bytes)
• Memory bandwidth: 10-20 GiB/s: ~1 second
• CPU: 3 Ghz (but multicore, say 4): 12 cycles/
second
• Few cycles per row/object, simple algorithm
• Histograms/Density grids
• Yes, but
• If it fits/cached in memory, otherwise sdd/hdd
speeds (30-100 seconds)
• proper storage and reading of data
• simple and fast algorithm for binning
•Aquarius A-2: pure dark matter N
body simulation of Milky Way like Halo
•6. *108 particles
• < 1 second
18. • Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if
unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the
file system cache
cache
How to store and read the data
19. • Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if
unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the
file system cache
cache
How to store and read the data
• Memory mapping:
• get direct access to OS memory cache, no copy, no
setup (apart from the kernel doing setting up the
pages)
• avoid memory copies, more cache available
• In previous example:
• copying 15 GB will take about 0.5-1.0 second, at
10-20 GB/s
• Can be 2-3x slower (cpu cache helps a bit)
20. • Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if
unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the
file system cache
cache
How to store and read the data
• Memory mapping:
• get direct access to OS memory cache, no copy, no
setup (apart from the kernel doing setting up the
pages)
• avoid memory copies, more cache available
• In previous example:
• copying 15 GB will take about 0.5-1.0 second, at
10-20 GB/s
• Can be 2-3x slower (cpu cache helps a bit)
24. How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
25. How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
26. How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
27. How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
600 times faster
42. Vaex: Visualization And EXploration
• A library
• python package
• ‘import vaex’
• reading of data
• multithreading
• binning (1,2,3, Nd)
• selections/queries
• server/client
• integrates with IPython notebook
• A GUI program
• Gives interactive navigation, zoom,
pan
• interactive selection (lasso,
rectangle)
• client
• undo/redo
43. Example data / vaex.example()
• Helmi and de Zeeuw 2000
• build up of a MW (stellar) halo from 33 satellites
• 3.3 * 106 particles
• Almost smooth in configuration space (x,y,z)
• Structure visible in E-Lz-L space
44.
45.
46.
47.
48. Exploration:Automated
• High dimensionality → many subspaces
• E-L, E-y, E-x, E-vx, E-vy, E-vz, E-z, E-Lz, L-y, L-
x, L-vx, L-vy, L-vz, L-z, L-Lz, y-x, y-vx, y-vy, y-
vz, y-z, y-Lz, x-vx, x-vy, x-vz, x-z, x-Lz, vx-vy,
vx-vz, vx-z, vx-Lz, vy-vz, vy-z, vy-Lz, vz-z, vz-
Lz, z-Lz
• Can we automate this / at least help?
• Ranking subspaces using Mutual Information
49. Mutual information
• Measure the information loss between
p(x,y) and p(x)p(y)
• information loss measured using the KL
divergence:
• In short:
• How much does breaking up correlation
(not just linear) change density
distribution?
50. Ranking by Mutual information
p(x,y) p(x)p(y) I(X;Y)
0.003
0.837
x,y
Lz,L
51. Ranking by Mutual information
p(x,y) p(x)p(y) I(X;Y)
0.003
0.837
x,y
Lz,L
52. Expressions/Virtual columns
• Not just visualize existing columns
• mathematical operations
• sqrt(x**2+y**2)
• log(r)
• Virtual columns
• r=sqrt(x**2+y**2)
• Suggest common ones
• equatorial to galactic
coordinates
53.
54.
55.
56.
57. How to get the data
• Gaia: ~1-2 TB
• download
• torrent
• Do you need to?
• server/client?
58. Client/Server
• 2d histogram example
• raw data:15 GB
• binned 256x256 grid
• 500KB, or 10-100kb compressed
• Proof of concept
• over http / websocket
• vaex webserver [file …]
67. Get vaex
• standalone binary (OS X , Linux) (just download and start)
• www.astro.rug.nl/~breddels/vaex (or google ‘vaex
visualisation’)
• www.github.com/maartenbreddels/vaex
• In Python
• Vanilla Python
• pip install —user —pre vaex
• Anaconda
• conda install -c maartenbreddels vaex
68. Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
69. Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
70. Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
71. Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
72. Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
73. Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)