The first scatter plot (top left) appears to be a simple linear relationship, corresponding to two variables correlated and following the assumption of normality. The second graph (top right) is not distributed normally; while a relationship between the two variables is obvious, it is not linear, and the Pearson correlation coefficient is not relevant. A more general regression and the corresponding coefficient of determination would be more appropriate. In the third graph (bottom left), the distribution is linear, but should have a different regression line (a robust regression would have been called for). The calculated regression is offset by the one outlier which exerts enough influence to lower the correlation coefficient from 1 to 0.816. Finally, the fourth graph (bottom right) shows an example when one outlier is enough to produce a high correlation coefficient, even though the other data points do not indicate any relationship between the variables. The quartet is still often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets. The datasets are as follows. The x values are the same for the first three datasets.
it's possible to generate bivariate data with a given mean, median, and correlation in any shape you like — even a dinosaur The paper linked below describes a method of perturbing the points in a scatterplot, moving them towards a given shape while keeping the statistical summaries close to the fixed target value. The shapes include a star, and a cross, and the "DataSaurus"
designed like MatLab many output formats ( A lot of documentation on the website and in the mailing lists refers to the “backend” and many new users are confused by this term. matplotlib targets many different use cases and output formats. Some people use matplotlib interactively from the python shell and have plotting windows pop up when they type commands. Some people embed matplotlib into graphical user interfaces like wxpython or pygtk to build rich applications. Others use matplotlib in batch scripts to generate postscript images from some numerical simulations, and still others in web application servers to dynamically serve up graphs. To support all of these use cases, matplotlib can target different outputs, and each of these capabilities is called a backend; the “frontend” is the user facing code, i.e., the plotting code, whereas the “backend” does all the hard work behind-the-scenes to make the figure. There are two types of backends: user interface backends (for use in pygtk, wxpython, tkinter, qt4, or macosx; also referred to as “interactive backends”) and hardcopy backends to make image files (PNG, SVG, PDF, PS; also referred to as “non-interactive backends”). ) can reproduce any plot well-tested, 14 year as a standard tool
I want population vs area coloured by Region
imperative and too verbose API poor styles sometimes poor support of webview/interactions often slow for large and complicated data
keep matplotlib as a backend and provide domain specific APIs
pandas - dataframe object with plotting methods seaborn - focus on statistical visualization. Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. (more than 5 years) ggplot is a Python implementation of the grammar of graphics. It is not intended to be a feature-for-feature port of ggplot2 for R--though there is much greatness in ggplot2, the Python world could stand to benefit from it. So there will be feature overlap, but not neccessarily mimicry (after all, R is a little weird). cartopy: ( Some of the key features of cartopy are: object oriented projection definitions point, line, polygon and image transformations between projections integration to expose advanced mapping in matplotlib with a simple and intuitive interface powerful vector data handling by integrating shapefile reading with Shapely capabilities ) http://proj4.org/ http://trac.osgeo.org/geos/
networkx: NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. Features Data structures for graphs, digraphs, and multigraphs Many standard graph algorithms Network structure and analysis measures Generators for classic graphs, random graphs, and synthetic networks Nodes can be "anything" (e.g., text, images, XML records) Edges can hold arbitrary data (e.g., weights, time-series) Open source 3-clause BSD license Well tested with over 90% code coverage Additional benefits from Python include fast prototyping, easy to teach, and multi-platform scikit-plot Scikit-plot is the result of an unartistic data scientist's dreadful realization that visualization is one of the most crucial components in the data science process, not just a mere afterthought. Gaining insights is simply a lot easier when you're looking at a colored heatmap of a confusion matrix complete with class labels rather than a single-line dump of numbers enclosed in brackets. Besides, if you ever need to present your results to someone (virtually any time anybody hires you to do data science), you show them visualizations, not a bunch of numbers in Excel. That said, there are a number of visualizations that frequently pop up in machine learning. Scikit-plot is a humble attempt to provide aesthetically-challenged programmers (such as myself) the opportunity to generate quick and beautiful graphs and plots with as little boilerplate as possible.
build an API that serializes the plot (usually JSON) that can be displayed in browser.
Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications. Plotly's Python graphing library makes interactive, publication-quality graphs online. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.
toyploy: Plot types: bar plots, filled region plots, graph visualizations, image visualizations, line plots, matrix plots, numberline plots, scatter plots, tabular plots, text plots. Styling: standard CSS, rich text with HTML markup. Integrates with Jupyter without any need for plugins, magics, etc. Interaction types: display interactive mouse coordinates, export figure data to CSV. Interactive output formats: Embeddable, self-contained HTML. Static output formats: SVG, PDF, PNG, MP4, WEBM. Portability: single code base for Python 2.7 / Python 3.6. Testing: greater-than-95% regression test coverage. Main feature: easy animations Cufflinks:This library binds the power of plotly with the flexibility of pandas for easy plotting. ipyvolume:3d plotting for Python in the Jupyter notebook based on IPython widgets using WebGL. Ipyvolume currenty can Do volume rendering. Create scatter plots (up to ~1 million glyphs). Create quiver plots (like scatter, but with an arrow pointing in a particular direction). Render in the Jupyter notebook, or create a standalone html page (or snippet to embed in your page). Render in stereo, for virtual reality with Google Cardboard. Animate in d3 style, for instance if the x coordinates or color of a scatter plots changes. Animations / sequences, all scatter/quiver plot properties can be a list of arrays, which can represent time snapshots. Stylable (although still basic) Integrates with ipywidgets for adding gui controls (sliders, button etc), see an example at the documentation homepage bokeh by linking the selection bqplot by linking the selection Ipyvolume will probably, but not yet: Render labels in latex. Do isosurface rendering. Do selections using mouse or touch. Show a custom popup on hovering over a glyph.
python, R, Matlab, JS
chart, dashboard, slides
Every chart that matplotlib or MATLAB graphics can do. Interactive charts and maps out-of-the-box. Get started working offline. Optional hosted sharing platform through Plotly On-Premises or Plotly Cloud. on top of d3.js
Streaming API (paid)
community, chat, email, phone support (depends on plan) public\private charts, dashboards, slides (depends on plan) png, jpeg, pdf, svg, eps, html export (depends on plan) connect to 7-18 sources (depends on plan)
Python, R, Scala, Julia Bokeh, a Python interactive visualization library, enables beautiful and meaningful visual presentation of data in modern web browsers. With Bokeh, you can quickly and easily create interactive plots, dashboards, and data applications. Bokeh helps provide elegant, concise construction of novel graphics in the style of D3.js, while also delivering high-performance interactivity over very large or streaming datasets.
Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data. Datashader breaks the creation of images of data into 3 main steps: Projection Each record is projected into zero or more bins of a nominal plotting grid shape, based on a specified glyph. Aggregation Reductions are computed for each bin, compressing the potentially large dataset into a much smaller aggregate array. Transformation These aggregates are then further processed, eventually creating an image. Using this very general pipeline, many interesting data visualizations can be created in a performant and scalable way. Datashader contains tools for easily creating these pipelines in a composable manner, using only a few lines of code. Datashader can be used on its own, but it is also designed to work as a pre-processing stage in a plotting library, allowing that library to work with much larger datasets than it would otherwise. Datashader is a graphics pipeline system for creating meaningful representations of large datasets quickly and flexibly. Datashader breaks the creation of images into a series of explicit steps that allow computations to be done on intermediate representations. This approach allows accurate and effective visualizations to be produced automatically, and also makes it simple for data scientists to focus on particular data and relationships of interest in a principled way. Using highly optimized rendering routines written in Python but compiled to machine code using Numba, datashader makes it practical to work with extremely large datasets even on standard hardware. https://datashader.readthedocs.io/en/latest/
Altair is a declarative statistical visualization library for Python, based on Vega-Lite. With Altair, you can spend more time understanding your data and its meaning. Altair’s API is simple, friendly and consistent and built on top of the powerful Vega-Lite visualization grammar. This elegant simplicity produces beautiful and effective visualizations with a minimal amount of code. Note: Altair and the underlying Vega-Lite library are under active development; new plot types and streamlined plotting interfaces will be added in future releases. Please stay tuned for developments in the coming months! – October 2016
The key idea is that you are declaring links between data columns to encoding channels, such as the x-axis, y-axis, color, etc. and the rest of the plot details are handled automatically. Building on this declarative plotting idea, a surprising number of useful plots and visualizations can be created.
Plotting data in the python ecosystem is a good news/bad news story. The good news is that there are a lot of options. The bad news is that there are a lot of options. Trying to figure out which ones works for you will depend on what you’re trying to accomplish. To some degree, you need to play with the tools to figure out if they will work for you. I don’t see one clear winner or clear loser. Here are a few of my closing thoughts: Pandas is handy for simple plots but you need to be willing to learn matplotlib to customize. Seaborn can support some more complex visualization approaches but still requires matplotlib knowledge to tweak. The color schemes are a nice bonus. ggplot has a lot of promise but is still going through growing pains. bokeh is a robust tool if you want to set up your own visualization server but may be overkill for the simple scenarios. pygal stands alone by being able to generate interactive svg graphs and png files. It is not as flexible as the matplotlib based solutions. Plotly generates the most interactive graphs. You can save them offline and create very rich web-based visualizations. As it stands now, I’ll continue to watch progress on the ggplot landscape and use pygal and plotly where interactivity is needed.
The power of machine learning comes from its ability to learn patterns from large amounts of data. Understanding your data is critical to building a powerful machine learning system.
Facets contains two robust visualizations to aid in understanding and analyzing machine learning datasets. Get a sense of the shape of each feature of your dataset using Facets Overview, or explore individual observations using Facets Dive. ********************* Explore Facets Overview and Facets Dive on the UCI Census Income dataset, used for predicting whether an individual’s income exceeds $50K/yr based on their census data. The census data contains features such as age, education level and occupation for each individual.1 ********************************************************
Overview takes input feature data from any number of datasets, analyzes them feature by feature and visualizes the analysis. Overview gives users a quick understanding of the distribution of values across the features of their dataset(s). Uncover several uncommon and common issues such as unexpected feature values, missing feature values for a large number of observation, training/serving skew and train/test/validation set skew. Facets Overview summarizes statistics for each feature and compares the training and test datasets. It becomes easy to learn the distribution of values across the 6 numeric and 9 categorical features for both datasets. Use the “Sort by” dropdown to sort features by “Distribution distance”. This sort order brings to the top of the tables, the features that are the most different between the two datasets. “Target” becomes the first feature in the table of categorical features. The chart for this feature shows that the training and test datasets actually use slightly different labels (“>50K” for the training data and “>50K.” for test data - notice the trailing period). This helps us uncover an unexpected difference between the training data and the test data.
Dive is a tool for interactively exploring large numbers of data points at once. Dive provides an interactive interface for exploring the relationship between data points across all of the different features of a dataset. Each individual item in the visualization represents a data point. Position items by "faceting" or bucketing them in multiple dimensions by their feature values. Success stories of Dive include the detection of classifier failure, identification of systematic errors, evaluating ground truth and potential new signals for ranking. The Dive visualization shows each individual item in the training dataset. Clicking on an individual item reveals key/value pairs that represent the features of that record; values may be strings or numbers. Using the menus on the left, you can change how the data is organized in order to gain insight into the dataset. Use the “Faceting” menu to do Row-based faceting” by “Education-num”. Use the “Color” menu to color by “Target”. This will show how higher levels of education are related to whether or not an individual earns more than $50K/yr.
Overview gives a high-level view of one or more data sets. It produces a visual feature-by-feature statistical analysis, and can also be used to compare statistics across two or more data sets. The tool can process both numeric and string features, including multiple instances of a number or string per feature. Overview can help uncover issues with datasets, including the following: Unexpected feature values Missing feature values for a large number of examples Training/serving skew Training/test/validation set skew Key aspects of the visualization are outlier detection and distribution comparison across multiple datasets. Interesting values (such as a high proportion of missing data, or very different distributions of a feature across multiple datasets) are highlighted in red. Features can be sorted by values of interest such as the number of missing values or the skew between the different datasets. Dive is a tool for interactively exploring up to tens of thousands of multidimensional data points, allowing users to seamlessly switch between a high-level overview and low-level details. Each example is a represented as single item in the visualization and the points can be positioned by faceting/bucketing in multiple dimensions by their feature values. Combining smooth animation and zooming with faceting and filtering, Dive makes it easy to spot patterns and outliers in complex data sets.
The Facets visualizations currently work only in Chrome - Issue 9. Disclaimer: This is not an official Google product Note: When visualizing a large amount of data, as is done in the Dive demo Jupyter notebook, you will need to start the notebook server with an increased IOPub data rate. This can be done with the command jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000.
Fun Fact: In large datasets, such as the CIFAR-10 dataset, a small human labelling error can easily go unnoticed. We inspected the CIFAR-10 dataset with Dive and were able to catch a frog-cat – an image of a frog that had been incorrectly labelled as a cat!
Exploration of the CIFAR-10 dataset using Facets Dive. Here we facet the ground truth labels by row and the predicted labels by column. This produces a confusion matrix view, allowing us to drill into particular kinds of misclassifications. In this particular case, the ML model incorrectly labels some small percentage of true cats as frogs. The interesting thing we find by putting the real images in the confusion matrix is that one of these "true cats" that the model predicted was a frog is actually a frog from visual inspection. With Facets Dive, we can determine that this one misclassification wasn't a true misclassification of the model, but instead incorrectly labeled data in the dataset.
We’ve gotten great value out of Facets inside of Google and are excited to share the visualizations with the world. We hope they can help you discover new and interesting things about your data that lead you to create more powerful and accurate machine learning models. And since they are open source, you can customize the visualizations for your specific needs or contribute to the project to help us all better understand our data. If you have feedback about your experience with Facets, please let us know what you think.
folium builds on the data wrangling strengths of the Python ecosystem and the mapping strengths of the Leaflet.js library. Manipulate your data in Python, then visualize it in on a Leaflet map via folium.
More than 1.5 million Instagram posts have been gathered to create this interactive infographics. All of the posts are geo-tagged so that mapping them out was possible. The colors on the map show density and sentiments of Instagram posts across Hong Kong.
Apache Superset is a data exploration and visualization web application. Superset provides: An intuitive interface to explore and visualize datasets, and create interactive dashboards. A wide array of beautiful visualizations to showcase your data. Easy, code-free, user flows to drill down and slice and dice the data underlying exposed dashboards. The dashboards and charts acts as a starting point for deeper analysis. A state of the art SQL editor/IDE exposing a rich metadata browser, and an easy workflow to create visualizations out of any result set. An extensible, high granularity security model allowing intricate rules on who can access which product features and datasets. Integration with major authentication backends (database, OpenID, LDAP, OAuth, REMOTE_USER, ...) A lightweight semantic layer, allowing to control how data sources are exposed to the user by defining dimensions and metrics Out of the box support for most SQL-speaking databases Deep integration with Druid allows for Superset to stay blazing fast while slicing and dicing large, realtime datasets Fast loading dashboards with configurable caching On top of having the ability to query your relational databases, Superset has ships with deep integration with Druid (a real time distributed column-store). When querying Druid, Superset can query humongous amounts of data on top of real time dataset. Note that Superset does not require Druid in any way to function, it's simply another database backend that it can query.
MySQL Postgres Vertica Oracle Microsoft SQL Server SQLite Greenplum Firebird MariaDB Sybase IBM DB2 Exasol MonetDB Snowflake Redshift more! look for the availability of a SQLAlchemy dialect for your database to find out whether it will work with Superset
Data Visualization Tools in Python
Data visualization tools in
Data Scientist at InData Labs
- why dataviz is important
- dataviz libraries in python
- facets tool
- interactive maps
- Apache Superset
- EDA & understanding the data
- fix data
- show insights
- models validation
- analytics & reporting
Plots vs descriptive statistics
Plots vs descriptive statistics
Property Value Accuracy
Mean of X 9 exact
variance of X
Mean of y 7.5
variance of y
4.125 +- 0.003
y = 3.00 +
Determ. coef. 0.67
Visualization of the week according to InsideBigData
if SQLAlchemy dialect is available for your DB
Airbnb Amino Brilliant.org Clark.de Digit Game Studios Douban
Endress+Hauser FBK - ICT center Faasos GfK Data Lab InData Labs
Maieutical Labs Qunar Shopkick Tails.com Tobii Tooploox Udemy Yahoo!
Panoramix Caravel Superset
Article on Superset benefits
Roaring Elephant podcast
Thanks for your attention!
some examples shown are available here