Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Visualizing Billions of Data Points:
Doing It Right
About Us
Peter Wang is the CTO and Co-founder of Continuum
Analytics and the creator of Bokeh.
He has been developing comm...
Overview
• Visualizing big data — what is the problem?
• Datashading
• Dataset 1: 10 million points of NYC Taxi data
• Dat...
Visualizing big data:
What is the problem?
X
Billions and billions…
5
Big data magnifies small problems
• Of course, big data presents storage and computation problems
• More importantly, st...
6
Overdrawing
• For a scatterplot, the order in
which points are drawn is very
important
• The same distribution can look
...
7
Overdrawing
• Underlying issue is just occlusion
• Same problem happens with one
category, but less obvious
• Can preven...
8
Saturation
• E.g. for alpha = 0.1, up to 10 points
can overlap before saturating the
available brightness
• Now the orde...
9
Saturation
• Same alpha value, more points:
• Now is highly misleading
• alpha value depends on size, overlap
of dataset...
10
Saturation
• Can try to reduce point size to reduce
overplotting and saturation
• Now points are hard to see, with no
g...
11
Binning issues
• Can use heatmap
instead of scatter
• Avoids saturation by
auto-ranging on bins
• Result independent of...
12
Plotting big data
• When exploring really big data, the visualization is all you have —
there’s no way to look at each ...
Datashading
14
Datashading
• Flexible, configurable pipeline for automatic plotting
• Provides flexible plugins for viz stages, like i...
15
Datashading Pipeline: Projection
Data
Project /
Synthesize
Scene
• Stage 1: select variables (columns) to project onto ...
16
Datashading Pipeline: Aggregation
Data
Project /
Synthesize
Scene Aggregates
Sample /
Raster
• Stage 2: Aggregate data ...
17
Datashading Pipeline: Transfer
Data
Project /
Synthesize
Scene Aggregates
Sample /
Raster Transfer
Image
• Stage 3: Tra...
18
Datashader software architecture
• Open-source BSD-licensed library, written in Python
• Download from https://github.c...
Dataset 1: NYC Taxi data
Dataset 1: Overview
This demo shows how traditional plotting tools break down for large datasets,
and how to use datashadi...
Dataset 2: 3 billion OSM points
Dataset 2: Overview
Because Datashader decouples the data-processing from the
visualization, it can handle arbitrarily lar...
Out of core operation
• OSM database is 40GB, larger than the 16GB
RAM on this machine
• Fast out-of-core operation powere...
Dataset 3: 300 million points Census
data
25
Categorical data: 2010 US Census
• One point per
person
• 300 million total
• Categorized by
race
• Datashading
shows f...
26
Categorical data: Illinois Census
• Zooming in
doesn’t require
new parameter
settings
• All ranges, etc.
update
automat...
27
Categorical data: Chicago Census
• Chicago area is
highly racially
segregated, on
average
• Population
density
differen...
28
Categorical data: Chinatown Census
• At neighborhood
level, the full
racial distribution
is clear
• Size of dots
increa...
29
Categorical data: Chinatown Census
• Bokeh supports
geographic
data tiles
• Borders
between race-
based
neighborhoods
a...
Go Further with Datashader
© 2016 Continuum Analytics- Confidential & Proprietary
Use DataShader Today!
• Go to https://github.com/bokeh/datashader
•...
© 2016 Continuum Analytics- Confidential & Proprietary© 2016 Continuum Analytics- Confidential & Proprietary
Support & Ind...
© 2016 Continuum Analytics- Confidential & Proprietary
Anaconda Mosaic
• Create PORTABLE Transformations
• Interactively E...
© 2016 Continuum Analytics- Confidential & Proprietary
Training & Consulting
• We also provide:
• Training for your team o...
Contact Information and Additional Details
• Contact sales@continuum.io for more information about

Anaconda subscriptions...
Thank you
Email: sales@continuum.io
Twitter: @ContinuumIO
Peter Wang
Twitter: @pwang
Jim Bednar
Twitter: @JamesABednar
Upcoming SlideShare
Loading in …5
×

Visualizing Billions of Data Points: Doing it Right

35,901 views

Published on

DOWNLOAD the video for this webinar here: http://go.continuum.io/visualizing-billions-data-points/

Use Intelligent Downsampling to Extract Signal from Noise in Big Data

Visualization is the best way to explore and communicate insights about data. Whether you are dealing with geospatial, time series, or tabular data, interactive graphics allow everyone on your team, from analysts to executives, to understand the patterns in your data.

But as data grows to include millions and billions of points, traditional visualization techniques break down. Whether you're loading the data into limited memory, or separating the signal from the noise when thousands of data points occupy each pixel, as data gets big, visualization gets challenging.

We are here to help - Continuum Analytics' data scientists and engineers have developed a new Big Data visualization approach called datashading. Check out our slides from our Datashading webinar, led by Continuum Analytics CTO Peter Wang and Solution Architect Dr. Jim Bednar.

Published in: Data & Analytics

Visualizing Billions of Data Points: Doing it Right

  1. 1. Visualizing Billions of Data Points: Doing It Right
  2. 2. About Us Peter Wang is the CTO and Co-founder of Continuum Analytics and the creator of Bokeh. He has been developing commercial scientific computing and visualization software for over 15 years. As a creator of the PyData conference, he devotes time and energy to growing the Python data community, and advocating and teaching Python at conferences worldwide. Jim Bednar is a Solutions Architect at Continuum Analytics, and an Honorary Fellow at the University of Edinburgh, Scotland. He holds a Ph.D. in Computer Science from the University of Texas, as well as undergraduate degrees in Electrical Engineering and Philosophy. He has published more than 50 papers and books about the visual system and software development, and manages the open-source Python projects HoloViews, ImaGen, Param, and Datashader.
  3. 3. Overview • Visualizing big data — what is the problem? • Datashading • Dataset 1: 10 million points of NYC Taxi data • Dataset 2: 3 billion points of OpenStreetMap data • Dataset 3: 300 million points of US Census data • Sneak preview: Data explorer web interface • Q&A
  4. 4. Visualizing big data: What is the problem?
  5. 5. X Billions and billions…
  6. 6. 5 Big data magnifies small problems • Of course, big data presents storage and computation problems • More importantly, standard plotting tools have problems that are magnified by big data: • Overdrawing/Overplotting • Saturation • Undersaturation • Binning issues • We’ll first explain these problems, and then present a new technique called datashading to address them head-on.
  7. 7. 6 Overdrawing • For a scatterplot, the order in which points are drawn is very important • The same distribution can look entirely different depending on plotting order • Last data plotted overplots
  8. 8. 7 Overdrawing • Underlying issue is just occlusion • Same problem happens with one category, but less obvious • Can prevent occlusion using transparency
  9. 9. 8 Saturation • E.g. for alpha = 0.1, up to 10 points can overlap before saturating the available brightness • Now the order of plotting matters less • After 10 points, first-plotted data still lost • For one category, 10, 20, or 2000 points overlapping will look identical
  10. 10. 9 Saturation • Same alpha value, more points: • Now is highly misleading • alpha value depends on size, overlap of dataset • Difficult-to-set parameter, hard to know when data is misrepresented
  11. 11. 10 Saturation • Can try to reduce point size to reduce overplotting and saturation • Now points are hard to see, with no guarantee of avoiding problems • Another difficult-to-set parameter • For really big data, scatterplots start to become very inefficient, because there are many datapoints per pixel — may as well be binning by pixel
  12. 12. 11 Binning issues • Can use heatmap instead of scatter • Avoids saturation by auto-ranging on bins • Result independent of data size • Here two merged normal distributions look very different at different binning • Another difficult-to-set parameter
  13. 13. 12 Plotting big data • When exploring really big data, the visualization is all you have — there’s no way to look at each of the individual data points • Common plotting problems can lead to completely incorrect conclusions based on misleading visualizations • Slow processing makes trial and error approach ineffective When data is large, you don’t know when the viz is lying.
  14. 14. Datashading
  15. 15. 14 Datashading • Flexible, configurable pipeline for automatic plotting • Provides flexible plugins for viz stages, like in graphics shaders • Completely prevents overplotting, saturation, and undersaturation • Mitigates binning issues by providing fully interactive exploration in web browsers, even of very large datasets on ordinary machines • Statistical transformations of data are a first-class aspect of the visualization • Allows rapid iteration of visual styles & configs, interactive selections and filtering, to support data exploration
  16. 16. 15 Datashading Pipeline: Projection Data Project / Synthesize Scene • Stage 1: select variables (columns) to project onto the screen • Data often filtered at this stage
  17. 17. 16 Datashading Pipeline: Aggregation Data Project / Synthesize Scene Aggregates Sample / Raster • Stage 2: Aggregate data into a fixed set of bins • Each bin yields one or more scalars (total count, mean, stddev, etc.)
  18. 18. 17 Datashading Pipeline: Transfer Data Project / Synthesize Scene Aggregates Sample / Raster Transfer Image • Stage 3: Transform data using one or more transfer functions, culminating in a function that yields a visible image • Each stage can be replaced and configured separately
  19. 19. 18 Datashader software architecture • Open-source BSD-licensed library, written in Python • Download from https://github.com/bokeh/datashader • conda install -c bokeh datashader • Works well with the Bokeh library, for interactive visualization of large data in the browser, but is fully independent • Supports large parallel machines, but works well on desktops and laptops — all demos shown here will be on a standard MacBook Pro
  20. 20. Dataset 1: NYC Taxi data
  21. 21. Dataset 1: Overview This demo shows how traditional plotting tools break down for large datasets, and how to use datashading to make even large datasets practical interactively. • Data for 10 million New York City taxi trips • Even 100,000 points gets slow for scatterplot • Parameters usually need adjusting for every zoom • True relationships within data not visible in std plot Datashading automatically reveals the entire dataset, including outliers, hot spots, and missing data
  22. 22. Dataset 2: 3 billion OSM points
  23. 23. Dataset 2: Overview Because Datashader decouples the data-processing from the visualization, it can handle arbitrarily large data • E.g. Open Street Map data: • About 3 billion GPS coordinates • https://blog.openstreetmap.org/ 2012/04/01/bulk-gps-point-data/. • This image was rendered in one minute on a standard MacBook with 16 GB RAM • Renders in 7 seconds on a 128GB Amazon EC2 instance
  24. 24. Out of core operation • OSM database is 40GB, larger than the 16GB RAM on this machine • Fast out-of-core operation powered by: • Numba: generates fast C and GPU code from Python source • Dask: Parallelizes tasks • datashader source code is all Python
  25. 25. Dataset 3: 300 million points Census data
  26. 26. 25 Categorical data: 2010 US Census • One point per person • 300 million total • Categorized by race • Datashading shows faithful distribution per pixel
  27. 27. 26 Categorical data: Illinois Census • Zooming in doesn’t require new parameter settings • All ranges, etc. update automatically
  28. 28. 27 Categorical data: Chicago Census • Chicago area is highly racially segregated, on average • Population density differences still visible, automatically
  29. 29. 28 Categorical data: Chinatown Census • At neighborhood level, the full racial distribution is clear • Size of dots increases automatically for visibility
  30. 30. 29 Categorical data: Chinatown Census • Bokeh supports geographic data tiles • Borders between race- based neighborhoods are related to geographic features
  31. 31. Go Further with Datashader
  32. 32. © 2016 Continuum Analytics- Confidential & Proprietary Use DataShader Today! • Go to https://github.com/bokeh/datashader • Please Go Try It Out! • Star It, Fork It, Open Issues, Let Us Know What You Think! 31
  33. 33. © 2016 Continuum Analytics- Confidential & Proprietary© 2016 Continuum Analytics- Confidential & Proprietary Support & Indemnification • We offer support & indemnification for DataShader and many other Python packages through our Anaconda Pro offering. • You can learn more at continuum.io/ anaconda-subscriptions 32
  34. 34. © 2016 Continuum Analytics- Confidential & Proprietary Anaconda Mosaic • Create PORTABLE Transformations • Interactively EXPLORE Heterogeneous Data Visually with Datashader • Easily ANALYZE Large Flat File Repositories • ELIMINATE Data Movement and Redundant Storage • CATALOG datasets and transformations 33 • Learn More About Anaconda Mosaic in Our Webinar - https://go.continuum.io/visualize-explore-transform/ • Get a Personalized Demo of Anaconda Mosaic - https://www.continuum.io/anaconda-subscriptions
  35. 35. © 2016 Continuum Analytics- Confidential & Proprietary Training & Consulting • We also provide: • Training for your team on DataShader, Bokeh, matplotlib & Many Other Visualization Packages • Consulting & Custom Engineering Help Taking Visualizations to Scale 34
  36. 36. Contact Information and Additional Details • Contact sales@continuum.io for more information about
 Anaconda subscriptions and about becoming an early adopter for Data Explorer — help make sure our product fits your needs! • View documentation and examples at github.com/bokeh/datashader and bokeh.pydata.org • View demo notebooks on Anaconda Cloud notebooks.anaconda.org/jbednar/
  37. 37. Thank you Email: sales@continuum.io Twitter: @ContinuumIO Peter Wang Twitter: @pwang Jim Bednar Twitter: @JamesABednar

×