tabplotd3, interactive inspection
of large data
Edwin de Jonge (e.dejonge@cbs.nl), Martijn
Tennekes
UseR 2012
Tableplots?
tabplotd3, interactive inspection of large data 2
Why?
Statistical offices typically have data with:
Millions of objects
Several (dozens) of variables
Challenge
Explore data quality
Check distribution of missing values
tabplotd3, interactive inspection of large data 3
< Enter tableplot >
A tableplot is a multivariate plotting technique for large data sets.
Data is:
Sorted on a variable vs of choice
Divided in equal sized bins bj
Plotted using mean µij for each numerical vi and bin bj
Plotted using level fraction lijk for each level lk of categorical
vi and bin bj .
tabplotd3, interactive inspection of large data 4
< Enter tableplot >
var1 var2 var3 .... var8
16.4 million
records
100 row bins
tabplotd3, interactive inspection of large data 5
< Enter tableplot >
var1 var2 var3 .... var8
16.4 million
records
100 row bins
Value:
- numerical: mean
- categorical: frequency of
each category.
tabplotd3, interactive inspection of large data 6
Dutch (virtual) census
tabplotd3, interactive inspection of large data 7
Dutch (virtual) census
tabplotd3, interactive inspection of large data 8
Implementation in R (2011)
Package tabplot: command line interface
Package tabplotGTK: graphical user interface
Available on the internet (CRAN)
Supports very large datasets (up to 2 billion records)
Great, but not interactive enough...
tabplotd3, interactive inspection of large data 9
Interactivity
Interactive
Tableplot is a summary method for large data sets it should allow
for:
zooming
inspecting values
sorting values
With tabplot most can be done the standard R REPL way, but
zooming is a very important feature for detecting errors and
missing values in observations.
tabplotd3, interactive inspection of large data 10
<Enter tabplotd3>
tabplotd3, interactive inspection of large data 11
<Enter tabplotd3>
tabplotd3, interactive inspection of large data 12
tabplotd3
Interactive
Zooming: scroll with the mouse, or select a part of the plot
Inspecting values: hover with mouse
Sorting values: click on variable
tabplotd3, interactive inspection of large data 13
Technique
Tabplotd3 is a webapplication:
Rook serves the web pages and data.
RJSONIO, all parameters and aggregated data transferred in
json
tabplot, calculates tableplots and color palettes
ffbase adds fast aggregation for large data (> 107).
d3.js renders the tableplots and adds interactivity.
tabplotd3, interactive inspection of large data 14
Conclusion
tabplotd3 will be available in your local CRAN shortly!
tabplotd3, interactive inspection of large data 15

Tabplotd3, interactive inspection of large data

  • 1.
    tabplotd3, interactive inspection oflarge data Edwin de Jonge (e.dejonge@cbs.nl), Martijn Tennekes UseR 2012
  • 2.
  • 3.
    Why? Statistical offices typicallyhave data with: Millions of objects Several (dozens) of variables Challenge Explore data quality Check distribution of missing values tabplotd3, interactive inspection of large data 3
  • 4.
    < Enter tableplot> A tableplot is a multivariate plotting technique for large data sets. Data is: Sorted on a variable vs of choice Divided in equal sized bins bj Plotted using mean µij for each numerical vi and bin bj Plotted using level fraction lijk for each level lk of categorical vi and bin bj . tabplotd3, interactive inspection of large data 4
  • 5.
    < Enter tableplot> var1 var2 var3 .... var8 16.4 million records 100 row bins tabplotd3, interactive inspection of large data 5
  • 6.
    < Enter tableplot> var1 var2 var3 .... var8 16.4 million records 100 row bins Value: - numerical: mean - categorical: frequency of each category. tabplotd3, interactive inspection of large data 6
  • 7.
    Dutch (virtual) census tabplotd3,interactive inspection of large data 7
  • 8.
    Dutch (virtual) census tabplotd3,interactive inspection of large data 8
  • 9.
    Implementation in R(2011) Package tabplot: command line interface Package tabplotGTK: graphical user interface Available on the internet (CRAN) Supports very large datasets (up to 2 billion records) Great, but not interactive enough... tabplotd3, interactive inspection of large data 9
  • 10.
    Interactivity Interactive Tableplot is asummary method for large data sets it should allow for: zooming inspecting values sorting values With tabplot most can be done the standard R REPL way, but zooming is a very important feature for detecting errors and missing values in observations. tabplotd3, interactive inspection of large data 10
  • 11.
    <Enter tabplotd3> tabplotd3, interactiveinspection of large data 11
  • 12.
    <Enter tabplotd3> tabplotd3, interactiveinspection of large data 12
  • 13.
    tabplotd3 Interactive Zooming: scroll withthe mouse, or select a part of the plot Inspecting values: hover with mouse Sorting values: click on variable tabplotd3, interactive inspection of large data 13
  • 14.
    Technique Tabplotd3 is awebapplication: Rook serves the web pages and data. RJSONIO, all parameters and aggregated data transferred in json tabplot, calculates tableplots and color palettes ffbase adds fast aggregation for large data (> 107). d3.js renders the tableplots and adds interactivity. tabplotd3, interactive inspection of large data 14
  • 15.
    Conclusion tabplotd3 will beavailable in your local CRAN shortly! tabplotd3, interactive inspection of large data 15