SlideShare a Scribd company logo
1 of 67
Download to read offline
Alex Priem (@_alex_priem_) 
Edwin de Jonge (@edwindjonge) 
Strata, 21 nov 2014, Barcelona 
Patterns and meta patterns 
in Income Tax Data
Age vs mortgage debt (men)
Who are we? 
Statistical consultants / Data scientists 
working @ R&D department of Statistics Netherlands 
Statistics Netherlands (SN): 
-Government agency 
-Produces all official statistics of The Netherlands 
3
Income statistics based on Tax data 
4
Income Tax data 
–Contains all income tax records for the Netherlands 
–Approx 17M records with 550 variables. 
–Used to produce income statistics! 
Analysis is not trivial 
–Income Tax is complex (at least in the Netherlands) 
‐stages of progressive tax 
‐Complex Tax deductions (mortgage, flex workers) 
‐Complex Tax benefits (child care, social benefits) 
5
Tax data (2) 
-550 variables (for each person in NL): 
-15 identificators/unique keys 
-Dwelling, person id, etc. 
-70 categorical 
-250 numerical variables from the income tax form 
->200 derived variables (useful for analysis) 
-E.g. expandable income, income of dwelling/household 
6
Income/tax distributions 
Income (re)distribution hot topic since Piketty 
So how are income/tax/benefits distributed? 
-Look at 1D distributions: histograms 
-Look at 2D distributions: heatmaps 
-Problem: potentially 0.5 n(n-1) > 100k heatmaps! 
-even more when categorical included 
7
Let look at Patterns… 
8
Heatmap Patterns 
–What defines a pattern in heatmap? 
‐Peak/Spike? (mode, 0D point) 
‐Stripe (1D): 
•Horizontal Line? 
•Vertical Line? 
•Band? 
•Ridge? 
‐Blob (2D) 
‐Similarity between distributions (2D) 
9
Meta pattern? 
Meta patterns constitutes of repeating pattern in: 
‐different subpopulations 
•E.g. Male/female, Social economic status, Works in branch of Industry 
‐different pairs of variables 
•Income x age 
•Benefits x age 
•Etc. 
So patterns that are generic over different heatmaps. 
10
Looking for patterns 
Subpopulations: 
– Generate heatmap per category e.g. Age x Gross Income per social economic status 
–Automatic cluster heatmaps on distribution simularity 
Pairs of variables: 
-Generate heatmaps for all pairs 
-Prune: remove heatmaps with low support 
1. Use image classification to cluster them 
2. Or Cluster on extracted mode/line (wip) 
You will still need to look at the result! 
11
Why Visualization?
Anscombes quartet… 
13 
DS1 x 
y 
DS2 
x 
y 
DS3 
x 
y 
DS4 
x 
y 
10 
8.04 
10 
9.14 
10 
7.46 
8 
6.58 
8 
6.95 
8 
8.14 
8 
6.77 
8 
5.76 
13 
7.58 
13 
8.74 
13 
12.74 
8 
7.71 
9 
8.81 
9 
8.77 
9 
7.11 
8 
8.84 
11 
8.33 
11 
9.26 
11 
7.81 
8 
8.47 
14 
9.96 
14 
8.1 
14 
8.84 
8 
7.04 
6 
7.24 
6 
6.13 
6 
6.08 
8 
5.25 
4 
4.26 
4 
3.1 
4 
5.39 
19 
12.5 
12 
10.84 
12 
9.13 
12 
8.15 
8 
5.56 
7 
4.82 
7 
7.26 
7 
6.42 
8 
7.91 
5 
5.68 
5 
4.74 
5 
5.73 
8 
6.89
Anscombe’s quartet 
Property 
Value 
Mean of x1, x2, x3, x4 
All equal: 9 
Variance of x1, x2, x3, x4 
All equal: 11 
Mean of y1, y2, y3, y4 
All equal: 7.50 
Variance of y1, y2, y3, y4 
All equal: 4.1 
Correlation for ds1, ds2, ds3, ds4 
All equal 0.816 
Linear regression for ds1, ds2, ds3, ds4 
All equal: y = 3.00 + 0.500x 
Looks the same, right?
Lets plot!
Machine learning 
So clustering (machine learning) different? 
16
17
Visualization helps to … 
–Test your (hidden model) assumptions! 
– To find structure in data, e.g. 
“How is my data distributed?” 
–Visually explore patterns: 
‐Are there clusters? 
‐Are there outliers? 
18
19 
Heatmap recipe
20 
1. Take two numerical variables x and y 
2. Determine range 푟푥=[min푥,max⁡(푥)] 
3. Chop 푟푥 in 푛푥 equal pieces 
4. Repeat for y 
5. We now have 푛푥⁡.푛푦 bins 
6. Count # records in each bin 
7. Assign colors to counts 
8. Plot matrix 
9. Enjoy!
Easy as pie? 
Best practices and problems with heatmaps: 
-Resolution 
-Rescaling 
-Zooming 
-Outliers 
-Color scales 
21
22
23 
1. Take two numerical variables x and y 
2. Determine range 퐫퐱=[퐦퐢퐧퐱,퐦퐚퐱⁡(퐱)] 
3. Chop 푟푥 in 푛푥 equal pieces 
4. Repeat for y 
5. We now have 푛푥⁡.푛푦 bins 
6. Count # records in each bin 
7. Assign colors to counts 
8. Plot matrix 
9. Enjoy!
Range: Outliers? (1D) 
24 
+5M€ 
-1M€ 
Gross Income
Range: outliers removed (1% removed) 
25 
Gross Income 
+150k€
Range: outliers… 
Does your data contain outliers? 
-If so: most pixels are empty 
-Most cases: outliers have low mass and are barely visible 
Truncate range: in x or y direction: e.g. 99% quantile 
-Interactively: allow for zoom and pan. 
26
Range: data skewed? 
27 
–Many variables are not normal distributed: 
‐Power law: 풙훼 
‐Exponential: 푒푎풙+푏 
So rescale x or y or both
28 
1. Take two numerical variables x and y 
2. Determine range rx=[minx,max⁡(x)] 
3. Chop 풓풙 in 풏풙 equal pieces 
4. Repeat for y 
5. We now have 푛푥⁡.푛푦 bins 
6. Count # records in each bin 
7. Assign colors to counts 
8. Plot matrix 
9. Enjoy!
Chop: AKA “Binning” 
29
30 
Chop: resolution 
Resolution matters
31 
25 x 25
32 
50 x 50
33 
100 x 100
34 
250 x 250
35 
500 x 500
Chop: Too small / Too big 
If #bins too small: 
- patterns are hidden 
If #bins too large: 
- heatmap is noisy (signal vs noise) 
Optimal nr bins depends on data. 
(kernel based approx), but always play with bin size / resolution! 
36
Chop: integers… 
37
38 
1. Take two numerical variables x and y 
2. Determine range rx=[minx,max⁡(x)] 
3. Chop 푟푥 in 푛푥 equal pieces 
4. Repeat for y 
5. We now have 푛푥⁡.푛푦 bins 
6. Count # records in each bin 
7. Assign colors to counts 
8. Plot matrix 
9. Enjoy!
Count: zero counts 
Not every variable is relevant for each person! 
39
Count: exclude zero values 
40
Assign colors! 
41
42 
1. Take two numerical variables x and y 
2. Determine range rx=[minx,max⁡(x)] 
3. Chop 푟푥 in 푛푥 equal pieces 
4. Repeat for y 
5. We now have 푛푥⁡.푛푦 bins 
6. Count # records in each bin 
7. Assign colors to counts 
8. Plot matrix 
9. Enjoy!
43
Colors: scales 
–Color ‘intensity’ implies value 
–Percieved response depends on ‘color’ and ‘color lightness’ (compare #00ff00 with #0000ff) 
–Different models for color response: 
‐RGB (models computer monitor) 
‐HSV 
‐HCL 
‐CIELAB (models human eye) 
–Gradient generator: http://davidjohnstone.net/pages/lch-lab-colour-gradient-picker 
44
Colors 
–Color has two functions in heatmap: 
‐Show ‘counts’ in your data 
‐Show ‘patterns’ 
At least, use a perceptually uniform gradient 
-Libs: chroma.js, colorbrewer (R) 
…but patterns need distinct colors 
45
Color scales 
–Range of color scale depends on distribution of data. 
–Often have multiple populations/distributions in data 
–Severe spikes/stripes drown the smaller distributions: 
‐We suggest log scale 
‐Sometimes log scale is not enough 
–In practice, linear scale with low maximum cut-off works well 
–Effect is best understood in 3D (!). 
46
Peaks are best cut-off 
47
Example: Linear gradient 
48
Log-gradient 
49
Linear gradient with cut-off 
50
Perceptually uniform gradient 
51
Colors: background/missings matters 
52
Heatmaps side-by-side: gross income, men vs women 
53 
men
Meta pattern 
Meta patterns constitutes of repeating pattern in: 
‐different subpopulations 
‐different pairs of variables 
So patterns that are generic over different heatmaps. 
54
Heatmaps decomposed in subpopulations: 
55
Gross income by socioeconomic status 
56
Gross income, men, categorized by socioeconomic status 
57
Patterns 
–Stripes are real, not outliers: 
–Corresponds with benefits, tax breaks 
–Needs paradigm shift: data is not normally distributed (but we knew that). 
58
Meta pattern 
Meta patterns constitutes of repeating pattern in: 
‐different subpopulations 
‐different pairs of variables 
So patterns that are generic over different heatmaps. 
59
Image classification of heatmaps 
60
No Domain knowledge required? 
61
62
Salary pay structure 
63
Domain knowledge, take II 
64
Pattern removal: Effect of weighting 
65
Summary 
Heatmaps: 
–ideal tool for analyzing big datasets 
–Be aware of perceptual and data biases! 
66
Questions? 
Thank you for your attention! 
More info? 
ah.priem@cbs.nl / @_alex_priem 
e.dejonge@cbs.nl / @edwindjonge 
Heatmapping code available at 
https://github.com/alexpriem/heatmapr 
67

More Related Content

What's hot

Multiply by 10, 100, 1000, etc...
Multiply by 10, 100, 1000, etc...Multiply by 10, 100, 1000, etc...
Multiply by 10, 100, 1000, etc...Brooke Young
 
Hexadecimal (Calculations and Explanations)
Hexadecimal (Calculations and Explanations)Hexadecimal (Calculations and Explanations)
Hexadecimal (Calculations and Explanations)Project Student
 
8 4 scientific notation - day 1
8 4 scientific notation - day 18 4 scientific notation - day 1
8 4 scientific notation - day 1bweldon
 
Stem-and-Leaf Plot and Line Plot
Stem-and-Leaf Plot and Line PlotStem-and-Leaf Plot and Line Plot
Stem-and-Leaf Plot and Line Plotsheisirenebkm
 
Teoria y problemas de tabla de frecuencias tf221 ccesa007
Teoria y problemas de tabla de frecuencias  tf221  ccesa007Teoria y problemas de tabla de frecuencias  tf221  ccesa007
Teoria y problemas de tabla de frecuencias tf221 ccesa007Demetrio Ccesa Rayme
 
4.5 multiplying and dividng by powers of 10
4.5 multiplying and dividng by powers of 104.5 multiplying and dividng by powers of 10
4.5 multiplying and dividng by powers of 10Rachel
 

What's hot (7)

Multiply by 10, 100, 1000, etc...
Multiply by 10, 100, 1000, etc...Multiply by 10, 100, 1000, etc...
Multiply by 10, 100, 1000, etc...
 
Cours Stats 5E
Cours Stats 5ECours Stats 5E
Cours Stats 5E
 
Hexadecimal (Calculations and Explanations)
Hexadecimal (Calculations and Explanations)Hexadecimal (Calculations and Explanations)
Hexadecimal (Calculations and Explanations)
 
8 4 scientific notation - day 1
8 4 scientific notation - day 18 4 scientific notation - day 1
8 4 scientific notation - day 1
 
Stem-and-Leaf Plot and Line Plot
Stem-and-Leaf Plot and Line PlotStem-and-Leaf Plot and Line Plot
Stem-and-Leaf Plot and Line Plot
 
Teoria y problemas de tabla de frecuencias tf221 ccesa007
Teoria y problemas de tabla de frecuencias  tf221  ccesa007Teoria y problemas de tabla de frecuencias  tf221  ccesa007
Teoria y problemas de tabla de frecuencias tf221 ccesa007
 
4.5 multiplying and dividng by powers of 10
4.5 multiplying and dividng by powers of 104.5 multiplying and dividng by powers of 10
4.5 multiplying and dividng by powers of 10
 

Viewers also liked

Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data VisualizationEdwin de Jonge
 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsEdwin de Jonge
 
Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsAjay Ohri
 
Chunked, dplyr for large text files
Chunked, dplyr for large text filesChunked, dplyr for large text files
Chunked, dplyr for large text filesEdwin de Jonge
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataDataWorks Summit/Hadoop Summit
 
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Ryan Rosario
 
Security and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasSecurity and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasDataWorks Summit/Hadoop Summit
 
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Ryan Rosario
 

Viewers also liked (8)

Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualization
 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasets
 
Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and concepts
 
Chunked, dplyr for large text files
Chunked, dplyr for large text filesChunked, dplyr for large text files
Chunked, dplyr for large text files
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
 
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
 
Security and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasSecurity and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache Atlas
 
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
 

Similar to Heatmaps best practices Strata Hadoop

Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.pptData Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.pptSATYAJIT58
 
Displaying quantitative data
Displaying quantitative dataDisplaying quantitative data
Displaying quantitative dataUlster BOCES
 
Lecture 10.1 10.2 bt
Lecture 10.1 10.2 btLecture 10.1 10.2 bt
Lecture 10.1 10.2 btbtmathematics
 
counting techniques
counting techniquescounting techniques
counting techniquesUnsa Shakir
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analyticsCollin Bennett
 
How to manually equalize the histograms of two (or more) subvolumes, measured...
How to manually equalize the histograms of two (or more) subvolumes, measured...How to manually equalize the histograms of two (or more) subvolumes, measured...
How to manually equalize the histograms of two (or more) subvolumes, measured...Javier García Molleja
 
3 3 polynomial inequalities in two variables
3 3 polynomial inequalities in two variables3 3 polynomial inequalities in two variables
3 3 polynomial inequalities in two variableshisema01
 
Presentation Math Workshop#May 25th New Help our teachers understa...
Presentation Math Workshop#May 25th New            Help our teachers understa...Presentation Math Workshop#May 25th New            Help our teachers understa...
Presentation Math Workshop#May 25th New Help our teachers understa...guest80c0981
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reductionMarco Quartulli
 
New approaches in linear inequalities
New approaches in linear inequalitiesNew approaches in linear inequalities
New approaches in linear inequalitiesTarun Gehlot
 
Do's and Don'ts of using t-SNE.pdf
Do's and Don'ts of using t-SNE.pdfDo's and Don'ts of using t-SNE.pdf
Do's and Don'ts of using t-SNE.pdfFrankClat
 
Matematica 1 ro sec segundo trimestre
Matematica 1 ro sec segundo trimestreMatematica 1 ro sec segundo trimestre
Matematica 1 ro sec segundo trimestreErickM20
 
dynamic programming Rod cutting class
dynamic programming Rod cutting classdynamic programming Rod cutting class
dynamic programming Rod cutting classgiridaroori
 
[Maths] arithmetic
[Maths] arithmetic[Maths] arithmetic
[Maths] arithmeticOurutopy
 

Similar to Heatmaps best practices Strata Hadoop (20)

Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.pptData Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
 
Displaying quantitative data
Displaying quantitative dataDisplaying quantitative data
Displaying quantitative data
 
Lecture 10.1 10.2 bt
Lecture 10.1 10.2 btLecture 10.1 10.2 bt
Lecture 10.1 10.2 bt
 
counting techniques
counting techniquescounting techniques
counting techniques
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
 
How to manually equalize the histograms of two (or more) subvolumes, measured...
How to manually equalize the histograms of two (or more) subvolumes, measured...How to manually equalize the histograms of two (or more) subvolumes, measured...
How to manually equalize the histograms of two (or more) subvolumes, measured...
 
3 3 polynomial inequalities in two variables
3 3 polynomial inequalities in two variables3 3 polynomial inequalities in two variables
3 3 polynomial inequalities in two variables
 
Presentation Math Workshop#May 25th New Help our teachers understa...
Presentation Math Workshop#May 25th New            Help our teachers understa...Presentation Math Workshop#May 25th New            Help our teachers understa...
Presentation Math Workshop#May 25th New Help our teachers understa...
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
 
New approaches in linear inequalities
New approaches in linear inequalitiesNew approaches in linear inequalities
New approaches in linear inequalities
 
Binary.pdf
Binary.pdfBinary.pdf
Binary.pdf
 
Seaborn visualization.pptx
Seaborn visualization.pptxSeaborn visualization.pptx
Seaborn visualization.pptx
 
Do's and Don'ts of using t-SNE.pdf
Do's and Don'ts of using t-SNE.pdfDo's and Don'ts of using t-SNE.pdf
Do's and Don'ts of using t-SNE.pdf
 
Statisics task fian's group
Statisics task fian's groupStatisics task fian's group
Statisics task fian's group
 
Statiska
StatiskaStatiska
Statiska
 
Matematica 1 ro sec segundo trimestre
Matematica 1 ro sec segundo trimestreMatematica 1 ro sec segundo trimestre
Matematica 1 ro sec segundo trimestre
 
dynamic programming Rod cutting class
dynamic programming Rod cutting classdynamic programming Rod cutting class
dynamic programming Rod cutting class
 
Counting
CountingCounting
Counting
 
[Maths] arithmetic
[Maths] arithmetic[Maths] arithmetic
[Maths] arithmetic
 

More from Edwin de Jonge

Validatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rulesValidatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rulesEdwin de Jonge
 
Data error! But where?
Data error! But where?Data error! But where?
Data error! But where?Edwin de Jonge
 
Daff: diff, patch and merge for data.frame
Daff: diff, patch and merge for data.frameDaff: diff, patch and merge for data.frame
Daff: diff, patch and merge for data.frameEdwin de Jonge
 
Uncertainty visualisation
Uncertainty visualisationUncertainty visualisation
Uncertainty visualisationEdwin de Jonge
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014Edwin de Jonge
 
Tabplotd3, interactive inspection of large data
Tabplotd3, interactive inspection of large dataTabplotd3, interactive inspection of large data
Tabplotd3, interactive inspection of large dataEdwin de Jonge
 
Big data as a source for official statistics
Big data as a source for official statisticsBig data as a source for official statistics
Big data as a source for official statisticsEdwin de Jonge
 
Statmine, Visuele dataexploratie
Statmine, Visuele dataexploratieStatmine, Visuele dataexploratie
Statmine, Visuele dataexploratieEdwin de Jonge
 
StatMine (New Technologies and Techniques for Statistics)
StatMine (New Technologies and Techniques for Statistics)StatMine (New Technologies and Techniques for Statistics)
StatMine (New Technologies and Techniques for Statistics)Edwin de Jonge
 
StatMine, visual exploration of output data
StatMine, visual exploration of output dataStatMine, visual exploration of output data
StatMine, visual exploration of output dataEdwin de Jonge
 

More from Edwin de Jonge (13)

sdcSpatial user!2019
sdcSpatial user!2019sdcSpatial user!2019
sdcSpatial user!2019
 
Validatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rulesValidatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rules
 
Data error! But where?
Data error! But where?Data error! But where?
Data error! But where?
 
Daff: diff, patch and merge for data.frame
Daff: diff, patch and merge for data.frameDaff: diff, patch and merge for data.frame
Daff: diff, patch and merge for data.frame
 
Uncertainty visualisation
Uncertainty visualisationUncertainty visualisation
Uncertainty visualisation
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014
 
Big data experiments
Big data experimentsBig data experiments
Big data experiments
 
StatMine
StatMineStatMine
StatMine
 
Tabplotd3, interactive inspection of large data
Tabplotd3, interactive inspection of large dataTabplotd3, interactive inspection of large data
Tabplotd3, interactive inspection of large data
 
Big data as a source for official statistics
Big data as a source for official statisticsBig data as a source for official statistics
Big data as a source for official statistics
 
Statmine, Visuele dataexploratie
Statmine, Visuele dataexploratieStatmine, Visuele dataexploratie
Statmine, Visuele dataexploratie
 
StatMine (New Technologies and Techniques for Statistics)
StatMine (New Technologies and Techniques for Statistics)StatMine (New Technologies and Techniques for Statistics)
StatMine (New Technologies and Techniques for Statistics)
 
StatMine, visual exploration of output data
StatMine, visual exploration of output dataStatMine, visual exploration of output data
StatMine, visual exploration of output data
 

Recently uploaded

Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 

Heatmaps best practices Strata Hadoop

  • 1. Alex Priem (@_alex_priem_) Edwin de Jonge (@edwindjonge) Strata, 21 nov 2014, Barcelona Patterns and meta patterns in Income Tax Data
  • 2. Age vs mortgage debt (men)
  • 3. Who are we? Statistical consultants / Data scientists working @ R&D department of Statistics Netherlands Statistics Netherlands (SN): -Government agency -Produces all official statistics of The Netherlands 3
  • 4. Income statistics based on Tax data 4
  • 5. Income Tax data –Contains all income tax records for the Netherlands –Approx 17M records with 550 variables. –Used to produce income statistics! Analysis is not trivial –Income Tax is complex (at least in the Netherlands) ‐stages of progressive tax ‐Complex Tax deductions (mortgage, flex workers) ‐Complex Tax benefits (child care, social benefits) 5
  • 6. Tax data (2) -550 variables (for each person in NL): -15 identificators/unique keys -Dwelling, person id, etc. -70 categorical -250 numerical variables from the income tax form ->200 derived variables (useful for analysis) -E.g. expandable income, income of dwelling/household 6
  • 7. Income/tax distributions Income (re)distribution hot topic since Piketty So how are income/tax/benefits distributed? -Look at 1D distributions: histograms -Look at 2D distributions: heatmaps -Problem: potentially 0.5 n(n-1) > 100k heatmaps! -even more when categorical included 7
  • 8. Let look at Patterns… 8
  • 9. Heatmap Patterns –What defines a pattern in heatmap? ‐Peak/Spike? (mode, 0D point) ‐Stripe (1D): •Horizontal Line? •Vertical Line? •Band? •Ridge? ‐Blob (2D) ‐Similarity between distributions (2D) 9
  • 10. Meta pattern? Meta patterns constitutes of repeating pattern in: ‐different subpopulations •E.g. Male/female, Social economic status, Works in branch of Industry ‐different pairs of variables •Income x age •Benefits x age •Etc. So patterns that are generic over different heatmaps. 10
  • 11. Looking for patterns Subpopulations: – Generate heatmap per category e.g. Age x Gross Income per social economic status –Automatic cluster heatmaps on distribution simularity Pairs of variables: -Generate heatmaps for all pairs -Prune: remove heatmaps with low support 1. Use image classification to cluster them 2. Or Cluster on extracted mode/line (wip) You will still need to look at the result! 11
  • 13. Anscombes quartet… 13 DS1 x y DS2 x y DS3 x y DS4 x y 10 8.04 10 9.14 10 7.46 8 6.58 8 6.95 8 8.14 8 6.77 8 5.76 13 7.58 13 8.74 13 12.74 8 7.71 9 8.81 9 8.77 9 7.11 8 8.84 11 8.33 11 9.26 11 7.81 8 8.47 14 9.96 14 8.1 14 8.84 8 7.04 6 7.24 6 6.13 6 6.08 8 5.25 4 4.26 4 3.1 4 5.39 19 12.5 12 10.84 12 9.13 12 8.15 8 5.56 7 4.82 7 7.26 7 6.42 8 7.91 5 5.68 5 4.74 5 5.73 8 6.89
  • 14. Anscombe’s quartet Property Value Mean of x1, x2, x3, x4 All equal: 9 Variance of x1, x2, x3, x4 All equal: 11 Mean of y1, y2, y3, y4 All equal: 7.50 Variance of y1, y2, y3, y4 All equal: 4.1 Correlation for ds1, ds2, ds3, ds4 All equal 0.816 Linear regression for ds1, ds2, ds3, ds4 All equal: y = 3.00 + 0.500x Looks the same, right?
  • 16. Machine learning So clustering (machine learning) different? 16
  • 17. 17
  • 18. Visualization helps to … –Test your (hidden model) assumptions! – To find structure in data, e.g. “How is my data distributed?” –Visually explore patterns: ‐Are there clusters? ‐Are there outliers? 18
  • 20. 20 1. Take two numerical variables x and y 2. Determine range 푟푥=[min푥,max⁡(푥)] 3. Chop 푟푥 in 푛푥 equal pieces 4. Repeat for y 5. We now have 푛푥⁡.푛푦 bins 6. Count # records in each bin 7. Assign colors to counts 8. Plot matrix 9. Enjoy!
  • 21. Easy as pie? Best practices and problems with heatmaps: -Resolution -Rescaling -Zooming -Outliers -Color scales 21
  • 22. 22
  • 23. 23 1. Take two numerical variables x and y 2. Determine range 퐫퐱=[퐦퐢퐧퐱,퐦퐚퐱⁡(퐱)] 3. Chop 푟푥 in 푛푥 equal pieces 4. Repeat for y 5. We now have 푛푥⁡.푛푦 bins 6. Count # records in each bin 7. Assign colors to counts 8. Plot matrix 9. Enjoy!
  • 24. Range: Outliers? (1D) 24 +5M€ -1M€ Gross Income
  • 25. Range: outliers removed (1% removed) 25 Gross Income +150k€
  • 26. Range: outliers… Does your data contain outliers? -If so: most pixels are empty -Most cases: outliers have low mass and are barely visible Truncate range: in x or y direction: e.g. 99% quantile -Interactively: allow for zoom and pan. 26
  • 27. Range: data skewed? 27 –Many variables are not normal distributed: ‐Power law: 풙훼 ‐Exponential: 푒푎풙+푏 So rescale x or y or both
  • 28. 28 1. Take two numerical variables x and y 2. Determine range rx=[minx,max⁡(x)] 3. Chop 풓풙 in 풏풙 equal pieces 4. Repeat for y 5. We now have 푛푥⁡.푛푦 bins 6. Count # records in each bin 7. Assign colors to counts 8. Plot matrix 9. Enjoy!
  • 30. 30 Chop: resolution Resolution matters
  • 31. 31 25 x 25
  • 32. 32 50 x 50
  • 33. 33 100 x 100
  • 34. 34 250 x 250
  • 35. 35 500 x 500
  • 36. Chop: Too small / Too big If #bins too small: - patterns are hidden If #bins too large: - heatmap is noisy (signal vs noise) Optimal nr bins depends on data. (kernel based approx), but always play with bin size / resolution! 36
  • 38. 38 1. Take two numerical variables x and y 2. Determine range rx=[minx,max⁡(x)] 3. Chop 푟푥 in 푛푥 equal pieces 4. Repeat for y 5. We now have 푛푥⁡.푛푦 bins 6. Count # records in each bin 7. Assign colors to counts 8. Plot matrix 9. Enjoy!
  • 39. Count: zero counts Not every variable is relevant for each person! 39
  • 40. Count: exclude zero values 40
  • 42. 42 1. Take two numerical variables x and y 2. Determine range rx=[minx,max⁡(x)] 3. Chop 푟푥 in 푛푥 equal pieces 4. Repeat for y 5. We now have 푛푥⁡.푛푦 bins 6. Count # records in each bin 7. Assign colors to counts 8. Plot matrix 9. Enjoy!
  • 43. 43
  • 44. Colors: scales –Color ‘intensity’ implies value –Percieved response depends on ‘color’ and ‘color lightness’ (compare #00ff00 with #0000ff) –Different models for color response: ‐RGB (models computer monitor) ‐HSV ‐HCL ‐CIELAB (models human eye) –Gradient generator: http://davidjohnstone.net/pages/lch-lab-colour-gradient-picker 44
  • 45. Colors –Color has two functions in heatmap: ‐Show ‘counts’ in your data ‐Show ‘patterns’ At least, use a perceptually uniform gradient -Libs: chroma.js, colorbrewer (R) …but patterns need distinct colors 45
  • 46. Color scales –Range of color scale depends on distribution of data. –Often have multiple populations/distributions in data –Severe spikes/stripes drown the smaller distributions: ‐We suggest log scale ‐Sometimes log scale is not enough –In practice, linear scale with low maximum cut-off works well –Effect is best understood in 3D (!). 46
  • 47. Peaks are best cut-off 47
  • 50. Linear gradient with cut-off 50
  • 53. Heatmaps side-by-side: gross income, men vs women 53 men
  • 54. Meta pattern Meta patterns constitutes of repeating pattern in: ‐different subpopulations ‐different pairs of variables So patterns that are generic over different heatmaps. 54
  • 55. Heatmaps decomposed in subpopulations: 55
  • 56. Gross income by socioeconomic status 56
  • 57. Gross income, men, categorized by socioeconomic status 57
  • 58. Patterns –Stripes are real, not outliers: –Corresponds with benefits, tax breaks –Needs paradigm shift: data is not normally distributed (but we knew that). 58
  • 59. Meta pattern Meta patterns constitutes of repeating pattern in: ‐different subpopulations ‐different pairs of variables So patterns that are generic over different heatmaps. 59
  • 60. Image classification of heatmaps 60
  • 61. No Domain knowledge required? 61
  • 62. 62
  • 65. Pattern removal: Effect of weighting 65
  • 66. Summary Heatmaps: –ideal tool for analyzing big datasets –Be aware of perceptual and data biases! 66
  • 67. Questions? Thank you for your attention! More info? ah.priem@cbs.nl / @_alex_priem e.dejonge@cbs.nl / @edwindjonge Heatmapping code available at https://github.com/alexpriem/heatmapr 67