SlideShare a Scribd company logo
Alex Priem (@_alex_priem_) 
Edwin de Jonge (@edwindjonge) 
Strata, 21 nov 2014, Barcelona 
Patterns and meta patterns 
in Income Tax Data
Age vs mortgage debt (men)
Who are we? 
Statistical consultants / Data scientists 
working @ R&D department of Statistics Netherlands 
Statistics Netherlands (SN): 
-Government agency 
-Produces all official statistics of The Netherlands 
3
Income statistics based on Tax data 
4
Income Tax data 
–Contains all income tax records for the Netherlands 
–Approx 17M records with 550 variables. 
–Used to produce income statistics! 
Analysis is not trivial 
–Income Tax is complex (at least in the Netherlands) 
‐stages of progressive tax 
‐Complex Tax deductions (mortgage, flex workers) 
‐Complex Tax benefits (child care, social benefits) 
5
Tax data (2) 
-550 variables (for each person in NL): 
-15 identificators/unique keys 
-Dwelling, person id, etc. 
-70 categorical 
-250 numerical variables from the income tax form 
->200 derived variables (useful for analysis) 
-E.g. expandable income, income of dwelling/household 
6
Income/tax distributions 
Income (re)distribution hot topic since Piketty 
So how are income/tax/benefits distributed? 
-Look at 1D distributions: histograms 
-Look at 2D distributions: heatmaps 
-Problem: potentially 0.5 n(n-1) > 100k heatmaps! 
-even more when categorical included 
7
Let look at Patterns… 
8
Heatmap Patterns 
–What defines a pattern in heatmap? 
‐Peak/Spike? (mode, 0D point) 
‐Stripe (1D): 
•Horizontal Line? 
•Vertical Line? 
•Band? 
•Ridge? 
‐Blob (2D) 
‐Similarity between distributions (2D) 
9
Meta pattern? 
Meta patterns constitutes of repeating pattern in: 
‐different subpopulations 
•E.g. Male/female, Social economic status, Works in branch of Industry 
‐different pairs of variables 
•Income x age 
•Benefits x age 
•Etc. 
So patterns that are generic over different heatmaps. 
10
Looking for patterns 
Subpopulations: 
– Generate heatmap per category e.g. Age x Gross Income per social economic status 
–Automatic cluster heatmaps on distribution simularity 
Pairs of variables: 
-Generate heatmaps for all pairs 
-Prune: remove heatmaps with low support 
1. Use image classification to cluster them 
2. Or Cluster on extracted mode/line (wip) 
You will still need to look at the result! 
11
Why Visualization?
Anscombes quartet… 
13 
DS1 x 
y 
DS2 
x 
y 
DS3 
x 
y 
DS4 
x 
y 
10 
8.04 
10 
9.14 
10 
7.46 
8 
6.58 
8 
6.95 
8 
8.14 
8 
6.77 
8 
5.76 
13 
7.58 
13 
8.74 
13 
12.74 
8 
7.71 
9 
8.81 
9 
8.77 
9 
7.11 
8 
8.84 
11 
8.33 
11 
9.26 
11 
7.81 
8 
8.47 
14 
9.96 
14 
8.1 
14 
8.84 
8 
7.04 
6 
7.24 
6 
6.13 
6 
6.08 
8 
5.25 
4 
4.26 
4 
3.1 
4 
5.39 
19 
12.5 
12 
10.84 
12 
9.13 
12 
8.15 
8 
5.56 
7 
4.82 
7 
7.26 
7 
6.42 
8 
7.91 
5 
5.68 
5 
4.74 
5 
5.73 
8 
6.89
Anscombe’s quartet 
Property 
Value 
Mean of x1, x2, x3, x4 
All equal: 9 
Variance of x1, x2, x3, x4 
All equal: 11 
Mean of y1, y2, y3, y4 
All equal: 7.50 
Variance of y1, y2, y3, y4 
All equal: 4.1 
Correlation for ds1, ds2, ds3, ds4 
All equal 0.816 
Linear regression for ds1, ds2, ds3, ds4 
All equal: y = 3.00 + 0.500x 
Looks the same, right?
Lets plot!
Machine learning 
So clustering (machine learning) different? 
16
17
Visualization helps to … 
–Test your (hidden model) assumptions! 
– To find structure in data, e.g. 
“How is my data distributed?” 
–Visually explore patterns: 
‐Are there clusters? 
‐Are there outliers? 
18
19 
Heatmap recipe
20 
1. Take two numerical variables x and y 
2. Determine range 푟푥=[min푥,max⁡(푥)] 
3. Chop 푟푥 in 푛푥 equal pieces 
4. Repeat for y 
5. We now have 푛푥⁡.푛푦 bins 
6. Count # records in each bin 
7. Assign colors to counts 
8. Plot matrix 
9. Enjoy!
Easy as pie? 
Best practices and problems with heatmaps: 
-Resolution 
-Rescaling 
-Zooming 
-Outliers 
-Color scales 
21
22
23 
1. Take two numerical variables x and y 
2. Determine range 퐫퐱=[퐦퐢퐧퐱,퐦퐚퐱⁡(퐱)] 
3. Chop 푟푥 in 푛푥 equal pieces 
4. Repeat for y 
5. We now have 푛푥⁡.푛푦 bins 
6. Count # records in each bin 
7. Assign colors to counts 
8. Plot matrix 
9. Enjoy!
Range: Outliers? (1D) 
24 
+5M€ 
-1M€ 
Gross Income
Range: outliers removed (1% removed) 
25 
Gross Income 
+150k€
Range: outliers… 
Does your data contain outliers? 
-If so: most pixels are empty 
-Most cases: outliers have low mass and are barely visible 
Truncate range: in x or y direction: e.g. 99% quantile 
-Interactively: allow for zoom and pan. 
26
Range: data skewed? 
27 
–Many variables are not normal distributed: 
‐Power law: 풙훼 
‐Exponential: 푒푎풙+푏 
So rescale x or y or both
28 
1. Take two numerical variables x and y 
2. Determine range rx=[minx,max⁡(x)] 
3. Chop 풓풙 in 풏풙 equal pieces 
4. Repeat for y 
5. We now have 푛푥⁡.푛푦 bins 
6. Count # records in each bin 
7. Assign colors to counts 
8. Plot matrix 
9. Enjoy!
Chop: AKA “Binning” 
29
30 
Chop: resolution 
Resolution matters
31 
25 x 25
32 
50 x 50
33 
100 x 100
34 
250 x 250
35 
500 x 500
Chop: Too small / Too big 
If #bins too small: 
- patterns are hidden 
If #bins too large: 
- heatmap is noisy (signal vs noise) 
Optimal nr bins depends on data. 
(kernel based approx), but always play with bin size / resolution! 
36
Chop: integers… 
37
38 
1. Take two numerical variables x and y 
2. Determine range rx=[minx,max⁡(x)] 
3. Chop 푟푥 in 푛푥 equal pieces 
4. Repeat for y 
5. We now have 푛푥⁡.푛푦 bins 
6. Count # records in each bin 
7. Assign colors to counts 
8. Plot matrix 
9. Enjoy!
Count: zero counts 
Not every variable is relevant for each person! 
39
Count: exclude zero values 
40
Assign colors! 
41
42 
1. Take two numerical variables x and y 
2. Determine range rx=[minx,max⁡(x)] 
3. Chop 푟푥 in 푛푥 equal pieces 
4. Repeat for y 
5. We now have 푛푥⁡.푛푦 bins 
6. Count # records in each bin 
7. Assign colors to counts 
8. Plot matrix 
9. Enjoy!
43
Colors: scales 
–Color ‘intensity’ implies value 
–Percieved response depends on ‘color’ and ‘color lightness’ (compare #00ff00 with #0000ff) 
–Different models for color response: 
‐RGB (models computer monitor) 
‐HSV 
‐HCL 
‐CIELAB (models human eye) 
–Gradient generator: http://davidjohnstone.net/pages/lch-lab-colour-gradient-picker 
44
Colors 
–Color has two functions in heatmap: 
‐Show ‘counts’ in your data 
‐Show ‘patterns’ 
At least, use a perceptually uniform gradient 
-Libs: chroma.js, colorbrewer (R) 
…but patterns need distinct colors 
45
Color scales 
–Range of color scale depends on distribution of data. 
–Often have multiple populations/distributions in data 
–Severe spikes/stripes drown the smaller distributions: 
‐We suggest log scale 
‐Sometimes log scale is not enough 
–In practice, linear scale with low maximum cut-off works well 
–Effect is best understood in 3D (!). 
46
Peaks are best cut-off 
47
Example: Linear gradient 
48
Log-gradient 
49
Linear gradient with cut-off 
50
Perceptually uniform gradient 
51
Colors: background/missings matters 
52
Heatmaps side-by-side: gross income, men vs women 
53 
men
Meta pattern 
Meta patterns constitutes of repeating pattern in: 
‐different subpopulations 
‐different pairs of variables 
So patterns that are generic over different heatmaps. 
54
Heatmaps decomposed in subpopulations: 
55
Gross income by socioeconomic status 
56
Gross income, men, categorized by socioeconomic status 
57
Patterns 
–Stripes are real, not outliers: 
–Corresponds with benefits, tax breaks 
–Needs paradigm shift: data is not normally distributed (but we knew that). 
58
Meta pattern 
Meta patterns constitutes of repeating pattern in: 
‐different subpopulations 
‐different pairs of variables 
So patterns that are generic over different heatmaps. 
59
Image classification of heatmaps 
60
No Domain knowledge required? 
61
62
Salary pay structure 
63
Domain knowledge, take II 
64
Pattern removal: Effect of weighting 
65
Summary 
Heatmaps: 
–ideal tool for analyzing big datasets 
–Be aware of perceptual and data biases! 
66
Questions? 
Thank you for your attention! 
More info? 
ah.priem@cbs.nl / @_alex_priem 
e.dejonge@cbs.nl / @edwindjonge 
Heatmapping code available at 
https://github.com/alexpriem/heatmapr 
67

More Related Content

What's hot

Multiply And Divide Decimals By Powers Of 10
Multiply And Divide Decimals By Powers Of 10Multiply And Divide Decimals By Powers Of 10
Multiply And Divide Decimals By Powers Of 10
Brooke Young
 
Cours Stats 5E
Cours Stats 5ECours Stats 5E
Cours Stats 5E
PaulineKRUMM
 
Hexadecimal (Calculations and Explanations)
Hexadecimal (Calculations and Explanations)Hexadecimal (Calculations and Explanations)
Hexadecimal (Calculations and Explanations)
Project Student
 
8 4 scientific notation - day 1
8 4 scientific notation - day 18 4 scientific notation - day 1
8 4 scientific notation - day 1bweldon
 
Stem-and-Leaf Plot and Line Plot
Stem-and-Leaf Plot and Line PlotStem-and-Leaf Plot and Line Plot
Stem-and-Leaf Plot and Line Plot
sheisirenebkm
 
Teoria y problemas de tabla de frecuencias tf221 ccesa007
Teoria y problemas de tabla de frecuencias  tf221  ccesa007Teoria y problemas de tabla de frecuencias  tf221  ccesa007
Teoria y problemas de tabla de frecuencias tf221 ccesa007
Demetrio Ccesa Rayme
 
4.5 multiplying and dividng by powers of 10
4.5 multiplying and dividng by powers of 104.5 multiplying and dividng by powers of 10
4.5 multiplying and dividng by powers of 10Rachel
 

What's hot (7)

Multiply And Divide Decimals By Powers Of 10
Multiply And Divide Decimals By Powers Of 10Multiply And Divide Decimals By Powers Of 10
Multiply And Divide Decimals By Powers Of 10
 
Cours Stats 5E
Cours Stats 5ECours Stats 5E
Cours Stats 5E
 
Hexadecimal (Calculations and Explanations)
Hexadecimal (Calculations and Explanations)Hexadecimal (Calculations and Explanations)
Hexadecimal (Calculations and Explanations)
 
8 4 scientific notation - day 1
8 4 scientific notation - day 18 4 scientific notation - day 1
8 4 scientific notation - day 1
 
Stem-and-Leaf Plot and Line Plot
Stem-and-Leaf Plot and Line PlotStem-and-Leaf Plot and Line Plot
Stem-and-Leaf Plot and Line Plot
 
Teoria y problemas de tabla de frecuencias tf221 ccesa007
Teoria y problemas de tabla de frecuencias  tf221  ccesa007Teoria y problemas de tabla de frecuencias  tf221  ccesa007
Teoria y problemas de tabla de frecuencias tf221 ccesa007
 
4.5 multiplying and dividng by powers of 10
4.5 multiplying and dividng by powers of 104.5 multiplying and dividng by powers of 10
4.5 multiplying and dividng by powers of 10
 

Similar to Heatmap best practices

Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.pptData Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
SATYAJIT58
 
Displaying quantitative data
Displaying quantitative dataDisplaying quantitative data
Displaying quantitative dataUlster BOCES
 
Lecture 10.1 10.2 bt
Lecture 10.1 10.2 btLecture 10.1 10.2 bt
Lecture 10.1 10.2 bt
btmathematics
 
counting techniques
counting techniquescounting techniques
counting techniques
Unsa Shakir
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
StampedeCon
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
Collin Bennett
 
How to manually equalize the histograms of two (or more) subvolumes, measured...
How to manually equalize the histograms of two (or more) subvolumes, measured...How to manually equalize the histograms of two (or more) subvolumes, measured...
How to manually equalize the histograms of two (or more) subvolumes, measured...
Javier García Molleja
 
3 3 polynomial inequalities in two variables
3 3 polynomial inequalities in two variables3 3 polynomial inequalities in two variables
3 3 polynomial inequalities in two variableshisema01
 
Presentation Math Workshop#May 25th New Help our teachers understa...
Presentation Math Workshop#May 25th New            Help our teachers understa...Presentation Math Workshop#May 25th New            Help our teachers understa...
Presentation Math Workshop#May 25th New Help our teachers understa...
guest80c0981
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
Marco Quartulli
 
New approaches in linear inequalities
New approaches in linear inequalitiesNew approaches in linear inequalities
New approaches in linear inequalitiesTarun Gehlot
 
Binary.pdf
Binary.pdfBinary.pdf
Binary.pdf
ChrisTalla1
 
Seaborn visualization.pptx
Seaborn visualization.pptxSeaborn visualization.pptx
Seaborn visualization.pptx
VaishnaviGaikwad67
 
Do's and Don'ts of using t-SNE.pdf
Do's and Don'ts of using t-SNE.pdfDo's and Don'ts of using t-SNE.pdf
Do's and Don'ts of using t-SNE.pdf
FrankClat
 
Matematica 1 ro sec segundo trimestre
Matematica 1 ro sec segundo trimestreMatematica 1 ro sec segundo trimestre
Matematica 1 ro sec segundo trimestre
ErickM20
 
dynamic programming Rod cutting class
dynamic programming Rod cutting classdynamic programming Rod cutting class
dynamic programming Rod cutting class
giridaroori
 
Counting
CountingCounting
[Maths] arithmetic
[Maths] arithmetic[Maths] arithmetic
[Maths] arithmeticOurutopy
 

Similar to Heatmap best practices (20)

Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.pptData Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
 
Displaying quantitative data
Displaying quantitative dataDisplaying quantitative data
Displaying quantitative data
 
Lecture 10.1 10.2 bt
Lecture 10.1 10.2 btLecture 10.1 10.2 bt
Lecture 10.1 10.2 bt
 
counting techniques
counting techniquescounting techniques
counting techniques
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
 
How to manually equalize the histograms of two (or more) subvolumes, measured...
How to manually equalize the histograms of two (or more) subvolumes, measured...How to manually equalize the histograms of two (or more) subvolumes, measured...
How to manually equalize the histograms of two (or more) subvolumes, measured...
 
3 3 polynomial inequalities in two variables
3 3 polynomial inequalities in two variables3 3 polynomial inequalities in two variables
3 3 polynomial inequalities in two variables
 
Presentation Math Workshop#May 25th New Help our teachers understa...
Presentation Math Workshop#May 25th New            Help our teachers understa...Presentation Math Workshop#May 25th New            Help our teachers understa...
Presentation Math Workshop#May 25th New Help our teachers understa...
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
 
New approaches in linear inequalities
New approaches in linear inequalitiesNew approaches in linear inequalities
New approaches in linear inequalities
 
Binary.pdf
Binary.pdfBinary.pdf
Binary.pdf
 
Seaborn visualization.pptx
Seaborn visualization.pptxSeaborn visualization.pptx
Seaborn visualization.pptx
 
Do's and Don'ts of using t-SNE.pdf
Do's and Don'ts of using t-SNE.pdfDo's and Don'ts of using t-SNE.pdf
Do's and Don'ts of using t-SNE.pdf
 
Statisics task fian's group
Statisics task fian's groupStatisics task fian's group
Statisics task fian's group
 
Statiska
StatiskaStatiska
Statiska
 
Matematica 1 ro sec segundo trimestre
Matematica 1 ro sec segundo trimestreMatematica 1 ro sec segundo trimestre
Matematica 1 ro sec segundo trimestre
 
dynamic programming Rod cutting class
dynamic programming Rod cutting classdynamic programming Rod cutting class
dynamic programming Rod cutting class
 
Counting
CountingCounting
Counting
 
[Maths] arithmetic
[Maths] arithmetic[Maths] arithmetic
[Maths] arithmetic
 

Recently uploaded

Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 

Recently uploaded (20)

Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 

Heatmap best practices

  • 1. Alex Priem (@_alex_priem_) Edwin de Jonge (@edwindjonge) Strata, 21 nov 2014, Barcelona Patterns and meta patterns in Income Tax Data
  • 2. Age vs mortgage debt (men)
  • 3. Who are we? Statistical consultants / Data scientists working @ R&D department of Statistics Netherlands Statistics Netherlands (SN): -Government agency -Produces all official statistics of The Netherlands 3
  • 4. Income statistics based on Tax data 4
  • 5. Income Tax data –Contains all income tax records for the Netherlands –Approx 17M records with 550 variables. –Used to produce income statistics! Analysis is not trivial –Income Tax is complex (at least in the Netherlands) ‐stages of progressive tax ‐Complex Tax deductions (mortgage, flex workers) ‐Complex Tax benefits (child care, social benefits) 5
  • 6. Tax data (2) -550 variables (for each person in NL): -15 identificators/unique keys -Dwelling, person id, etc. -70 categorical -250 numerical variables from the income tax form ->200 derived variables (useful for analysis) -E.g. expandable income, income of dwelling/household 6
  • 7. Income/tax distributions Income (re)distribution hot topic since Piketty So how are income/tax/benefits distributed? -Look at 1D distributions: histograms -Look at 2D distributions: heatmaps -Problem: potentially 0.5 n(n-1) > 100k heatmaps! -even more when categorical included 7
  • 8. Let look at Patterns… 8
  • 9. Heatmap Patterns –What defines a pattern in heatmap? ‐Peak/Spike? (mode, 0D point) ‐Stripe (1D): •Horizontal Line? •Vertical Line? •Band? •Ridge? ‐Blob (2D) ‐Similarity between distributions (2D) 9
  • 10. Meta pattern? Meta patterns constitutes of repeating pattern in: ‐different subpopulations •E.g. Male/female, Social economic status, Works in branch of Industry ‐different pairs of variables •Income x age •Benefits x age •Etc. So patterns that are generic over different heatmaps. 10
  • 11. Looking for patterns Subpopulations: – Generate heatmap per category e.g. Age x Gross Income per social economic status –Automatic cluster heatmaps on distribution simularity Pairs of variables: -Generate heatmaps for all pairs -Prune: remove heatmaps with low support 1. Use image classification to cluster them 2. Or Cluster on extracted mode/line (wip) You will still need to look at the result! 11
  • 13. Anscombes quartet… 13 DS1 x y DS2 x y DS3 x y DS4 x y 10 8.04 10 9.14 10 7.46 8 6.58 8 6.95 8 8.14 8 6.77 8 5.76 13 7.58 13 8.74 13 12.74 8 7.71 9 8.81 9 8.77 9 7.11 8 8.84 11 8.33 11 9.26 11 7.81 8 8.47 14 9.96 14 8.1 14 8.84 8 7.04 6 7.24 6 6.13 6 6.08 8 5.25 4 4.26 4 3.1 4 5.39 19 12.5 12 10.84 12 9.13 12 8.15 8 5.56 7 4.82 7 7.26 7 6.42 8 7.91 5 5.68 5 4.74 5 5.73 8 6.89
  • 14. Anscombe’s quartet Property Value Mean of x1, x2, x3, x4 All equal: 9 Variance of x1, x2, x3, x4 All equal: 11 Mean of y1, y2, y3, y4 All equal: 7.50 Variance of y1, y2, y3, y4 All equal: 4.1 Correlation for ds1, ds2, ds3, ds4 All equal 0.816 Linear regression for ds1, ds2, ds3, ds4 All equal: y = 3.00 + 0.500x Looks the same, right?
  • 16. Machine learning So clustering (machine learning) different? 16
  • 17. 17
  • 18. Visualization helps to … –Test your (hidden model) assumptions! – To find structure in data, e.g. “How is my data distributed?” –Visually explore patterns: ‐Are there clusters? ‐Are there outliers? 18
  • 20. 20 1. Take two numerical variables x and y 2. Determine range 푟푥=[min푥,max⁡(푥)] 3. Chop 푟푥 in 푛푥 equal pieces 4. Repeat for y 5. We now have 푛푥⁡.푛푦 bins 6. Count # records in each bin 7. Assign colors to counts 8. Plot matrix 9. Enjoy!
  • 21. Easy as pie? Best practices and problems with heatmaps: -Resolution -Rescaling -Zooming -Outliers -Color scales 21
  • 22. 22
  • 23. 23 1. Take two numerical variables x and y 2. Determine range 퐫퐱=[퐦퐢퐧퐱,퐦퐚퐱⁡(퐱)] 3. Chop 푟푥 in 푛푥 equal pieces 4. Repeat for y 5. We now have 푛푥⁡.푛푦 bins 6. Count # records in each bin 7. Assign colors to counts 8. Plot matrix 9. Enjoy!
  • 24. Range: Outliers? (1D) 24 +5M€ -1M€ Gross Income
  • 25. Range: outliers removed (1% removed) 25 Gross Income +150k€
  • 26. Range: outliers… Does your data contain outliers? -If so: most pixels are empty -Most cases: outliers have low mass and are barely visible Truncate range: in x or y direction: e.g. 99% quantile -Interactively: allow for zoom and pan. 26
  • 27. Range: data skewed? 27 –Many variables are not normal distributed: ‐Power law: 풙훼 ‐Exponential: 푒푎풙+푏 So rescale x or y or both
  • 28. 28 1. Take two numerical variables x and y 2. Determine range rx=[minx,max⁡(x)] 3. Chop 풓풙 in 풏풙 equal pieces 4. Repeat for y 5. We now have 푛푥⁡.푛푦 bins 6. Count # records in each bin 7. Assign colors to counts 8. Plot matrix 9. Enjoy!
  • 30. 30 Chop: resolution Resolution matters
  • 31. 31 25 x 25
  • 32. 32 50 x 50
  • 33. 33 100 x 100
  • 34. 34 250 x 250
  • 35. 35 500 x 500
  • 36. Chop: Too small / Too big If #bins too small: - patterns are hidden If #bins too large: - heatmap is noisy (signal vs noise) Optimal nr bins depends on data. (kernel based approx), but always play with bin size / resolution! 36
  • 38. 38 1. Take two numerical variables x and y 2. Determine range rx=[minx,max⁡(x)] 3. Chop 푟푥 in 푛푥 equal pieces 4. Repeat for y 5. We now have 푛푥⁡.푛푦 bins 6. Count # records in each bin 7. Assign colors to counts 8. Plot matrix 9. Enjoy!
  • 39. Count: zero counts Not every variable is relevant for each person! 39
  • 40. Count: exclude zero values 40
  • 42. 42 1. Take two numerical variables x and y 2. Determine range rx=[minx,max⁡(x)] 3. Chop 푟푥 in 푛푥 equal pieces 4. Repeat for y 5. We now have 푛푥⁡.푛푦 bins 6. Count # records in each bin 7. Assign colors to counts 8. Plot matrix 9. Enjoy!
  • 43. 43
  • 44. Colors: scales –Color ‘intensity’ implies value –Percieved response depends on ‘color’ and ‘color lightness’ (compare #00ff00 with #0000ff) –Different models for color response: ‐RGB (models computer monitor) ‐HSV ‐HCL ‐CIELAB (models human eye) –Gradient generator: http://davidjohnstone.net/pages/lch-lab-colour-gradient-picker 44
  • 45. Colors –Color has two functions in heatmap: ‐Show ‘counts’ in your data ‐Show ‘patterns’ At least, use a perceptually uniform gradient -Libs: chroma.js, colorbrewer (R) …but patterns need distinct colors 45
  • 46. Color scales –Range of color scale depends on distribution of data. –Often have multiple populations/distributions in data –Severe spikes/stripes drown the smaller distributions: ‐We suggest log scale ‐Sometimes log scale is not enough –In practice, linear scale with low maximum cut-off works well –Effect is best understood in 3D (!). 46
  • 47. Peaks are best cut-off 47
  • 50. Linear gradient with cut-off 50
  • 53. Heatmaps side-by-side: gross income, men vs women 53 men
  • 54. Meta pattern Meta patterns constitutes of repeating pattern in: ‐different subpopulations ‐different pairs of variables So patterns that are generic over different heatmaps. 54
  • 55. Heatmaps decomposed in subpopulations: 55
  • 56. Gross income by socioeconomic status 56
  • 57. Gross income, men, categorized by socioeconomic status 57
  • 58. Patterns –Stripes are real, not outliers: –Corresponds with benefits, tax breaks –Needs paradigm shift: data is not normally distributed (but we knew that). 58
  • 59. Meta pattern Meta patterns constitutes of repeating pattern in: ‐different subpopulations ‐different pairs of variables So patterns that are generic over different heatmaps. 59
  • 60. Image classification of heatmaps 60
  • 61. No Domain knowledge required? 61
  • 62. 62
  • 65. Pattern removal: Effect of weighting 65
  • 66. Summary Heatmaps: –ideal tool for analyzing big datasets –Be aware of perceptual and data biases! 66
  • 67. Questions? Thank you for your attention! More info? ah.priem@cbs.nl / @_alex_priem e.dejonge@cbs.nl / @edwindjonge Heatmapping code available at https://github.com/alexpriem/heatmapr 67