-
1.
1
AUTOMATING DATA EXPLORATION
A structured approach to analysing data
A TOOL AGNOSTIC APPROACH
-
2.
2
AUTOMATING DATA EXPLORATION
A structured approach to analysing data
METADATA
UNIVARIATE
ANALYSIS
BIVARIATE
ANALYSIS
-
3.
LET’S TAKE A DATASET
3
Each row has details about an employee who has left the organization.
Just “reading” the dataset is quite informative.
-
4.
DESCRIBE THE DATA IN A STRUCTURED WAY
4
-
5.
5
AUTOMATING DATA EXPLORATION
A structured approach to analysing data
METADATA
UNIVARIATE
ANALYSIS
BIVARIATE
ANALYSIS
-
6.
CATEGORICAL COLUMNS YIELD VERY LITTLE DATA
6
There’s not much information in one column.
The values are not quantitative,
so a distribution is not meaningful.
The values are not even ordered.
In fact, the only thing we have is the list of values
and their count.
... or is there more to this?
Region Count
India 10780
Headstrong 1554
China 1130
Philippines 1030
US 792
Romania 788
Mexico 324
Guatemala 233
Poland 124
Brazil 45
Hungary 41
Colombia 38
Netherlands 33
South Africa 30
UK 18
UAE 15
GMS India 15
Japan 11
CZECH Republic 10
Kenya 9
-
7.
... BUT RANK FREQUENCY IS STILL POSSIBLE
7
The rank of the row provides additional
information.
With this, we can explore the distribution
of the rank against the count.
These distributions are called rank-
frequency distributions.
Rank Region Count
1 India 10780
2 Headstrong 1554
3 China 1130
4 Philippines 1030
5 US 792
6 Romania 788
7 Mexico 324
8 Guatemala 233
9 Poland 124
10 Brazil 45
11 Hungary 41
12 Colombia 38
13 Netherlands 33
14 South Africa 30
15 UK 18
16 UAE 15
17 GMS India 15
18 Japan 11
19 CZECH Republic 10
20 Kenya 9
-
8.
REGION SHOWS A POWER LAW DISTRIBUTION
8
Region Count
India 10780
Headstrong 1554
China 1130
Philippines 1030
US 792
Romania 788
Mexico 324
Guatemala 233
Poland 124
Brazil 45
Hungary 41
Colombia 38
Netherlands 33
South Africa 30
UK 18
UAE 15
GMS India 15
Japan 11
CZECH Republic 10
Kenya 9
Rank on a log scale
Frequencyonalogscale
-
9.
COST CODE SHOWS A POWER LAW DISTRIBUTION
9
Cost Code Count
105 9542
121 1757
125 875
122 796
3001 654
3310 635
124 435
131 415
115 336
nan 207
101 205
127 173
109 148
116 91
126 66
...
-
10.
LE SHOWS A POWER LAW DISTRIBUTION
10
LE Count
D84 11487
GPL 853
RM1 789
LC2 565
GMR 323
D95 247
GUT 233
ML1 223
CTK 184
AXE 127
A38 98
A21 79
EMP 61
BRL 45
A66 43
...
-
11.
11
WHAT CAUSES
POWER LAW DISTRIBUTIONS?
PREFERENTIAL
ATTACHMENT
EXPONENTIAL
GROWTH
-
12.
NO. OF FOLLOWERS ON GITHUB
12
Username Count
slidenerd 1700
astaxie 1320
MugunthKumar 1081
honcheng 870
arunoda 827
csjaba 670
cheeaun 658
timoxley 600
karlseguin 600
hemanth 514
arvindr21 400
yuvipanda 335
mbrochh 330
anandology 330
sayanee 314
zz85 314
sanand0 309
captn3m0 300
sameersbn 300
...
-
13.
NO. OF MOVIES ACTED IN BY BOLLYWOOD PEOPLE
13
Person Count
Lata Mangeshkar 824
Asha Bhosle 810
Shakti Kapoor 589
Kishore Kumar 585
Mohammed Rafi 527
Sunidhi Chauhan 515
Alka Yagnik 451
Udit Narayan 435
Kader Khan 430
Sonu Nigam 405
Sameer 398
Asrani 397
Helen 395
Shaan 377
Aruna Irani 375
Anupam Kher 367
Shreya Ghoshal 357
Gulshan Grover 341
...
-
14.
PARTIES IN PARLIAMENT ELECTIONS
14
Name Count
IND 44704
INC 7213
BJP 3354
BSP 2628
SP 1311
CPI 1102
JD 943
CPM 914
DDP 716
JNP 676
BJS 657
JP 563
NOTA 543
PSP 538
INC(I) 492
SHS 467
AAP 432
SWA 410
...
-
15.
CANDIDATE NAMES IN ASSEMBLY ELECTIONS
15
Name Count
NONE OF THE ABOVE 629
OM PRAKASH 478
ASHOK KUMAR 411
RAM SINGH 362
RAJ KUMAR 294
ANIL KUMAR 271
AMAR SINGH 248
MOHAN LAL 235
RAM KUMAR 224
BABU LAL 218
RAM PRASAD 213
JAGDISH 210
VIJAY KUMAR 207
RAJENDRA SINGH 196
VINOD KUMAR 195
SHYAM LAL 193
RAJESH KUMAR 186
SITA RAM 186
RAM LAL 171
...
-
16.
STUDENT NAMES IN SSA SURVEY
16
Name Count
M.MANIKANDAN 99
S.PAVITHRA 84
S.MANIKANDAN 84
R.RAMYA 82
S.SANGEETHA 70
R.MANIKANDAN 69
S.DIVYA 68
M.PAVITHRA 68
S.SANTHIYA 67
S.VIGNESH 67
M.PRIYA 67
M.MAHALAKSHMI 64
S.SARANYA 63
S.SURYA 60
K.MANIKANDAN 60
P.PAVITHRA 56
S.GAYATHRI 56
P.MANIKANDAN 55
...
-
17.
Jain
Harini
Shweta
Sneha Pooja
Ashwin
Shah
Deepti
Sanjana
Varshini
Ezhumalai
Venkatesan
Silambarasan
Pandiyan
Kumaresan
Manikandan
Thirupathi
Agarwal
Kumar
Priya
-
18.
NOT EVERYTHING IS POWER-LAW, THOUGH
18
Need to understand what drives these distributions from their behaviours
-
19.
ORDERED CATEGORICALS HAVE MORE INFORMATION
19
-
20.
CORPORATE BAND
20
LE Count
5 12247
4 4449
3 205
2 63
Not Mapped 24
1 22
SVP 10
-
21.
LOCAL BAND
21
LE Count
5A 7483
5B 4764
4A 1683
4B 1612
4C 747
4D 407
3 205
2 63
Not Mapped 24
1 22
SVP 10
-
22.
QUANTITIES HAVE EVEN MORE INFORMATION
22
-
23.
AGE DISTRIBUTION IS LOG-NORMAL
23
-
24.
DETECTING FRAUD
“
We know meter readings are
incorrect, for various reasons.
We don’t, however, have the
concrete proof we need to start
the process of meter reading
automation.
Part of our problem is the
volume of data that needs to be
analysed. The other is the
inexperience in tools or
analyses to identify such
patterns.
ENERGY UTILITY
24
-
25.
This plot shows the frequency of all meter readings from Apr-
2010 to Mar-2011. An unusually large number of readings are
aligned with the tariff slab boundaries.
This clearly shows collusion
of some form with the
customers.
Apr-10 May-10Jun-10Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11
217 219 200 200 200 200 200 200 200 350 200 200
250 200 200 200 201 200 200 200 250 200 200 150
250 150 150 200 200 200 200 200 200 200 200 150
150 200 200 200 200 200 200 200 200 200 200 50
200 200 200 150 180 150 50 100 50 70 100 100
100 100 100 100 100 100 100 100 100 100 110 100
100 150 123 123 50 100 50 100 100 100 100 100
0 111 100 100 100 100 100 100 100 100 50 50
0 100 27 100 50 100 100 100 100 100 70 100
1 1 1 100 99 50 100 100 100 100 100 100
This happens with specific
customers, not randomly.
Here are such customers’
meter readings.
Section Apr-10 May-10Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11
Section 1 70% 97% 136% 65% 110% 116% 121% 107% 114% 88% 74% 109%
Section 2 66% 92% 66% 87% 70% 64% 63% 50% 58% 38% 41% 54%
Section 3 90% 46% 47% 43% 28% 31% 50% 32% 19% 38% 8% 34%
Section 4 44% 24% 36% 39% 21% 18% 24% 49% 56% 44% 31% 14%
Section 5 4% 63% -27% 20% 41% 82% 26% 34% 43% 2% 37% 15%
Section 6 18% 23% 30% 21% 28% 33% 39% 41% 39% 18% 0% 33%
Section 7 36% 51% 33% 33% 27% 35% 10% 39% 12% 5% 15% 14%
Section 8 22% 21% 28% 12% 24% 27% 10% 31% 13% 11% 22% 17%
Section 9 19% 35% 14% 9% 16% 32% 37% 12% 9% 5% -3% 11%
If we define the “extent of
fraud” as the percentage
excess of the 100 unit
meter reading, the
value varies
considerably
across sections,
and time
New section
manager arrives
… and is
transferred out
… with some
explainable
anomalies.
Why would
these happen?
25
-
26.
PREDICTING MARKS
“
What determines a child’s marks?
Do girls score better than boys?
Does the choice of subject matter?
Does the medium of instruction
matter?
Does community or religion
matter?
Does their birthday matter?
Does the first letter of their name
matter?
EDUCATION
26
-
27.
TN CLASS X: ENGLISH
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 27
-
28.
TN CLASS X: SOCIAL SCIENCE
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 28
-
29.
TN CLASS X: LANGUAGE
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 29
-
30.
TN CLASS X: SCIENCE
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 30
-
31.
TN CLASS X: MATHEMATICS
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 31
-
32.
ICSE 2013 CLASS XII: TOTAL MARKS
32
-
33.
CBSE 2013 CLASS XII: ENGLISH MARKS
33
-
34.
CBSE 2013 CLASS XII: PHYSICS MARKS
34
-
35.
35
AUTOMATING DATA EXPLORATION
A structured approach to analysing data
METADATA
UNIVARIATE
ANALYSIS
BIVARIATE
ANALYSIS
-
36.
LET’S TAKE ONE DAY CRICKET DATA
Country Player Runs ScoreRate MatchDate Ground Versus
Australia Michael J Clarke 99* 93.39 30-06-2010The Oval England
Australia Dean M Jones 99* 128.57 28-01-1985Adelaide Oval Sri Lanka
Australia Bradley J Hodge 99* 115.11 04-02-2007Melbourne Cricket Ground New Zealand
India Virender Sehwag 99* 99 16-08-2010Rangiri Dambulla International Stad. Sri Lanka
New Zealand Bruce A Edgar 99* 72.79 14-02-1981Eden Park India
Pakistan Mohammad Yousuf 99* 95.19 15-11-2007Captain Roop Singh Stadium India
West Indies Richard B Richardson 99* 70.21 15-11-1985Sharjah CA Stadium Pakistan
West Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002Sardar Patel Stadium India
Zimbabwe Andrew Flower 99* 89.18 24-10-1999Harare Sports Club Australia
Zimbabwe Alistair D R Campbell 99* 79.83 01-10-2000Queens Sports Club New Zealand
Zimbabwe Malcolm N Waller 99* 133.78 25-10-2011Queens Sports Club New Zealand
Australia David C Boon 98* 82.35 08-12-1994Bellerive Oval Zimbabwe
Australia Graeme M Wood 98* 63.22 11-01-1981Melbourne Cricket Ground India
England Ian J L Trott 98* 84.48 20-10-2011Punjab Cricket Association Stadium India
India Yuvraj Singh 98* 89.09 01-08-2001Sinhalese Sports Club Ground Sri Lanka
Ireland Kevin J O'Brien 98* 94.23 10-07-2010VRA Ground Scotland
Kenya Collins O Obuya 98* 75.96 13-03-2011M.Chinnaswamy Stadium Australia
Netherlands Ryan N ten Doeschate 98* 73.68 01-09-2009VRA Ground Afghanistan
New Zealand James E C Franklin 98* 142.02 07-12-2010M.Chinnaswamy Stadium India
Pakistan Ijaz Ahmed 98* 112.64 28-10-1994Iqbal Stadium South Africa
South Africa Jacques H Kallis 98* 74.24 06-02-2000St George's Park Zimbabwe
36
-
37.
Against which countries are
higher averages scored?
Which countries’ players
score more per match?
37
-
38.
Which player scores the
most per ball?
The player with the highest strike
rate is an obscure South African
whose name most of us have never
heard of.
In fact, this list is filled with players
we have never heard of.
38
-
39.
Most analysis answers the question
“Which is are the top 10 X”?
Which are my top products?
Which are my top branches?
Who are my best sales people?
Which vendors have the highest cost per unit?
Which divisions are spending the most money?
In which hours does the under 12 segment watch TV most?
Which customer segment has the highest revenue per user?
39
-
40.
THIS QUESTION CAN BE ANSWERED SYSTEMATICALLY
Country Player Runs ScoreRate MatchDate Ground Versus
Australia Michael J Clarke 99* 93.39 30-06-2010The Oval England
Australia Dean M Jones 99* 128.57 28-01-1985Adelaide Oval Sri Lanka
Australia Bradley J Hodge 99* 115.11 04-02-2007Melbourne Cricket Ground New Zealand
India Virender Sehwag 99* 99 16-08-2010Rangiri Dambulla International Stad. Sri Lanka
New Zealand Bruce A Edgar 99* 72.79 14-02-1981Eden Park India
Pakistan Mohammad Yousuf 99* 95.19 15-11-2007Captain Roop Singh Stadium India
West Indies Richard B Richardson 99* 70.21 15-11-1985Sharjah CA Stadium Pakistan
West Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002Sardar Patel Stadium India
Zimbabwe Andrew Flower 99* 89.18 24-10-1999Harare Sports Club Australia
Zimbabwe Alistair D R Campbell 99* 79.83 01-10-2000Queens Sports Club New Zealand
Zimbabwe Malcolm N Waller 99* 133.78 25-10-2011Queens Sports Club New Zealand
Australia David C Boon 98* 82.35 08-12-1994Bellerive Oval Zimbabwe
Australia Graeme M Wood 98* 63.22 11-01-1981Melbourne Cricket Ground India
England Ian J L Trott 98* 84.48 20-10-2011Punjab Cricket Association Stadium India
India Yuvraj Singh 98* 89.09 01-08-2001Sinhalese Sports Club Ground Sri Lanka
Ireland Kevin J O'Brien 98* 94.23 10-07-2010VRA Ground Scotland
Kenya Collins O Obuya 98* 75.96 13-03-2011M.Chinnaswamy Stadium Australia
Netherlands Ryan N ten Doeschate 98* 73.68 01-09-2009VRA Ground Afghanistan
New Zealand James E C Franklin 98* 142.02 07-12-2010M.Chinnaswamy Stadium India
Pakistan Ijaz Ahmed 98* 112.64 28-10-1994Iqbal Stadium South Africa
South Africa Jacques H Kallis 98* 74.24 06-02-2000St George's Park Zimbabwe
Take every column in the data
Find the top value by that column
Country South Africa has the highest strike rate of 76%
Player Johann Louw has the highest strike rate of 329%
Runs 164 runs has the highest strike rate of 156%
MatchDate 12-03-2006 has the highest strike rate of 136%
Ground AC-VDCA Stadium has the highest strike rate of 98%
Versus United States has the highest strike rate of 104%
40
-
41.
What do the children in schools know and can do at
different stages of elementary education?
Have the inputs made into the elementary education
system had a beneficial effect or not?
41
-
42.
HAVING BOOKS IMPROVES READING ABILITY
Having more books at home improves the performance of children when it
comes to reading. (But children typically only have only 1-10 books at home)
Number of students sampled
What is the impact? How many more marks
can having more books fetch?
Circle size indicates number of students with
this response. Few students have no books.
Is this response (“25+ books”) good or bad?
Small red bars indicate low marks. Large
green bars indicate high marks. Students
having 25+ books tend to score high marks.
The most common response is marked in
blue. This is also the circle.
The graphic is summarized in words
Indicates whether the best response is the
most popular. Blue means that it is not.
Green means that it is. Red means that the
worst level is the most popular response.
42
-
43.
CHILDREN LIKE GAMES, AND THEY’RE GOOD
… but playing daily hurts reading ability
43
-
44.
WATCHING TV OCCASIONALLY IS GOOD
Children who watch TV
every day don’t do as well
as children who watch TV
only once a week.
But children who never
watch TV fare the worst.
Watching TV every day
helps improve children’s
reading ability a little bit
more…
… but mathematical
abilities fall dramatically at
that point
44
-
45.
WE HAVE A WEBSITE THAT YOU CAN EXPLORE
GRAMENER.COM/NAS
45
-
46.
46
AUTOMATING DATA EXPLORATION
A structured approach to analysing data
METADATA
UNIVARIATE
ANALYSIS
BIVARIATE
ANALYSIS
We did the simplest possible thing – plot the number of customers who had meter readings of 0, 1, 2, 3, etc. – all the way up to 300 and beyond. (Effectively, we drew a histogram.)
As expected, it was log-normal. Relatively few users with low meter readings, and few with high meter readings. But what was striking were the spikes – at 50 units, 100 units, 200 units and 300 units – precisely at the slab boundaries.
Given the metering system, there is a strong economic incentive to stay at or within a slab boundary. Exceeding it increases the unit rate. However, there are two ways this could happen. Either the consumer watches their meter carefully, and the instant it hits 100, stops using their lights and fans – or a certain amount of money changes hands.
It was easy to see from this that there was fraud happening, but what stumped us were the spikes at 10, 20, 30, 40, etc. Here, there’s no economic incentive. There’s no significant difference between a meter reading of 10 vs 11, so there was no incentive to commit fraud. However, we later learnt that we were looking at this the wrong way. This was not a case of fraud, but of laziness. These were the meter readings taken by staff that never visited the premises, and were cooking up numbers.
When people cook up numbers, they cook up round numbers. (An official said that he had to let go of one person who had not taken readings in a colony of houses for as long as six months. “Sir, there’s a pack of dogs in the colony” was his official statement.)
The other question is, what is the nature of this fraudulent contract. Is it monthly? The meter reading guy appears and charges a small sum to adjust the reading? Or is it an annual contract that’s paid upfront? We looked at the meter readings of some of the people who were consistently at the slab boundaries. For example, the table in the middle has the readings of 10 customers, one per row. In the first row, the readings are consistently at 200 for 9 of the 12 months. However, there’s a spike in Jan-11 to 350 units. This indicated a monthly contract with a failure to pay in just one month. However, we later learnt that many of the people on this list were famous personalities. In fact, the lady in the first row had an event at their place in Jan-11, and the actual reading was expected to be well over a thousand units. But since the electricity board has a policy of not often auditing those that were in the highest slab (above 300), a more likely explanation was a collusion of the lineman with the customer to place her in the highest slab just this month, to avoid scrutiny.
Lastly, we were examining the level at which fraud can be controlled. The last table above shows the extent of fraud of each section in one city, month on month. (The extent of fraud can be measured by the relative height of the spikes compared to the expected value.) Sections vary in the level of fraud, with Section 1 having significantly more fraud than Section 9. We also observe that fraud generally decreases in the winter season (Dec – Feb) when the need for cooling is less. But what’s most striking is the negative fraud in Section 5 in Jun-10. It stays low for a couple of months, and then, as if to compensate, shoots up to 82% in Sep-10.
We learnt that this coincided with the appointment and transfer of a new section manager – under whose “regime”, fraud seems to have been dramatically controlled. It appears that a good organisation level to control fraud is at the 5,000 people strong section manager level, rather than the 100,000 people strong staff level.