Advertisement
Advertisement

More Related Content

Advertisement

Automating Data Exploration SciPy 2016

  1. 1 AUTOMATING DATA EXPLORATION A structured approach to analysing data A TOOL AGNOSTIC APPROACH
  2. 2 AUTOMATING DATA EXPLORATION A structured approach to analysing data METADATA UNIVARIATE ANALYSIS BIVARIATE ANALYSIS
  3. LET’S TAKE A DATASET 3 Each row has details about an employee who has left the organization. Just “reading” the dataset is quite informative.
  4. DESCRIBE THE DATA IN A STRUCTURED WAY 4
  5. 5 AUTOMATING DATA EXPLORATION A structured approach to analysing data METADATA UNIVARIATE ANALYSIS BIVARIATE ANALYSIS
  6. CATEGORICAL COLUMNS YIELD VERY LITTLE DATA 6 There’s not much information in one column. The values are not quantitative, so a distribution is not meaningful. The values are not even ordered. In fact, the only thing we have is the list of values and their count. ... or is there more to this? Region Count India 10780 Headstrong 1554 China 1130 Philippines 1030 US 792 Romania 788 Mexico 324 Guatemala 233 Poland 124 Brazil 45 Hungary 41 Colombia 38 Netherlands 33 South Africa 30 UK 18 UAE 15 GMS India 15 Japan 11 CZECH Republic 10 Kenya 9
  7. ... BUT RANK FREQUENCY IS STILL POSSIBLE 7 The rank of the row provides additional information. With this, we can explore the distribution of the rank against the count. These distributions are called rank- frequency distributions. Rank Region Count 1 India 10780 2 Headstrong 1554 3 China 1130 4 Philippines 1030 5 US 792 6 Romania 788 7 Mexico 324 8 Guatemala 233 9 Poland 124 10 Brazil 45 11 Hungary 41 12 Colombia 38 13 Netherlands 33 14 South Africa 30 15 UK 18 16 UAE 15 17 GMS India 15 18 Japan 11 19 CZECH Republic 10 20 Kenya 9
  8. REGION SHOWS A POWER LAW DISTRIBUTION 8 Region Count India 10780 Headstrong 1554 China 1130 Philippines 1030 US 792 Romania 788 Mexico 324 Guatemala 233 Poland 124 Brazil 45 Hungary 41 Colombia 38 Netherlands 33 South Africa 30 UK 18 UAE 15 GMS India 15 Japan 11 CZECH Republic 10 Kenya 9 Rank on a log scale Frequencyonalogscale
  9. COST CODE SHOWS A POWER LAW DISTRIBUTION 9 Cost Code Count 105 9542 121 1757 125 875 122 796 3001 654 3310 635 124 435 131 415 115 336 nan 207 101 205 127 173 109 148 116 91 126 66 ...
  10. LE SHOWS A POWER LAW DISTRIBUTION 10 LE Count D84 11487 GPL 853 RM1 789 LC2 565 GMR 323 D95 247 GUT 233 ML1 223 CTK 184 AXE 127 A38 98 A21 79 EMP 61 BRL 45 A66 43 ...
  11. 11 WHAT CAUSES POWER LAW DISTRIBUTIONS? PREFERENTIAL ATTACHMENT EXPONENTIAL GROWTH
  12. NO. OF FOLLOWERS ON GITHUB 12 Username Count slidenerd 1700 astaxie 1320 MugunthKumar 1081 honcheng 870 arunoda 827 csjaba 670 cheeaun 658 timoxley 600 karlseguin 600 hemanth 514 arvindr21 400 yuvipanda 335 mbrochh 330 anandology 330 sayanee 314 zz85 314 sanand0 309 captn3m0 300 sameersbn 300 ...
  13. NO. OF MOVIES ACTED IN BY BOLLYWOOD PEOPLE 13 Person Count Lata Mangeshkar 824 Asha Bhosle 810 Shakti Kapoor 589 Kishore Kumar 585 Mohammed Rafi 527 Sunidhi Chauhan 515 Alka Yagnik 451 Udit Narayan 435 Kader Khan 430 Sonu Nigam 405 Sameer 398 Asrani 397 Helen 395 Shaan 377 Aruna Irani 375 Anupam Kher 367 Shreya Ghoshal 357 Gulshan Grover 341 ...
  14. PARTIES IN PARLIAMENT ELECTIONS 14 Name Count IND 44704 INC 7213 BJP 3354 BSP 2628 SP 1311 CPI 1102 JD 943 CPM 914 DDP 716 JNP 676 BJS 657 JP 563 NOTA 543 PSP 538 INC(I) 492 SHS 467 AAP 432 SWA 410 ...
  15. CANDIDATE NAMES IN ASSEMBLY ELECTIONS 15 Name Count NONE OF THE ABOVE 629 OM PRAKASH 478 ASHOK KUMAR 411 RAM SINGH 362 RAJ KUMAR 294 ANIL KUMAR 271 AMAR SINGH 248 MOHAN LAL 235 RAM KUMAR 224 BABU LAL 218 RAM PRASAD 213 JAGDISH 210 VIJAY KUMAR 207 RAJENDRA SINGH 196 VINOD KUMAR 195 SHYAM LAL 193 RAJESH KUMAR 186 SITA RAM 186 RAM LAL 171 ...
  16. STUDENT NAMES IN SSA SURVEY 16 Name Count M.MANIKANDAN 99 S.PAVITHRA 84 S.MANIKANDAN 84 R.RAMYA 82 S.SANGEETHA 70 R.MANIKANDAN 69 S.DIVYA 68 M.PAVITHRA 68 S.SANTHIYA 67 S.VIGNESH 67 M.PRIYA 67 M.MAHALAKSHMI 64 S.SARANYA 63 S.SURYA 60 K.MANIKANDAN 60 P.PAVITHRA 56 S.GAYATHRI 56 P.MANIKANDAN 55 ...
  17. Jain Harini Shweta Sneha Pooja Ashwin Shah Deepti Sanjana Varshini Ezhumalai Venkatesan Silambarasan Pandiyan Kumaresan Manikandan Thirupathi Agarwal Kumar Priya
  18. NOT EVERYTHING IS POWER-LAW, THOUGH 18 Need to understand what drives these distributions from their behaviours
  19. ORDERED CATEGORICALS HAVE MORE INFORMATION 19
  20. CORPORATE BAND 20 LE Count 5 12247 4 4449 3 205 2 63 Not Mapped 24 1 22 SVP 10
  21. LOCAL BAND 21 LE Count 5A 7483 5B 4764 4A 1683 4B 1612 4C 747 4D 407 3 205 2 63 Not Mapped 24 1 22 SVP 10
  22. QUANTITIES HAVE EVEN MORE INFORMATION 22
  23. AGE DISTRIBUTION IS LOG-NORMAL 23
  24. DETECTING FRAUD “ We know meter readings are incorrect, for various reasons. We don’t, however, have the concrete proof we need to start the process of meter reading automation. Part of our problem is the volume of data that needs to be analysed. The other is the inexperience in tools or analyses to identify such patterns. ENERGY UTILITY 24
  25. This plot shows the frequency of all meter readings from Apr- 2010 to Mar-2011. An unusually large number of readings are aligned with the tariff slab boundaries. This clearly shows collusion of some form with the customers. Apr-10 May-10Jun-10Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11 217 219 200 200 200 200 200 200 200 350 200 200 250 200 200 200 201 200 200 200 250 200 200 150 250 150 150 200 200 200 200 200 200 200 200 150 150 200 200 200 200 200 200 200 200 200 200 50 200 200 200 150 180 150 50 100 50 70 100 100 100 100 100 100 100 100 100 100 100 100 110 100 100 150 123 123 50 100 50 100 100 100 100 100 0 111 100 100 100 100 100 100 100 100 50 50 0 100 27 100 50 100 100 100 100 100 70 100 1 1 1 100 99 50 100 100 100 100 100 100 This happens with specific customers, not randomly. Here are such customers’ meter readings. Section Apr-10 May-10Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11 Section 1 70% 97% 136% 65% 110% 116% 121% 107% 114% 88% 74% 109% Section 2 66% 92% 66% 87% 70% 64% 63% 50% 58% 38% 41% 54% Section 3 90% 46% 47% 43% 28% 31% 50% 32% 19% 38% 8% 34% Section 4 44% 24% 36% 39% 21% 18% 24% 49% 56% 44% 31% 14% Section 5 4% 63% -27% 20% 41% 82% 26% 34% 43% 2% 37% 15% Section 6 18% 23% 30% 21% 28% 33% 39% 41% 39% 18% 0% 33% Section 7 36% 51% 33% 33% 27% 35% 10% 39% 12% 5% 15% 14% Section 8 22% 21% 28% 12% 24% 27% 10% 31% 13% 11% 22% 17% Section 9 19% 35% 14% 9% 16% 32% 37% 12% 9% 5% -3% 11% If we define the “extent of fraud” as the percentage excess of the 100 unit meter reading, the value varies considerably across sections, and time New section manager arrives … and is transferred out … with some explainable anomalies. Why would these happen? 25
  26. PREDICTING MARKS “ What determines a child’s marks? Do girls score better than boys? Does the choice of subject matter? Does the medium of instruction matter? Does community or religion matter? Does their birthday matter? Does the first letter of their name matter? EDUCATION 26
  27. TN CLASS X: ENGLISH 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 27
  28. TN CLASS X: SOCIAL SCIENCE 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 28
  29. TN CLASS X: LANGUAGE 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 29
  30. TN CLASS X: SCIENCE 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 30
  31. TN CLASS X: MATHEMATICS 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 31
  32. ICSE 2013 CLASS XII: TOTAL MARKS 32
  33. CBSE 2013 CLASS XII: ENGLISH MARKS 33
  34. CBSE 2013 CLASS XII: PHYSICS MARKS 34
  35. 35 AUTOMATING DATA EXPLORATION A structured approach to analysing data METADATA UNIVARIATE ANALYSIS BIVARIATE ANALYSIS
  36. LET’S TAKE ONE DAY CRICKET DATA Country Player Runs ScoreRate MatchDate Ground Versus Australia Michael J Clarke 99* 93.39 30-06-2010The Oval England Australia Dean M Jones 99* 128.57 28-01-1985Adelaide Oval Sri Lanka Australia Bradley J Hodge 99* 115.11 04-02-2007Melbourne Cricket Ground New Zealand India Virender Sehwag 99* 99 16-08-2010Rangiri Dambulla International Stad. Sri Lanka New Zealand Bruce A Edgar 99* 72.79 14-02-1981Eden Park India Pakistan Mohammad Yousuf 99* 95.19 15-11-2007Captain Roop Singh Stadium India West Indies Richard B Richardson 99* 70.21 15-11-1985Sharjah CA Stadium Pakistan West Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002Sardar Patel Stadium India Zimbabwe Andrew Flower 99* 89.18 24-10-1999Harare Sports Club Australia Zimbabwe Alistair D R Campbell 99* 79.83 01-10-2000Queens Sports Club New Zealand Zimbabwe Malcolm N Waller 99* 133.78 25-10-2011Queens Sports Club New Zealand Australia David C Boon 98* 82.35 08-12-1994Bellerive Oval Zimbabwe Australia Graeme M Wood 98* 63.22 11-01-1981Melbourne Cricket Ground India England Ian J L Trott 98* 84.48 20-10-2011Punjab Cricket Association Stadium India India Yuvraj Singh 98* 89.09 01-08-2001Sinhalese Sports Club Ground Sri Lanka Ireland Kevin J O'Brien 98* 94.23 10-07-2010VRA Ground Scotland Kenya Collins O Obuya 98* 75.96 13-03-2011M.Chinnaswamy Stadium Australia Netherlands Ryan N ten Doeschate 98* 73.68 01-09-2009VRA Ground Afghanistan New Zealand James E C Franklin 98* 142.02 07-12-2010M.Chinnaswamy Stadium India Pakistan Ijaz Ahmed 98* 112.64 28-10-1994Iqbal Stadium South Africa South Africa Jacques H Kallis 98* 74.24 06-02-2000St George's Park Zimbabwe 36
  37. Against which countries are higher averages scored? Which countries’ players score more per match? 37
  38. Which player scores the most per ball? The player with the highest strike rate is an obscure South African whose name most of us have never heard of. In fact, this list is filled with players we have never heard of. 38
  39. Most analysis answers the question “Which is are the top 10 X”? Which are my top products? Which are my top branches? Who are my best sales people? Which vendors have the highest cost per unit? Which divisions are spending the most money? In which hours does the under 12 segment watch TV most? Which customer segment has the highest revenue per user? 39
  40. THIS QUESTION CAN BE ANSWERED SYSTEMATICALLY Country Player Runs ScoreRate MatchDate Ground Versus Australia Michael J Clarke 99* 93.39 30-06-2010The Oval England Australia Dean M Jones 99* 128.57 28-01-1985Adelaide Oval Sri Lanka Australia Bradley J Hodge 99* 115.11 04-02-2007Melbourne Cricket Ground New Zealand India Virender Sehwag 99* 99 16-08-2010Rangiri Dambulla International Stad. Sri Lanka New Zealand Bruce A Edgar 99* 72.79 14-02-1981Eden Park India Pakistan Mohammad Yousuf 99* 95.19 15-11-2007Captain Roop Singh Stadium India West Indies Richard B Richardson 99* 70.21 15-11-1985Sharjah CA Stadium Pakistan West Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002Sardar Patel Stadium India Zimbabwe Andrew Flower 99* 89.18 24-10-1999Harare Sports Club Australia Zimbabwe Alistair D R Campbell 99* 79.83 01-10-2000Queens Sports Club New Zealand Zimbabwe Malcolm N Waller 99* 133.78 25-10-2011Queens Sports Club New Zealand Australia David C Boon 98* 82.35 08-12-1994Bellerive Oval Zimbabwe Australia Graeme M Wood 98* 63.22 11-01-1981Melbourne Cricket Ground India England Ian J L Trott 98* 84.48 20-10-2011Punjab Cricket Association Stadium India India Yuvraj Singh 98* 89.09 01-08-2001Sinhalese Sports Club Ground Sri Lanka Ireland Kevin J O'Brien 98* 94.23 10-07-2010VRA Ground Scotland Kenya Collins O Obuya 98* 75.96 13-03-2011M.Chinnaswamy Stadium Australia Netherlands Ryan N ten Doeschate 98* 73.68 01-09-2009VRA Ground Afghanistan New Zealand James E C Franklin 98* 142.02 07-12-2010M.Chinnaswamy Stadium India Pakistan Ijaz Ahmed 98* 112.64 28-10-1994Iqbal Stadium South Africa South Africa Jacques H Kallis 98* 74.24 06-02-2000St George's Park Zimbabwe Take every column in the data Find the top value by that column Country South Africa has the highest strike rate of 76% Player Johann Louw has the highest strike rate of 329% Runs 164 runs has the highest strike rate of 156% MatchDate 12-03-2006 has the highest strike rate of 136% Ground AC-VDCA Stadium has the highest strike rate of 98% Versus United States has the highest strike rate of 104% 40
  41. What do the children in schools know and can do at different stages of elementary education? Have the inputs made into the elementary education system had a beneficial effect or not? 41
  42. HAVING BOOKS IMPROVES READING ABILITY Having more books at home improves the performance of children when it comes to reading. (But children typically only have only 1-10 books at home) Number of students sampled What is the impact? How many more marks can having more books fetch? Circle size indicates number of students with this response. Few students have no books. Is this response (“25+ books”) good or bad? Small red bars indicate low marks. Large green bars indicate high marks. Students having 25+ books tend to score high marks. The most common response is marked in blue. This is also the circle. The graphic is summarized in words Indicates whether the best response is the most popular. Blue means that it is not. Green means that it is. Red means that the worst level is the most popular response. 42
  43. CHILDREN LIKE GAMES, AND THEY’RE GOOD … but playing daily hurts reading ability 43
  44. WATCHING TV OCCASIONALLY IS GOOD Children who watch TV every day don’t do as well as children who watch TV only once a week. But children who never watch TV fare the worst. Watching TV every day helps improve children’s reading ability a little bit more… … but mathematical abilities fall dramatically at that point 44
  45. WE HAVE A WEBSITE THAT YOU CAN EXPLORE GRAMENER.COM/NAS 45
  46. 46 AUTOMATING DATA EXPLORATION A structured approach to analysing data METADATA UNIVARIATE ANALYSIS BIVARIATE ANALYSIS

Editor's Notes

  1. We did the simplest possible thing – plot the number of customers who had meter readings of 0, 1, 2, 3, etc. – all the way up to 300 and beyond. (Effectively, we drew a histogram.) As expected, it was log-normal. Relatively few users with low meter readings, and few with high meter readings. But what was striking were the spikes – at 50 units, 100 units, 200 units and 300 units – precisely at the slab boundaries. Given the metering system, there is a strong economic incentive to stay at or within a slab boundary. Exceeding it increases the unit rate. However, there are two ways this could happen. Either the consumer watches their meter carefully, and the instant it hits 100, stops using their lights and fans – or a certain amount of money changes hands. It was easy to see from this that there was fraud happening, but what stumped us were the spikes at 10, 20, 30, 40, etc. Here, there’s no economic incentive. There’s no significant difference between a meter reading of 10 vs 11, so there was no incentive to commit fraud. However, we later learnt that we were looking at this the wrong way. This was not a case of fraud, but of laziness. These were the meter readings taken by staff that never visited the premises, and were cooking up numbers. When people cook up numbers, they cook up round numbers. (An official said that he had to let go of one person who had not taken readings in a colony of houses for as long as six months. “Sir, there’s a pack of dogs in the colony” was his official statement.) The other question is, what is the nature of this fraudulent contract. Is it monthly? The meter reading guy appears and charges a small sum to adjust the reading? Or is it an annual contract that’s paid upfront? We looked at the meter readings of some of the people who were consistently at the slab boundaries. For example, the table in the middle has the readings of 10 customers, one per row. In the first row, the readings are consistently at 200 for 9 of the 12 months. However, there’s a spike in Jan-11 to 350 units. This indicated a monthly contract with a failure to pay in just one month. However, we later learnt that many of the people on this list were famous personalities. In fact, the lady in the first row had an event at their place in Jan-11, and the actual reading was expected to be well over a thousand units. But since the electricity board has a policy of not often auditing those that were in the highest slab (above 300), a more likely explanation was a collusion of the lineman with the customer to place her in the highest slab just this month, to avoid scrutiny. Lastly, we were examining the level at which fraud can be controlled. The last table above shows the extent of fraud of each section in one city, month on month. (The extent of fraud can be measured by the relative height of the spikes compared to the expected value.) Sections vary in the level of fraud, with Section 1 having significantly more fraud than Section 9. We also observe that fraud generally decreases in the winter season (Dec – Feb) when the need for cooling is less. But what’s most striking is the negative fraud in Section 5 in Jun-10. It stays low for a couple of months, and then, as if to compensate, shoots up to 82% in Sep-10. We learnt that this coincided with the appointment and transfer of a new section manager – under whose “regime”, fraud seems to have been dramatically controlled. It appears that a good organisation level to control fraud is at the 5,000 people strong section manager level, rather than the 100,000 people strong staff level.
Advertisement