SlideShare a Scribd company logo
1
AUTOMATING DATA EXPLORATION
A structured approach to analysing data
A TOOL AGNOSTIC APPROACH
2
AUTOMATING DATA EXPLORATION
A structured approach to analysing data
METADATA
UNIVARIATE
ANALYSIS
BIVARIATE
ANALYSIS
LET’S TAKE A DATASET
3
Each row has details about an employee who has left the organization.
Just “reading” the dataset is quite informative.
DESCRIBE THE DATA IN A STRUCTURED WAY
4

Recommended for you

Storytelling through data
Storytelling through dataStorytelling through data
Storytelling through data

The document presents a data visualization challenge that asks the user 3 questions about a dataset within time limits, then repeats the challenge with simple visual cues to answer more quickly. It demonstrates how visualizing data can help identify patterns and insights more easily and quickly than just looking at the raw numbers. Visualizing data allows for consistent interpretation and conclusions to be drawn from the same dataset.

databig dataanalytics
Econometrics Project
Econometrics ProjectEconometrics Project
Econometrics Project

The study examines the effect of inflation, investment, life expectancy and literacy rate on per capita GDP across 20 countries using ordinary least squares regression. Initially, the regression results show inflation, investment and literacy rate have a negative effect, while life expectancy has a positive effect on per capita GDP. Sri Lanka, USA and Japan are identified as potential outliers based on their high residuals. Running the regression after removing these outliers improves the model fit and explanatory power of the variables. Diagnostic tests find no evidence of misspecification or heteroskedasticity, validating the OLS estimates.

healthcare healthcare statistics.pdf
healthcare healthcare statistics.pdfhealthcare healthcare statistics.pdf
healthcare healthcare statistics.pdf

The document discusses analyzing healthcare statistics from multiple datasets. It involves taking random samples from datasets and calculating mean values for infant mortality rates. It also involves creating frequency distributions, tables, and different types of charts to visualize data on hospital charges, age, and reasons for late meal delivery.

5
AUTOMATING DATA EXPLORATION
A structured approach to analysing data
METADATA
UNIVARIATE
ANALYSIS
BIVARIATE
ANALYSIS
CATEGORICAL COLUMNS YIELD VERY LITTLE DATA
6
There’s not much information in one column.
The values are not quantitative,
so a distribution is not meaningful.
The values are not even ordered.
In fact, the only thing we have is the list of values
and their count.
... or is there more to this?
Region Count
India 10780
Headstrong 1554
China 1130
Philippines 1030
US 792
Romania 788
Mexico 324
Guatemala 233
Poland 124
Brazil 45
Hungary 41
Colombia 38
Netherlands 33
South Africa 30
UK 18
UAE 15
GMS India 15
Japan 11
CZECH Republic 10
Kenya 9
... BUT RANK FREQUENCY IS STILL POSSIBLE
7
The rank of the row provides additional
information.
With this, we can explore the distribution
of the rank against the count.
These distributions are called rank-
frequency distributions.
Rank Region Count
1 India 10780
2 Headstrong 1554
3 China 1130
4 Philippines 1030
5 US 792
6 Romania 788
7 Mexico 324
8 Guatemala 233
9 Poland 124
10 Brazil 45
11 Hungary 41
12 Colombia 38
13 Netherlands 33
14 South Africa 30
15 UK 18
16 UAE 15
17 GMS India 15
18 Japan 11
19 CZECH Republic 10
20 Kenya 9
REGION SHOWS A POWER LAW DISTRIBUTION
8
Region Count
India 10780
Headstrong 1554
China 1130
Philippines 1030
US 792
Romania 788
Mexico 324
Guatemala 233
Poland 124
Brazil 45
Hungary 41
Colombia 38
Netherlands 33
South Africa 30
UK 18
UAE 15
GMS India 15
Japan 11
CZECH Republic 10
Kenya 9
Rank on a log scale
Frequencyonalogscale

Recommended for you

Forecasting Visitation
Forecasting VisitationForecasting Visitation
Forecasting Visitation

This document outlines the process of using predictive analytics and modeling to forecast visitation for a science center. It describes defining the business question of what factors affect visitation, exploring and selecting relevant data, building and evaluating three predictive models, and deploying the final model to compile data and compare predictions to actual admissions. The final model allows the science center to strategically plan staffing, facilities, and events based on visitation forecasts.

MLSEV Virtual. Anomaly Detection Examples
MLSEV Virtual. Anomaly Detection ExamplesMLSEV Virtual. Anomaly Detection Examples
MLSEV Virtual. Anomaly Detection Examples

Practical Anomaly Detection Examples with BigML, by Guillem Vidal, Machine Learning Engineer at BigML. *MLSEV 2020: Virtual Conference.

virtual conferencemachine learning schoolmachine learning
Visual Intelligence @ IBS, Hyderabad
Visual Intelligence @ IBS, HyderabadVisual Intelligence @ IBS, Hyderabad
Visual Intelligence @ IBS, Hyderabad

Here is a visualization of strike rates of some of India's prolific one-day run scorers: [GRAPH SHOWING STRIKE RATES OF TENDULKAR, GANGULY, SEHWAG, YUVRAJ, KOHLI] Tendulkar had the lowest strike rate among these players, averaging around 80. Sehwag had the highest strike rate, averaging over 90. Ganguly, Yuvaraj and Kohli's strike rates were in the mid 80s. So based on this data, Sehwag had the best strike rate among these prolific Indian one-day run scorers.

visualizationeducationbi
COST CODE SHOWS A POWER LAW DISTRIBUTION
9
Cost Code Count
105 9542
121 1757
125 875
122 796
3001 654
3310 635
124 435
131 415
115 336
nan 207
101 205
127 173
109 148
116 91
126 66
...
LE SHOWS A POWER LAW DISTRIBUTION
10
LE Count
D84 11487
GPL 853
RM1 789
LC2 565
GMR 323
D95 247
GUT 233
ML1 223
CTK 184
AXE 127
A38 98
A21 79
EMP 61
BRL 45
A66 43
...
11
WHAT CAUSES
POWER LAW DISTRIBUTIONS?
PREFERENTIAL
ATTACHMENT
EXPONENTIAL
GROWTH
NO. OF FOLLOWERS ON GITHUB
12
Username Count
slidenerd 1700
astaxie 1320
MugunthKumar 1081
honcheng 870
arunoda 827
csjaba 670
cheeaun 658
timoxley 600
karlseguin 600
hemanth 514
arvindr21 400
yuvipanda 335
mbrochh 330
anandology 330
sayanee 314
zz85 314
sanand0 309
captn3m0 300
sameersbn 300
...

Recommended for you

Editors Lab Delhi
Editors Lab DelhiEditors Lab Delhi
Editors Lab Delhi

This document provides an overview of visualizing data and discusses the benefits of data visualization. It begins with introducing the challenges of understanding data through questions and numeric tables. Adding some basic visual elements like highlighting and separating the tables helps improve understanding. However, looking more closely reveals the numbers from different locations behave quite differently, though they appear identical at first glance. This shows how visualizing data can help reveal patterns and insights that are not obvious from numbers alone. Further examples demonstrate how visualization techniques like maps and charts help make comparisons clearer and identify trends over time or based on other factors. The document argues visualization is an important tool for truly understanding and analyzing data rather than just presenting summary statistics.

datajournalismmedia
histgram[1].ppt
histgram[1].ppthistgram[1].ppt
histgram[1].ppt

This document provides an overview of histograms and how to construct them. It defines a histogram as a bar graph that shows the distribution of data and is used to summarize large data sets, compare measurements to specifications, and assist in decision making. It then outlines the 9 steps to construct a histogram: 1) count data points, 2) summarize data on a tally sheet, 3) compute the range, 4) determine intervals, 5) compute interval width, 6) determine interval starting points, 7) count points in each interval, 8) plot the data, and 9) add a title and legend. Examples and worksheets are provided to demonstrate each step.

histogram
Alexandru_V_Sima_January_03_2022_December_16_2022.pdf
Alexandru_V_Sima_January_03_2022_December_16_2022.pdfAlexandru_V_Sima_January_03_2022_December_16_2022.pdf
Alexandru_V_Sima_January_03_2022_December_16_2022.pdf

The document provides a summary of an investment portfolio as of December 16, 2022. It includes details of long equity positions, cash holdings, performance metrics, asset allocation breakdowns, top and bottom performing stocks, and historical returns compared to benchmarks. Key information reported includes a portfolio value of $7.02 million consisting primarily of long stock positions, a year-to-date return of -52.24%, and top holdings of TDOC, PLTR, and CRSP.

bursaeconomieincredere
NO. OF MOVIES ACTED IN BY BOLLYWOOD PEOPLE
13
Person Count
Lata Mangeshkar 824
Asha Bhosle 810
Shakti Kapoor 589
Kishore Kumar 585
Mohammed Rafi 527
Sunidhi Chauhan 515
Alka Yagnik 451
Udit Narayan 435
Kader Khan 430
Sonu Nigam 405
Sameer 398
Asrani 397
Helen 395
Shaan 377
Aruna Irani 375
Anupam Kher 367
Shreya Ghoshal 357
Gulshan Grover 341
...
PARTIES IN PARLIAMENT ELECTIONS
14
Name Count
IND 44704
INC 7213
BJP 3354
BSP 2628
SP 1311
CPI 1102
JD 943
CPM 914
DDP 716
JNP 676
BJS 657
JP 563
NOTA 543
PSP 538
INC(I) 492
SHS 467
AAP 432
SWA 410
...
CANDIDATE NAMES IN ASSEMBLY ELECTIONS
15
Name Count
NONE OF THE ABOVE 629
OM PRAKASH 478
ASHOK KUMAR 411
RAM SINGH 362
RAJ KUMAR 294
ANIL KUMAR 271
AMAR SINGH 248
MOHAN LAL 235
RAM KUMAR 224
BABU LAL 218
RAM PRASAD 213
JAGDISH 210
VIJAY KUMAR 207
RAJENDRA SINGH 196
VINOD KUMAR 195
SHYAM LAL 193
RAJESH KUMAR 186
SITA RAM 186
RAM LAL 171
...
STUDENT NAMES IN SSA SURVEY
16
Name Count
M.MANIKANDAN 99
S.PAVITHRA 84
S.MANIKANDAN 84
R.RAMYA 82
S.SANGEETHA 70
R.MANIKANDAN 69
S.DIVYA 68
M.PAVITHRA 68
S.SANTHIYA 67
S.VIGNESH 67
M.PRIYA 67
M.MAHALAKSHMI 64
S.SARANYA 63
S.SURYA 60
K.MANIKANDAN 60
P.PAVITHRA 56
S.GAYATHRI 56
P.MANIKANDAN 55
...

Recommended for you

AP Statistics - Confidence Intervals with Means - One Sample
AP Statistics - Confidence Intervals with Means - One SampleAP Statistics - Confidence Intervals with Means - One Sample
AP Statistics - Confidence Intervals with Means - One Sample

The document discusses how to construct confidence intervals for means using z-scores and t-scores. It outlines the assumptions, calculations, and conclusions for one-sample confidence intervals. The key steps are to check assumptions about the population distribution and sample size, then use the appropriate formula to calculate the confidence interval with either z-critical values if the population standard deviation is known, or t-critical values if the population standard deviation is unknown.

fvcproductionsfrancêscoronel
q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx
 q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx
q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx

: q.ur.hr,"r, L/*3o DtscusstoN QUESIoNS AND PROBLEMS 145 CATEGORY Puhhc Private Pri\ ato Pril atcr Private Privirte Prlvatd Private 20.200 10,.100 4t, I (X) 100 100 -14 cosr ($) MI'DIAN SAT TEAM OBPAVGERA r 620 1 610 I tt.l0 19ti0 1 930 2 t30 2010 1 590 1720 t]10 B-5 68 8\ 72' 89 4.02 4.78 3.75 4 4.1',7 3.85 3.48 3.16 3.19 f .99 +. o+ 126 676 '76'7 101 0.25.5 0.251 0.268 0.265 0.211 0.260 0.265 0.238 0.23,+ 0.3 rn 0.324 0.335 0.317 0.332 0.325 0.337 0.31 0 0.296 0.317 Baltimore 0rioles Boston Rerl Sor Chicago White Sox Cleveland Indians Detoit Tigers Kansas City Royals Los Angeies Angels IV{innesota Twins New York Yankecs Oakland Athletics Seattle Mariners Tampa Ray Rays Teras Rangers Toronto Blue Jays 3.90 7l.2 0.247 0.31 1 4.10 134 0.260 0..i 15 93 69 12" 1 00 3 r .{i00 66 94 't5 90 93 't3 619 691 32" I 00 't S: +:! h ZtltZ. the total payroli for the New l'ork Yankees was almost $200 million, whilc the total payroll fbr the Oakland Athletics (a team known fbr using base- ball analytics or sabermetrics) was about $55 million, lc:ss than one-third o{ the Yankees payroll. In thc fol- lowing table. you q,ill see the payrolls (in millions) and thc total rumb.:r ol- victories I'or the baseball tcams in thc American l-eague in the 20l2 soason. Devclop a regression nrodel to predict the total rtum- ber of victories based on tht: payroll. Use the model to predict the number of victones tor a team with a pay- roll oi ti79 million. Based on the results of the com- puter output, discuss the relationship betwecn payroll and victories. (a) Dc-vclop a rcgrcssion modcl that could bc ttscd to predict the nunrber of based on the ERA. ii08 0,273 0.33:t '716 0.245 0.309 (c) (d) (b) Develop a r prcdict the scored. Deveiop a predict the ting aver Develt.rp a 1 2 3 4 5 6 7 8 9 it) 11, that could be used to ies based on the runs that could be used to ies based on the bat- could be used to TEAM PAYROLL ($MTLLIONS) NUMBEROF VICTORIES prcdict number of victories based on the on- base (e) of the four models is bener tbr pre<licting the r of victories? (t) Find the best multiple regression rnodel to pre- dict the nurnber of wins. Use any combination of the variables to tind the best nrodel. 4-32 The closing stock price for each o1' two stocks (DJIA) was also over this same time MONTH D.IIA Baltimore Orioles Boston Red Sox Chicago White Sox Cleveiand Inciians Detroit Tigers Kansas City Royals Los Angele s Angels Minncsota Ts,ins Ncrv York Yankees 0akland Athletics Seattle Mariners Tampa Bay Rays Texas Rangers Toronto Blue Jays 81 .4 113.2 96.9 78.1 132.3 60.9 154"5 94.1 198.0 55..1 82.0 61.2 120.5 75.5 93 (t9 85 68 ri8 72 89 66 95 94 75 90 9? 73 .7 .-1 .-1 .1-ll Thc number of t,ictories (W), earned flrn average (ERA), runs scored (R), batting (AVG), and on-base ntage ( each team in the scason are providcd i ...

Automating Analysis and Visualizing Machine Learning
Automating Analysis and Visualizing Machine LearningAutomating Analysis and Visualizing Machine Learning
Automating Analysis and Visualizing Machine Learning

A talk at Cypher 2017, Bangalore on how the same patterns of analysis can be applied across domains. Also highlights the growing need for visualizing models since the most effective models are black box

data analysisdata visualizationmachine learning
Jain
Harini
Shweta
Sneha Pooja
Ashwin
Shah
Deepti
Sanjana
Varshini
Ezhumalai
Venkatesan
Silambarasan
Pandiyan
Kumaresan
Manikandan
Thirupathi
Agarwal
Kumar
Priya
NOT EVERYTHING IS POWER-LAW, THOUGH
18
Need to understand what drives these distributions from their behaviours
ORDERED CATEGORICALS HAVE MORE INFORMATION
19
CORPORATE BAND
20
LE Count
5 12247
4 4449
3 205
2 63
Not Mapped 24
1 22
SVP 10

Recommended for you

Database Marketing - Dominick's stores in Chicago distric
Database Marketing - Dominick's stores in Chicago districDatabase Marketing - Dominick's stores in Chicago distric
Database Marketing - Dominick's stores in Chicago distric

Determined two courses for the Dominick's transnational database analysis: one performed on a corporate level to facilitate a variety of corporate planning activities; and the other one on a category level to improves sales performance and expand product offerings. • Extracted one year sales data from 109 Dominick's stores in Chicago district and merged with store demographic data. • Analysis the data by segmentation analysis (create groups of the stores similar in performance), response analysis (find targetable characteristics of identified groups of stores) and model validation (evaluate performance of the model on a 20% hold-out sample) utilizing SAS • Explicated the result in 25 pages report, which discussed the evaluation of potential locations for a new store and choice of the stores to test market a new product.

dominick'sdatabase marketingdominick's stores
Statistical quality control
Statistical quality controlStatistical quality control
Statistical quality control

The document discusses quality control and statistical quality control. It defines quality as properties valued by consumers and quality control as maintaining standards through testing samples. The goal of quality control is to eliminate nonconformities and wasted resources at lowest cost. Statistical quality control uses statistical tools like descriptive statistics, acceptance sampling, and statistical process control to measure and control variation in processes. Examples are provided of x-bar and R charts to determine if a gluing process is in control, as well as P and C charts to monitor defects and complaints.

management
4 5b Histograms
4 5b Histograms4 5b Histograms
4 5b Histograms

The document discusses histograms and how they are used to organize and summarize data. Histograms divide a continuous range of data into bins of equal size to display the frequency distribution. They show the frequency of data values within each discrete interval. The document provides an example of students measuring the density of an unknown liquid and creating a histogram to analyze the results. It outlines the steps to organize the density data values, determine the appropriate bin ranges, count the data into groups, and interpret the frequency table and histogram created.

algebraalgebraalgebra
LOCAL BAND
21
LE Count
5A 7483
5B 4764
4A 1683
4B 1612
4C 747
4D 407
3 205
2 63
Not Mapped 24
1 22
SVP 10
QUANTITIES HAVE EVEN MORE INFORMATION
22
AGE DISTRIBUTION IS LOG-NORMAL
23
DETECTING FRAUD
“
We know meter readings are
incorrect, for various reasons.
We don’t, however, have the
concrete proof we need to start
the process of meter reading
automation.
Part of our problem is the
volume of data that needs to be
analysed. The other is the
inexperience in tools or
analyses to identify such
patterns.
ENERGY UTILITY
24

Recommended for you

Derivative daily report
Derivative daily reportDerivative daily report
Derivative daily report

Stock futures are less risky that’s why we provide Stock Future Tips, Equity Trading, Derivatives Trading and Options On Futures etc. Stock Futures are basic financial contract with individual stock as an underlying asset. Visit our website: http://www.moneyclassicresearch.com/stock-future-tips.php

derivatives tradingderivatives trading tipsderivatives tips
Weekly equity market report for 14 to 18 may
Weekly equity market report for 14 to 18 mayWeekly equity market report for 14 to 18 may
Weekly equity market report for 14 to 18 may

A bad week ended red with Nifty and Sensex both down more than 3.2 %. U.S. Unemployment claims came at 367K; Trade Balance came at (-) 51.8B. This week has given a weak ending for almost all international markets.

market todayequity market for todaytheequicom
6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision

Computer vision is a technology that enables computers to interpret and comprehend visual information from their surroundings, and it has the potential to transform the manufacturing industry. Manufacturers can improve their processes in a variety of ways by using computer vision, from ensuring quality control and optimizing production to inspecting and measuring products and monitoring machinery. In this presentation you will find out 6 methods how you can improve your manufacturing process with computer vision. Download our E-book bit.ly/ebookcomputervision

computer visionuse of computer visioncomputer vision solutions
This plot shows the frequency of all meter readings from Apr-
2010 to Mar-2011. An unusually large number of readings are
aligned with the tariff slab boundaries.
This clearly shows collusion
of some form with the
customers.
Apr-10 May-10Jun-10Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11
217 219 200 200 200 200 200 200 200 350 200 200
250 200 200 200 201 200 200 200 250 200 200 150
250 150 150 200 200 200 200 200 200 200 200 150
150 200 200 200 200 200 200 200 200 200 200 50
200 200 200 150 180 150 50 100 50 70 100 100
100 100 100 100 100 100 100 100 100 100 110 100
100 150 123 123 50 100 50 100 100 100 100 100
0 111 100 100 100 100 100 100 100 100 50 50
0 100 27 100 50 100 100 100 100 100 70 100
1 1 1 100 99 50 100 100 100 100 100 100
This happens with specific
customers, not randomly.
Here are such customers’
meter readings.
Section Apr-10 May-10Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11
Section 1 70% 97% 136% 65% 110% 116% 121% 107% 114% 88% 74% 109%
Section 2 66% 92% 66% 87% 70% 64% 63% 50% 58% 38% 41% 54%
Section 3 90% 46% 47% 43% 28% 31% 50% 32% 19% 38% 8% 34%
Section 4 44% 24% 36% 39% 21% 18% 24% 49% 56% 44% 31% 14%
Section 5 4% 63% -27% 20% 41% 82% 26% 34% 43% 2% 37% 15%
Section 6 18% 23% 30% 21% 28% 33% 39% 41% 39% 18% 0% 33%
Section 7 36% 51% 33% 33% 27% 35% 10% 39% 12% 5% 15% 14%
Section 8 22% 21% 28% 12% 24% 27% 10% 31% 13% 11% 22% 17%
Section 9 19% 35% 14% 9% 16% 32% 37% 12% 9% 5% -3% 11%
If we define the “extent of
fraud” as the percentage
excess of the 100 unit
meter reading, the
value varies
considerably
across sections,
and time
New section
manager arrives
… and is
transferred out
… with some
explainable
anomalies.
Why would
these happen?
25
PREDICTING MARKS
“
What determines a child’s marks?
Do girls score better than boys?
Does the choice of subject matter?
Does the medium of instruction
matter?
Does community or religion
matter?
Does their birthday matter?
Does the first letter of their name
matter?
EDUCATION
26
TN CLASS X: ENGLISH
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 27
TN CLASS X: SOCIAL SCIENCE
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 28

Recommended for you

Detecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer VisionDetecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer Vision

Computer vision is the field of artificial intelligence that deals with the ability of computers to interpret and understand visual data from the world around them. In the manufacturing industry, computer vision can be used to detect defects in products as they are being produced. This can help to improve the quality of the final product and reduce the cost of rework or recalls. In this presentation you will find out the use of computer vision for defect detection in manufacturing which aids in improving the efficiency and effectiveness of the production process, leading to higher quality products and lower costs. Book a discovery call https://reachus.gramener.com/damage-detection/

computer visionmanufacturingdefect detection
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & HealthcareHow to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare

Find out the importance of KOLs (Key Opinion leaders) in the Pharma industry and everything you need to know about them. In the presentation, we will show you who is a KOL in the Pharmaceutical Industry, what role they play and how to identify the right KOLs. Book a free demo https://gramener.com/demorequest/

kol pharmakol identificationkey opinion leader pharma
Automated Barcode Generation System in Manufacturing
Automated Barcode Generation System in ManufacturingAutomated Barcode Generation System in Manufacturing
Automated Barcode Generation System in Manufacturing

The document discusses how a leading semiconductor company was facing issues with validating product labels from multiple suppliers due to different labeling standards. They solved this by using a low-code barcode labeling solution called BarGen, which enables centralized standards and reduces validation time by 67%. BarGen allows for smart conversion of user inputs to barcodes via APIs and can generate barcodes in common formats for web, Excel, and bulk printing across operating systems and languages.

technologydata sciencedata analysis
TN CLASS X: LANGUAGE
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 29
TN CLASS X: SCIENCE
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 30
TN CLASS X: MATHEMATICS
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 31
ICSE 2013 CLASS XII: TOTAL MARKS
32

Recommended for you

The Role of Technology to Save Biodiversity
The Role of Technology to Save BiodiversityThe Role of Technology to Save Biodiversity
The Role of Technology to Save Biodiversity

Find out what are the major challenges biodiversity is facing such as deforestation, species endangerment, and poaching. In the presentation, we will show you how some of the major technology and nature conservation organizations are building innovative solutions to protect our biodiversity. Download this E-book to know how geospatial AI is impacting biodiversity conservation and sustainable development. https://info.gramener.com/geospatial-analytics-ai-solutions-esg-sector-ebook

technologygeospatial technologynature
Enable Storytelling with Power BI & Comicgen Plugin
Enable Storytelling with Power BI  & Comicgen PluginEnable Storytelling with Power BI  & Comicgen Plugin
Enable Storytelling with Power BI & Comicgen Plugin

The document summarizes a webinar about Comicgen, a Power BI plug-in that generates comic strips from data insights. It introduces Comicgen's features like controlling character emotions and poses based on two KPIs. The webinar agenda covers an introduction to data comics, what Comicgen is, how to generate comics, different use cases, and data storytelling. Future enhancements are also discussed, such as adding conversation between two characters, new Sherlock Holmes and Watson characters, improved performance, and customized comics with client CEO/CFO faces.

datadata sciencedata analytics
The Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsThe Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science Projects

Ganes Kesari, Gramener's Head of Analytics & Co-Founder gives his insights on how to craft a data science roadmap that maximizes ROI. The biggest reason why 80% of analytics projects fail is that they don’t solve the right problem. Asking analytics or data-related question is the worst way to initiate a data analytics project. This webinar will walk you through how to get started in the most efficient way possible. You'll discover a straightforward step-by-step strategy to unlocking corporate value through industry examples. Things you will learn from this webinar: -The most common reasons for the failure of data science initiatives -Identifying projects and prioritizing them -Building a data science strategy in three easy steps -Real-life examples are used to explain the approach Watch this full webinar on: https://info.gramener.com/data-science-roadmap To know more from our industry experts book a free demo at: https://gramener.com/demorequest/

datadata sciencedata analytics
CBSE 2013 CLASS XII: ENGLISH MARKS
33
CBSE 2013 CLASS XII: PHYSICS MARKS
34
35
AUTOMATING DATA EXPLORATION
A structured approach to analysing data
METADATA
UNIVARIATE
ANALYSIS
BIVARIATE
ANALYSIS
LET’S TAKE ONE DAY CRICKET DATA
Country Player Runs ScoreRate MatchDate Ground Versus
Australia Michael J Clarke 99* 93.39 30-06-2010The Oval England
Australia Dean M Jones 99* 128.57 28-01-1985Adelaide Oval Sri Lanka
Australia Bradley J Hodge 99* 115.11 04-02-2007Melbourne Cricket Ground New Zealand
India Virender Sehwag 99* 99 16-08-2010Rangiri Dambulla International Stad. Sri Lanka
New Zealand Bruce A Edgar 99* 72.79 14-02-1981Eden Park India
Pakistan Mohammad Yousuf 99* 95.19 15-11-2007Captain Roop Singh Stadium India
West Indies Richard B Richardson 99* 70.21 15-11-1985Sharjah CA Stadium Pakistan
West Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002Sardar Patel Stadium India
Zimbabwe Andrew Flower 99* 89.18 24-10-1999Harare Sports Club Australia
Zimbabwe Alistair D R Campbell 99* 79.83 01-10-2000Queens Sports Club New Zealand
Zimbabwe Malcolm N Waller 99* 133.78 25-10-2011Queens Sports Club New Zealand
Australia David C Boon 98* 82.35 08-12-1994Bellerive Oval Zimbabwe
Australia Graeme M Wood 98* 63.22 11-01-1981Melbourne Cricket Ground India
England Ian J L Trott 98* 84.48 20-10-2011Punjab Cricket Association Stadium India
India Yuvraj Singh 98* 89.09 01-08-2001Sinhalese Sports Club Ground Sri Lanka
Ireland Kevin J O'Brien 98* 94.23 10-07-2010VRA Ground Scotland
Kenya Collins O Obuya 98* 75.96 13-03-2011M.Chinnaswamy Stadium Australia
Netherlands Ryan N ten Doeschate 98* 73.68 01-09-2009VRA Ground Afghanistan
New Zealand James E C Franklin 98* 142.02 07-12-2010M.Chinnaswamy Stadium India
Pakistan Ijaz Ahmed 98* 112.64 28-10-1994Iqbal Stadium South Africa
South Africa Jacques H Kallis 98* 74.24 06-02-2000St George's Park Zimbabwe
36

Recommended for you

Low Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI ProductsLow Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI Products

Gramener's CEO, Anand S conducted this webinar where he explained how to build Data and AI products using a low-code platform in less than two weeks. Few takeaways: -How low-code approaches can be tailored to your data/digital needs? -Decisions on Building vs. Buying -Production-ready use cases to stimulate your thinking Who should watch? You will find this webinar to be valuable if you're a CPO, VP IT, handling product development, or building analytical solutions for your company. Watch this full webinar on: https://info.gramener.com/low-code-platform-to-build-process-optimization-solutions? Want to know more about our low-code platform, Gramex? Visit: https://gramener.com/gramex/

ai productsanalytics consultingartificial intelligence
5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program

Gramener's VP of Analytics Amit Garg hosted this webinar and talked about what are the principles of a good customer experience program, and why is it important. This webinar will be beneficial to leaders in the CMO, CCO, Customer Service, and any other customer-facing departments within a firm. Pain points discussed: -You'll be able to assess the level of CX maturity in your company. -You'll learn the high-level steps to creating a successful CX program. -You'll figure out what tools you'll need to improve your talents. To watch the full webinar visit: https://info.gramener.com/5-key-foundations-effective-cx-program Learn more about CX Analytics: https://gramener.com/customer-experience-analytics/

analyticsanalytics consultinganalytics webinar
Using Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad PerformanceUsing Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad Performance

This document discusses using Power BI to optimize media buying and ad performance. It introduces Power BI and its capabilities to provide a centralized campaign reporting platform. Media buying involves complex decisions around strategy, budget, objectives, and target markets. An ideal solution would provide a single product with user access control, an overview of spends and campaigns, detailed views of campaigns, and comparisons across campaigns. The demo then shows Power BI's flexibility, visual analytics, and data storytelling capabilities to evaluate campaign performance through live operational dashboards.

power bimedia buyingad performance
Against which countries are
higher averages scored?
Which countries’ players
score more per match?
37
Which player scores the
most per ball?
The player with the highest strike
rate is an obscure South African
whose name most of us have never
heard of.
In fact, this list is filled with players
we have never heard of.
38
Most analysis answers the question
“Which is are the top 10 X”?
Which are my top products?
Which are my top branches?
Who are my best sales people?
Which vendors have the highest cost per unit?
Which divisions are spending the most money?
In which hours does the under 12 segment watch TV most?
Which customer segment has the highest revenue per user?
39
THIS QUESTION CAN BE ANSWERED SYSTEMATICALLY
Country Player Runs ScoreRate MatchDate Ground Versus
Australia Michael J Clarke 99* 93.39 30-06-2010The Oval England
Australia Dean M Jones 99* 128.57 28-01-1985Adelaide Oval Sri Lanka
Australia Bradley J Hodge 99* 115.11 04-02-2007Melbourne Cricket Ground New Zealand
India Virender Sehwag 99* 99 16-08-2010Rangiri Dambulla International Stad. Sri Lanka
New Zealand Bruce A Edgar 99* 72.79 14-02-1981Eden Park India
Pakistan Mohammad Yousuf 99* 95.19 15-11-2007Captain Roop Singh Stadium India
West Indies Richard B Richardson 99* 70.21 15-11-1985Sharjah CA Stadium Pakistan
West Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002Sardar Patel Stadium India
Zimbabwe Andrew Flower 99* 89.18 24-10-1999Harare Sports Club Australia
Zimbabwe Alistair D R Campbell 99* 79.83 01-10-2000Queens Sports Club New Zealand
Zimbabwe Malcolm N Waller 99* 133.78 25-10-2011Queens Sports Club New Zealand
Australia David C Boon 98* 82.35 08-12-1994Bellerive Oval Zimbabwe
Australia Graeme M Wood 98* 63.22 11-01-1981Melbourne Cricket Ground India
England Ian J L Trott 98* 84.48 20-10-2011Punjab Cricket Association Stadium India
India Yuvraj Singh 98* 89.09 01-08-2001Sinhalese Sports Club Ground Sri Lanka
Ireland Kevin J O'Brien 98* 94.23 10-07-2010VRA Ground Scotland
Kenya Collins O Obuya 98* 75.96 13-03-2011M.Chinnaswamy Stadium Australia
Netherlands Ryan N ten Doeschate 98* 73.68 01-09-2009VRA Ground Afghanistan
New Zealand James E C Franklin 98* 142.02 07-12-2010M.Chinnaswamy Stadium India
Pakistan Ijaz Ahmed 98* 112.64 28-10-1994Iqbal Stadium South Africa
South Africa Jacques H Kallis 98* 74.24 06-02-2000St George's Park Zimbabwe
Take every column in the data
Find the top value by that column
Country South Africa has the highest strike rate of 76%
Player Johann Louw has the highest strike rate of 329%
Runs 164 runs has the highest strike rate of 156%
MatchDate 12-03-2006 has the highest strike rate of 136%
Ground AC-VDCA Stadium has the highest strike rate of 98%
Versus United States has the highest strike rate of 104%
40

Recommended for you

Recession Proofing With Data : Webinar
Recession Proofing With Data : WebinarRecession Proofing With Data : Webinar
Recession Proofing With Data : Webinar

This webinar was hosted by Gramener's CEO/Co-Founder, Anand S, and Ganes Kesari, Head of Analytics/Co-Founder on how data can help firms recover quickly throughout the recession and recovery period. Who should watch this webinar : Analytics Leaders, Business Leaders, CDOs, CTOs, etc. Few takeaways : -Which aspects of your company could benefit the most from a data-driven response? -A strategy for identifying use cases that will provide the most value for the money. How to use data in creative ways to uncover new market opportunities and customers. Objectives : -Data's utility in COVID situation -How data science may assist you in navigating the recession -Gramener's industry case studies to assist businesses in responding to COVID-19 Full Webinar: https://info.gramener.com/recession-proofing-your-business-with-data To know more from industry leaders visit our official website: https://gramener.com/

datadata sciencedata science consulting
Engage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: WebinarEngage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: Webinar

Gramener's CEO and Co-Founder Anand S hosted a webinar on how interactive PowerPoint decks can engage your audiences. Pain points discussed in this webinar : -How to utilize interactive slides to answer business questions like "Where is the problem?" and "What created this problem?" -What forms of interactivity does PowerPoint offer, and when should you utilize each? -What tools and plug-ins can aid in the creation of interactive presentations? Watch the full webinar on: https://info.gramener.com/interactive-powerpoint-for-operations Book a free demo to know more about Gramener's solutions: https://gramener.com/demorequest/

power pointpower point webinardata visualization
Structure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best OutcomesStructure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best Outcomes

Gramener's Head of Analytics, Ganes Kesari conducted this webinar and discussed the following points : -Why do data analytics and visualization initiatives require teams to work in silos? -What are the best organizational structures for data science? -As your data journey progresses, how should the organizational structure evolve? -Best methods for encouraging team collaboration in data projects This is a unique webinar designed for Executives, Chief Analytics Officers, Heads of Analytics, Directors, Technology Leaders, and Managers that work with data science teams on a daily basis. To check out the full webinar visit: https://info.gramener.com/data-science-teams-structure-for-best-outcomes To contact us & book a free demo visit: https://gramener.com/demorequest/

datadata sciencedata analytics
What do the children in schools know and can do at
different stages of elementary education?
Have the inputs made into the elementary education
system had a beneficial effect or not?
41
HAVING BOOKS IMPROVES READING ABILITY
Having more books at home improves the performance of children when it
comes to reading. (But children typically only have only 1-10 books at home)
Number of students sampled
What is the impact? How many more marks
can having more books fetch?
Circle size indicates number of students with
this response. Few students have no books.
Is this response (“25+ books”) good or bad?
Small red bars indicate low marks. Large
green bars indicate high marks. Students
having 25+ books tend to score high marks.
The most common response is marked in
blue. This is also the circle.
The graphic is summarized in words
Indicates whether the best response is the
most popular. Blue means that it is not.
Green means that it is. Red means that the
worst level is the most popular response.
42
CHILDREN LIKE GAMES, AND THEY’RE GOOD
… but playing daily hurts reading ability
43
WATCHING TV OCCASIONALLY IS GOOD
Children who watch TV
every day don’t do as well
as children who watch TV
only once a week.
But children who never
watch TV fare the worst.
Watching TV every day
helps improve children’s
reading ability a little bit
more…
… but mathematical
abilities fall dramatically at
that point
44

Recommended for you

Dawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - WebinarDawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - Webinar

Gramener's Lead Data Scientist Soumya Ranjan and Senior Data Science Engineer Sumedh Ghatage conducted a webinar on Geospatial AI. In this webinar, they discussed the technical know-how to get started, as well as some strategies for navigating this fascinating realm of Geospatial Analytics. Pain points covered : -How to begin with Geospatial Analytics in Python -How can large-scale geospatial datasets be cleaned and analyzed? -What is the best way to design geospatial workflows? -How to use Geospatial Datasets for Deep Learning? No matter whatever industry you're in, Geospatial Analytics will provide you with a wealth of unique solutions. To watch the full webinar visit: https://info.gramener.com/geospatial-ai-technical-sneak-peek To know more about Gramener's Geospatial AI solutions book a free demo on: https://gramener.com/demorequest/

aiartificial intelligencegeospatial
5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar

Gramener's Chief Data Scientist and Co-founder Ganes Kesari conducted an interesting webinar that will give you an idea of how to analyze your data maturity and plan the five steps to transforming your business using data. Who should watch this webinar? Executives, Chief Data/Analytics Officers, Technology leaders, Business heads, Directors, and Managers. Important points discussed on the webinar: -The majority of businesses reach a halt in the middle of their data journey. -According to Gartner, approximately 87% of companies in the business have a poor degree of data maturity (levels 1 and 2 on a scale of 5). -Adding more data science projects to your portfolio will not boost your talents or results. The truth is that CDOs' primary issues are divided into five categories. Learnings from this webinar: -Data Science Maturity. What is it and why is it important? -How can you determine the maturity of data science and its limitations? -How does data science maturity (described with an example) assist your business in progressing? Watch the full webinar on: https://info.gramener.com/5-steps-to-transform-into-data-driven-organization To know more about Data Maturity visit: https://gramener.com/data-maturity/#

data sciencedata science maturitydata science webinar
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar

1. Measuring ROI from data science initiatives is challenging for many organizations as the outcomes are often not clearly defined, quantified, or attributed to the initiatives. Breaking the chain from data to insights to actions to outcomes is common. 2. A framework is presented for quantifying the value of data science initiatives using 5 steps - define success metrics, measure the metrics, attribute outcomes to causal factors, calculate net costs and benefits to determine breakeven, and benchmark results. 3. The framework is applied to a case study of a beverage manufacturer that used analytics to optimize plant costs. Key metrics like cost savings, employee productivity, and process efficiency were defined and attribution methods like A/B testing were used

data sciencedata analyticsanalytics consulting
WE HAVE A WEBSITE THAT YOU CAN EXPLORE
GRAMENER.COM/NAS
45
46
AUTOMATING DATA EXPLORATION
A structured approach to analysing data
METADATA
UNIVARIATE
ANALYSIS
BIVARIATE
ANALYSIS

More Related Content

Similar to Automating Data Exploration SciPy 2016

Making Big Data relevant: Importance of Data Visualization and Analytics
Making Big Data relevant: Importance of Data Visualization and AnalyticsMaking Big Data relevant: Importance of Data Visualization and Analytics
Making Big Data relevant: Importance of Data Visualization and Analytics
Gramener
 
Econ stat1
Econ stat1Econ stat1
Econ stat1
ling selanoba
 
HYDSPIN Dec14 visual story telling
HYDSPIN Dec14 visual story tellingHYDSPIN Dec14 visual story telling
HYDSPIN Dec14 visual story telling
Gramener
 
Storytelling through data
Storytelling through dataStorytelling through data
Storytelling through data
Gramener
 
Econometrics Project
Econometrics ProjectEconometrics Project
Econometrics Project
Uday Tharar
 
healthcare healthcare statistics.pdf
healthcare healthcare statistics.pdfhealthcare healthcare statistics.pdf
healthcare healthcare statistics.pdf
sdfghj21
 
Forecasting Visitation
Forecasting VisitationForecasting Visitation
MLSEV Virtual. Anomaly Detection Examples
MLSEV Virtual. Anomaly Detection ExamplesMLSEV Virtual. Anomaly Detection Examples
MLSEV Virtual. Anomaly Detection Examples
BigML, Inc
 
Visual Intelligence @ IBS, Hyderabad
Visual Intelligence @ IBS, HyderabadVisual Intelligence @ IBS, Hyderabad
Visual Intelligence @ IBS, Hyderabad
Gramener
 
Editors Lab Delhi
Editors Lab DelhiEditors Lab Delhi
Editors Lab Delhi
Gramener
 
histgram[1].ppt
histgram[1].ppthistgram[1].ppt
histgram[1].ppt
ssuserb036e8
 
Alexandru_V_Sima_January_03_2022_December_16_2022.pdf
Alexandru_V_Sima_January_03_2022_December_16_2022.pdfAlexandru_V_Sima_January_03_2022_December_16_2022.pdf
Alexandru_V_Sima_January_03_2022_December_16_2022.pdf
AlexandruSima8
 
AP Statistics - Confidence Intervals with Means - One Sample
AP Statistics - Confidence Intervals with Means - One SampleAP Statistics - Confidence Intervals with Means - One Sample
AP Statistics - Confidence Intervals with Means - One Sample
Frances Coronel
 
q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx
 q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx
q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx
aryan532920
 
Automating Analysis and Visualizing Machine Learning
Automating Analysis and Visualizing Machine LearningAutomating Analysis and Visualizing Machine Learning
Automating Analysis and Visualizing Machine Learning
Gramener
 
Database Marketing - Dominick's stores in Chicago distric
Database Marketing - Dominick's stores in Chicago districDatabase Marketing - Dominick's stores in Chicago distric
Database Marketing - Dominick's stores in Chicago distric
Demin Wang
 
Statistical quality control
Statistical quality controlStatistical quality control
Statistical quality control
Sai Datri Arige
 
4 5b Histograms
4 5b Histograms4 5b Histograms
4 5b Histograms
taco40
 
Derivative daily report
Derivative daily reportDerivative daily report
Derivative daily report
Money Classic Research
 
Weekly equity market report for 14 to 18 may
Weekly equity market report for 14 to 18 mayWeekly equity market report for 14 to 18 may
Weekly equity market report for 14 to 18 may
TheEquicom Advisory
 

Similar to Automating Data Exploration SciPy 2016 (20)

Making Big Data relevant: Importance of Data Visualization and Analytics
Making Big Data relevant: Importance of Data Visualization and AnalyticsMaking Big Data relevant: Importance of Data Visualization and Analytics
Making Big Data relevant: Importance of Data Visualization and Analytics
 
Econ stat1
Econ stat1Econ stat1
Econ stat1
 
HYDSPIN Dec14 visual story telling
HYDSPIN Dec14 visual story tellingHYDSPIN Dec14 visual story telling
HYDSPIN Dec14 visual story telling
 
Storytelling through data
Storytelling through dataStorytelling through data
Storytelling through data
 
Econometrics Project
Econometrics ProjectEconometrics Project
Econometrics Project
 
healthcare healthcare statistics.pdf
healthcare healthcare statistics.pdfhealthcare healthcare statistics.pdf
healthcare healthcare statistics.pdf
 
Forecasting Visitation
Forecasting VisitationForecasting Visitation
Forecasting Visitation
 
MLSEV Virtual. Anomaly Detection Examples
MLSEV Virtual. Anomaly Detection ExamplesMLSEV Virtual. Anomaly Detection Examples
MLSEV Virtual. Anomaly Detection Examples
 
Visual Intelligence @ IBS, Hyderabad
Visual Intelligence @ IBS, HyderabadVisual Intelligence @ IBS, Hyderabad
Visual Intelligence @ IBS, Hyderabad
 
Editors Lab Delhi
Editors Lab DelhiEditors Lab Delhi
Editors Lab Delhi
 
histgram[1].ppt
histgram[1].ppthistgram[1].ppt
histgram[1].ppt
 
Alexandru_V_Sima_January_03_2022_December_16_2022.pdf
Alexandru_V_Sima_January_03_2022_December_16_2022.pdfAlexandru_V_Sima_January_03_2022_December_16_2022.pdf
Alexandru_V_Sima_January_03_2022_December_16_2022.pdf
 
AP Statistics - Confidence Intervals with Means - One Sample
AP Statistics - Confidence Intervals with Means - One SampleAP Statistics - Confidence Intervals with Means - One Sample
AP Statistics - Confidence Intervals with Means - One Sample
 
q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx
 q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx
q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx
 
Automating Analysis and Visualizing Machine Learning
Automating Analysis and Visualizing Machine LearningAutomating Analysis and Visualizing Machine Learning
Automating Analysis and Visualizing Machine Learning
 
Database Marketing - Dominick's stores in Chicago distric
Database Marketing - Dominick's stores in Chicago districDatabase Marketing - Dominick's stores in Chicago distric
Database Marketing - Dominick's stores in Chicago distric
 
Statistical quality control
Statistical quality controlStatistical quality control
Statistical quality control
 
4 5b Histograms
4 5b Histograms4 5b Histograms
4 5b Histograms
 
Derivative daily report
Derivative daily reportDerivative daily report
Derivative daily report
 
Weekly equity market report for 14 to 18 may
Weekly equity market report for 14 to 18 mayWeekly equity market report for 14 to 18 may
Weekly equity market report for 14 to 18 may
 

More from Gramener

6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision
Gramener
 
Detecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer VisionDetecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer Vision
Gramener
 
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & HealthcareHow to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
Gramener
 
Automated Barcode Generation System in Manufacturing
Automated Barcode Generation System in ManufacturingAutomated Barcode Generation System in Manufacturing
Automated Barcode Generation System in Manufacturing
Gramener
 
The Role of Technology to Save Biodiversity
The Role of Technology to Save BiodiversityThe Role of Technology to Save Biodiversity
The Role of Technology to Save Biodiversity
Gramener
 
Enable Storytelling with Power BI & Comicgen Plugin
Enable Storytelling with Power BI  & Comicgen PluginEnable Storytelling with Power BI  & Comicgen Plugin
Enable Storytelling with Power BI & Comicgen Plugin
Gramener
 
The Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsThe Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science Projects
Gramener
 
Low Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI ProductsLow Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI Products
Gramener
 
5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program
Gramener
 
Using Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad PerformanceUsing Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad Performance
Gramener
 
Recession Proofing With Data : Webinar
Recession Proofing With Data : WebinarRecession Proofing With Data : Webinar
Recession Proofing With Data : Webinar
Gramener
 
Engage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: WebinarEngage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: Webinar
Gramener
 
Structure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best OutcomesStructure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best Outcomes
Gramener
 
Dawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - WebinarDawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - Webinar
Gramener
 
5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar
Gramener
 
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
Gramener
 
Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020
Gramener
 
Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)
Gramener
 
The Art of Storytelling Using Data Science
The Art of Storytelling Using Data ScienceThe Art of Storytelling Using Data Science
The Art of Storytelling Using Data Science
Gramener
 
Storyfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to StoriesStoryfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to Stories
Gramener
 

More from Gramener (20)

6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision
 
Detecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer VisionDetecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer Vision
 
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & HealthcareHow to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
 
Automated Barcode Generation System in Manufacturing
Automated Barcode Generation System in ManufacturingAutomated Barcode Generation System in Manufacturing
Automated Barcode Generation System in Manufacturing
 
The Role of Technology to Save Biodiversity
The Role of Technology to Save BiodiversityThe Role of Technology to Save Biodiversity
The Role of Technology to Save Biodiversity
 
Enable Storytelling with Power BI & Comicgen Plugin
Enable Storytelling with Power BI  & Comicgen PluginEnable Storytelling with Power BI  & Comicgen Plugin
Enable Storytelling with Power BI & Comicgen Plugin
 
The Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsThe Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science Projects
 
Low Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI ProductsLow Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI Products
 
5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program
 
Using Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad PerformanceUsing Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad Performance
 
Recession Proofing With Data : Webinar
Recession Proofing With Data : WebinarRecession Proofing With Data : Webinar
Recession Proofing With Data : Webinar
 
Engage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: WebinarEngage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: Webinar
 
Structure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best OutcomesStructure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best Outcomes
 
Dawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - WebinarDawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - Webinar
 
5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar
 
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
 
Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020
 
Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)
 
The Art of Storytelling Using Data Science
The Art of Storytelling Using Data ScienceThe Art of Storytelling Using Data Science
The Art of Storytelling Using Data Science
 
Storyfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to StoriesStoryfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to Stories
 

Recently uploaded

From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
Milind Agarwal
 
Sunshine Coast University diploma
Sunshine Coast University diplomaSunshine Coast University diploma
Sunshine Coast University diploma
cwavvyy
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
sanjay singh
 
Swinburne University of Technology degree offer diploma Transcript
Swinburne University of Technology  degree offer diploma TranscriptSwinburne University of Technology  degree offer diploma Transcript
Swinburne University of Technology degree offer diploma Transcript
taqyea
 
Welcome back to Instagram. Sign in to check out what your
Welcome back to Instagram. Sign in to check out what yourWelcome back to Instagram. Sign in to check out what your
Welcome back to Instagram. Sign in to check out what your
Virni Arrora
 
Universidad Camilo José Cela degree offer diploma Transcript
Universidad Camilo José Cela  degree offer diploma TranscriptUniversidad Camilo José Cela  degree offer diploma Transcript
Universidad Camilo José Cela degree offer diploma Transcript
taqyea
 
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptxBIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
RajdeepPaul47
 
York University Degree of Diploma
York University Degree  of DiplomaYork University Degree  of Diploma
York University Degree of Diploma
taqyea
 
Universidad de Alcalá degree offer diploma Transcript
Universidad de Alcalá  degree offer diploma TranscriptUniversidad de Alcalá  degree offer diploma Transcript
Universidad de Alcalá degree offer diploma Transcript
taqyea
 
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
kamli sharma#S10
 
ch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ssch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ss
MinThetLwin1
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
javier ramirez
 
The University of Melbourne degree offer diploma Transcript
The University of Melbourne  degree offer diploma TranscriptThe University of Melbourne  degree offer diploma Transcript
The University of Melbourne degree offer diploma Transcript
taqyea
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
Universidad Politécnica de Madrid degree offer diploma Transcript
Universidad Politécnica de Madrid  degree offer diploma TranscriptUniversidad Politécnica de Madrid  degree offer diploma Transcript
Universidad Politécnica de Madrid degree offer diploma Transcript
taqyea
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
gargtinna79
 
Universidad de Alcalá degree offer diploma Transcript
Universidad de Alcalá  degree offer diploma TranscriptUniversidad de Alcalá  degree offer diploma Transcript
Universidad de Alcalá degree offer diploma Transcript
taqyea
 
Universitat Oberta de Catalunya degree offer diploma Transcript
Universitat Oberta de Catalunya  degree offer diploma TranscriptUniversitat Oberta de Catalunya  degree offer diploma Transcript
Universitat Oberta de Catalunya degree offer diploma Transcript
taqyea
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
kihus38
 
Victoria University degree offer diploma Transcript
Victoria University  degree offer diploma TranscriptVictoria University  degree offer diploma Transcript
Victoria University degree offer diploma Transcript
taqyea
 

Recently uploaded (20)

From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
 
Sunshine Coast University diploma
Sunshine Coast University diplomaSunshine Coast University diploma
Sunshine Coast University diploma
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
 
Swinburne University of Technology degree offer diploma Transcript
Swinburne University of Technology  degree offer diploma TranscriptSwinburne University of Technology  degree offer diploma Transcript
Swinburne University of Technology degree offer diploma Transcript
 
Welcome back to Instagram. Sign in to check out what your
Welcome back to Instagram. Sign in to check out what yourWelcome back to Instagram. Sign in to check out what your
Welcome back to Instagram. Sign in to check out what your
 
Universidad Camilo José Cela degree offer diploma Transcript
Universidad Camilo José Cela  degree offer diploma TranscriptUniversidad Camilo José Cela  degree offer diploma Transcript
Universidad Camilo José Cela degree offer diploma Transcript
 
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptxBIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
 
York University Degree of Diploma
York University Degree  of DiplomaYork University Degree  of Diploma
York University Degree of Diploma
 
Universidad de Alcalá degree offer diploma Transcript
Universidad de Alcalá  degree offer diploma TranscriptUniversidad de Alcalá  degree offer diploma Transcript
Universidad de Alcalá degree offer diploma Transcript
 
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
 
ch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ssch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ss
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
 
The University of Melbourne degree offer diploma Transcript
The University of Melbourne  degree offer diploma TranscriptThe University of Melbourne  degree offer diploma Transcript
The University of Melbourne degree offer diploma Transcript
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
 
Universidad Politécnica de Madrid degree offer diploma Transcript
Universidad Politécnica de Madrid  degree offer diploma TranscriptUniversidad Politécnica de Madrid  degree offer diploma Transcript
Universidad Politécnica de Madrid degree offer diploma Transcript
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
 
Universidad de Alcalá degree offer diploma Transcript
Universidad de Alcalá  degree offer diploma TranscriptUniversidad de Alcalá  degree offer diploma Transcript
Universidad de Alcalá degree offer diploma Transcript
 
Universitat Oberta de Catalunya degree offer diploma Transcript
Universitat Oberta de Catalunya  degree offer diploma TranscriptUniversitat Oberta de Catalunya  degree offer diploma Transcript
Universitat Oberta de Catalunya degree offer diploma Transcript
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
 
Victoria University degree offer diploma Transcript
Victoria University  degree offer diploma TranscriptVictoria University  degree offer diploma Transcript
Victoria University degree offer diploma Transcript
 

Automating Data Exploration SciPy 2016

  • 1. 1 AUTOMATING DATA EXPLORATION A structured approach to analysing data A TOOL AGNOSTIC APPROACH
  • 2. 2 AUTOMATING DATA EXPLORATION A structured approach to analysing data METADATA UNIVARIATE ANALYSIS BIVARIATE ANALYSIS
  • 3. LET’S TAKE A DATASET 3 Each row has details about an employee who has left the organization. Just “reading” the dataset is quite informative.
  • 4. DESCRIBE THE DATA IN A STRUCTURED WAY 4
  • 5. 5 AUTOMATING DATA EXPLORATION A structured approach to analysing data METADATA UNIVARIATE ANALYSIS BIVARIATE ANALYSIS
  • 6. CATEGORICAL COLUMNS YIELD VERY LITTLE DATA 6 There’s not much information in one column. The values are not quantitative, so a distribution is not meaningful. The values are not even ordered. In fact, the only thing we have is the list of values and their count. ... or is there more to this? Region Count India 10780 Headstrong 1554 China 1130 Philippines 1030 US 792 Romania 788 Mexico 324 Guatemala 233 Poland 124 Brazil 45 Hungary 41 Colombia 38 Netherlands 33 South Africa 30 UK 18 UAE 15 GMS India 15 Japan 11 CZECH Republic 10 Kenya 9
  • 7. ... BUT RANK FREQUENCY IS STILL POSSIBLE 7 The rank of the row provides additional information. With this, we can explore the distribution of the rank against the count. These distributions are called rank- frequency distributions. Rank Region Count 1 India 10780 2 Headstrong 1554 3 China 1130 4 Philippines 1030 5 US 792 6 Romania 788 7 Mexico 324 8 Guatemala 233 9 Poland 124 10 Brazil 45 11 Hungary 41 12 Colombia 38 13 Netherlands 33 14 South Africa 30 15 UK 18 16 UAE 15 17 GMS India 15 18 Japan 11 19 CZECH Republic 10 20 Kenya 9
  • 8. REGION SHOWS A POWER LAW DISTRIBUTION 8 Region Count India 10780 Headstrong 1554 China 1130 Philippines 1030 US 792 Romania 788 Mexico 324 Guatemala 233 Poland 124 Brazil 45 Hungary 41 Colombia 38 Netherlands 33 South Africa 30 UK 18 UAE 15 GMS India 15 Japan 11 CZECH Republic 10 Kenya 9 Rank on a log scale Frequencyonalogscale
  • 9. COST CODE SHOWS A POWER LAW DISTRIBUTION 9 Cost Code Count 105 9542 121 1757 125 875 122 796 3001 654 3310 635 124 435 131 415 115 336 nan 207 101 205 127 173 109 148 116 91 126 66 ...
  • 10. LE SHOWS A POWER LAW DISTRIBUTION 10 LE Count D84 11487 GPL 853 RM1 789 LC2 565 GMR 323 D95 247 GUT 233 ML1 223 CTK 184 AXE 127 A38 98 A21 79 EMP 61 BRL 45 A66 43 ...
  • 11. 11 WHAT CAUSES POWER LAW DISTRIBUTIONS? PREFERENTIAL ATTACHMENT EXPONENTIAL GROWTH
  • 12. NO. OF FOLLOWERS ON GITHUB 12 Username Count slidenerd 1700 astaxie 1320 MugunthKumar 1081 honcheng 870 arunoda 827 csjaba 670 cheeaun 658 timoxley 600 karlseguin 600 hemanth 514 arvindr21 400 yuvipanda 335 mbrochh 330 anandology 330 sayanee 314 zz85 314 sanand0 309 captn3m0 300 sameersbn 300 ...
  • 13. NO. OF MOVIES ACTED IN BY BOLLYWOOD PEOPLE 13 Person Count Lata Mangeshkar 824 Asha Bhosle 810 Shakti Kapoor 589 Kishore Kumar 585 Mohammed Rafi 527 Sunidhi Chauhan 515 Alka Yagnik 451 Udit Narayan 435 Kader Khan 430 Sonu Nigam 405 Sameer 398 Asrani 397 Helen 395 Shaan 377 Aruna Irani 375 Anupam Kher 367 Shreya Ghoshal 357 Gulshan Grover 341 ...
  • 14. PARTIES IN PARLIAMENT ELECTIONS 14 Name Count IND 44704 INC 7213 BJP 3354 BSP 2628 SP 1311 CPI 1102 JD 943 CPM 914 DDP 716 JNP 676 BJS 657 JP 563 NOTA 543 PSP 538 INC(I) 492 SHS 467 AAP 432 SWA 410 ...
  • 15. CANDIDATE NAMES IN ASSEMBLY ELECTIONS 15 Name Count NONE OF THE ABOVE 629 OM PRAKASH 478 ASHOK KUMAR 411 RAM SINGH 362 RAJ KUMAR 294 ANIL KUMAR 271 AMAR SINGH 248 MOHAN LAL 235 RAM KUMAR 224 BABU LAL 218 RAM PRASAD 213 JAGDISH 210 VIJAY KUMAR 207 RAJENDRA SINGH 196 VINOD KUMAR 195 SHYAM LAL 193 RAJESH KUMAR 186 SITA RAM 186 RAM LAL 171 ...
  • 16. STUDENT NAMES IN SSA SURVEY 16 Name Count M.MANIKANDAN 99 S.PAVITHRA 84 S.MANIKANDAN 84 R.RAMYA 82 S.SANGEETHA 70 R.MANIKANDAN 69 S.DIVYA 68 M.PAVITHRA 68 S.SANTHIYA 67 S.VIGNESH 67 M.PRIYA 67 M.MAHALAKSHMI 64 S.SARANYA 63 S.SURYA 60 K.MANIKANDAN 60 P.PAVITHRA 56 S.GAYATHRI 56 P.MANIKANDAN 55 ...
  • 18. NOT EVERYTHING IS POWER-LAW, THOUGH 18 Need to understand what drives these distributions from their behaviours
  • 19. ORDERED CATEGORICALS HAVE MORE INFORMATION 19
  • 20. CORPORATE BAND 20 LE Count 5 12247 4 4449 3 205 2 63 Not Mapped 24 1 22 SVP 10
  • 21. LOCAL BAND 21 LE Count 5A 7483 5B 4764 4A 1683 4B 1612 4C 747 4D 407 3 205 2 63 Not Mapped 24 1 22 SVP 10
  • 22. QUANTITIES HAVE EVEN MORE INFORMATION 22
  • 23. AGE DISTRIBUTION IS LOG-NORMAL 23
  • 24. DETECTING FRAUD “ We know meter readings are incorrect, for various reasons. We don’t, however, have the concrete proof we need to start the process of meter reading automation. Part of our problem is the volume of data that needs to be analysed. The other is the inexperience in tools or analyses to identify such patterns. ENERGY UTILITY 24
  • 25. This plot shows the frequency of all meter readings from Apr- 2010 to Mar-2011. An unusually large number of readings are aligned with the tariff slab boundaries. This clearly shows collusion of some form with the customers. Apr-10 May-10Jun-10Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11 217 219 200 200 200 200 200 200 200 350 200 200 250 200 200 200 201 200 200 200 250 200 200 150 250 150 150 200 200 200 200 200 200 200 200 150 150 200 200 200 200 200 200 200 200 200 200 50 200 200 200 150 180 150 50 100 50 70 100 100 100 100 100 100 100 100 100 100 100 100 110 100 100 150 123 123 50 100 50 100 100 100 100 100 0 111 100 100 100 100 100 100 100 100 50 50 0 100 27 100 50 100 100 100 100 100 70 100 1 1 1 100 99 50 100 100 100 100 100 100 This happens with specific customers, not randomly. Here are such customers’ meter readings. Section Apr-10 May-10Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11 Section 1 70% 97% 136% 65% 110% 116% 121% 107% 114% 88% 74% 109% Section 2 66% 92% 66% 87% 70% 64% 63% 50% 58% 38% 41% 54% Section 3 90% 46% 47% 43% 28% 31% 50% 32% 19% 38% 8% 34% Section 4 44% 24% 36% 39% 21% 18% 24% 49% 56% 44% 31% 14% Section 5 4% 63% -27% 20% 41% 82% 26% 34% 43% 2% 37% 15% Section 6 18% 23% 30% 21% 28% 33% 39% 41% 39% 18% 0% 33% Section 7 36% 51% 33% 33% 27% 35% 10% 39% 12% 5% 15% 14% Section 8 22% 21% 28% 12% 24% 27% 10% 31% 13% 11% 22% 17% Section 9 19% 35% 14% 9% 16% 32% 37% 12% 9% 5% -3% 11% If we define the “extent of fraud” as the percentage excess of the 100 unit meter reading, the value varies considerably across sections, and time New section manager arrives … and is transferred out … with some explainable anomalies. Why would these happen? 25
  • 26. PREDICTING MARKS “ What determines a child’s marks? Do girls score better than boys? Does the choice of subject matter? Does the medium of instruction matter? Does community or religion matter? Does their birthday matter? Does the first letter of their name matter? EDUCATION 26
  • 27. TN CLASS X: ENGLISH 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 27
  • 28. TN CLASS X: SOCIAL SCIENCE 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 28
  • 29. TN CLASS X: LANGUAGE 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 29
  • 30. TN CLASS X: SCIENCE 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 30
  • 31. TN CLASS X: MATHEMATICS 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 31
  • 32. ICSE 2013 CLASS XII: TOTAL MARKS 32
  • 33. CBSE 2013 CLASS XII: ENGLISH MARKS 33
  • 34. CBSE 2013 CLASS XII: PHYSICS MARKS 34
  • 35. 35 AUTOMATING DATA EXPLORATION A structured approach to analysing data METADATA UNIVARIATE ANALYSIS BIVARIATE ANALYSIS
  • 36. LET’S TAKE ONE DAY CRICKET DATA Country Player Runs ScoreRate MatchDate Ground Versus Australia Michael J Clarke 99* 93.39 30-06-2010The Oval England Australia Dean M Jones 99* 128.57 28-01-1985Adelaide Oval Sri Lanka Australia Bradley J Hodge 99* 115.11 04-02-2007Melbourne Cricket Ground New Zealand India Virender Sehwag 99* 99 16-08-2010Rangiri Dambulla International Stad. Sri Lanka New Zealand Bruce A Edgar 99* 72.79 14-02-1981Eden Park India Pakistan Mohammad Yousuf 99* 95.19 15-11-2007Captain Roop Singh Stadium India West Indies Richard B Richardson 99* 70.21 15-11-1985Sharjah CA Stadium Pakistan West Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002Sardar Patel Stadium India Zimbabwe Andrew Flower 99* 89.18 24-10-1999Harare Sports Club Australia Zimbabwe Alistair D R Campbell 99* 79.83 01-10-2000Queens Sports Club New Zealand Zimbabwe Malcolm N Waller 99* 133.78 25-10-2011Queens Sports Club New Zealand Australia David C Boon 98* 82.35 08-12-1994Bellerive Oval Zimbabwe Australia Graeme M Wood 98* 63.22 11-01-1981Melbourne Cricket Ground India England Ian J L Trott 98* 84.48 20-10-2011Punjab Cricket Association Stadium India India Yuvraj Singh 98* 89.09 01-08-2001Sinhalese Sports Club Ground Sri Lanka Ireland Kevin J O'Brien 98* 94.23 10-07-2010VRA Ground Scotland Kenya Collins O Obuya 98* 75.96 13-03-2011M.Chinnaswamy Stadium Australia Netherlands Ryan N ten Doeschate 98* 73.68 01-09-2009VRA Ground Afghanistan New Zealand James E C Franklin 98* 142.02 07-12-2010M.Chinnaswamy Stadium India Pakistan Ijaz Ahmed 98* 112.64 28-10-1994Iqbal Stadium South Africa South Africa Jacques H Kallis 98* 74.24 06-02-2000St George's Park Zimbabwe 36
  • 37. Against which countries are higher averages scored? Which countries’ players score more per match? 37
  • 38. Which player scores the most per ball? The player with the highest strike rate is an obscure South African whose name most of us have never heard of. In fact, this list is filled with players we have never heard of. 38
  • 39. Most analysis answers the question “Which is are the top 10 X”? Which are my top products? Which are my top branches? Who are my best sales people? Which vendors have the highest cost per unit? Which divisions are spending the most money? In which hours does the under 12 segment watch TV most? Which customer segment has the highest revenue per user? 39
  • 40. THIS QUESTION CAN BE ANSWERED SYSTEMATICALLY Country Player Runs ScoreRate MatchDate Ground Versus Australia Michael J Clarke 99* 93.39 30-06-2010The Oval England Australia Dean M Jones 99* 128.57 28-01-1985Adelaide Oval Sri Lanka Australia Bradley J Hodge 99* 115.11 04-02-2007Melbourne Cricket Ground New Zealand India Virender Sehwag 99* 99 16-08-2010Rangiri Dambulla International Stad. Sri Lanka New Zealand Bruce A Edgar 99* 72.79 14-02-1981Eden Park India Pakistan Mohammad Yousuf 99* 95.19 15-11-2007Captain Roop Singh Stadium India West Indies Richard B Richardson 99* 70.21 15-11-1985Sharjah CA Stadium Pakistan West Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002Sardar Patel Stadium India Zimbabwe Andrew Flower 99* 89.18 24-10-1999Harare Sports Club Australia Zimbabwe Alistair D R Campbell 99* 79.83 01-10-2000Queens Sports Club New Zealand Zimbabwe Malcolm N Waller 99* 133.78 25-10-2011Queens Sports Club New Zealand Australia David C Boon 98* 82.35 08-12-1994Bellerive Oval Zimbabwe Australia Graeme M Wood 98* 63.22 11-01-1981Melbourne Cricket Ground India England Ian J L Trott 98* 84.48 20-10-2011Punjab Cricket Association Stadium India India Yuvraj Singh 98* 89.09 01-08-2001Sinhalese Sports Club Ground Sri Lanka Ireland Kevin J O'Brien 98* 94.23 10-07-2010VRA Ground Scotland Kenya Collins O Obuya 98* 75.96 13-03-2011M.Chinnaswamy Stadium Australia Netherlands Ryan N ten Doeschate 98* 73.68 01-09-2009VRA Ground Afghanistan New Zealand James E C Franklin 98* 142.02 07-12-2010M.Chinnaswamy Stadium India Pakistan Ijaz Ahmed 98* 112.64 28-10-1994Iqbal Stadium South Africa South Africa Jacques H Kallis 98* 74.24 06-02-2000St George's Park Zimbabwe Take every column in the data Find the top value by that column Country South Africa has the highest strike rate of 76% Player Johann Louw has the highest strike rate of 329% Runs 164 runs has the highest strike rate of 156% MatchDate 12-03-2006 has the highest strike rate of 136% Ground AC-VDCA Stadium has the highest strike rate of 98% Versus United States has the highest strike rate of 104% 40
  • 41. What do the children in schools know and can do at different stages of elementary education? Have the inputs made into the elementary education system had a beneficial effect or not? 41
  • 42. HAVING BOOKS IMPROVES READING ABILITY Having more books at home improves the performance of children when it comes to reading. (But children typically only have only 1-10 books at home) Number of students sampled What is the impact? How many more marks can having more books fetch? Circle size indicates number of students with this response. Few students have no books. Is this response (“25+ books”) good or bad? Small red bars indicate low marks. Large green bars indicate high marks. Students having 25+ books tend to score high marks. The most common response is marked in blue. This is also the circle. The graphic is summarized in words Indicates whether the best response is the most popular. Blue means that it is not. Green means that it is. Red means that the worst level is the most popular response. 42
  • 43. CHILDREN LIKE GAMES, AND THEY’RE GOOD … but playing daily hurts reading ability 43
  • 44. WATCHING TV OCCASIONALLY IS GOOD Children who watch TV every day don’t do as well as children who watch TV only once a week. But children who never watch TV fare the worst. Watching TV every day helps improve children’s reading ability a little bit more… … but mathematical abilities fall dramatically at that point 44
  • 45. WE HAVE A WEBSITE THAT YOU CAN EXPLORE GRAMENER.COM/NAS 45
  • 46. 46 AUTOMATING DATA EXPLORATION A structured approach to analysing data METADATA UNIVARIATE ANALYSIS BIVARIATE ANALYSIS

Editor's Notes

  1. We did the simplest possible thing – plot the number of customers who had meter readings of 0, 1, 2, 3, etc. – all the way up to 300 and beyond. (Effectively, we drew a histogram.) As expected, it was log-normal. Relatively few users with low meter readings, and few with high meter readings. But what was striking were the spikes – at 50 units, 100 units, 200 units and 300 units – precisely at the slab boundaries. Given the metering system, there is a strong economic incentive to stay at or within a slab boundary. Exceeding it increases the unit rate. However, there are two ways this could happen. Either the consumer watches their meter carefully, and the instant it hits 100, stops using their lights and fans – or a certain amount of money changes hands. It was easy to see from this that there was fraud happening, but what stumped us were the spikes at 10, 20, 30, 40, etc. Here, there’s no economic incentive. There’s no significant difference between a meter reading of 10 vs 11, so there was no incentive to commit fraud. However, we later learnt that we were looking at this the wrong way. This was not a case of fraud, but of laziness. These were the meter readings taken by staff that never visited the premises, and were cooking up numbers. When people cook up numbers, they cook up round numbers. (An official said that he had to let go of one person who had not taken readings in a colony of houses for as long as six months. “Sir, there’s a pack of dogs in the colony” was his official statement.) The other question is, what is the nature of this fraudulent contract. Is it monthly? The meter reading guy appears and charges a small sum to adjust the reading? Or is it an annual contract that’s paid upfront? We looked at the meter readings of some of the people who were consistently at the slab boundaries. For example, the table in the middle has the readings of 10 customers, one per row. In the first row, the readings are consistently at 200 for 9 of the 12 months. However, there’s a spike in Jan-11 to 350 units. This indicated a monthly contract with a failure to pay in just one month. However, we later learnt that many of the people on this list were famous personalities. In fact, the lady in the first row had an event at their place in Jan-11, and the actual reading was expected to be well over a thousand units. But since the electricity board has a policy of not often auditing those that were in the highest slab (above 300), a more likely explanation was a collusion of the lineman with the customer to place her in the highest slab just this month, to avoid scrutiny. Lastly, we were examining the level at which fraud can be controlled. The last table above shows the extent of fraud of each section in one city, month on month. (The extent of fraud can be measured by the relative height of the spikes compared to the expected value.) Sections vary in the level of fraud, with Section 1 having significantly more fraud than Section 9. We also observe that fraud generally decreases in the winter season (Dec – Feb) when the need for cooling is less. But what’s most striking is the negative fraud in Section 5 in Jun-10. It stays low for a couple of months, and then, as if to compensate, shoots up to 82% in Sep-10. We learnt that this coincided with the appointment and transfer of a new section manager – under whose “regime”, fraud seems to have been dramatically controlled. It appears that a good organisation level to control fraud is at the 5,000 people strong section manager level, rather than the 100,000 people strong staff level.