SlideShare a Scribd company logo
1 of 21
1
Exploratory Data Analysis
Kathirmani Sukumar
Data Scientist @ Gramener
How do I start doing analysis?
2
Exploratory Data Analysis might help
you…!!!
3
CASE STUDIES
4
DETECTING FRAUD
“
We know meter readings are
incorrect, for various reasons.
We don’t, however, have the
concrete proof we need to start
the process of meter reading
automation.
Part of our problem is the
volume of data that needs to be
analysed. The other is the
inexperience in tools or
analyses to identify such
patterns.
ENERGY UTILITY
5
AN ENERGY UTILITY DETECTED BILLING FRAUD
This plot shows the frequency of all meter readings from Apr-
2010 to Mar-2011. An unusually large number of readings are
aligned with the slab boundaries.
Below is a simple histogram (or frequency distribution) of usage levels.
Each bar represents the number of customers with a customers with a
specific bill amount (in units, or KWh).
Tariffs are based on the usage slab. Someone with 101 units is billed in
full at a higher tariff than someone with 100 units. So people have a
strong incentive to stay at or within a slab boundary.
An energy utility (with over 50 million
subscribers) had 10 years worth of
customer billing data available.
Most fraud detection software failed to
load the data, and sampled data
revealed little or no insight.
This can happen in one of two ways.
First, people may be monitoring their
usage very carefully, and turn of their
lights and fans the instant their usage
hits the slab boundary.
Or, more realistically, there’s probably some level of corruption
involved, where customers pay a small sum to the meter reading staff
to ensure that it stays exactly at the slab boundary, giving them the
advantage of a lower price.
6
PREDICTING MARKS
“
What determines a child’s marks?
Do girls score better than boys?
Does the choice of subject matter?
Does the medium of instruction
matter?
Does community or religion
matter?
Does their birthday matter?
Does the first letter of their name
matter?
EDUCATION
7
TN CLASS X: ENGLISH
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 8
TN CLASS X: SOCIAL SCIENCE
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 9
TN CLASS X: LANGUAGE
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 10
TN CLASS X: SCIENCE
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 11
TN CLASS X: MATHEMATICS
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 12
ICSE 2013 CLASS XII: TOTAL MARKS
13
CBSE 2013 CLASS XII: ENGLISH MARKS
14
Based on the results of the 20 lakh
students taking the Class XII
exams at Tamil Nadu over the last
3 years, it appears that the month
you were born in can make a
difference of as much as 120
marks out of 1,200.
June borns
score the lowest
The marks shoot
up for Aug borns
… and peaks for
Sep-borns
120 marks out of
1200 explainable
by month of birth
An identical pattern was observed in 2009 and 2010…
… and across districts, gender, subjects, and class X & XII.
“It’s simply that in Canada the eligibility cut-
off for age-class hockey is January 1. A boy
who turns ten on January 2, then, could be
playing alongside someone who doesn’t turn
ten until the end of the year—and at that age,
in preadolescence, a twelve-month gap in age
represents an enormous difference in physical
maturity.”
-- Malcolm Gladwell, Outliers
15
This is a dataset (1975 – 1990) that has
been around for several years, and has
been studied extensively. Yet, a
visualization can reveal patterns that
are neither obvious nor well known.
For example,
• Are birthdays uniformly distributed?
• Do doctors or parents exercise the C-section option to move dates?
• Is there any day of the month that has unusually high or low births?
• Are there any months with relatively high or low births?
Very high births in September.
But this is fairly well known.
Most conceptions happen during
the winter holiday season
Relatively few births during the
Christmas and Thanksgiving
holidays, as well as New Year and
Independence Day.
Most people prefer not
to have children on the
13th of any month, given
that it’s an unlucky day
Some special days like April
Fool’s day are avoided, but
Valentine’s Day is quite
popular
More births Fewer births … on average, for each day of the year (from 1975 to 1990)
LET’S LOOK AT 15 YEARS OF US BIRTH DATA
16
THE PATTERN IN INDIA IS QUITE DIFFERENT
This is a birth date dataset that’s
obtained from school admission data
for over 10 million children. When we
compare this with births in the US, we
see none of the same patterns.
For example,
• Is there an aversion to the 13th or is there a local cultural nuance?
• Are holidays avoided for births?
• Which months have a higher propensity for births, and why?
• Are there any patterns not found in the US data?
Very few children are born in the
month of August, and thereafter.
Most births are concentrated in
the first half of the year
We see a large number of
children born on the 5th, 10th,
15th, 20th and 25th of each month
– that is, round numbered dates
Such round numbered patterns a
typical indication of fraud. Here,
birthdates are brought forward
to aid early school admission
More births Fewer births … on average, for each day of the year (from 2007 to 2013)
17
EDA PROCESS
UNDERSTAND DERIVE QUESTION INTERACT
 Identify
Relevant data &
sources
 Map Context
 Prepare
Metadata
 Label & Clean
data
 New Metrics
from business
 Metrics from
Patterns
(Binning,
comparison,
Ratios,
Attributes,
Transformation)
 Stakeholder
inputs who
would benefit
from the
analysis
 Based on
patterns(top
groups by a
metric,
maximise a
metric,
bivariate
relationships)
 Filter by a
group value
 Compare
against a
value or a
derived
metric
 Sort by a
dimension
LIVE DEMO
19
THANK YOU
20
Reaching out…
21
Kathirmani Sukumar
Email: kathir.mani@gramener.com
Twitter: @skathirmani
LinkedIn: https://in.linkedin.com/in/skathirmani

More Related Content

What's hot

Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.pptneelamoberoi1030
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisVishwas N
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisEva Durall
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analyticsUmasree Raghunath
 
Data Wrangling
Data WranglingData Wrangling
Data WranglingGramener
 
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Edureka!
 
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.Megha Sharma
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science ProcessVishal Patel
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersFunctional Imperative
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...Simplilearn
 
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...Edureka!
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision treeKrish_ver2
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Simplilearn
 
data mining
data miningdata mining
data mininguoitc
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 

What's hot (20)

Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Statistics for data science
Statistics for data science Statistics for data science
Statistics for data science
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
 
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science Process
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
data mining
data miningdata mining
data mining
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 

Similar to Exploratory data analysis

'Visual Intelligence' by Ganes Kesari, at Hyderabad Analytics Club
'Visual Intelligence' by Ganes Kesari, at Hyderabad Analytics Club'Visual Intelligence' by Ganes Kesari, at Hyderabad Analytics Club
'Visual Intelligence' by Ganes Kesari, at Hyderabad Analytics ClubGramener
 
Humanizing Data Storytelling for Greater Business Impact
Humanizing Data Storytelling for Greater Business ImpactHumanizing Data Storytelling for Greater Business Impact
Humanizing Data Storytelling for Greater Business ImpactGramener
 
Data visualization for social problems
Data visualization for social problemsData visualization for social problems
Data visualization for social problemsGramener
 
The Art of Data Visualization
The Art of Data VisualizationThe Art of Data Visualization
The Art of Data VisualizationGramener
 
Data visualization
Data visualizationData visualization
Data visualizationMukul Taneja
 
New Age Tools in Data Journalism - Analytics & Visualization
New Age Tools in Data Journalism - Analytics & VisualizationNew Age Tools in Data Journalism - Analytics & Visualization
New Age Tools in Data Journalism - Analytics & VisualizationGanes Kesari
 
New Age Tools in Data Journalism - Analytics & Visualization
New Age Tools in Data Journalism - Analytics & VisualizationNew Age Tools in Data Journalism - Analytics & Visualization
New Age Tools in Data Journalism - Analytics & VisualizationGramener
 
The value of storytelling through data
The value of storytelling through dataThe value of storytelling through data
The value of storytelling through dataGramener
 
Data monetization
Data monetizationData monetization
Data monetizationGramener
 
Prisoners of birth: TEDx Whitefield
Prisoners of birth: TEDx WhitefieldPrisoners of birth: TEDx Whitefield
Prisoners of birth: TEDx WhitefieldGramener
 
Telling the story presentation
Telling the story presentationTelling the story presentation
Telling the story presentationteachfirst
 
Storyfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to StoriesStoryfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to StoriesGramener
 
Teaching Students How (Not) to Lie with Statistics
Teaching Students How (Not) to Lie with StatisticsTeaching Students How (Not) to Lie with Statistics
Teaching Students How (Not) to Lie with StatisticsLynette Hoelter
 
The ultimate guide to data storytelling | Materclass
The ultimate guide to data storytelling | MaterclassThe ultimate guide to data storytelling | Materclass
The ultimate guide to data storytelling | MaterclassGramener
 
Argumentative Essay Topics On Health
Argumentative Essay Topics On HealthArgumentative Essay Topics On Health
Argumentative Essay Topics On HealthAbbe Schoch
 
Language Barriers in the United States
Language Barriers in the United StatesLanguage Barriers in the United States
Language Barriers in the United StatesKyle Downey
 
Using the “Checklist” to Respond to Racial Disproportionality in Special Educ...
Using the “Checklist” to Respond to Racial Disproportionality in Special Educ...Using the “Checklist” to Respond to Racial Disproportionality in Special Educ...
Using the “Checklist” to Respond to Racial Disproportionality in Special Educ...SPPTAP
 
EYHC 2011: Don't Count Me Out
EYHC 2011: Don't Count Me OutEYHC 2011: Don't Count Me Out
EYHC 2011: Don't Count Me OutYfoundations
 

Similar to Exploratory data analysis (20)

'Visual Intelligence' by Ganes Kesari, at Hyderabad Analytics Club
'Visual Intelligence' by Ganes Kesari, at Hyderabad Analytics Club'Visual Intelligence' by Ganes Kesari, at Hyderabad Analytics Club
'Visual Intelligence' by Ganes Kesari, at Hyderabad Analytics Club
 
Humanizing Data Storytelling for Greater Business Impact
Humanizing Data Storytelling for Greater Business ImpactHumanizing Data Storytelling for Greater Business Impact
Humanizing Data Storytelling for Greater Business Impact
 
Data visualization for social problems
Data visualization for social problemsData visualization for social problems
Data visualization for social problems
 
The Art of Data Visualization
The Art of Data VisualizationThe Art of Data Visualization
The Art of Data Visualization
 
Data visualization
Data visualizationData visualization
Data visualization
 
New Age Tools in Data Journalism - Analytics & Visualization
New Age Tools in Data Journalism - Analytics & VisualizationNew Age Tools in Data Journalism - Analytics & Visualization
New Age Tools in Data Journalism - Analytics & Visualization
 
New Age Tools in Data Journalism - Analytics & Visualization
New Age Tools in Data Journalism - Analytics & VisualizationNew Age Tools in Data Journalism - Analytics & Visualization
New Age Tools in Data Journalism - Analytics & Visualization
 
The value of storytelling through data
The value of storytelling through dataThe value of storytelling through data
The value of storytelling through data
 
Data monetization
Data monetizationData monetization
Data monetization
 
Prisoners of birth: TEDx Whitefield
Prisoners of birth: TEDx WhitefieldPrisoners of birth: TEDx Whitefield
Prisoners of birth: TEDx Whitefield
 
Telling the story presentation
Telling the story presentationTelling the story presentation
Telling the story presentation
 
Storyfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to StoriesStoryfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to Stories
 
Teaching Students How (Not) to Lie with Statistics
Teaching Students How (Not) to Lie with StatisticsTeaching Students How (Not) to Lie with Statistics
Teaching Students How (Not) to Lie with Statistics
 
The ultimate guide to data storytelling | Materclass
The ultimate guide to data storytelling | MaterclassThe ultimate guide to data storytelling | Materclass
The ultimate guide to data storytelling | Materclass
 
Argumentative Essay Topics On Health
Argumentative Essay Topics On HealthArgumentative Essay Topics On Health
Argumentative Essay Topics On Health
 
Data Collection Activity
Data Collection ActivityData Collection Activity
Data Collection Activity
 
Language Barriers in the United States
Language Barriers in the United StatesLanguage Barriers in the United States
Language Barriers in the United States
 
Telling Stories with Data
Telling Stories with DataTelling Stories with Data
Telling Stories with Data
 
Using the “Checklist” to Respond to Racial Disproportionality in Special Educ...
Using the “Checklist” to Respond to Racial Disproportionality in Special Educ...Using the “Checklist” to Respond to Racial Disproportionality in Special Educ...
Using the “Checklist” to Respond to Racial Disproportionality in Special Educ...
 
EYHC 2011: Don't Count Me Out
EYHC 2011: Don't Count Me OutEYHC 2011: Don't Count Me Out
EYHC 2011: Don't Count Me Out
 

More from Gramener

6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer VisionGramener
 
Detecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer VisionDetecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer VisionGramener
 
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & HealthcareHow to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & HealthcareGramener
 
Automated Barcode Generation System in Manufacturing
Automated Barcode Generation System in ManufacturingAutomated Barcode Generation System in Manufacturing
Automated Barcode Generation System in ManufacturingGramener
 
The Role of Technology to Save Biodiversity
The Role of Technology to Save BiodiversityThe Role of Technology to Save Biodiversity
The Role of Technology to Save BiodiversityGramener
 
Enable Storytelling with Power BI & Comicgen Plugin
Enable Storytelling with Power BI  & Comicgen PluginEnable Storytelling with Power BI  & Comicgen Plugin
Enable Storytelling with Power BI & Comicgen PluginGramener
 
The Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsThe Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsGramener
 
Low Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI ProductsLow Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI ProductsGramener
 
5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX ProgramGramener
 
Using Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad PerformanceUsing Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad PerformanceGramener
 
Recession Proofing With Data : Webinar
Recession Proofing With Data : WebinarRecession Proofing With Data : Webinar
Recession Proofing With Data : WebinarGramener
 
Engage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: WebinarEngage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: WebinarGramener
 
Structure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best OutcomesStructure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best OutcomesGramener
 
Dawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - WebinarDawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - WebinarGramener
 
5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : WebinarGramener
 
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
5 Steps To Measure ROI On Your Data Science Initiatives - WebinarGramener
 
Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020Gramener
 
Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)Gramener
 
The Art of Storytelling Using Data Science
The Art of Storytelling Using Data ScienceThe Art of Storytelling Using Data Science
The Art of Storytelling Using Data ScienceGramener
 
Data and Storytelling | What Now?
Data and Storytelling | What Now?Data and Storytelling | What Now?
Data and Storytelling | What Now?Gramener
 

More from Gramener (20)

6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision
 
Detecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer VisionDetecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer Vision
 
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & HealthcareHow to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
 
Automated Barcode Generation System in Manufacturing
Automated Barcode Generation System in ManufacturingAutomated Barcode Generation System in Manufacturing
Automated Barcode Generation System in Manufacturing
 
The Role of Technology to Save Biodiversity
The Role of Technology to Save BiodiversityThe Role of Technology to Save Biodiversity
The Role of Technology to Save Biodiversity
 
Enable Storytelling with Power BI & Comicgen Plugin
Enable Storytelling with Power BI  & Comicgen PluginEnable Storytelling with Power BI  & Comicgen Plugin
Enable Storytelling with Power BI & Comicgen Plugin
 
The Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsThe Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science Projects
 
Low Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI ProductsLow Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI Products
 
5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program
 
Using Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad PerformanceUsing Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad Performance
 
Recession Proofing With Data : Webinar
Recession Proofing With Data : WebinarRecession Proofing With Data : Webinar
Recession Proofing With Data : Webinar
 
Engage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: WebinarEngage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: Webinar
 
Structure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best OutcomesStructure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best Outcomes
 
Dawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - WebinarDawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - Webinar
 
5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar
 
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
 
Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020
 
Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)
 
The Art of Storytelling Using Data Science
The Art of Storytelling Using Data ScienceThe Art of Storytelling Using Data Science
The Art of Storytelling Using Data Science
 
Data and Storytelling | What Now?
Data and Storytelling | What Now?Data and Storytelling | What Now?
Data and Storytelling | What Now?
 

Recently uploaded

MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 

Recently uploaded (20)

MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 

Exploratory data analysis

  • 1. 1 Exploratory Data Analysis Kathirmani Sukumar Data Scientist @ Gramener
  • 2. How do I start doing analysis? 2
  • 3. Exploratory Data Analysis might help you…!!! 3
  • 5. DETECTING FRAUD “ We know meter readings are incorrect, for various reasons. We don’t, however, have the concrete proof we need to start the process of meter reading automation. Part of our problem is the volume of data that needs to be analysed. The other is the inexperience in tools or analyses to identify such patterns. ENERGY UTILITY 5
  • 6. AN ENERGY UTILITY DETECTED BILLING FRAUD This plot shows the frequency of all meter readings from Apr- 2010 to Mar-2011. An unusually large number of readings are aligned with the slab boundaries. Below is a simple histogram (or frequency distribution) of usage levels. Each bar represents the number of customers with a customers with a specific bill amount (in units, or KWh). Tariffs are based on the usage slab. Someone with 101 units is billed in full at a higher tariff than someone with 100 units. So people have a strong incentive to stay at or within a slab boundary. An energy utility (with over 50 million subscribers) had 10 years worth of customer billing data available. Most fraud detection software failed to load the data, and sampled data revealed little or no insight. This can happen in one of two ways. First, people may be monitoring their usage very carefully, and turn of their lights and fans the instant their usage hits the slab boundary. Or, more realistically, there’s probably some level of corruption involved, where customers pay a small sum to the meter reading staff to ensure that it stays exactly at the slab boundary, giving them the advantage of a lower price. 6
  • 7. PREDICTING MARKS “ What determines a child’s marks? Do girls score better than boys? Does the choice of subject matter? Does the medium of instruction matter? Does community or religion matter? Does their birthday matter? Does the first letter of their name matter? EDUCATION 7
  • 8. TN CLASS X: ENGLISH 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 8
  • 9. TN CLASS X: SOCIAL SCIENCE 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 9
  • 10. TN CLASS X: LANGUAGE 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 10
  • 11. TN CLASS X: SCIENCE 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 11
  • 12. TN CLASS X: MATHEMATICS 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 12
  • 13. ICSE 2013 CLASS XII: TOTAL MARKS 13
  • 14. CBSE 2013 CLASS XII: ENGLISH MARKS 14
  • 15. Based on the results of the 20 lakh students taking the Class XII exams at Tamil Nadu over the last 3 years, it appears that the month you were born in can make a difference of as much as 120 marks out of 1,200. June borns score the lowest The marks shoot up for Aug borns … and peaks for Sep-borns 120 marks out of 1200 explainable by month of birth An identical pattern was observed in 2009 and 2010… … and across districts, gender, subjects, and class X & XII. “It’s simply that in Canada the eligibility cut- off for age-class hockey is January 1. A boy who turns ten on January 2, then, could be playing alongside someone who doesn’t turn ten until the end of the year—and at that age, in preadolescence, a twelve-month gap in age represents an enormous difference in physical maturity.” -- Malcolm Gladwell, Outliers 15
  • 16. This is a dataset (1975 – 1990) that has been around for several years, and has been studied extensively. Yet, a visualization can reveal patterns that are neither obvious nor well known. For example, • Are birthdays uniformly distributed? • Do doctors or parents exercise the C-section option to move dates? • Is there any day of the month that has unusually high or low births? • Are there any months with relatively high or low births? Very high births in September. But this is fairly well known. Most conceptions happen during the winter holiday season Relatively few births during the Christmas and Thanksgiving holidays, as well as New Year and Independence Day. Most people prefer not to have children on the 13th of any month, given that it’s an unlucky day Some special days like April Fool’s day are avoided, but Valentine’s Day is quite popular More births Fewer births … on average, for each day of the year (from 1975 to 1990) LET’S LOOK AT 15 YEARS OF US BIRTH DATA 16
  • 17. THE PATTERN IN INDIA IS QUITE DIFFERENT This is a birth date dataset that’s obtained from school admission data for over 10 million children. When we compare this with births in the US, we see none of the same patterns. For example, • Is there an aversion to the 13th or is there a local cultural nuance? • Are holidays avoided for births? • Which months have a higher propensity for births, and why? • Are there any patterns not found in the US data? Very few children are born in the month of August, and thereafter. Most births are concentrated in the first half of the year We see a large number of children born on the 5th, 10th, 15th, 20th and 25th of each month – that is, round numbered dates Such round numbered patterns a typical indication of fraud. Here, birthdates are brought forward to aid early school admission More births Fewer births … on average, for each day of the year (from 2007 to 2013) 17
  • 18. EDA PROCESS UNDERSTAND DERIVE QUESTION INTERACT  Identify Relevant data & sources  Map Context  Prepare Metadata  Label & Clean data  New Metrics from business  Metrics from Patterns (Binning, comparison, Ratios, Attributes, Transformation)  Stakeholder inputs who would benefit from the analysis  Based on patterns(top groups by a metric, maximise a metric, bivariate relationships)  Filter by a group value  Compare against a value or a derived metric  Sort by a dimension
  • 21. Reaching out… 21 Kathirmani Sukumar Email: kathir.mani@gramener.com Twitter: @skathirmani LinkedIn: https://in.linkedin.com/in/skathirmani