SlideShare a Scribd company logo
1 of 170
SJB Institute of Technology
(An Autonomous Institute under Visvesvaraya Technological University,Belagavi
Approved by AICTE,New Delhi, Recognized by UGC, New Delhi with 2(f) and 12 (B).
Accredited by NAAC with ‘A+’ Grade, Accredited by National Board of Accreditation)
No. 67, BGS Health & Education City ,Dr Vishnuvardhan Road ,Kengeri, Bengaluru-560060
II Jai Sri Gurudev II
Sri AdichunuchanagiriShikshna Trust ®
Exploratory Data Analysis for Business
22MBABA304
Prepared by
Dr. Harshitha S
Assistant Professor
MBA Department
SJBIT
Module 1 (8 HOURS)
Introduction to Data Mining:
Applications- Nature of The Problem- Classification. Problems in Real Life-
Email Spam, Handwritten Digit Recognition, Image segmentation, Speech
Recognition, DNA Expression Microarray, DNA Sequence Classification.
Exploratory Data Analysis (EDA)- What is Data- Numerical Summarization -
Measures of Similarity and Dissimilarity, Proximity Distance- Euclidean Distance,
Minkowski Distance, Mahalanobis Distance. Visualization- Tools for Displaying
Single Variables - Tools for Displaying Relationships Between Two Variables -
Tools for Displaying More Than Two Variables.
R Scripts- R Library: ggplot2-R Markdown
MODULE 2: (8 HOURS)
Statistical Learning and Model Selection:
Prediction Accuracy - Prediction Error, Training and Test Error as A
Function of Model Complexity, Over fitting a Model, Bias-Variance
Trade-off, Cross Validation- Holdout Sample: Training and Test Data,
Three-way Split: Training, Validation and Test Data, Cross-Validation,
Random Sub sampling, K-fold Cross-Validation, Leave-One-Out Cross-
Validation with examples for each.
MODULE3: (8 HOURS)
Linear Regression and Variable Selection:
Meaning- Review Expectation, Variance, Frequentist Basics,
Parameter Estimation, Linear Methods, Point Estimate,
Example Results, Theoretical Justification, R Scripts.
Variable Selection- Variable Selection for the Linear Model,
R Scripts.
.
MODULE4 – (9 hours)
Regression Shrinkage Methods and Tree based method:
Meaning, Types- Ridge Regression, Compare Squared Loss for
Ridge Regression, More on Coefficient Shrinkage, The Lasso. Tree
Based Methods- Construct the Tree, The Impurity Function, Estimate
the Posterior Probabilities of Classes in Each Node, Advantages of
the Tree-Structured Approach, Variable Combinations, Missing
Values, Right Sized Tree via Pruning, Bagging and Random Forests,
R Scripts, Bagging, From Bagging to Random Forests, Boosting
MODULE 5: (10 HOURS)
Principal Components Analysis and Classification:
Singular Value Decomposition (SVD), Principal Components,
Principal Components Analysis(PCA), Geometric
Interpretation, Acquire Data, Classification - Classification
Error Rate, Bayes Classification Rule, Linear Methods for
Classification, Logistic Regression - Assumptions,
Comparison with Linear Regression on Indicators- Fitting
based on Optimization Criterion, Binary Classification,
Multiclass Case (K ≥ 3), Discriminant Analysis - Class Density
Estimation, Linear Discriminant Analysis, Optimal
Classification
MODULE 6: (7 HOURS)
Support Vector Machines:
Overview, When Data is Linearly Separable, Support
Vector Classifier, When Data is NOT Linearly Separable,
Kernel Functions, Multiclass SVM.
Module 1 :
Introduction to Data Mining:
Data refers to raw facts, information, or observations that can be
collected, stored, and processed. It can take various forms, such as
numbers, text, images, audio, or video. Data is typically unorganized and
lacks meaning on its own, but it becomes valuable when it is processed
and interpreted to generate insights or support decision-making.
Structured Data: This type of data is organized in a specific format,
often in rows and columns. Structured data is easy to query and analyze
because it follows a predefined model.
Examples include databases, spreadsheets, and tables.
Unstructured Data: This type of data lacks a predefined data model or
structure. Unstructured data can include text documents, images, videos,
and other formats that don't fit neatly into a table.
Data mining is the process of searching and analyzing large
batch of raw data sets to identify patterns and relationships
that can help solve business problems through data analysis.
It refers to a set of methods applicable to large and complex
databases to eliminate the randomness and discover the hidden
pattern.
Data mining is about tools, methodologies, and theories for
revealing patterns in data — which is a critical step in
knowledge discovery. Data mining techniques and tools
enable enterprises to predict future trends and make more-
informed business decisions.
Data Mining Steps
1. Understand Business
What is the company’s current situation, the project’s objectives, and what
defines success?
2. Understand the Data
Figure out what kind of data is needed to solve the issue, and then collect it
from the proper sources.
3. Prepare the Data
Resolve data quality problems like duplicate, missing, or corrupted data, then
prepare the data in a format suitable to resolve the business problem.
4. Model the Data
Employ algorithms to ascertain data patterns. Data scientists create, test, and
evaluate the model.
Examples of Data Mining
The following are a few real-world examples of data:
Shopping Market Analysis
In the shopping market, there is a big quantity of data, and the user
must manage enormous amounts of data using various patterns. To
do the study, market basket analysis is a modeling approach.
Market basket analysis is basically a modeling approach that is
based on the notion that if you purchase one set of products, you're
more likely to purchase another set of items. This strategy may
help a retailer understand a buyer's purchasing habits. Using
differential analysis, data from different businesses and consumers
from different demographic groups may be compared.
Weather Forecasting Analysis
For prediction, weather forecasting systems rely on massive
amounts of historical data. Because massive amounts of data
are being processed, the appropriate data mining approach must
be used.
Stock Market Analysis
In the stock market, there is a massive amount of data to be
analyzed. As a result, data mining techniques are utilized to
model such data in order to do the analysis.
Fraud Detection
Traditional techniques of fraud detection are time-
consuming and difficult due to the amount of data. Data
mining aids in the discovery of relevant patterns and the
transformation of data into information.
Surveillance
Well, video surveillance is utilized practically everywhere
in everyday life for security perception. Because we must
deal with a huge volume of acquired data, data mining is
employed in video surveillance.
Benefits of Data Mining
Data mining provides us with the means of resolving problems and issues in
this challenging information age.
• It helps companies to gather reliable information.
• It’s an efficient, cost-effective solution compared to other data
applications.
• It helps businesses make profitable production and operational
adjustments
• Data mining uses both new and legacy systems
• It helps detect credit risks and fraud
• It helps data scientists easily analyze enormous amounts of data quickly
• Data scientists can use the information to detect fraud, build risk models,
and improve product safety.
• It helps data scientists quickly initiate automated predictions of behaviors
and trends and discover hidden patterns.
Applications of Data Mining
Data mining is a useful and versatile tool for today’s
competitive businesses.
Banks
Data mining helps banks work with credit ratings and anti-
fraud systems, analyzing customer financial data,
purchasing transactions, and card transactions. Data mining
also helps banks better understand their customers’ online
habits and preferences, which helps when designing a new
marketing campaign.
Healthcare
Data mining helps doctors create more accurate diagnoses by
bringing together every patient’s medical history, physical
examination results, medications, and treatment patterns. Mining
also helps fight fraud and waste and bring about a more cost-
effective health resource management strategy.
Retail
The world of retail and marketing go hand-in-hand, but the former
still warrants its separate listing. Retail stores and supermarkets
can use purchasing patterns to narrow down product associations
and determine which items should be stocked in the store and
where they should go. Data mining also pinpoints which
campaigns get the most response.
Marketing
If there was ever an application that benefitted from data
mining, it’s marketing!
Data mining helps bring together data on age, gender, tastes,
income level, location, and spending habits to create more
effective personalized loyalty campaigns. Data marketing can
even predict which customers will more likely unsubscribe to
a mailing list or other related service. Armed with that
information, companies can take steps to retain those
customers before they get the chance to leave!
Nature of The Problem
Data mining challenges :
• Security and Social Challenges
• Noisy and Incomplete Data
• Distributed Data
• Complex Data
• Performance
• Scalability and Efficiency of the Algorithms
• Improvement of Mining Algorithms
• Incorporation of Background Knowledge
• Data Visualization
• Data Privacy and Security
• User Interface
• Mining dependent on Level of Abstraction
• Integration of Background Knowledge
• Mining Methodology Challenges
Nature of The Problem
The nature of the problem in data mining can vary widely depending on the
specific goals and challenges of the task at hand. Data mining involves extracting
useful and previously unknown patterns or knowledge from large volumes of data.
Large Volume of Data:
One of the primary challenges in data mining is dealing with massive datasets. The
sheer volume of data can lead to issues related to storage, processing, and analysis.
Efficient algorithms and scalable techniques are required to handle large datasets.
Complexity and Dimensionality:
Data in real-world applications is often high-dimensional and complex. This
complexity can arise from various sources, such as the number of features,
interactions between features, and the nature of relationships within the data.
Dealing with high-dimensional data requires sophisticated methods for analysis
and visualization.
Data Quality and Preprocessing:
The quality of data can significantly impact the success of data mining
efforts. Issues such as missing values, outliers, noise, and inconsistencies
must be addressed through data cleaning and preprocessing techniques.
Ensuring data quality is crucial for obtaining meaningful and reliable
results.
Heterogeneity of Data:
Data in real-world scenarios often comes from diverse sources, and it
may exhibit heterogeneity in terms of formats, scales, and semantics.
Integrating and mining heterogeneous data requires specialized
techniques to handle the diversity of information.
Scalability:
As datasets grow in size, scalability becomes a critical factor. Data
mining algorithms and techniques need to scale efficiently to handle
increasing data volumes without compromising performance.
Privacy and Security Concerns:
The nature of data mining often involves analyzing sensitive
information. Ensuring the privacy and security of data is a
significant challenge, especially when dealing with personal or
confidential data. Techniques such as anonymization and
encryption may be employed to address these concerns.
Dynamic and Evolving Data:
In some applications, data is dynamic and evolves over time.
This introduces challenges related to handling streaming data,
adapting models to changes, and maintaining the relevance of
mined patterns as the data distribution shifts.
Interpretability and Explainability:
Many data mining models, especially those based on
complex machine learning algorithms, can be challenging
to interpret. Ensuring the interpretability and explainability
of models is crucial, particularly in applications where
stakeholders need to understand and trust the results.
Domain-specific Challenges:
The nature of data mining problems is often domain-
specific. Understanding the characteristics and
requirements of a particular domain is essential for
designing effective data mining solutions. Different
industries and applications may have unique challenges and
nuances.
Classification Problems in Real Life
1. Email Spam
The goal is to predict whether an email is a spam and should be
delivered to the Junk folder.
The raw data comprises only the text part but ignores all images.
Text is a simple sequence of words which is the input (X).The goal
is to predict the binary response Y: spam or not.
The first step is to process the raw data into a vector, which can be
done in several ways. The method followed here is based on the
relative frequencies of most common words and punctuation
marks in e-mail messages.
A set of 57 such words and punctuation marks are pre-selected by
researchers.
Given these 57 most commonly occurring words and punctuation
marks, then, in every e-mail message we would compute a relative
frequency for each word, i.e., the percentage of times this word
appears with respect to the total number of words in the email
message.
In the current example, 4000 email messages are considered in the
training sample. These e-mail messages are identified as either a
good e-mail or spam after reading the emails and assuming
implicitly that human decision is perfect(an arguable point!).
Relative frequency of the 57 most commonly used words and
punctuation based on this set of emails was constructed. This is
an example of supervised learning as in the training data the
response Y is known.
In the future when a new email message is received, the
algorithm will analyze the text sequence and compute the
relative frequency for these 57 identified words. This is the new
input vector to be classified into spam or not through the
learning algorithm.
Handwritten Digit Recognition
Handwritten digit recognition is the process to provide the
ability to machines to recognize human handwritten digits.
The image recognition in handwriting is more challenging
because everyone has different handwriting forms, so that on
the detection also handwriting will be more difficult to detect
compared writings from computers that already have a
definite standard form.
The goal is to identify images of single digits 0 - 9 correctly.
The raw data comprises images that are scaled segments from five-
digit ZIP codes. In the diagram below every green box is one image.
Every image is to be identified as 0 or 1 or 2 ... or 9. Since the numbers
are handwritten, the task is not trivial. For instance, a '5' sometimes can
very much look like a '6', and '7' is sometimes confused with '1'.
To the computer, an image is a matrix, and every pixel in the
image corresponds to one entry in the matrix. Every entry is an
integer ranging from a pixel intensity of 0 (black) to 255
(white). Hence the raw data can be submitted to the computer
directly without any feature extraction. The image matrix was
scanned row by row and then arranged into a large 256-
dimensional vector. This is used as the input to train the
classifier. Note that this is also a supervised learning algorithm
where Y, the response, is multi-level and can take 10 values.
DNA Expression Microarray
A DNA expression microarray is a powerful tool used in
molecular biology and genetics to analyze the expression
levels of thousands of genes simultaneously.
Principle:
DNA microarrays consist of tiny spots, called probes,
attached to a solid surface (often a glass slide or silicon
chip).
Each spot contains a known DNA sequence that
corresponds to a specific gene or a portion of a gene.
Sample Preparation:
RNA is extracted from cells or tissues of interest. Since RNA
represents the actively expressed genes, it provides information
about gene expression levels.
The extracted RNA is converted into complementary DNA
(cDNA) using reverse transcription.
Labeling:
The cDNA is labeled with a fluorescent dye or another
detectable marker. Often, different samples are labeled with
different colors to enable comparison.
Hybridization:
The labeled cDNA is then hybridized to the microarray. The
cDNA will bind to its complementary DNA sequence on the
microarray.
Detection:
The microarray is scanned to measure the fluorescence intensity
at each spot.
The intensity of the fluorescence signal indicates the amount of
gene expression in the sample.
Data Analysis:
Bioinformatics tools are used to analyze the massive amount of
data generated. This involves comparing expression levels
between different samples or conditions.
Applications:
1.Gene Expression Profiling:
1. Microarrays are widely used to study how genes are expressed
under different conditions or in different tissues.
2.Disease Research:
1. Microarrays help identify genes associated with diseases, enabling
researchers to understand the molecular mechanisms underlying
various conditions.
3.Pharmacogenomics:
1. Microarrays can be used to study how individuals respond to drugs
based on their genetic makeup, leading to personalized medicine
approaches.
4.Cancer Research:
1. Microarrays are used to classify tumors based on gene expression
patterns, helping in cancer diagnosis and treatment decisions.
5.Functional Genomics:
1. Microarrays aid in understanding the function of genes by analyzing
their expression in different biological contexts.
6.Toxicology Studies:
1. Microarrays are used to assess the impact of drugs, chemicals, or
environmental factors on gene expression.
DNA Sequence Classification
DNA sequence classification involves categorizing DNA sequences into
different classes or groups based on certain features or patterns. This
process is crucial in various fields, including bioinformatics, genomics, and
molecular biology.
Data Collection:
Collect DNA sequences from relevant sources. These sequences may
represent genes, genomes, or other functional elements.
Feature Extraction:
Extract relevant features from the DNA sequences. Features can
include nucleotide composition, sequence motifs, structural properties,
or any other characteristic that distinguishes one class from another.
.
Data Preprocessing:
Clean and preprocess the data to remove noise, handle missing
values, and standardize the format. This step ensures that the data
is in a suitable form for analysis.
Labeling:
Assign labels or classes to the DNA sequences based on the
biological context or the problem at hand. For example, classes
could represent different species, functional elements, or disease
states.
Training Data and Model Selection:
Split the dataset into training and testing sets. The training set
is used to train a classification model. Choose an appropriate
classification algorithm based on the nature of the data and
the problem, such as decision trees, support vector machines,
neural networks, or others.
Feature Representation:
Represent the DNA sequences in a format suitable for the chosen
classification algorithm. This may involve encoding sequences
into numerical vectors using methods like one-hot encoding or k-
mer counting.
Model Training:
Train the selected classification model using the
training dataset. The model learns to recognize patterns
or features that distinguish between different classes.
Model Evaluation:
Evaluate the trained model using the testing dataset to
assess its performance. Common evaluation metrics
include accuracy, precision, recall, F1 score, and area
under the receiver operating characteristic (ROC)
curve.
Optimization:
Fine-tune the model or optimize hyper parameters to
improve performance if necessary.
Prediction:
Use the trained model to predict the class labels of new,
unseen DNA sequences.
Interpretation:
Interpret the results and gain insights into the biological
significance of the classification. Understand which features
contribute most to the classification decision.
Applications of DNA sequence classification include:
•Species Identification: Classifying DNA sequences to identify
different species.
•Functional Annotation: Assigning functions to genes or
genomic regions based on their sequences.
•Disease Prediction: Predicting disease states or susceptibility
based on genetic variations.
•Drug Target Identification: Identifying potential drug targets
by classifying sequences associated with disease pathways.
Each genome is made up of DNA sequences and each DNA
segment has specific biological functions. However there are
DNA segments which are non-coding, i.e. they do not have any
biological function (or their functionalities are not yet known).
One problem in DNA sequencing is to label the sampled
segments as coding or non-coding (with a biological function or
without).
The raw DNA data comprises sequences of letters, e.g., A, C, G,
T for each of the DNA sequences. One method of classification
assumes the sequences to be realizations of random processes.
Different random processes are assumed for different classes of
sequences.
IMAGE SEGMENTATION
Image segmentation is a computer vision task that involves
dividing an image into multiple segments or regions based
on certain criteria. The goal is to simplify the representation
of an image or to make it more meaningful for analysis. Each
segment typically represents a specific object or region in the
image. Image segmentation is a crucial step in various
applications, such as object recognition, scene
understanding, medical imaging, and autonomous vehicles.
Feature Extraction:
1. Image segmentation helps in breaking down an image
into smaller, meaningful segments or regions.
2. Each segment or region can be treated as a feature,
and the properties of these regions can be analyzed for
further data mining tasks.
3. Extracted features can include color histograms,
texture information, and spatial distribution of objects.
Object Recognition:
1. Segmentation is essential for identifying and
delineating objects or entities within an image.
2. Once objects are segmented, data mining techniques
can be applied to recognize patterns, classify objects,
or discover associations among them.
Pattern Recognition:
1. Segmentation aids in identifying patterns within images by
isolating distinct regions of interest.
2. Data mining algorithms can then be applied to discover and
analyze patterns, such as trends, clusters, or anomalies, within
these segmented regions.
Image Classification:
1. Segmentation can be a preprocessing step for image
classification tasks.
2. By segmenting an image into meaningful regions, the
subsequent classification process can be more focused and
accurate, as it operates on smaller, more homogeneous portions
of the image.
Data Preprocessing:
1. Image segmentation is often used as a preprocessing step to
simplify complex images and reduce the amount of data that
needs to be processed.
2. This can help improve the efficiency and effectiveness of
subsequent data mining algorithms.
Anomaly Detection:
1. Segmentation can be used to identify anomalies or
outliers within an image.
2. Data mining techniques can then be applied to detect
unusual patterns or behaviors within these segmented
regions, indicating potential issues or interesting
phenomena.
Speech Recognition
Speech recognition, also known as automatic speech
recognition (ASR) or speech-to-text (STT), is a
technology that converts spoken language into written
text. The primary goal of speech recognition systems is to
accurately transcribe spoken words into a text format that
can be processed, analyzed, or used for various
applications.
Breaks down audio into individual sounds
Converts these sounds into a digital format
Uses algorithms and models to find out the most probable
word fit in the language.
1.Acoustic Signal Capture:
Speech recognition systems start by capturing the acoustic
signal, which is the spoken language, using a microphone or
other audio input devices.
2.Pre-processing:
The captured audio signal undergoes pre-processing to remove
noise, filter out irrelevant information, and enhance the quality
of the signal.
3.Feature Extraction:
The system extracts relevant features from the audio signal,
such as spectral features, MFCCs (Mel-Frequency Cepstral
Coefficients), or other time-frequency representations.
4. Acoustic Modeling:
Acoustic models are trained using machine learning algorithms, often
based on Hidden Markov Models (HMMs) or more recently, deep neural
networks (DNNs) in the case of deep learning-based ASR.
5.Language Modeling:
Language models incorporate linguistic context and probabilities of word
sequences. This helps improve the accuracy of recognizing words and
phrases within a given context.
6. Decoding:
The speech recognition system decodes the acoustic and language
models to produce a sequence of words that best matches the input audio.
7. Post-processing:
Post-processing steps may involve refining the output, handling errors,
and adapting to specific applications or domains.
For instance, if you call the University Park Airport,
the system might ask you your flight number, or your
origin and destination cities. The system does a very
good job recognizing city names. This is a
classification problem, in which each city name is a
class. The number of classes is very big but finite.
The raw data involves voice amplitude sampled at discrete
time points (a time sequence), which may be represented in
the waveforms as shown above. In speech recognition, a
very popular method is the Hidden Markov Model. At every
time point, one or more features, such as frequencies, are
computed. The speech signal essentially becomes a
sequence of frequency vectors. This sequence is assumed to
be an instance of a hidden Markov model (HMM). An
HMM can be estimated using multiple sample sequences
under the same class (e.g., city name).
Hidden Markov Model (HMM) Methodology:
HMM captures the time dependence of the feature vectors. The HMM
has unspecified parameters that need to be estimated. Based on the
sample sequences, model estimation takes place and an HMM is
obtained. This HMM is like a mathematical signature for each word.
Each city name, for example, will have a different signature. In the
diagram above the signatures corresponding to State College and San
Francisco are compared. It is possible that several models are
constructed for one word or phrase. For instance, there may be a
model for a female voice as opposed to another for a male voice.
When a customer calls in for information and utters origin or
destination city pairs, the system computes the likelihood of what the
customer uttered under possibly thousands of models. The system
finds the HMM that yields the maximum likelihood and identifies the
word as the one associated with that HMM.
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is an analysis
approach that identifies general patterns in the data.
These patterns include outliers and features of the data
that might be unexpected. EDA is an important first step
in any data analysis
It is an approach to analyzing datasets to
summarize their main characteristics, often with
visual methods. EDA is used for seeing what the data
can tell us before the modeling task.
Exploratory data analysis has been promoted by John
Tukey since 1970 to encourage statisticians to explore
the data, and possibly formulate hypotheses that could
lead to new data collection and experiments.
What is Data
Data are units of information.
Qualitative data is descriptive data.
It is non-numerical and is also known as categorical data.
Quantitative data is numerical information, and answers questions like
'how many', 'how much' and 'how often'.
Exploratory Data Analysis
 Exploratory Data Analysis refers to the critical process of
performing initial investigations on data so as to discover
patterns, to spot anomalies(anomalies refer to unusual or
unexpected patterns, observations, or values in a dataset),
to test hypothesis and to check assumptions with the help
of summary statistics and graphical representations.
 Data Visualization:
Data visualization is the presentation of data in a
graphical or pictorial format to help people
understand and interpret patterns, trends, and
insights within the data.
Effective data visualization is a crucial component
of data analysis, as it allows for the
communication of complex information in a clear
and accessible manner. Here are key points about
data visualization:
 Types of Visualizations:
 Charts and Graphs: Bar charts, line
charts, pie charts, scatter plots, and area
charts.
 Maps: Geographic information system
(GIS) maps and choropleth maps.
 Tables: Tabular representations of data.
 Dashboards: Combined visualizations
providing an overview of key metrics.
 Infographics: Visual representations that
convey information and data through
graphics.
 Descriptive Statistics:
Compute basic descriptive statistics (mean,
median, standard deviation, etc.) to summarize
numerical variables
Data Cleaning
Data
visualization
Feature
Engineering
Correlation &
Regression
Data
Segmentation
Hypothesis
Generation
Data Quality
assessment
The Importance of EDA
1. Data Cleaning: EDA involves examining the information for
errors, lacking values, and inconsistencies. It includes techniques
including records imputation, managing missing statistics, and
figuring out and getting rid of outliers.
2. Descriptive Statistics: EDA utilizes precise records to
recognize the important tendency, variability, and distribution of
variables. Measures like suggest, median, mode, preferred
deviation, range, and percentiles are usually used.
 Data Visualization: EDA employs visual techniques to
represent the statistics graphically. Visualizations consisting of
histograms, box plots, scatter plots, line plots, heatmaps, and
bar charts assist in identifying styles, trends, and relationships
within the facts.
 4. Feature Engineering: EDA allows for the exploration of
various variables and their adjustments to create new functions
or derive meaningful insights. Feature engineering can contain
scaling, normalization, binning, encoding express variables,
and creating interplay or derived variables.
 Correlation and Relationships: EDA allows discover
relationships and dependencies between variables. Techniques
such as correlation analysis, scatter plots, and pass-tabulations
offer insights into the power and direction of relationships
between variables.
 6. Data Segmentation: EDA can contain dividing the
information into significant segments based totally on sure
standards or traits. This segmentation allows advantage insights
into unique subgroups inside the information and might cause
extra focused analysis.
 Hypothesis Generation: EDA aids in generating
hypotheses or studies questions based totally on the
preliminary exploration of the data. It facilitates form the
inspiration for in addition evaluation and model building.
 8. Data Quality Assessment: EDA permits for assessing
the nice and reliability of the information. It involves
checking for records integrity, consistency, and accuracy
to make certain the information is suitable for analysis.
Data
 Data refers to facts, information, or values that are
collected, stored, and analyzed for various purposes
 Data denotes a collection of objects and their attributes
 Attribute(feature, variable, or field) is a property or
characteristic of an object
Types of Attributes
Nominal: Qualitative variables that do not have a natural order, e.g.
Hair color, Religion, Residence zipcode of a student.
Ordinal: Qualitative variables that have a natural order, e.g. Grades,
Rating of a service rendered on a scale of 1-5 (1 is terrible and 5 is
excellent), Street numbers in New York City.
Interval: Measurements where the difference between two values is
meaningful, e.g. Calendar dates, Temperature in Celsius or
Fahrenheit.
Ratio: Measurements where both difference and ratio are
meaningful, e.g. Temperature in Kelvin, Length, Counts.
Discrete and Continuous Attributes
Discrete Attribute
A variable or attribute is discrete if it can take a finite
or a countably infinite set of values. A discrete variable
is often represented as an integer-valued variable. A
binary variable is a special case where the attribute can
assume only two values, usually represented by 0 and
1. Examples of a discrete variable are the number of
birds in a flock; the number of heads realized when a
coin is flipped 10 times, etc.
Continuous Attribute
A variable or attribute is continuous if it can take any
value in a given range; possibly the range being
infinite. Examples of continuous variables are weights
and heights of birds, temperature of a day, etc.
Numerical summarization
 Numerical summarization involves calculating various statistical
measures to capture different aspects of the dataset. Numerical
data are usually summarized and presented by distribution,
measures of central tendency and dispersion. For normally
distributed data, arithmetic mean and standard deviation are used.
 Numerical summarization, in the context of data analysis and
statistics, refers to the process of using quantitative measures to
describe and summarize the main characteristics of a dataset.
 The goal is to distill complex information into key numerical
indicators that provide insights into the distribution, central
tendency, spread, and relationships within the data.
Measures of Location
They are single numbers representing a set of observations. Measures of
location also include measures of central tendency. Measures of central
tendency can also be taken as the most representative values of the set of
observations. The most common measures of location are the Mean, the
Median, the Mode, and the Quartiles.
Mean
The arithmetic average of all the observations. The mean equals the
sum of all observations divided by the sample size
Median
The middle-most value of the ranked set of observations so that half
the observations are greater than the median and the other half is less.
Median is a robust measure of central tendency
Mode
The most frequently occurring value in the data set. This
makes more sense when attributes are not continuous
Quartiles
Division points which split data into three parts after
rank-ordering them.
Division points are called Q1 (the first quartile), Q2
(the second quartile or median), and Q3 (the third
quartile).
Measures of Spread
Measures of location are not enough to capture all aspects of the
attributes. Measures of dispersion are necessary to understand the
variability of the data. The most common measure of dispersion is the
Variance, the Standard Deviation, the Interquartile Range and Range.
Variance
Measures how far a set of data
(numbers) are spread out from
their mean (average) value.
It is defined as the average of the
squared differences between the
mean and the individual data values.
Standard Deviation
Is the square root of the variance. It is defined as the
average distance between the mean and the individual
data values
Interquartile range (IQR)
is the difference between Q3 and Q1. IQR contains the
middle 50% of data
Range
is the difference between the maximum and minimum
values in the sample
Measures of Skewness
In addition to the measures of location and dispersion, the arrangement of
data or the shape of the data distribution is also of considerable interest.
The most 'well-behaved' distribution is a symmetric distribution where
the mean and the median are coincident. The symmetry is lost if there
exists a tail in either direction. Skewness is a measurement of the
distortion of symmetrical distribution or asymmetry in a data set.
Skewness is demonstrated on a bell curve when data points are not
distributed symmetrically to the left and right sides of the median on a
bell curve.
Skewness measures whether or not a distribution has a single long tail.
Skewness is measured as:
The figure below gives examples of symmetric and skewed
distributions. Note that these diagrams are generated from theoretical
distributions and in practice one is likely to see only approximations.
In a positively skewed distribution, the mean is greater than the median as the
data is more towards the lower side and the mean average of all the values.
Measures of Correlation
Correlation describes the degree of the linear relationship between two
attributes, X and Y.
With X taking the values x(1), … , x(n) and Y taking the values y(1), … ,
y(n), the sample correlation coefficient is defined as:
The correlation coefficient is always between -1 (perfect negative linear
relationship) and +1 (perfect positive linear relationship). If the
correlation coefficient is 0, then there is no linear relationship between
X and Y.
In the figure below a set of representative plots are shown for various values of the population
correlation coefficient ρ ranging from - 1 to + 1. At the two extreme values, the relation is a
perfectly straight line. As the value of ρ approaches 0, the elliptical shape becomes round and
then it moves again towards an elliptical shape with the principal axis in the opposite direction.
 Proximity measures are mainly mathematical techniques
that calculate the similarity/dissimilarity of data points
Manhattan distance
 The Manhattan distance, also called the Taxicab distance
or the City Block distance, calculates the distance between
two real-valued vectors.
 It is perhaps more useful to vectors that describe objects
on a uniform grid, like a chessboard or city blocks. The
taxicab name for the measure refers to the intuition for
what the measure calculates: the shortest path that a
taxicab would take between city blocks (coordinates on
the grid).
It is named after the German mathematician Hermann Minkowski.
Here,
n is the number of dimensions (in this case, 2 for a two-
dimensional space),
p is a parameter that determines the order of the distance.
When p=1, it becomes the Manhattan distance, and when
p=2, it becomes the Euclidean distance.
The Minkowski distance is versatile and can be used in
various applications, including clustering, classification,
and regression. The choice of p depends on the nature of
the data and the specific requirements of the problem at
hand.
Let's consider an example of calculating the Minkowski distance
between two points in a two-dimensional space. In this example,
we'll calculate the distance between points P(1,2) and Q(4,6)
using the Minkowski distance formula.
So, the Minkowski distance between points P(1,2) and Q(4,6) with
p=2 is 5, which is equivalent to the Euclidean distance between
these two points. If you were to use a different value of p, you
would get a different Minkowski distance.
For example, if p=1, it would be the Manhattan distance:
D(P,Q)=∣1−4∣+∣2−6∣=3+4=7
It's important to note that the Minkowski distance generalizes to
different distance metrics depending on the value of p, and
choosing the appropriate p depends on the specific requirements of
the problem.
Mahalanobis Distance
The Mahalanobis distance is a measure of the distance between
a point and a distribution, taking into account the correlation
between variables in the distribution.
It is a generalized form of the Euclidean distance and is
particularly useful when dealing with multivariate data.
It scales each variable by its standard deviation and adjusts for
the correlation between variables. This makes it particularly
useful in situations where variables are correlated, and it
allows for a more accurate measure of distance compared to
the Euclidean distance.
This distance metric is particularly useful in statistics
and pattern recognition, helping to identify outliers or
measure dissimilarity between observations in a
multivariate dataset.
The Mahalanobis distance accounts for the correlation
between variables in the distribution making it
applicable when variables are interrelated.
The formula for Mahalanobis distance between a point x and a
distribution with mean μ and covariance matrix Σ is given by:
The most common use for the Mahalanobis distance is to find
multivariate outliers, which indicates unusual combinations of two or
more variables. For example, it’s fairly common to find a 6′ tall
woman weighing 185 lbs, but it’s rare to find a 4′ tall woman who
weighs that much.
Example with specific values for the mean vector, covariance
matrix, and a data point.
Assume:
Mean vector μ = [65 inches, 150 pounds]
Covariance matrix S = [[25, 10], [10, 36]]
Data point x = [70 inches, 160 pounds]
1. Subtract the mean vector from the data point:
x−μ=[70−65,160−150]=[5,10]
Take the square root of the result from the above step to get the
Mahalanobis Distance.
Tools for Displaying Single Variables
Histograms
Histograms are the most common graphical tool to represent continuous
data. It is used to visualize the frequency or probability distribution of a
single continuous variable.
On the horizontal axis, the range of the sample is plotted. On the vertical
axis are plotted the frequencies or relative frequencies of each class. The
class width has an impact on the shape of the histogram.
Use Cases:
Histograms are commonly used in statistical analysis, quality control, and
research to understand the distribution of data.
They help identify patterns, trends, and potential outliers within a dataset.
Histograms are useful for exploring and communicating the central
tendencies and variability of a continuous variable.
Components of a Histogram:
Bins or Intervals: The entire range of the variable is divided into
intervals or bins. Each bin represents a range of values.
Frequency or Count: The height of each bar in the histogram
corresponds to the frequency or count of observations falling within the
respective bin.
Creating a Histogram:
Data Collection: Collect data on the variable of interest, ensuring it is a
continuous variable.
Data Binning: Divide the range of the data into intervals or bins. The
number and width of the bins can impact the appearance and
interpretation of the histogram.
Counting Observations: Determine how many observations fall into each bin by
counting the occurrences of data points within the specified intervals.
Plotting: For each bin, draw a rectangle or bar whose height corresponds to the
frequency of observations in that bin. The bars are typically adjacent to each other.
Interpreting a Histogram:
Shape: The shape of the histogram provides insights into the distribution of the
data. Common shapes include normal (bell-shaped), skewed (positively or
negatively), and uniform.
Central Tendency: The central tendency of the data, including measures like mean
or median, can be identified based on the position of the central peak or the center
of mass of the histogram.
Spread: The spread or dispersion of the data can be observed by looking at the
width of the histogram. A wider spread indicates higher variability.
Outliers: Outliers, or values significantly different from the bulk of the data, may
be visible as individual bars separated from the main distribution.
OUTLIER
An outlier is an observation or data point that significantly differs from
the rest of the data in a dataset. In other words, it is a data point that lies
an abnormal distance away from other values in a random sample from
a population. Outliers can occur in various types of data, and their
presence may indicate errors in data collection, measurement variability,
or important information about the underlying distribution of the data
Boxplot
A boxplot, also known as a box-and-whisker plot, is a graphical representation of
the distribution of a dataset. It provides a summary of key statistical measures,
including the minimum, first quartile (Q1), median, third quartile (Q3), and
maximum.
median (Q2/50th Percentile): the middle value of the dataset.
first quartile (Q1/25th Percentile): the middle number between the smallest
number (not the “minimum”) and the median of the dataset.
third quartile (Q3/75th Percentile): the middle value between the median
and the highest value (not the “maximum”) of the dataset.
interquartile range (IQR): 25th to the 75th percentile.
whiskers (shown in blue)
outliers (shown as green circles)
“maximum”: Q3 + 1.5*IQR
“minimum”: Q1 -1.5*IQR
They are useful for identifying the spread and central tendency of a variable.
Outliers, which are data points that fall significantly outside the
overall pattern of the data, are sometimes shown as individual
points beyond the whiskers.
The boxplot of the Wage distribution clearly identifies many
outliers.
Boxplots are useful for identifying the central tendency, spread,
and skewness of a dataset, as well as for detecting potential
outliers. They are particularly effective for comparing the
distribution of different groups or variables.
Example:
Find the maximum, minimum, median, first quartile, third quartile
for the given data set: 23, 42, 12, 10, 15, 14, 9.
Solution:
Given: 23, 42, 12, 10, 15, 14, 9.
Arrange the given dataset in ascending order.
9, 10, 12, 14, 15, 23, 42
Hence,
Minimum = 9
Maximum = 42
Median = 14
First Quartile = 10 (Middle value of 9, 10, 12 is 10)
Third Quartile = 23 (Middle value of 15, 23, 42 is 23).
When it comes to representing data graphically, box plots and
histograms are two popular choices. Both are used to
summarize and display data, but they are quite different in
terms of their approach and the information they convey.
1. A histogram is a graphical representation of data that displays
the frequency of numerical data in different intervals or bins. The bars
in a histogram represent the number of observations falling into each
bin.
A Box Plot, also called a box and whisker plot, is a way of
displaying the distribution of a dataset using the five-number
summary: the minimum value, the first quartile, the median, the third
quartile, and the maximum value.
2. A histogram is a bar graph, where the height of each bar represents
the frequency or count of data points falling within a certain range. In
contrast, a box plot is a schematic that shows the range, median,
quartiles, and outliers of a dataset.
3. Histograms are commonly used to display continuous data, such as
weight, height, and temperature, and discrete data, such as counts and
scores.
Box plots are more suitable for displaying the spread and central
tendency of continuous data and comparing it across different
categories or groups.
4. In terms of the information conveyed, a histogram provides an
overview of the distribution of the data and the frequency of the
observations. It can show whether the data is normally distributed,
skewed to the left or right, or bimodal.
A box plot provides more detailed information, showing not only the
central tendency and the spread of the data but also the outliers and
the skewness of the distribution.
5. A histogram is interpreted by observing the shape of the
distribution, such as whether it is symmetric or skewed, and the
position of the center of the distribution, which is represented by
the peak of the histogram.
On the other hand, a box plot is interpreted by analyzing the
position of the whiskers, the length of the box, and the presence
of outliers. A box plot also provides information on the quartiles,
which indicate the spread of the data, and the median, which
represents the central tendency of the data.
Bar Charts:
Bar charts are effective for displaying the frequency or
proportion of categorical data. Each bar represents a category,
and the height of the bar corresponds to the frequency or
proportion of observations in that category.
Pie Charts:
Pie charts are suitable for displaying the proportions of
different categories within a dataset. Each slice represents a
category, and the size of the slice corresponds to the
proportion of the whole.
Excel or Google Sheets:
For quick and simple visualizations, spreadsheet tools like Microsoft
Excel or Google Sheets can be effective. They offer various chart
types and are easy to use for basic data visualization tasks.
R with ggplot2:
If you are comfortable with the R programming language, ggplot2 is
a powerful and flexible data visualization package that can handle
single-variable visualizations and more complex plots.
Plotly:
Plotly is a versatile graphing library that supports interactive plots. It
can be used with Python, R, and Julia. It's particularly useful if you
want to create interactive visualizations for web applications.
Tableau Public:
If you prefer a more graphical and user-friendly interface, Tableau
Public is a powerful data visualization tool. It allows you to create
interactive dashboards and share them online. Tableau can handle
single-variable visualizations and much more.
Matplotlib:
This is a widely used 2D plotting library for Python. It can
be used to create various types of plots, including
histograms, bar charts, line plots, and more. If you're
working with Python and have a single variable to visualize,
Matplotlib is a good choice.
Seaborn:
Seaborn is built on top of Matplotlib and provides a high-
level interface for drawing attractive statistical graphics. It
comes with several built-in themes and color palettes to
make your visualizations more appealing.
Tools for Displaying Relationship Between Two Variables
Scatter Plots
Scatter plots are a basic but effective way to visualize the relationship
between two continuous numerical variables. It shows the direction and
strength of association between two variables. If points generally follow a
linear pattern from the bottom-left to the top-right (positive correlation) or
vice versa (negative correlation), there is an indication of a relationship.
Outliers: Outliers, or data points that deviate significantly from the overall
pattern, can be easily identified in a scatter plot.
Matplotlib and Seaborn are Python libraries that can be used to create scatter
plots easily.
Tools: Excel, Google Sheets, Python (Matplotlib, Seaborn), R (ggplot2).
library(ISLR)
with(Wage, plot(age, wage, pch = 19, cex=0.6))
title(main = "Relationship between Age and Wage")
It is clear from the scatterplot that the Wage does not seem to
depend on Age very strongly.
Contour plot
A contour plot is a graphical technique for representing a 3-
dimensional surface by plotting constant z slices, called contours,
on a 2-dimensional format.
For example, you can use a contour plot to visualize the height of
a surface in two or three dimensions.
This is useful when a continuous attribute is measured on a spatial
grid. They partition the plane into regions of similar values. The
contour lines that form the boundaries of these regions connect
points with equal values. In spatial statistics, contour plots have a
lot of applications.
 Contour plots join points of equal probability. Within the contour
lines concentration of bivariate distribution is the same. One may
think of the contour lines as slices of a bivariate density, sliced
horizontally. Contour plots are concentric; if they are perfect
circles then the random variables are independent. The more oval-
shaped they are, the farther they are from independence.
Tools for Displaying More Than Two Variables
Scatter Plot matrix
 A scatter plot matrix is a grid (or matrix) of scatter plots used to visualize
bivariate relationships between combinations of variables. Each scatter plot
in the matrix visualizes the relationship between a pair of variables,
allowing many relationships to be explored in one chart
 A scatter plot matrix is a nonspatial tool that can be used to visualize the
relationship among up to five numeric variables.
 Scatter plot matrices are a good way to determine if linear correlations
exist between multiple variable
Correlation Matrices:
Description: Displaying a correlation matrix helps to quantify and
visualize the correlation between two variables.
A correlation matrix is a table that displays the correlation
coefficients between multiple variables. Each cell in the matrix
represents the correlation between two variables, and the matrix
allows for a comprehensive view of the relationships among all
variable pairs. Correlation coefficients quantify the strength and
direction of a linear relationship between two variables, ranging
from -1 (perfect negative correlation) to 1 (perfect positive
correlation), with 0 indicating no linear correlation.
Tools: Python (pandas, seaborn), R (corrplot).
Symmetric Matrix:
The correlation matrix is symmetric, meaning the correlation
between variable A and B is the same as the correlation between
B and A. This is because the correlation coefficient measures
the relationship without considering the order of the variables.
Diagonal Elements:
The diagonal elements of the matrix (where the variable is
correlated with itself) always have a correlation coefficient of 1.
This is because a variable is perfectly correlated with itself.
Range of Correlation Coefficients:
Correlation coefficients can range from -1 to 1. A coefficient of
1 indicates a perfect positive correlation, -1 indicates a perfect
negative correlation, and 0 indicates no linear correlation. The
closer the coefficient is to ±1, the stronger the correlation.
Interpretation of Correlation Values:
Positive correlation coefficients indicate a positive linear
relationship (as one variable increases, the other tends to
increase), while negative coefficients indicate a negative
linear relationship (as one variable increases, the other
tends to decrease).
Uses of Correlation Matrices:
Correlation matrices are widely used in statistics, finance,
and data analysis to explore relationships between
variables. They help identify patterns, dependencies, and
potential multicollinearity in the dataset.
Visualization:
Correlation matrices can be visualized using heatmaps. Each cell
in the heatmap is colored based on the magnitude and direction of
the correlation coefficient, providing a quick and intuitive way to
interpret the relationships.
Calculation:
The correlation coefficient (usually Pearson correlation
coefficient) between two variables X and Y is calculated as the
covariance of X and Y divided by the product of their standard
deviations. In mathematical terms, it is represented as Corr(X, Y)
= Cov(X, Y) / (σ_X * σ_Y).
Multivariate Analysis:
Correlation matrices are essential in multivariate analysis, helping
researchers and analysts understand how multiple variables are
related to each other.
Heatmaps:
A heatmap is a graphical representation of data that uses a system
of color coding to represent different values. Heatmaps are used in
various forms of analytics but are most commonly used to show
user behavior on specific web pages or webpage templates.
Heatmaps can be used to show where users have clicked on a page,
how far they have scrolled down a page, or used to display the
results of eye-tracking tests.
Tools: Python (matplotlib, seaborn), R (ggplot2).
They are particularly useful for displaying large sets of data and
identifying patterns, trends, or areas of interest. Heat maps are
commonly employed in various fields, including statistics, data
analysis, biology, finance, and geography.
The intensity of color in a heat map corresponds to the magnitude of
the values being represented. Users can quickly interpret the visual
patterns and identify areas of interest or outliers.
While heat maps are powerful tools, they have limitations.
Misinterpretation can occur if the color scale or range is not chosen
appropriately. Additionally, the effectiveness of a heat map depends
on the quality and relevance of the underlying data.
Data Collection:
Heatmaps are commonly used with large sets of data, such as
matrices or tables, where each cell contains a value.
The data points could represent various metrics, such as website
clicks, gene expression levels, financial indicators, or geographic
information.
Color Mapping:
A color gradient is chosen to represent the range of values in the
data. For example, a spectrum from cool colors (e.g., blue) to
warm colors (e.g., red) is often used.
The color scale is divided into intervals, with each interval
corresponding to a specific range of values.
Intensity Mapping:
The intensity of the color in each cell or data point represents the
magnitude of the underlying value. Higher values are typically
associated with more intense or warmer colors, while lower values
correspond to cooler colors.
Visualization:
The data is then mapped onto a visual grid or surface, with each cell
colored according to its corresponding value.
Users can observe patterns and variations in color across the grid,
making it easy to identify areas of high or low concentration.
Interactivity (Optional):
Some heatmaps are interactive, allowing users to explore the data
further. This may involve adjusting color scales, zooming in on specific
regions, or applying filters to focus on particular aspects of the data.
Applications:
Heatmaps find applications in various fields, including
website analytics, biology, finance, geography, and more.
They help users make informed decisions by quickly
highlighting areas of interest or significance.
Customization:
Users often have the flexibility to customize heatmaps based
on their preferences. This may include choosing color
schemes, adjusting scale ranges, or applying specific
algorithms for data normalization.
There are many different types of heatmaps:
Click heatmaps: These heatmaps show where users click on a webpage. Identify the
integral elements on a page and see how users interact with different features.
Scroll heatmaps: See how far users scroll down a webpage with this type of heatmap.
See which parts of a page are most engaging and how users find the information they
are looking for.
Mouse movement heatmaps: These heatmaps show the path of a user's mouse as they
move the cursor around a webpage. Know where users are looking and how they
interact with different elements on the page.
Eye tracking heatmaps: This heatmap shows the path of a user's eye movements as
they look at a webpage. Understand where users are paying attention and how they
process different elements on the page.
Conversion heatmaps: Get a view of all the steps your users take to complete a
desired action, such as when making a purchase, clicking on the call to action (ctas), or
signing up for a newsletter. Use this information to identify bottlenecks in the
conversion process and guide users to take the desired action.
https://www.hotjar.com/heatmaps/
R Scripts:
R is an open-source programming language and it is available on widely used
platforms e.g. Windows, Linux, and Mac.
R is a programming language created by statisticians for statistics, specifically
for working with data. It is a language for statistical computing and data
visualizations used widely by business analysts, data analysts, data scientists,
and scientists.
Python is a general-purpose programming language, while R is a statistical
programming language. Python is more versatile and can be used for a wider
range of tasks, such as web development, data manipulation, machine learning.
If you're passionate about the statistical calculation and data visualization
portions of data analysis, R could be a good fit for you. If, you're interested in
becoming a data scientist and working with big data, artificial intelligence, and
deep learning algorithms, Python would be the better fit.
https://biocorecrg.github.io/CRG_RIntroduction/exercise-12-ggplot2.html
R Script
An R script is simply a text file containing (almost) the same
commands that you would enter on the command line of R.
R scripts are simply a text file containing a set of commands
and comments. The script can be saved and used later to re-
execute the saved commands. The script can also be edited so
you can execute a modified version of the commands.
Class is the blueprint that helps to create an object and contains
its member variable along with the attributes.
R Library
 R Studio is a must-know tool for everyone who works
with the R programming language. It's used in data
analysis to import, access, transform, explore, plot, and
model data, and for machine learning to make predictions
on data.
ggplot2-R Markdown
Graphics packages in R
• graphics : a base R package, which means it is loaded
every time we open R
• ggplot2 : a user-contributed package by RStudio, so you
must install it the first time you use it. It is a standalone
package but also comes together with tidyverse package
• lattice : This is a user-contributed package. It provides
the ability to display multivariate relationships, and it
improves on the base-R graphics. This package supports
the creation of trellis graphs: graphs that display a
variable or.
• Markdown is a simple formatting syntax for authoring
HTML, PDF, and MS Word documents
The ggplot2 Package
 The ggplot2 package is an elegant, easy and versatile
general graphics package in R. It implements the grammar
of graphics concept. This concept’s advantage is that it
fastens the process of learning graphics. It also facilitates
the process of creating complex graphics
 To work with ggplot2, remember that at least your R codes
must
 start with ggplot()
 identify which data to plot data = Your Data
 state variables to plot for example aes(x = Variable on x-
axis, y = Variable on y-axis ) for bivariate
 choose type of graph, for example geom_histogram() for
histogram, and geom_points() for scatterplots
Exploratory data analysis for business MODULE 1.pptx
Exploratory data analysis for business MODULE 1.pptx
Exploratory data analysis for business MODULE 1.pptx
Exploratory data analysis for business MODULE 1.pptx
Exploratory data analysis for business MODULE 1.pptx
Exploratory data analysis for business MODULE 1.pptx
Exploratory data analysis for business MODULE 1.pptx
Exploratory data analysis for business MODULE 1.pptx

More Related Content

Similar to Exploratory data analysis for business MODULE 1.pptx

datamining management slyabbus and ppt.pptx
datamining management slyabbus and ppt.pptxdatamining management slyabbus and ppt.pptx
datamining management slyabbus and ppt.pptxshyam1985
 
Study of Data Mining Methods and its Applications
Study of  Data Mining Methods and its ApplicationsStudy of  Data Mining Methods and its Applications
Study of Data Mining Methods and its ApplicationsIRJET Journal
 
Predictive Analytics.pdf
Predictive Analytics.pdfPredictive Analytics.pdf
Predictive Analytics.pdfAmirKhan811717
 
notes_dmdw_chap1.docx
notes_dmdw_chap1.docxnotes_dmdw_chap1.docx
notes_dmdw_chap1.docxAbshar Fatima
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecasesSreenatha Reddy K R
 
Uncover Trends and Patterns with Data Science.pdf
Uncover Trends and Patterns with Data Science.pdfUncover Trends and Patterns with Data Science.pdf
Uncover Trends and Patterns with Data Science.pdfUncodemy
 
Data Mining Assignment Help.pptx
Data Mining Assignment Help.pptxData Mining Assignment Help.pptx
Data Mining Assignment Help.pptxDavidjohn504862
 
What Are the Challenges and Opportunities in Big Data Analytics.pdf
What Are the Challenges and Opportunities in Big Data Analytics.pdfWhat Are the Challenges and Opportunities in Big Data Analytics.pdf
What Are the Challenges and Opportunities in Big Data Analytics.pdfMr. Business Magazine
 
Data Mining & Applications
Data Mining & ApplicationsData Mining & Applications
Data Mining & ApplicationsFazle Rabbi Ador
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Seerat Malik
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousingSunny Gandhi
 
Data Analysis and Analytics.pdf
Data Analysis and Analytics.pdfData Analysis and Analytics.pdf
Data Analysis and Analytics.pdfrohitgautam105831
 

Similar to Exploratory data analysis for business MODULE 1.pptx (20)

datamining management slyabbus and ppt.pptx
datamining management slyabbus and ppt.pptxdatamining management slyabbus and ppt.pptx
datamining management slyabbus and ppt.pptx
 
datamining.ppt
datamining.pptdatamining.ppt
datamining.ppt
 
Study of Data Mining Methods and its Applications
Study of  Data Mining Methods and its ApplicationsStudy of  Data Mining Methods and its Applications
Study of Data Mining Methods and its Applications
 
Predictive Analytics.pdf
Predictive Analytics.pdfPredictive Analytics.pdf
Predictive Analytics.pdf
 
Datamining
DataminingDatamining
Datamining
 
Datamining
DataminingDatamining
Datamining
 
notes_dmdw_chap1.docx
notes_dmdw_chap1.docxnotes_dmdw_chap1.docx
notes_dmdw_chap1.docx
 
Data mining
Data miningData mining
Data mining
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Data Mining
Data MiningData Mining
Data Mining
 
Uncover Trends and Patterns with Data Science.pdf
Uncover Trends and Patterns with Data Science.pdfUncover Trends and Patterns with Data Science.pdf
Uncover Trends and Patterns with Data Science.pdf
 
Data Mining Applications And Feature Scope Survey
Data Mining Applications And Feature Scope SurveyData Mining Applications And Feature Scope Survey
Data Mining Applications And Feature Scope Survey
 
Data Mining Assignment Help.pptx
Data Mining Assignment Help.pptxData Mining Assignment Help.pptx
Data Mining Assignment Help.pptx
 
What Are the Challenges and Opportunities in Big Data Analytics.pdf
What Are the Challenges and Opportunities in Big Data Analytics.pdfWhat Are the Challenges and Opportunities in Big Data Analytics.pdf
What Are the Challenges and Opportunities in Big Data Analytics.pdf
 
Data mining concepts
Data mining conceptsData mining concepts
Data mining concepts
 
Data Mining & Applications
Data Mining & ApplicationsData Mining & Applications
Data Mining & Applications
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
 
Unit 4 Advanced Data Analytics
Unit 4 Advanced Data AnalyticsUnit 4 Advanced Data Analytics
Unit 4 Advanced Data Analytics
 
Data Analysis and Analytics.pdf
Data Analysis and Analytics.pdfData Analysis and Analytics.pdf
Data Analysis and Analytics.pdf
 

More from YashwanthKumar306128

More from YashwanthKumar306128 (6)

Logistics and suppy chain management of amazon
Logistics and suppy chain management of amazonLogistics and suppy chain management of amazon
Logistics and suppy chain management of amazon
 
Introduction-to-Supply-Chain-Management.pptx
Introduction-to-Supply-Chain-Management.pptxIntroduction-to-Supply-Chain-Management.pptx
Introduction-to-Supply-Chain-Management.pptx
 
system development life cycle(SDLC) (1).pptx
system development life cycle(SDLC) (1).pptxsystem development life cycle(SDLC) (1).pptx
system development life cycle(SDLC) (1).pptx
 
ME.pptx
ME.pptxME.pptx
ME.pptx
 
RM PRESENTATION.pptx
RM PRESENTATION.pptxRM PRESENTATION.pptx
RM PRESENTATION.pptx
 
RM PRESENTATION-1.pptx
RM PRESENTATION-1.pptxRM PRESENTATION-1.pptx
RM PRESENTATION-1.pptx
 

Recently uploaded

B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 

Recently uploaded (20)

B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 

Exploratory data analysis for business MODULE 1.pptx

  • 1. SJB Institute of Technology (An Autonomous Institute under Visvesvaraya Technological University,Belagavi Approved by AICTE,New Delhi, Recognized by UGC, New Delhi with 2(f) and 12 (B). Accredited by NAAC with ‘A+’ Grade, Accredited by National Board of Accreditation) No. 67, BGS Health & Education City ,Dr Vishnuvardhan Road ,Kengeri, Bengaluru-560060 II Jai Sri Gurudev II Sri AdichunuchanagiriShikshna Trust ® Exploratory Data Analysis for Business 22MBABA304 Prepared by Dr. Harshitha S Assistant Professor MBA Department SJBIT
  • 2. Module 1 (8 HOURS) Introduction to Data Mining: Applications- Nature of The Problem- Classification. Problems in Real Life- Email Spam, Handwritten Digit Recognition, Image segmentation, Speech Recognition, DNA Expression Microarray, DNA Sequence Classification. Exploratory Data Analysis (EDA)- What is Data- Numerical Summarization - Measures of Similarity and Dissimilarity, Proximity Distance- Euclidean Distance, Minkowski Distance, Mahalanobis Distance. Visualization- Tools for Displaying Single Variables - Tools for Displaying Relationships Between Two Variables - Tools for Displaying More Than Two Variables. R Scripts- R Library: ggplot2-R Markdown
  • 3. MODULE 2: (8 HOURS) Statistical Learning and Model Selection: Prediction Accuracy - Prediction Error, Training and Test Error as A Function of Model Complexity, Over fitting a Model, Bias-Variance Trade-off, Cross Validation- Holdout Sample: Training and Test Data, Three-way Split: Training, Validation and Test Data, Cross-Validation, Random Sub sampling, K-fold Cross-Validation, Leave-One-Out Cross- Validation with examples for each.
  • 4. MODULE3: (8 HOURS) Linear Regression and Variable Selection: Meaning- Review Expectation, Variance, Frequentist Basics, Parameter Estimation, Linear Methods, Point Estimate, Example Results, Theoretical Justification, R Scripts. Variable Selection- Variable Selection for the Linear Model, R Scripts. .
  • 5. MODULE4 – (9 hours) Regression Shrinkage Methods and Tree based method: Meaning, Types- Ridge Regression, Compare Squared Loss for Ridge Regression, More on Coefficient Shrinkage, The Lasso. Tree Based Methods- Construct the Tree, The Impurity Function, Estimate the Posterior Probabilities of Classes in Each Node, Advantages of the Tree-Structured Approach, Variable Combinations, Missing Values, Right Sized Tree via Pruning, Bagging and Random Forests, R Scripts, Bagging, From Bagging to Random Forests, Boosting
  • 6. MODULE 5: (10 HOURS) Principal Components Analysis and Classification: Singular Value Decomposition (SVD), Principal Components, Principal Components Analysis(PCA), Geometric Interpretation, Acquire Data, Classification - Classification Error Rate, Bayes Classification Rule, Linear Methods for Classification, Logistic Regression - Assumptions, Comparison with Linear Regression on Indicators- Fitting based on Optimization Criterion, Binary Classification, Multiclass Case (K ≥ 3), Discriminant Analysis - Class Density Estimation, Linear Discriminant Analysis, Optimal Classification
  • 7. MODULE 6: (7 HOURS) Support Vector Machines: Overview, When Data is Linearly Separable, Support Vector Classifier, When Data is NOT Linearly Separable, Kernel Functions, Multiclass SVM.
  • 8. Module 1 : Introduction to Data Mining:
  • 9. Data refers to raw facts, information, or observations that can be collected, stored, and processed. It can take various forms, such as numbers, text, images, audio, or video. Data is typically unorganized and lacks meaning on its own, but it becomes valuable when it is processed and interpreted to generate insights or support decision-making. Structured Data: This type of data is organized in a specific format, often in rows and columns. Structured data is easy to query and analyze because it follows a predefined model. Examples include databases, spreadsheets, and tables. Unstructured Data: This type of data lacks a predefined data model or structure. Unstructured data can include text documents, images, videos, and other formats that don't fit neatly into a table.
  • 10. Data mining is the process of searching and analyzing large batch of raw data sets to identify patterns and relationships that can help solve business problems through data analysis. It refers to a set of methods applicable to large and complex databases to eliminate the randomness and discover the hidden pattern. Data mining is about tools, methodologies, and theories for revealing patterns in data — which is a critical step in knowledge discovery. Data mining techniques and tools enable enterprises to predict future trends and make more- informed business decisions.
  • 11. Data Mining Steps 1. Understand Business What is the company’s current situation, the project’s objectives, and what defines success? 2. Understand the Data Figure out what kind of data is needed to solve the issue, and then collect it from the proper sources. 3. Prepare the Data Resolve data quality problems like duplicate, missing, or corrupted data, then prepare the data in a format suitable to resolve the business problem. 4. Model the Data Employ algorithms to ascertain data patterns. Data scientists create, test, and evaluate the model.
  • 12. Examples of Data Mining The following are a few real-world examples of data: Shopping Market Analysis In the shopping market, there is a big quantity of data, and the user must manage enormous amounts of data using various patterns. To do the study, market basket analysis is a modeling approach. Market basket analysis is basically a modeling approach that is based on the notion that if you purchase one set of products, you're more likely to purchase another set of items. This strategy may help a retailer understand a buyer's purchasing habits. Using differential analysis, data from different businesses and consumers from different demographic groups may be compared.
  • 13. Weather Forecasting Analysis For prediction, weather forecasting systems rely on massive amounts of historical data. Because massive amounts of data are being processed, the appropriate data mining approach must be used. Stock Market Analysis In the stock market, there is a massive amount of data to be analyzed. As a result, data mining techniques are utilized to model such data in order to do the analysis.
  • 14. Fraud Detection Traditional techniques of fraud detection are time- consuming and difficult due to the amount of data. Data mining aids in the discovery of relevant patterns and the transformation of data into information. Surveillance Well, video surveillance is utilized practically everywhere in everyday life for security perception. Because we must deal with a huge volume of acquired data, data mining is employed in video surveillance.
  • 15. Benefits of Data Mining Data mining provides us with the means of resolving problems and issues in this challenging information age. • It helps companies to gather reliable information. • It’s an efficient, cost-effective solution compared to other data applications. • It helps businesses make profitable production and operational adjustments • Data mining uses both new and legacy systems • It helps detect credit risks and fraud • It helps data scientists easily analyze enormous amounts of data quickly • Data scientists can use the information to detect fraud, build risk models, and improve product safety. • It helps data scientists quickly initiate automated predictions of behaviors and trends and discover hidden patterns.
  • 16. Applications of Data Mining Data mining is a useful and versatile tool for today’s competitive businesses. Banks Data mining helps banks work with credit ratings and anti- fraud systems, analyzing customer financial data, purchasing transactions, and card transactions. Data mining also helps banks better understand their customers’ online habits and preferences, which helps when designing a new marketing campaign.
  • 17. Healthcare Data mining helps doctors create more accurate diagnoses by bringing together every patient’s medical history, physical examination results, medications, and treatment patterns. Mining also helps fight fraud and waste and bring about a more cost- effective health resource management strategy. Retail The world of retail and marketing go hand-in-hand, but the former still warrants its separate listing. Retail stores and supermarkets can use purchasing patterns to narrow down product associations and determine which items should be stocked in the store and where they should go. Data mining also pinpoints which campaigns get the most response.
  • 18. Marketing If there was ever an application that benefitted from data mining, it’s marketing! Data mining helps bring together data on age, gender, tastes, income level, location, and spending habits to create more effective personalized loyalty campaigns. Data marketing can even predict which customers will more likely unsubscribe to a mailing list or other related service. Armed with that information, companies can take steps to retain those customers before they get the chance to leave!
  • 19. Nature of The Problem Data mining challenges : • Security and Social Challenges • Noisy and Incomplete Data • Distributed Data • Complex Data • Performance • Scalability and Efficiency of the Algorithms • Improvement of Mining Algorithms • Incorporation of Background Knowledge • Data Visualization • Data Privacy and Security • User Interface • Mining dependent on Level of Abstraction • Integration of Background Knowledge • Mining Methodology Challenges
  • 20. Nature of The Problem The nature of the problem in data mining can vary widely depending on the specific goals and challenges of the task at hand. Data mining involves extracting useful and previously unknown patterns or knowledge from large volumes of data. Large Volume of Data: One of the primary challenges in data mining is dealing with massive datasets. The sheer volume of data can lead to issues related to storage, processing, and analysis. Efficient algorithms and scalable techniques are required to handle large datasets. Complexity and Dimensionality: Data in real-world applications is often high-dimensional and complex. This complexity can arise from various sources, such as the number of features, interactions between features, and the nature of relationships within the data. Dealing with high-dimensional data requires sophisticated methods for analysis and visualization.
  • 21. Data Quality and Preprocessing: The quality of data can significantly impact the success of data mining efforts. Issues such as missing values, outliers, noise, and inconsistencies must be addressed through data cleaning and preprocessing techniques. Ensuring data quality is crucial for obtaining meaningful and reliable results. Heterogeneity of Data: Data in real-world scenarios often comes from diverse sources, and it may exhibit heterogeneity in terms of formats, scales, and semantics. Integrating and mining heterogeneous data requires specialized techniques to handle the diversity of information. Scalability: As datasets grow in size, scalability becomes a critical factor. Data mining algorithms and techniques need to scale efficiently to handle increasing data volumes without compromising performance.
  • 22. Privacy and Security Concerns: The nature of data mining often involves analyzing sensitive information. Ensuring the privacy and security of data is a significant challenge, especially when dealing with personal or confidential data. Techniques such as anonymization and encryption may be employed to address these concerns. Dynamic and Evolving Data: In some applications, data is dynamic and evolves over time. This introduces challenges related to handling streaming data, adapting models to changes, and maintaining the relevance of mined patterns as the data distribution shifts.
  • 23. Interpretability and Explainability: Many data mining models, especially those based on complex machine learning algorithms, can be challenging to interpret. Ensuring the interpretability and explainability of models is crucial, particularly in applications where stakeholders need to understand and trust the results. Domain-specific Challenges: The nature of data mining problems is often domain- specific. Understanding the characteristics and requirements of a particular domain is essential for designing effective data mining solutions. Different industries and applications may have unique challenges and nuances.
  • 24. Classification Problems in Real Life 1. Email Spam The goal is to predict whether an email is a spam and should be delivered to the Junk folder. The raw data comprises only the text part but ignores all images. Text is a simple sequence of words which is the input (X).The goal is to predict the binary response Y: spam or not. The first step is to process the raw data into a vector, which can be done in several ways. The method followed here is based on the relative frequencies of most common words and punctuation marks in e-mail messages.
  • 25. A set of 57 such words and punctuation marks are pre-selected by researchers. Given these 57 most commonly occurring words and punctuation marks, then, in every e-mail message we would compute a relative frequency for each word, i.e., the percentage of times this word appears with respect to the total number of words in the email message. In the current example, 4000 email messages are considered in the training sample. These e-mail messages are identified as either a good e-mail or spam after reading the emails and assuming implicitly that human decision is perfect(an arguable point!).
  • 26. Relative frequency of the 57 most commonly used words and punctuation based on this set of emails was constructed. This is an example of supervised learning as in the training data the response Y is known. In the future when a new email message is received, the algorithm will analyze the text sequence and compute the relative frequency for these 57 identified words. This is the new input vector to be classified into spam or not through the learning algorithm.
  • 27. Handwritten Digit Recognition Handwritten digit recognition is the process to provide the ability to machines to recognize human handwritten digits. The image recognition in handwriting is more challenging because everyone has different handwriting forms, so that on the detection also handwriting will be more difficult to detect compared writings from computers that already have a definite standard form.
  • 28. The goal is to identify images of single digits 0 - 9 correctly. The raw data comprises images that are scaled segments from five- digit ZIP codes. In the diagram below every green box is one image. Every image is to be identified as 0 or 1 or 2 ... or 9. Since the numbers are handwritten, the task is not trivial. For instance, a '5' sometimes can very much look like a '6', and '7' is sometimes confused with '1'.
  • 29. To the computer, an image is a matrix, and every pixel in the image corresponds to one entry in the matrix. Every entry is an integer ranging from a pixel intensity of 0 (black) to 255 (white). Hence the raw data can be submitted to the computer directly without any feature extraction. The image matrix was scanned row by row and then arranged into a large 256- dimensional vector. This is used as the input to train the classifier. Note that this is also a supervised learning algorithm where Y, the response, is multi-level and can take 10 values.
  • 30. DNA Expression Microarray A DNA expression microarray is a powerful tool used in molecular biology and genetics to analyze the expression levels of thousands of genes simultaneously. Principle: DNA microarrays consist of tiny spots, called probes, attached to a solid surface (often a glass slide or silicon chip). Each spot contains a known DNA sequence that corresponds to a specific gene or a portion of a gene.
  • 31. Sample Preparation: RNA is extracted from cells or tissues of interest. Since RNA represents the actively expressed genes, it provides information about gene expression levels. The extracted RNA is converted into complementary DNA (cDNA) using reverse transcription. Labeling: The cDNA is labeled with a fluorescent dye or another detectable marker. Often, different samples are labeled with different colors to enable comparison.
  • 32. Hybridization: The labeled cDNA is then hybridized to the microarray. The cDNA will bind to its complementary DNA sequence on the microarray. Detection: The microarray is scanned to measure the fluorescence intensity at each spot. The intensity of the fluorescence signal indicates the amount of gene expression in the sample. Data Analysis: Bioinformatics tools are used to analyze the massive amount of data generated. This involves comparing expression levels between different samples or conditions.
  • 33. Applications: 1.Gene Expression Profiling: 1. Microarrays are widely used to study how genes are expressed under different conditions or in different tissues. 2.Disease Research: 1. Microarrays help identify genes associated with diseases, enabling researchers to understand the molecular mechanisms underlying various conditions. 3.Pharmacogenomics: 1. Microarrays can be used to study how individuals respond to drugs based on their genetic makeup, leading to personalized medicine approaches. 4.Cancer Research: 1. Microarrays are used to classify tumors based on gene expression patterns, helping in cancer diagnosis and treatment decisions. 5.Functional Genomics: 1. Microarrays aid in understanding the function of genes by analyzing their expression in different biological contexts. 6.Toxicology Studies: 1. Microarrays are used to assess the impact of drugs, chemicals, or environmental factors on gene expression.
  • 34. DNA Sequence Classification DNA sequence classification involves categorizing DNA sequences into different classes or groups based on certain features or patterns. This process is crucial in various fields, including bioinformatics, genomics, and molecular biology. Data Collection: Collect DNA sequences from relevant sources. These sequences may represent genes, genomes, or other functional elements. Feature Extraction: Extract relevant features from the DNA sequences. Features can include nucleotide composition, sequence motifs, structural properties, or any other characteristic that distinguishes one class from another.
  • 35. . Data Preprocessing: Clean and preprocess the data to remove noise, handle missing values, and standardize the format. This step ensures that the data is in a suitable form for analysis. Labeling: Assign labels or classes to the DNA sequences based on the biological context or the problem at hand. For example, classes could represent different species, functional elements, or disease states.
  • 36. Training Data and Model Selection: Split the dataset into training and testing sets. The training set is used to train a classification model. Choose an appropriate classification algorithm based on the nature of the data and the problem, such as decision trees, support vector machines, neural networks, or others. Feature Representation: Represent the DNA sequences in a format suitable for the chosen classification algorithm. This may involve encoding sequences into numerical vectors using methods like one-hot encoding or k- mer counting.
  • 37. Model Training: Train the selected classification model using the training dataset. The model learns to recognize patterns or features that distinguish between different classes. Model Evaluation: Evaluate the trained model using the testing dataset to assess its performance. Common evaluation metrics include accuracy, precision, recall, F1 score, and area under the receiver operating characteristic (ROC) curve. Optimization: Fine-tune the model or optimize hyper parameters to improve performance if necessary.
  • 38. Prediction: Use the trained model to predict the class labels of new, unseen DNA sequences. Interpretation: Interpret the results and gain insights into the biological significance of the classification. Understand which features contribute most to the classification decision.
  • 39. Applications of DNA sequence classification include: •Species Identification: Classifying DNA sequences to identify different species. •Functional Annotation: Assigning functions to genes or genomic regions based on their sequences. •Disease Prediction: Predicting disease states or susceptibility based on genetic variations. •Drug Target Identification: Identifying potential drug targets by classifying sequences associated with disease pathways.
  • 40. Each genome is made up of DNA sequences and each DNA segment has specific biological functions. However there are DNA segments which are non-coding, i.e. they do not have any biological function (or their functionalities are not yet known). One problem in DNA sequencing is to label the sampled segments as coding or non-coding (with a biological function or without). The raw DNA data comprises sequences of letters, e.g., A, C, G, T for each of the DNA sequences. One method of classification assumes the sequences to be realizations of random processes. Different random processes are assumed for different classes of sequences.
  • 41. IMAGE SEGMENTATION Image segmentation is a computer vision task that involves dividing an image into multiple segments or regions based on certain criteria. The goal is to simplify the representation of an image or to make it more meaningful for analysis. Each segment typically represents a specific object or region in the image. Image segmentation is a crucial step in various applications, such as object recognition, scene understanding, medical imaging, and autonomous vehicles.
  • 42. Feature Extraction: 1. Image segmentation helps in breaking down an image into smaller, meaningful segments or regions. 2. Each segment or region can be treated as a feature, and the properties of these regions can be analyzed for further data mining tasks. 3. Extracted features can include color histograms, texture information, and spatial distribution of objects. Object Recognition: 1. Segmentation is essential for identifying and delineating objects or entities within an image. 2. Once objects are segmented, data mining techniques can be applied to recognize patterns, classify objects, or discover associations among them.
  • 43. Pattern Recognition: 1. Segmentation aids in identifying patterns within images by isolating distinct regions of interest. 2. Data mining algorithms can then be applied to discover and analyze patterns, such as trends, clusters, or anomalies, within these segmented regions. Image Classification: 1. Segmentation can be a preprocessing step for image classification tasks. 2. By segmenting an image into meaningful regions, the subsequent classification process can be more focused and accurate, as it operates on smaller, more homogeneous portions of the image. Data Preprocessing: 1. Image segmentation is often used as a preprocessing step to simplify complex images and reduce the amount of data that needs to be processed. 2. This can help improve the efficiency and effectiveness of subsequent data mining algorithms.
  • 44. Anomaly Detection: 1. Segmentation can be used to identify anomalies or outliers within an image. 2. Data mining techniques can then be applied to detect unusual patterns or behaviors within these segmented regions, indicating potential issues or interesting phenomena.
  • 45. Speech Recognition Speech recognition, also known as automatic speech recognition (ASR) or speech-to-text (STT), is a technology that converts spoken language into written text. The primary goal of speech recognition systems is to accurately transcribe spoken words into a text format that can be processed, analyzed, or used for various applications. Breaks down audio into individual sounds Converts these sounds into a digital format Uses algorithms and models to find out the most probable word fit in the language.
  • 46. 1.Acoustic Signal Capture: Speech recognition systems start by capturing the acoustic signal, which is the spoken language, using a microphone or other audio input devices. 2.Pre-processing: The captured audio signal undergoes pre-processing to remove noise, filter out irrelevant information, and enhance the quality of the signal. 3.Feature Extraction: The system extracts relevant features from the audio signal, such as spectral features, MFCCs (Mel-Frequency Cepstral Coefficients), or other time-frequency representations.
  • 47. 4. Acoustic Modeling: Acoustic models are trained using machine learning algorithms, often based on Hidden Markov Models (HMMs) or more recently, deep neural networks (DNNs) in the case of deep learning-based ASR. 5.Language Modeling: Language models incorporate linguistic context and probabilities of word sequences. This helps improve the accuracy of recognizing words and phrases within a given context. 6. Decoding: The speech recognition system decodes the acoustic and language models to produce a sequence of words that best matches the input audio. 7. Post-processing: Post-processing steps may involve refining the output, handling errors, and adapting to specific applications or domains.
  • 48. For instance, if you call the University Park Airport, the system might ask you your flight number, or your origin and destination cities. The system does a very good job recognizing city names. This is a classification problem, in which each city name is a class. The number of classes is very big but finite.
  • 49. The raw data involves voice amplitude sampled at discrete time points (a time sequence), which may be represented in the waveforms as shown above. In speech recognition, a very popular method is the Hidden Markov Model. At every time point, one or more features, such as frequencies, are computed. The speech signal essentially becomes a sequence of frequency vectors. This sequence is assumed to be an instance of a hidden Markov model (HMM). An HMM can be estimated using multiple sample sequences under the same class (e.g., city name).
  • 50. Hidden Markov Model (HMM) Methodology: HMM captures the time dependence of the feature vectors. The HMM has unspecified parameters that need to be estimated. Based on the sample sequences, model estimation takes place and an HMM is obtained. This HMM is like a mathematical signature for each word. Each city name, for example, will have a different signature. In the diagram above the signatures corresponding to State College and San Francisco are compared. It is possible that several models are constructed for one word or phrase. For instance, there may be a model for a female voice as opposed to another for a male voice. When a customer calls in for information and utters origin or destination city pairs, the system computes the likelihood of what the customer uttered under possibly thousands of models. The system finds the HMM that yields the maximum likelihood and identifies the word as the one associated with that HMM.
  • 51. Exploratory Data Analysis (EDA) Exploratory Data Analysis (EDA) is an analysis approach that identifies general patterns in the data. These patterns include outliers and features of the data that might be unexpected. EDA is an important first step in any data analysis
  • 52. It is an approach to analyzing datasets to summarize their main characteristics, often with visual methods. EDA is used for seeing what the data can tell us before the modeling task. Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.
  • 53. What is Data Data are units of information. Qualitative data is descriptive data. It is non-numerical and is also known as categorical data. Quantitative data is numerical information, and answers questions like 'how many', 'how much' and 'how often'.
  • 54.
  • 55. Exploratory Data Analysis  Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies(anomalies refer to unusual or unexpected patterns, observations, or values in a dataset), to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.
  • 56.
  • 57.
  • 58.  Data Visualization: Data visualization is the presentation of data in a graphical or pictorial format to help people understand and interpret patterns, trends, and insights within the data. Effective data visualization is a crucial component of data analysis, as it allows for the communication of complex information in a clear and accessible manner. Here are key points about data visualization:  Types of Visualizations:  Charts and Graphs: Bar charts, line charts, pie charts, scatter plots, and area charts.  Maps: Geographic information system (GIS) maps and choropleth maps.  Tables: Tabular representations of data.  Dashboards: Combined visualizations providing an overview of key metrics.  Infographics: Visual representations that convey information and data through graphics.  Descriptive Statistics: Compute basic descriptive statistics (mean, median, standard deviation, etc.) to summarize numerical variables
  • 59.
  • 61. The Importance of EDA 1. Data Cleaning: EDA involves examining the information for errors, lacking values, and inconsistencies. It includes techniques including records imputation, managing missing statistics, and figuring out and getting rid of outliers. 2. Descriptive Statistics: EDA utilizes precise records to recognize the important tendency, variability, and distribution of variables. Measures like suggest, median, mode, preferred deviation, range, and percentiles are usually used.
  • 62.  Data Visualization: EDA employs visual techniques to represent the statistics graphically. Visualizations consisting of histograms, box plots, scatter plots, line plots, heatmaps, and bar charts assist in identifying styles, trends, and relationships within the facts.  4. Feature Engineering: EDA allows for the exploration of various variables and their adjustments to create new functions or derive meaningful insights. Feature engineering can contain scaling, normalization, binning, encoding express variables, and creating interplay or derived variables.
  • 63.  Correlation and Relationships: EDA allows discover relationships and dependencies between variables. Techniques such as correlation analysis, scatter plots, and pass-tabulations offer insights into the power and direction of relationships between variables.  6. Data Segmentation: EDA can contain dividing the information into significant segments based totally on sure standards or traits. This segmentation allows advantage insights into unique subgroups inside the information and might cause extra focused analysis.
  • 64.  Hypothesis Generation: EDA aids in generating hypotheses or studies questions based totally on the preliminary exploration of the data. It facilitates form the inspiration for in addition evaluation and model building.  8. Data Quality Assessment: EDA permits for assessing the nice and reliability of the information. It involves checking for records integrity, consistency, and accuracy to make certain the information is suitable for analysis.
  • 65. Data  Data refers to facts, information, or values that are collected, stored, and analyzed for various purposes  Data denotes a collection of objects and their attributes  Attribute(feature, variable, or field) is a property or characteristic of an object
  • 66.
  • 67. Types of Attributes Nominal: Qualitative variables that do not have a natural order, e.g. Hair color, Religion, Residence zipcode of a student. Ordinal: Qualitative variables that have a natural order, e.g. Grades, Rating of a service rendered on a scale of 1-5 (1 is terrible and 5 is excellent), Street numbers in New York City. Interval: Measurements where the difference between two values is meaningful, e.g. Calendar dates, Temperature in Celsius or Fahrenheit. Ratio: Measurements where both difference and ratio are meaningful, e.g. Temperature in Kelvin, Length, Counts.
  • 68. Discrete and Continuous Attributes Discrete Attribute A variable or attribute is discrete if it can take a finite or a countably infinite set of values. A discrete variable is often represented as an integer-valued variable. A binary variable is a special case where the attribute can assume only two values, usually represented by 0 and 1. Examples of a discrete variable are the number of birds in a flock; the number of heads realized when a coin is flipped 10 times, etc. Continuous Attribute A variable or attribute is continuous if it can take any value in a given range; possibly the range being infinite. Examples of continuous variables are weights and heights of birds, temperature of a day, etc.
  • 69. Numerical summarization  Numerical summarization involves calculating various statistical measures to capture different aspects of the dataset. Numerical data are usually summarized and presented by distribution, measures of central tendency and dispersion. For normally distributed data, arithmetic mean and standard deviation are used.  Numerical summarization, in the context of data analysis and statistics, refers to the process of using quantitative measures to describe and summarize the main characteristics of a dataset.  The goal is to distill complex information into key numerical indicators that provide insights into the distribution, central tendency, spread, and relationships within the data.
  • 70. Measures of Location They are single numbers representing a set of observations. Measures of location also include measures of central tendency. Measures of central tendency can also be taken as the most representative values of the set of observations. The most common measures of location are the Mean, the Median, the Mode, and the Quartiles. Mean The arithmetic average of all the observations. The mean equals the sum of all observations divided by the sample size Median The middle-most value of the ranked set of observations so that half the observations are greater than the median and the other half is less. Median is a robust measure of central tendency
  • 71. Mode The most frequently occurring value in the data set. This makes more sense when attributes are not continuous Quartiles Division points which split data into three parts after rank-ordering them. Division points are called Q1 (the first quartile), Q2 (the second quartile or median), and Q3 (the third quartile).
  • 72. Measures of Spread Measures of location are not enough to capture all aspects of the attributes. Measures of dispersion are necessary to understand the variability of the data. The most common measure of dispersion is the Variance, the Standard Deviation, the Interquartile Range and Range. Variance Measures how far a set of data (numbers) are spread out from their mean (average) value. It is defined as the average of the squared differences between the mean and the individual data values.
  • 73. Standard Deviation Is the square root of the variance. It is defined as the average distance between the mean and the individual data values Interquartile range (IQR) is the difference between Q3 and Q1. IQR contains the middle 50% of data Range is the difference between the maximum and minimum values in the sample
  • 74. Measures of Skewness In addition to the measures of location and dispersion, the arrangement of data or the shape of the data distribution is also of considerable interest. The most 'well-behaved' distribution is a symmetric distribution where the mean and the median are coincident. The symmetry is lost if there exists a tail in either direction. Skewness is a measurement of the distortion of symmetrical distribution or asymmetry in a data set. Skewness is demonstrated on a bell curve when data points are not distributed symmetrically to the left and right sides of the median on a bell curve. Skewness measures whether or not a distribution has a single long tail. Skewness is measured as:
  • 75. The figure below gives examples of symmetric and skewed distributions. Note that these diagrams are generated from theoretical distributions and in practice one is likely to see only approximations.
  • 76.
  • 77.
  • 78. In a positively skewed distribution, the mean is greater than the median as the data is more towards the lower side and the mean average of all the values.
  • 79. Measures of Correlation Correlation describes the degree of the linear relationship between two attributes, X and Y. With X taking the values x(1), … , x(n) and Y taking the values y(1), … , y(n), the sample correlation coefficient is defined as: The correlation coefficient is always between -1 (perfect negative linear relationship) and +1 (perfect positive linear relationship). If the correlation coefficient is 0, then there is no linear relationship between X and Y.
  • 80. In the figure below a set of representative plots are shown for various values of the population correlation coefficient ρ ranging from - 1 to + 1. At the two extreme values, the relation is a perfectly straight line. As the value of ρ approaches 0, the elliptical shape becomes round and then it moves again towards an elliptical shape with the principal axis in the opposite direction.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.  Proximity measures are mainly mathematical techniques that calculate the similarity/dissimilarity of data points
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97. Manhattan distance  The Manhattan distance, also called the Taxicab distance or the City Block distance, calculates the distance between two real-valued vectors.  It is perhaps more useful to vectors that describe objects on a uniform grid, like a chessboard or city blocks. The taxicab name for the measure refers to the intuition for what the measure calculates: the shortest path that a taxicab would take between city blocks (coordinates on the grid).
  • 98.
  • 99.
  • 100. It is named after the German mathematician Hermann Minkowski.
  • 101. Here, n is the number of dimensions (in this case, 2 for a two- dimensional space), p is a parameter that determines the order of the distance. When p=1, it becomes the Manhattan distance, and when p=2, it becomes the Euclidean distance. The Minkowski distance is versatile and can be used in various applications, including clustering, classification, and regression. The choice of p depends on the nature of the data and the specific requirements of the problem at hand.
  • 102.
  • 103. Let's consider an example of calculating the Minkowski distance between two points in a two-dimensional space. In this example, we'll calculate the distance between points P(1,2) and Q(4,6) using the Minkowski distance formula.
  • 104. So, the Minkowski distance between points P(1,2) and Q(4,6) with p=2 is 5, which is equivalent to the Euclidean distance between these two points. If you were to use a different value of p, you would get a different Minkowski distance. For example, if p=1, it would be the Manhattan distance: D(P,Q)=∣1−4∣+∣2−6∣=3+4=7 It's important to note that the Minkowski distance generalizes to different distance metrics depending on the value of p, and choosing the appropriate p depends on the specific requirements of the problem.
  • 105. Mahalanobis Distance The Mahalanobis distance is a measure of the distance between a point and a distribution, taking into account the correlation between variables in the distribution. It is a generalized form of the Euclidean distance and is particularly useful when dealing with multivariate data. It scales each variable by its standard deviation and adjusts for the correlation between variables. This makes it particularly useful in situations where variables are correlated, and it allows for a more accurate measure of distance compared to the Euclidean distance.
  • 106. This distance metric is particularly useful in statistics and pattern recognition, helping to identify outliers or measure dissimilarity between observations in a multivariate dataset. The Mahalanobis distance accounts for the correlation between variables in the distribution making it applicable when variables are interrelated.
  • 107. The formula for Mahalanobis distance between a point x and a distribution with mean μ and covariance matrix Σ is given by: The most common use for the Mahalanobis distance is to find multivariate outliers, which indicates unusual combinations of two or more variables. For example, it’s fairly common to find a 6′ tall woman weighing 185 lbs, but it’s rare to find a 4′ tall woman who weighs that much.
  • 108. Example with specific values for the mean vector, covariance matrix, and a data point. Assume: Mean vector μ = [65 inches, 150 pounds] Covariance matrix S = [[25, 10], [10, 36]] Data point x = [70 inches, 160 pounds] 1. Subtract the mean vector from the data point: x−μ=[70−65,160−150]=[5,10]
  • 109.
  • 110. Take the square root of the result from the above step to get the Mahalanobis Distance.
  • 111. Tools for Displaying Single Variables Histograms Histograms are the most common graphical tool to represent continuous data. It is used to visualize the frequency or probability distribution of a single continuous variable. On the horizontal axis, the range of the sample is plotted. On the vertical axis are plotted the frequencies or relative frequencies of each class. The class width has an impact on the shape of the histogram. Use Cases: Histograms are commonly used in statistical analysis, quality control, and research to understand the distribution of data. They help identify patterns, trends, and potential outliers within a dataset. Histograms are useful for exploring and communicating the central tendencies and variability of a continuous variable.
  • 112.
  • 113. Components of a Histogram: Bins or Intervals: The entire range of the variable is divided into intervals or bins. Each bin represents a range of values. Frequency or Count: The height of each bar in the histogram corresponds to the frequency or count of observations falling within the respective bin. Creating a Histogram: Data Collection: Collect data on the variable of interest, ensuring it is a continuous variable. Data Binning: Divide the range of the data into intervals or bins. The number and width of the bins can impact the appearance and interpretation of the histogram.
  • 114. Counting Observations: Determine how many observations fall into each bin by counting the occurrences of data points within the specified intervals. Plotting: For each bin, draw a rectangle or bar whose height corresponds to the frequency of observations in that bin. The bars are typically adjacent to each other. Interpreting a Histogram: Shape: The shape of the histogram provides insights into the distribution of the data. Common shapes include normal (bell-shaped), skewed (positively or negatively), and uniform. Central Tendency: The central tendency of the data, including measures like mean or median, can be identified based on the position of the central peak or the center of mass of the histogram. Spread: The spread or dispersion of the data can be observed by looking at the width of the histogram. A wider spread indicates higher variability. Outliers: Outliers, or values significantly different from the bulk of the data, may be visible as individual bars separated from the main distribution.
  • 115. OUTLIER An outlier is an observation or data point that significantly differs from the rest of the data in a dataset. In other words, it is a data point that lies an abnormal distance away from other values in a random sample from a population. Outliers can occur in various types of data, and their presence may indicate errors in data collection, measurement variability, or important information about the underlying distribution of the data
  • 116. Boxplot A boxplot, also known as a box-and-whisker plot, is a graphical representation of the distribution of a dataset. It provides a summary of key statistical measures, including the minimum, first quartile (Q1), median, third quartile (Q3), and maximum. median (Q2/50th Percentile): the middle value of the dataset. first quartile (Q1/25th Percentile): the middle number between the smallest number (not the “minimum”) and the median of the dataset. third quartile (Q3/75th Percentile): the middle value between the median and the highest value (not the “maximum”) of the dataset. interquartile range (IQR): 25th to the 75th percentile. whiskers (shown in blue) outliers (shown as green circles) “maximum”: Q3 + 1.5*IQR “minimum”: Q1 -1.5*IQR
  • 117. They are useful for identifying the spread and central tendency of a variable.
  • 118. Outliers, which are data points that fall significantly outside the overall pattern of the data, are sometimes shown as individual points beyond the whiskers. The boxplot of the Wage distribution clearly identifies many outliers. Boxplots are useful for identifying the central tendency, spread, and skewness of a dataset, as well as for detecting potential outliers. They are particularly effective for comparing the distribution of different groups or variables.
  • 119. Example: Find the maximum, minimum, median, first quartile, third quartile for the given data set: 23, 42, 12, 10, 15, 14, 9. Solution: Given: 23, 42, 12, 10, 15, 14, 9. Arrange the given dataset in ascending order. 9, 10, 12, 14, 15, 23, 42 Hence, Minimum = 9 Maximum = 42 Median = 14 First Quartile = 10 (Middle value of 9, 10, 12 is 10) Third Quartile = 23 (Middle value of 15, 23, 42 is 23).
  • 120. When it comes to representing data graphically, box plots and histograms are two popular choices. Both are used to summarize and display data, but they are quite different in terms of their approach and the information they convey.
  • 121.
  • 122.
  • 123.
  • 124. 1. A histogram is a graphical representation of data that displays the frequency of numerical data in different intervals or bins. The bars in a histogram represent the number of observations falling into each bin. A Box Plot, also called a box and whisker plot, is a way of displaying the distribution of a dataset using the five-number summary: the minimum value, the first quartile, the median, the third quartile, and the maximum value. 2. A histogram is a bar graph, where the height of each bar represents the frequency or count of data points falling within a certain range. In contrast, a box plot is a schematic that shows the range, median, quartiles, and outliers of a dataset.
  • 125. 3. Histograms are commonly used to display continuous data, such as weight, height, and temperature, and discrete data, such as counts and scores. Box plots are more suitable for displaying the spread and central tendency of continuous data and comparing it across different categories or groups. 4. In terms of the information conveyed, a histogram provides an overview of the distribution of the data and the frequency of the observations. It can show whether the data is normally distributed, skewed to the left or right, or bimodal. A box plot provides more detailed information, showing not only the central tendency and the spread of the data but also the outliers and the skewness of the distribution.
  • 126. 5. A histogram is interpreted by observing the shape of the distribution, such as whether it is symmetric or skewed, and the position of the center of the distribution, which is represented by the peak of the histogram. On the other hand, a box plot is interpreted by analyzing the position of the whiskers, the length of the box, and the presence of outliers. A box plot also provides information on the quartiles, which indicate the spread of the data, and the median, which represents the central tendency of the data.
  • 127. Bar Charts: Bar charts are effective for displaying the frequency or proportion of categorical data. Each bar represents a category, and the height of the bar corresponds to the frequency or proportion of observations in that category. Pie Charts: Pie charts are suitable for displaying the proportions of different categories within a dataset. Each slice represents a category, and the size of the slice corresponds to the proportion of the whole.
  • 128. Excel or Google Sheets: For quick and simple visualizations, spreadsheet tools like Microsoft Excel or Google Sheets can be effective. They offer various chart types and are easy to use for basic data visualization tasks. R with ggplot2: If you are comfortable with the R programming language, ggplot2 is a powerful and flexible data visualization package that can handle single-variable visualizations and more complex plots.
  • 129. Plotly: Plotly is a versatile graphing library that supports interactive plots. It can be used with Python, R, and Julia. It's particularly useful if you want to create interactive visualizations for web applications. Tableau Public: If you prefer a more graphical and user-friendly interface, Tableau Public is a powerful data visualization tool. It allows you to create interactive dashboards and share them online. Tableau can handle single-variable visualizations and much more.
  • 130. Matplotlib: This is a widely used 2D plotting library for Python. It can be used to create various types of plots, including histograms, bar charts, line plots, and more. If you're working with Python and have a single variable to visualize, Matplotlib is a good choice. Seaborn: Seaborn is built on top of Matplotlib and provides a high- level interface for drawing attractive statistical graphics. It comes with several built-in themes and color palettes to make your visualizations more appealing.
  • 131. Tools for Displaying Relationship Between Two Variables Scatter Plots Scatter plots are a basic but effective way to visualize the relationship between two continuous numerical variables. It shows the direction and strength of association between two variables. If points generally follow a linear pattern from the bottom-left to the top-right (positive correlation) or vice versa (negative correlation), there is an indication of a relationship. Outliers: Outliers, or data points that deviate significantly from the overall pattern, can be easily identified in a scatter plot. Matplotlib and Seaborn are Python libraries that can be used to create scatter plots easily. Tools: Excel, Google Sheets, Python (Matplotlib, Seaborn), R (ggplot2).
  • 132.
  • 133.
  • 134.
  • 135.
  • 136.
  • 137.
  • 138. library(ISLR) with(Wage, plot(age, wage, pch = 19, cex=0.6)) title(main = "Relationship between Age and Wage") It is clear from the scatterplot that the Wage does not seem to depend on Age very strongly.
  • 139. Contour plot A contour plot is a graphical technique for representing a 3- dimensional surface by plotting constant z slices, called contours, on a 2-dimensional format. For example, you can use a contour plot to visualize the height of a surface in two or three dimensions. This is useful when a continuous attribute is measured on a spatial grid. They partition the plane into regions of similar values. The contour lines that form the boundaries of these regions connect points with equal values. In spatial statistics, contour plots have a lot of applications.
  • 140.  Contour plots join points of equal probability. Within the contour lines concentration of bivariate distribution is the same. One may think of the contour lines as slices of a bivariate density, sliced horizontally. Contour plots are concentric; if they are perfect circles then the random variables are independent. The more oval- shaped they are, the farther they are from independence.
  • 141.
  • 142. Tools for Displaying More Than Two Variables Scatter Plot matrix  A scatter plot matrix is a grid (or matrix) of scatter plots used to visualize bivariate relationships between combinations of variables. Each scatter plot in the matrix visualizes the relationship between a pair of variables, allowing many relationships to be explored in one chart  A scatter plot matrix is a nonspatial tool that can be used to visualize the relationship among up to five numeric variables.  Scatter plot matrices are a good way to determine if linear correlations exist between multiple variable
  • 143.
  • 144.
  • 145.
  • 146. Correlation Matrices: Description: Displaying a correlation matrix helps to quantify and visualize the correlation between two variables. A correlation matrix is a table that displays the correlation coefficients between multiple variables. Each cell in the matrix represents the correlation between two variables, and the matrix allows for a comprehensive view of the relationships among all variable pairs. Correlation coefficients quantify the strength and direction of a linear relationship between two variables, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear correlation. Tools: Python (pandas, seaborn), R (corrplot).
  • 147. Symmetric Matrix: The correlation matrix is symmetric, meaning the correlation between variable A and B is the same as the correlation between B and A. This is because the correlation coefficient measures the relationship without considering the order of the variables. Diagonal Elements: The diagonal elements of the matrix (where the variable is correlated with itself) always have a correlation coefficient of 1. This is because a variable is perfectly correlated with itself. Range of Correlation Coefficients: Correlation coefficients can range from -1 to 1. A coefficient of 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no linear correlation. The closer the coefficient is to ±1, the stronger the correlation.
  • 148. Interpretation of Correlation Values: Positive correlation coefficients indicate a positive linear relationship (as one variable increases, the other tends to increase), while negative coefficients indicate a negative linear relationship (as one variable increases, the other tends to decrease). Uses of Correlation Matrices: Correlation matrices are widely used in statistics, finance, and data analysis to explore relationships between variables. They help identify patterns, dependencies, and potential multicollinearity in the dataset.
  • 149. Visualization: Correlation matrices can be visualized using heatmaps. Each cell in the heatmap is colored based on the magnitude and direction of the correlation coefficient, providing a quick and intuitive way to interpret the relationships. Calculation: The correlation coefficient (usually Pearson correlation coefficient) between two variables X and Y is calculated as the covariance of X and Y divided by the product of their standard deviations. In mathematical terms, it is represented as Corr(X, Y) = Cov(X, Y) / (σ_X * σ_Y). Multivariate Analysis: Correlation matrices are essential in multivariate analysis, helping researchers and analysts understand how multiple variables are related to each other.
  • 150. Heatmaps: A heatmap is a graphical representation of data that uses a system of color coding to represent different values. Heatmaps are used in various forms of analytics but are most commonly used to show user behavior on specific web pages or webpage templates. Heatmaps can be used to show where users have clicked on a page, how far they have scrolled down a page, or used to display the results of eye-tracking tests. Tools: Python (matplotlib, seaborn), R (ggplot2).
  • 151. They are particularly useful for displaying large sets of data and identifying patterns, trends, or areas of interest. Heat maps are commonly employed in various fields, including statistics, data analysis, biology, finance, and geography. The intensity of color in a heat map corresponds to the magnitude of the values being represented. Users can quickly interpret the visual patterns and identify areas of interest or outliers. While heat maps are powerful tools, they have limitations. Misinterpretation can occur if the color scale or range is not chosen appropriately. Additionally, the effectiveness of a heat map depends on the quality and relevance of the underlying data.
  • 152. Data Collection: Heatmaps are commonly used with large sets of data, such as matrices or tables, where each cell contains a value. The data points could represent various metrics, such as website clicks, gene expression levels, financial indicators, or geographic information. Color Mapping: A color gradient is chosen to represent the range of values in the data. For example, a spectrum from cool colors (e.g., blue) to warm colors (e.g., red) is often used. The color scale is divided into intervals, with each interval corresponding to a specific range of values.
  • 153. Intensity Mapping: The intensity of the color in each cell or data point represents the magnitude of the underlying value. Higher values are typically associated with more intense or warmer colors, while lower values correspond to cooler colors. Visualization: The data is then mapped onto a visual grid or surface, with each cell colored according to its corresponding value. Users can observe patterns and variations in color across the grid, making it easy to identify areas of high or low concentration. Interactivity (Optional): Some heatmaps are interactive, allowing users to explore the data further. This may involve adjusting color scales, zooming in on specific regions, or applying filters to focus on particular aspects of the data.
  • 154. Applications: Heatmaps find applications in various fields, including website analytics, biology, finance, geography, and more. They help users make informed decisions by quickly highlighting areas of interest or significance. Customization: Users often have the flexibility to customize heatmaps based on their preferences. This may include choosing color schemes, adjusting scale ranges, or applying specific algorithms for data normalization.
  • 155. There are many different types of heatmaps: Click heatmaps: These heatmaps show where users click on a webpage. Identify the integral elements on a page and see how users interact with different features. Scroll heatmaps: See how far users scroll down a webpage with this type of heatmap. See which parts of a page are most engaging and how users find the information they are looking for. Mouse movement heatmaps: These heatmaps show the path of a user's mouse as they move the cursor around a webpage. Know where users are looking and how they interact with different elements on the page. Eye tracking heatmaps: This heatmap shows the path of a user's eye movements as they look at a webpage. Understand where users are paying attention and how they process different elements on the page. Conversion heatmaps: Get a view of all the steps your users take to complete a desired action, such as when making a purchase, clicking on the call to action (ctas), or signing up for a newsletter. Use this information to identify bottlenecks in the conversion process and guide users to take the desired action. https://www.hotjar.com/heatmaps/
  • 156. R Scripts: R is an open-source programming language and it is available on widely used platforms e.g. Windows, Linux, and Mac. R is a programming language created by statisticians for statistics, specifically for working with data. It is a language for statistical computing and data visualizations used widely by business analysts, data analysts, data scientists, and scientists. Python is a general-purpose programming language, while R is a statistical programming language. Python is more versatile and can be used for a wider range of tasks, such as web development, data manipulation, machine learning. If you're passionate about the statistical calculation and data visualization portions of data analysis, R could be a good fit for you. If, you're interested in becoming a data scientist and working with big data, artificial intelligence, and deep learning algorithms, Python would be the better fit.
  • 158. R Script An R script is simply a text file containing (almost) the same commands that you would enter on the command line of R. R scripts are simply a text file containing a set of commands and comments. The script can be saved and used later to re- execute the saved commands. The script can also be edited so you can execute a modified version of the commands. Class is the blueprint that helps to create an object and contains its member variable along with the attributes.
  • 159. R Library  R Studio is a must-know tool for everyone who works with the R programming language. It's used in data analysis to import, access, transform, explore, plot, and model data, and for machine learning to make predictions on data.
  • 161. Graphics packages in R • graphics : a base R package, which means it is loaded every time we open R • ggplot2 : a user-contributed package by RStudio, so you must install it the first time you use it. It is a standalone package but also comes together with tidyverse package • lattice : This is a user-contributed package. It provides the ability to display multivariate relationships, and it improves on the base-R graphics. This package supports the creation of trellis graphs: graphs that display a variable or. • Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents
  • 162. The ggplot2 Package  The ggplot2 package is an elegant, easy and versatile general graphics package in R. It implements the grammar of graphics concept. This concept’s advantage is that it fastens the process of learning graphics. It also facilitates the process of creating complex graphics  To work with ggplot2, remember that at least your R codes must  start with ggplot()  identify which data to plot data = Your Data  state variables to plot for example aes(x = Variable on x- axis, y = Variable on y-axis ) for bivariate  choose type of graph, for example geom_histogram() for histogram, and geom_points() for scatterplots