Data Mining Steps Explained in 40 Characters

Data Mining Steps
Problem Definition
Market Analysis
Customer Profiling, Identifying Customer Requirements, Cross
Market Analysis, Target Marketing, Determining Customer
purchasing pattern
Corporate Analysis and Risk Management
Finance Planning and Asset Evaluation, Resource Planning,
Competition
Fraud Detection
Customer Retention
Production Control
Science Exploration
> Data Preparation
Data preparation is about constructing a dataset from one or
more data sources to be used for exploration and modeling. It is
a solid practice to start with an initial dataset to get familiar
with the data, to discover first insights into the data and have a
good understanding of any possible data quality issues. The
Datasets you are provided in these projects were obtained from
kaggle.com.

Variable selection and description
Numerical – Ratio, Interval
Categorical – Ordinal, Nominal
Simplifying variables: From continuous to discrete
Formatting the data
Basic data integrity checks: missing data, outliers
> Data Exploration
Data Exploration is about describing the data by means of
statistical and visualization techniques.
· Data Visualization:
o
Univariate
analysis explores variables (attributes) one by one. Variables
could be either categorical or numerical.
Univariate Analysis - Categorical
Statistics
Visualization

Description
Count
Bar Chart
The number of values of the specified variable.
Count%
Pie Chart
The percentage of values of the specified variable
Univariate Analysis - Numerical
Statistics
Visualization
Equation

Description
Count
Histogram
N
The number of values (observations) of the variable.
Minimum
Box Plot
Min
The smallest value of the variable.
Maximum
Box Plot

Max
The largest value of the variable.
Mean
Box Plot
The sum of the values divided by the count.
Median
Box Plot
The middle value. Below and above median lies an equal
number of values.
Mode

Histogram
The most frequent value. There can be more than one mode.
Quantile
Box Plot
A set of 'cut points' that divide a set of data into groups
containing equal numbers of values (Quartile, Quintile,
Percentile, ...).
Range
Box Plot
Max-Min
The difference between maximum and minimum.

Variance
Histogram
A measure of data dispersion.
Standard Deviation
Histogram
The square root of variance.
Coefficient of Deviation
Histogram
A measure of data dispersion divided by mean.

Skewness
Histogram
A measure of symmetry or asymmetry in the distribution of
data.
Kurtosis
Histogram
A measure of whether the data are peaked or flat relative to a
normal distribution.
Note: There are two types of numerical variables, interval and
ratio. An interval variable has values whose differences are
interpretable, but it does not have a true zero. A good example
is temperature in Centigrade degrees. Data on an interval scale
can be added and subtracted but cannot be meaningfully
multiplied or divided. For example, we cannot say that one day
is twice as hot as another day. In contrast, a ratio variable has
values with a true zero and can be added, subtracted, multiplied
or divided (e.g., weight).
o

Bivariate analysis
is the simultaneous analysis of two variables (attributes). It
explores the concept of relationship between two variables,
whether there exists an association and the strength of this
association.
There are three types of bivariate analysis.
1.Numerical & Numerical
ScMatter Plot, Linear Correlation …
2.Categorical & Categorical
Stacked Column Chart, Combination Chart, Chi-square Test
3.Numerical & Categorical
Line Chart with Error Bars, Combination Chart, Z-test and t-test
> Modeling
· Predictive modeling is the process by which a model is created
to predict an outcome
o If the outcome is categorical it is called
classification
and if the outcome is numerical it is called
regression
.
· Descriptive modeling or
clustering
is the assignment of observations into clusters so that
observations in the same cluster are similar.

· Finally,
a
ssociation rules
can find interesting associations amongst observations.
Classification algorithms:
Frequency Table
ZeroR
,
OneR
,
Naive Bayesian
,
Decision Tree
Covariance Matrix
Linear Discriminant Analysis
,
Logistic Regression

Similarity Functions
K Nearest Neighbors
Others
Artificial Neural Network
,
Support Vector Machine
Regression
Frequency Table
Decision Tree
Covariance Matrix

Multiple Linear Regression
Similarity Function
K Nearest Neighbors
Others
Artificial Neural Network
,
Support Vector Machine
Clustering algorithms are:
Hierarchical

Agglomerative
,
Divisive
Partitive
K Means
,
Self-Organizing Map
> Evaluation
· helps to find the best model that represents our data and how
well the chosen model will work in the future. Hold-Out and
Cross-Validation
> Deployment
The concept of deployment in predictive data mining refers to
the application of a model for prediction to new data.
<

Data Mining Steps Explained in 40 Characters

Data Mining Steps Explained in 40 Characters

Recommended

Recommended

More Related Content

Similar to Data Mining Steps Explained in 40 Characters

Similar to Data Mining Steps Explained in 40 Characters (20)

More from sharondabriggs

More from sharondabriggs (20)

Recently uploaded

Recently uploaded (20)

Data Mining Steps Explained in 40 Characters