Data mining

What is data mining?
• Data mining is the process of analyzing data to
find hidden patterns using automatic statistical
methodologies/algorithms/models
• Use data as the “brain”
• Predictive analytics is a subset of data mining
• Ad hoc queries and OLAP are not suited to the
task

Statistical Models
• Decision trees
• Clustering
• Naïve Bayes
• Neural network
• Logistic regression
• Time series
• Associations

Business Cases
• Recommendation generation
• Anomaly detection
• Churn analysis
• Risk management
• Customer segmentation
• Targeted ads
• Forecasting
• Data exploration

Common Approaches
• Classification
• Clustering
• Association
• Regression
• Forecasting
• Sequence Analysis
• Deviation Analysis

Classification
• Classification is the most common data mining task. Business problems such
as churn analysis, risk management, and targeted advertising usually involve
classification
• Supervised Machine Learning
• Classification is the act of assigning a category to each case
• Each case contains a set of attributes, one of which is the class attribute
• The task requires finding a model that describes the class attribute as a
function of input attributes
• Typical classification algorithms include decision trees, neural network, and
Naïve Bayes

Clustering
• Clustering is also called segmentation. It is used
to identify natural groupings of cases based on a
set of attributes
• Cases within the same group have more or less
similar attribute values
• Clustering is an unsupervised machine learning.
There is no single attribute used to guide the
training process, so all input attributes are
treated equally

Association
• Association is also called market basket analysis
• Common usage of association is to identify
common sets of items and rules for the purpose
of cross-selling
• The association task has two goals: to find those
items that appear together frequently, and from
that, to determine rules about the association

Regression
• The regression task is similar to classification, except
that instead of looking for patterns that describe a
class, the goal is to find patterns to determine a
numerical value
• The most popular techniques used for regression are
linear regression and logistic regression. SQL Server
supports regression trees (part of the Microsoft
Decision Trees algorithm) and neural networks
• Support categorical inputs as well as numerical
inputs

Forecasting
• As input, it takes sequences of numbers
indicating a series of values through time, and
then it imputes future values of those series
using a variety of machine-learning and statistical
techniques that deal with seasonality, trending,
and noisiness of data

Sequence Analysis
• Sequence analysis is used to find patterns in a
series of events called a sequence
• Both sequence and time-series data are similar
in that they contain adjacent observations that
are order-dependent. The difference is that
where a time series contains numerical data, a
sequence series contains discrete states

Deviation Analysis
• Deviation analysis is used to find rare cases that
behave very differently from the norm
• Widely used, fraud protection
• There is no standard technique for deviation
analysis, usually apply decision trees, clustering,
or neural network algorithms for this task

Demo
• Demo #1: Data Mining By Example – Building
Predictive Model Using Microsoft Decision Trees

Data mining

More Related Content

What's hot

Similar to Data mining

Recently uploaded

Data mining