Data Mining
by Shaoli Lu
What is data mining?
• Data mining is the process of analyzing data to
find hidden patterns using automatic statistical
methodologies/algorithms/models
• Use data as the “brain”
• Predictive analytics is a subset of data mining
• Ad hoc queries and OLAP are not suited to the
task
Statistical Models
• Decision trees
• Clustering
• Naïve Bayes
• Neural network
• Logistic regression
• Time series
• Associations
Business Cases
• Recommendation generation
• Anomaly detection
• Churn analysis
• Risk management
• Customer segmentation
• Targeted ads
• Forecasting
• Data exploration
Common Approaches
• Classification
• Clustering
• Association
• Regression
• Forecasting
• Sequence Analysis
• Deviation Analysis
Classification
• Classification is the most common data mining task. Business problems such
as churn analysis, risk management, and targeted advertising usually involve
classification
• Supervised Machine Learning
• Classification is the act of assigning a category to each case
• Each case contains a set of attributes, one of which is the class attribute
• The task requires finding a model that describes the class attribute as a
function of input attributes
• Typical classification algorithms include decision trees, neural network, and
Naïve Bayes
Clustering
• Clustering is also called segmentation. It is used
to identify natural groupings of cases based on a
set of attributes
• Cases within the same group have more or less
similar attribute values
• Clustering is an unsupervised machine learning.
There is no single attribute used to guide the
training process, so all input attributes are
treated equally
Association
• Association is also called market basket analysis
• Common usage of association is to identify
common sets of items and rules for the purpose
of cross-selling
• The association task has two goals: to find those
items that appear together frequently, and from
that, to determine rules about the association
Regression
• The regression task is similar to classification, except
that instead of looking for patterns that describe a
class, the goal is to find patterns to determine a
numerical value
• The most popular techniques used for regression are
linear regression and logistic regression. SQL Server
supports regression trees (part of the Microsoft
Decision Trees algorithm) and neural networks
• Support categorical inputs as well as numerical
inputs
Forecasting
• As input, it takes sequences of numbers
indicating a series of values through time, and
then it imputes future values of those series
using a variety of machine-learning and statistical
techniques that deal with seasonality, trending,
and noisiness of data
Sequence Analysis
• Sequence analysis is used to find patterns in a
series of events called a sequence
• Both sequence and time-series data are similar
in that they contain adjacent observations that
are order-dependent. The difference is that
where a time series contains numerical data, a
sequence series contains discrete states
Deviation Analysis
• Deviation analysis is used to find rare cases that
behave very differently from the norm
• Widely used, fraud protection
• There is no standard technique for deviation
analysis, usually apply decision trees, clustering,
or neural network algorithms for this task
Demo
• Demo #1: Data Mining By Example – Building
Predictive Model Using Microsoft Decision Trees

Data mining

  • 1.
  • 2.
    What is datamining? • Data mining is the process of analyzing data to find hidden patterns using automatic statistical methodologies/algorithms/models • Use data as the “brain” • Predictive analytics is a subset of data mining • Ad hoc queries and OLAP are not suited to the task
  • 3.
    Statistical Models • Decisiontrees • Clustering • Naïve Bayes • Neural network • Logistic regression • Time series • Associations
  • 4.
    Business Cases • Recommendationgeneration • Anomaly detection • Churn analysis • Risk management • Customer segmentation • Targeted ads • Forecasting • Data exploration
  • 5.
    Common Approaches • Classification •Clustering • Association • Regression • Forecasting • Sequence Analysis • Deviation Analysis
  • 6.
    Classification • Classification isthe most common data mining task. Business problems such as churn analysis, risk management, and targeted advertising usually involve classification • Supervised Machine Learning • Classification is the act of assigning a category to each case • Each case contains a set of attributes, one of which is the class attribute • The task requires finding a model that describes the class attribute as a function of input attributes • Typical classification algorithms include decision trees, neural network, and Naïve Bayes
  • 7.
    Clustering • Clustering isalso called segmentation. It is used to identify natural groupings of cases based on a set of attributes • Cases within the same group have more or less similar attribute values • Clustering is an unsupervised machine learning. There is no single attribute used to guide the training process, so all input attributes are treated equally
  • 8.
    Association • Association isalso called market basket analysis • Common usage of association is to identify common sets of items and rules for the purpose of cross-selling • The association task has two goals: to find those items that appear together frequently, and from that, to determine rules about the association
  • 9.
    Regression • The regressiontask is similar to classification, except that instead of looking for patterns that describe a class, the goal is to find patterns to determine a numerical value • The most popular techniques used for regression are linear regression and logistic regression. SQL Server supports regression trees (part of the Microsoft Decision Trees algorithm) and neural networks • Support categorical inputs as well as numerical inputs
  • 10.
    Forecasting • As input,it takes sequences of numbers indicating a series of values through time, and then it imputes future values of those series using a variety of machine-learning and statistical techniques that deal with seasonality, trending, and noisiness of data
  • 11.
    Sequence Analysis • Sequenceanalysis is used to find patterns in a series of events called a sequence • Both sequence and time-series data are similar in that they contain adjacent observations that are order-dependent. The difference is that where a time series contains numerical data, a sequence series contains discrete states
  • 12.
    Deviation Analysis • Deviationanalysis is used to find rare cases that behave very differently from the norm • Widely used, fraud protection • There is no standard technique for deviation analysis, usually apply decision trees, clustering, or neural network algorithms for this task
  • 13.
    Demo • Demo #1:Data Mining By Example – Building Predictive Model Using Microsoft Decision Trees