Data Mining Presentation.pptx

Group 1
INTRODUCTION
TO DATA MINING

• WHAT IS DATA MINING?
INTRODUCTION TO DATA MINING
STATISTICS
MACHINE
LEARNING
ARTIFICIAL
INTELLIGENT
DATABASES
Combines:
Data mining, or data science, is about analyzing
information to understand the past and predict the
future.

STATISTICS
MACHINE
LEARNING
ARTIFICIAL
INTELLIGENT
DATABASES
The science of collecting,
classifying, summarizing,
organizing, analyzing, and
interpreting data.
The study of computer
algorithms dealing with the
simulation of intelligent
behaviors in order to perform
those activities that are
normally thought to
require intelligence.
The study of computer
algorithms to learn in order
to improve automatically
through experience.
The science and technology
of collecting, storing and
managing data so users can
retrieve, add, update or
remove such data.

Businesses, as well as fields like engineering and medicine,
find value in using data mining to extract useful
knowledge from their accumulated data. This helps them
attract more clients, increase sales, and make more
profits.

• HISTORY OF DATA MINING
First of all, in 1960s statisticians used the terms “Data
Fishing” or “Data Dredging”. That was to refer what they
considered the bad practice of analyzing data.
Consequently, the term “Data Mining” appeared around
1990 in the database community.

The current evolution of data mining functions and
products is the result of years of influence from many
disciplines, including databases, information databases,
information retrieval, statistics, algorithms and machine
learning.

1990-now, data science
• The flood of data from new scientific
instruments and simulations
• The ability to economically store and
manage petabytes of data online
• The Internet and computing Grid that
makes all these archives
• universally accessible
• Scientific information management,
acquisition, organization, query,
• and visualization tasks scale almost
linearly with data volumes. Data
• mining is a major new challenge!
1950s-1990s, computational
science
• Over the last 50 years, most
disciplines have grown a third,
computational branch (e.g.
empirical, theoretical, and
computational ecology, or
physics, or linguistics.)
• Computational Science
traditionally meant simulation.
It grew out of our inability to
find closed-form solutions for
complex mathematical models.
1600-1950s, theoretical science
• Each discipline has grown a
theoretical component.
Theoretical models often
motivate experiments and
generalize our understanding.
Before 1600, empirical science
Evolution of Sciences

2000s
• Stream data management and
mining
• Data mining and its
applications
• Web technology (XML, data
integration) and global
information systems
1980s:
• RDBMS, advanced data models
(extended-relational, OO,
deductive, etc.)
• Application-oriented DBMS
(spatial, scientific, engineering,
etc.)
1980s:
• RDBMS, advanced data models
(extended-relational, OO,
deductive, etc.)
• Application-oriented DBMS
(spatial, scientific, engineering,
etc.)
1970s:
• Relational data model,
relational DBMS
implementation
1960s:
• Data collection, database
creation, IMS and network
DBMS
Evolution of Database Technology

Data mining involves many different algorithms to accomplish different
tasks. All of these algorithms attempt to fit a model to the data. The
algorithms examine the data and determine the model that is closest to
the characteristics of the data being examined. Data mining algorithms can
be characterized as consisting three parts:
Model
Preference
Search
The purpose of the algorithm is to fit a model to the data.
Some criteria must be used to fit one model over another.
All algorithms require some technique to search the data.

Example
Credit card companies must determine whether to authorize credit
card purchases. Suppose that based on past historical information
about purchases, each purchase is placed into one of four classes: (1)
authorize, (2) ask for further identification before authorization, (3) do
not authorize, and (4) do not authorize but contact police. The data
mining functions here are twofold. First, the historical data must be
examined to determine how the data fit into the four classes. Then the
problem is to apply this model to each new purchase. Although the
second part indeed may be stated as a simple database query, the first
part cannot be.

Example
In Example 1, we're dealing with credit card data and want to create
categories (classes) based on past purchase outcomes. It examine
factors like purchase amount and customer history to define these
categories. For instance, we might approve small purchases for loyal
customers but reject transactions on reported stolen cards. The
challenge is to figure out the right criteria for each category by looking
at the data patterns.

As seen in Figure 3, the model
that is created can be either
predictive or descriptive in
nature. In this figure, it show
under each model type some of
the most common data mining
tasks that use that type of model.
predictive model descriptive model

predictive model descriptive model
makes a prediction about values of data
using known results
found from different data. Predictive
modeling may be made based on the use of
other historical data.
A descriptive model identifies patterns or
relationships in data. Unlike the
predictive model, a descriptive model
serves as a way to explore the properties
of the data examined, not to predict new
properties. Clustering, summarization,
association rules, and sequence discovery
are usually viewed as descriptive in
nature.

• BASIC DATA MINING TASK
Summarization
Association Rules Sequence Discovery
Time Series Analysis
Prediction
Clustering
Classification
Regression

Classification
Classification maps data into predefined groups or classes. It is often referred to
as supervised learning because the classes are determined before examining the
data. Two examples of classification applications are determining whether to
make a bank loan and identifying credit risks. Classification algorithms require
that the classes be defined based on data attribute values. They often describe
these classes by looking at the characteristics of data already known to belong to
the classes. Pattern recognition is a type of classification where an input pattern
is classified into one of several classes based on its similarity to these predefined
Classes.

Prediction
Many real-world data mining applications can be seen as predicting future data
states based on past and current data. Prediction can be viewed as a type of
classification. (Note: This is a data mining task that is different from the prediction
model, although the prediction task is a type of prediction model.) The difference
is that prediction is predicting a future state rather than a current state. Here we
are referring to a type of application rather than to a type of data mining modeling
approach, as discussed earlier. Prediction applications include flooding, speech
recognition, machine learning, and pattern recognition. Although future values
may be predicted using time series analysis or regression techniques, other
approaches may be used as well. Example 5 illustrates the process.

Regression
Regression analysis is a statistical method used to model and examine the
relationship between a dependent variable (the outcome or response) and
one or more independent variables (predictors or features). The goal of
regression analysis is to understand how changes in the independent
variables are associated with changes in the dependent variable, allowing
for prediction or estimation of the dependent variable based on the values
of the independent variables.

Clustering
Clustering is similar to classification except that the groups are not predefined,
but rather defined by the data alone. Clustering is alternatively referred to as
unsupervised learning or segmentation. It can be thought of as partitioning or
segmenting the data into groups that might or might not be disjointed. The
clustering is usually accomplished by determining the similarity among the data
on predefined attributes. The most similar data are grouped into clusters.
Example 6 provides a simple clustering example. Since the clusters are not
predefined, a domain expert is often required to interpret the meaning of the
created clusters.

With time series analysis, the value of an attribute is examined
as it varies over time. The values usually are obtained as evenly
spaced time points (daily, weekly, hourly, etc.).
Time Series Analysis

Summarization maps data into subsets with associated simple
descriptions. Summarization is also called characterization or
generalization. It extracts or derives representative information about the
database. This may be accomplished by actually retrieving portions of the
data. Alternatively, summary type information (such as the mean of some
numeric attribute) can be derived from the data. The summarization
succinctly characterizes the contents of the database.
Summarization

Link analysis, alternatively referred to as affinity analysis or
association, refers to the data mining task of uncovering
relationships among data. The best example of this type of
application is to determine association rules. An association
rule is a model that identifies specific types of data associations.
These associations are often used in the retail sales community
to identify items that are frequently purchased together.
Association Rules

Sequential analysis or sequence discovery is used to determine sequential
patterns in data. These patterns are based on a time sequence of actions. These
patterns are similar to associations in that data (or events) are found to be related,
but the relationship is based on time. Unlike a market basket analysis, which
requires the items to be purchased at the same time, in sequence discovery the
items are purchased over time in some order. Example 9 illustrates the discovery
of some simple patterns. A similar type of discovery can be seen in the sequence
within which data are purchased. For example, most people who purchase CD
players may be found to purchase CDs within one week. As we will see, temporal
association rules really fall into this category.
Sequence Discovery

• THE DATA MINING PROCESS
Figure 5 illustrates the phases, and
the iterative nature, of a data
mining project. The process flow
shows that a data mining project
does not stop when a particular
solution is deployed. The results of
data mining trigger new business
questions, which in turn can be
used to develop more focused
models.

Problem Definition
This initial phase of a data mining project
focuses on understanding the project
objectives and requirements. Once you have
specified the project from a business
perspective, you can formulate it as a data
mining problem and develop a preliminary
implementation plan.

The data understanding phase involves data
collection and exploration. As you take a closer
look at the data, you can determine how well it
addresses the business problem. You might decide
to remove some of the data or add additional
data. This is also the time to identify data quality
problems and to scan for patterns in the data.
Data Gathering and Preparation

In this phase, you select and apply various
modeling techniques and calibrate the parameters
to optimal values. If the algorithm requires data
transformations, you will need to step back to the
previous phase to implement them.
Model Building and Evaluation

Knowledge deployment is the use of data mining
within a target environment. In the deployment
phase, insight and actionable information can be
derived from data.
Knowledge Deployment

There are many important implementation
issues associated with data mining:
• DATA MINING ISSUES
Human Interaction Overfitting Outliers
Interpretation of
result
Visualization of
result
Large datasets High dimensionality Multimedia data Missing data Irrelevant data
Noisy data Changing data Integration Application

Human Interaction
Since data mining problems are often not precisely
stated, interfaces may be needed with both domain
and technical experts. Technical experts are used to
formulate the queries and assist in interpreting the
results. Users are needed to identify training data
and desired results.

When a model is generated that is associated with a given database
state it is desirable that the model also fit future database states. Overfitting
occurs when the model does not fit future states. This may be caused by
assumptions that are made about the data or may simply be caused by the small
size of the training database. For example, a classification model for an employee
database may be developed to classify employees as short, medium, or tall. If the
training database is quite small, the model might erroneously indicate that a short
person is anyone under five feet eight inches because there is only one entry in
the training database under five feet eight. In this case, many future employees
would be erroneously classified as short. Overfitting can arise under other
circumstances as well, even though the data are not changing.
Overfitting

Outliers
There are often many data entries that do not fit nicely into
the derived model. This becomes even more of an issue with
very large databases. If a model is developed that includes
these outliers, then the model may not behave well for data
that are not outliers.

Interpretation of
result
Currently, data mining output may require
experts to correctly interpret the results, which
might otherwise be meaningless to the average
database user.

Visualization of
result
To easily view and understand the output of
data mining algorithms, visualization of the
results is helpful.

Large datasets
The massive datasets associated with data mining
create problems when applying algorithms designed
for small datasets. Many modeling applications grow
exponentially on the dataset size and thus are too
inefficient for larger datasets. Sampling and
parallelization are effective tools to attack this
scalability problem.

High dimensionality
A conventional database schema may be composed
of many different attributes. The problem here is that
not all attributes may be needed to solve a given data
mining problem.

Multimedia data
Most previous data mining algorithms are targeted to
traditional data types (numeric, character, text, etc.).
The use of multimedia data such as is found in GIS
databases complicates or invalidates many proposed
algorithms.

Missing data
During the preprocessing phase of KDD, missing data
may be replaced with estimates. This and other
approaches to handling missing data can lead to
invalid results in the data mining step.

Irrelevant data
Some attributes in the database might not be of
interest to the data mining task being developed.

Noisy data
Some attribute values might be invalid or incorrect.
These values are often corrected before running data
mining applications.

Changing data
Databases cannot be assumed to be static. However,
most data mining algorithms do assume a static
database. This requires that the algorithm be
completely rerun anytime the database changes.

Integration
The KDD process is not currently integrated into
normal data processing activities. KDD requests may
be treated as special, unusual, or onetime needs.
This makes them inefficient, ineffective, and not
general enough to be used on an ongoing basis.
Integration of data mining functions into traditional
DBMS systems is certainly a desirable goal.

Application
Determining the intended use for the information obtained
from the data mining function is a challenge. Indeed, how
business executives can effectively use the output is
sometimes considered the more difficult part, not the
running of the algorithms themselves. Because the data are
of a type that has not previously been known, business
practices may have to be modified to determine
how to effectively use the information uncovered.

• PROBLEM DEFINITION
Understanding the project objectives and requirements from
a domain perspective and then converting this knowledge
into a data science problem definition with a preliminary plan
designed to achieve the objectives. Data science projects are
often structured around the specific needs of an industry
sector (as shown below) or even tailored and built for a single
organization. A successful data science project starts from a
well-defined question or need.

• PROBLEM DEFINITION

• DATA PREPARATION
Data preparation is about constructing a dataset from one or
more data sources to be used for exploration and modeling. It
is a solid practice to start with an initial dataset to get familiar
with the data, to discover first insights into the data and have
a good understanding of any possible data quality issues.
Data preparation is often a time consuming process and
heavily prone to errors.

Data is information typically the results of measurement
(numerical) or counting (categorical).
Data

Data
Variables serve as placeholders for data. There are two types
of variables, numerical and categorical.

Data
A numerical or continuous variable is one that can accept any
value within a finite or infinite interval (e.g., height, weight,
temperature, blood glucose, ...). There are two types of
numerical data, interval and ratio. Data on an interval scale
can be added and subtracted but cannot be meaningfully
multiplied or divided because there is no true zero. For
example, we cannot say that one day is twice as hot as
another day. On the other hand, data on a ratio scale has true
zero and can be added, subtracted, multiplied or divided
(e.g., weight).

Data
A categorical or discrete variable is one that can accept two
or more values (categories). There are two types of
categorical data, nominal and ordinal. Nominal data does not
have an intrinsic ordering in the categories. For example,
"gender" with two categories, male and female. In contrast,
ordinal data does have an intrinsic ordering in the categories.
For example, "level of energy" with three orderly categories
(low, medium and high).

Dataset
Dataset is a collection
of data, usually
presented in a
tabular form. Each
column represents a
particular variable,
and each row
corresponds to a
given member of the
data.

Dataset
In predictive modeling, predictors or attributes are the input
variables and target or class attribute is the output variable
whose value is determined by the values of the predictors and
function of the predictive model.

Database
Database collects, stores and manages information so users can retrieve,
add, update or remove such information. It presents information in tables
with rows and columns. A table is referred to as a relation in the sense
that it is a collection of objects of the same type (rows). Data in a table can
be related according to common keys or concepts, and the ability to
retrieve related data from related tables is the basis for the term relational
database. A Database Management System (DBMS) handles the way data
is stored, maintained, and retrieved. Most data science toolboxes connect
to databases through ODBC (Open Database Connectivity) or JDBC (Java
Database Connectivity) interfaces.

Database
SQL (Structured Query Language) is a
database computer language for
managing and manipulating data in
relational database management
systems (RDBMS). SQL Data
Definition Language (DDL) permits
database tables to be created,
altered or deleted. We can also
define indexes (keys), specify links
between tables, and impose
constraints between database tables.
CREATE TABLE : creates a new table
ALTER TABLE : alters a table
DROP TABLE : deletes a table
CREATE INDEX : creates an index
DROP INDEX : deletes an index

Database
SQL Data Manipulation Language
(DML) is a language which enables
users to access and manipulate data.
SELECT : retrieval of data from the
database
INSERT INTO : insertion of new data
into the database
UPDATE : modification of data in the
database
DELETE : deletion of data in the
database

Database
ETL extracts data from data sources
and loads it into data destinations
using a set of transformation
functions.
ETL (Extraction, Transformation and Loading)
Data extraction provides the ability to extract
data from a variety of data sources, such as flat
files, relational databases, streaming data, XML
files, and ODBC/JDBC data sources.
Data transformation provides the ability to
cleanse, convert, aggregate, merge, and split
data.
Data loading provides the ability to load data into
destination databases via update, insert or delete
statements, or in bulk.

Database
ETL (Extraction, Transformation and Loading)

Data Mining Presentation.pptx

Recommended

Recommended

More Related Content

Similar to Data Mining Presentation.pptx

Similar to Data Mining Presentation.pptx (20)

More from ChingChingErm

More from ChingChingErm (6)

Recently uploaded

Recently uploaded (20)

Data Mining Presentation.pptx