SlideShare a Scribd company logo
1 of 58
Group 1
INTRODUCTION
TO DATA MINING
• WHAT IS DATA MINING?
INTRODUCTION TO DATA MINING
STATISTICS
MACHINE
LEARNING
ARTIFICIAL
INTELLIGENT
DATABASES
Combines:
Data mining, or data science, is about analyzing
information to understand the past and predict the
future.
• WHAT IS DATA MINING?
INTRODUCTION TO DATA MINING
STATISTICS
MACHINE
LEARNING
ARTIFICIAL
INTELLIGENT
DATABASES
The science of collecting,
classifying, summarizing,
organizing, analyzing, and
interpreting data.
The study of computer
algorithms dealing with the
simulation of intelligent
behaviors in order to perform
those activities that are
normally thought to
require intelligence.
The study of computer
algorithms to learn in order
to improve automatically
through experience.
The science and technology
of collecting, storing and
managing data so users can
retrieve, add, update or
remove such data.
• WHAT IS DATA MINING?
INTRODUCTION TO DATA MINING
Businesses, as well as fields like engineering and medicine,
find value in using data mining to extract useful
knowledge from their accumulated data. This helps them
attract more clients, increase sales, and make more
profits.
• HISTORY OF DATA MINING
INTRODUCTION TO DATA MINING
First of all, in 1960s statisticians used the terms “Data
Fishing” or “Data Dredging”. That was to refer what they
considered the bad practice of analyzing data.
Consequently, the term “Data Mining” appeared around
1990 in the database community.
• HISTORY OF DATA MINING
INTRODUCTION TO DATA MINING
The current evolution of data mining functions and
products is the result of years of influence from many
disciplines, including databases, information databases,
information retrieval, statistics, algorithms and machine
learning.
1990-now, data science
• The flood of data from new scientific
instruments and simulations
• The ability to economically store and
manage petabytes of data online
• The Internet and computing Grid that
makes all these archives
• universally accessible
• Scientific information management,
acquisition, organization, query,
• and visualization tasks scale almost
linearly with data volumes. Data
• mining is a major new challenge!
1950s-1990s, computational
science
• Over the last 50 years, most
disciplines have grown a third,
computational branch (e.g.
empirical, theoretical, and
computational ecology, or
physics, or linguistics.)
• Computational Science
traditionally meant simulation.
It grew out of our inability to
find closed-form solutions for
complex mathematical models.
1600-1950s, theoretical science
• Each discipline has grown a
theoretical component.
Theoretical models often
motivate experiments and
generalize our understanding.
Before 1600, empirical science
• HISTORY OF DATA MINING
INTRODUCTION TO DATA MINING
Evolution of Sciences
2000s
• Stream data management and
mining
• Data mining and its
applications
• Web technology (XML, data
integration) and global
information systems
1980s:
• RDBMS, advanced data models
(extended-relational, OO,
deductive, etc.)
• Application-oriented DBMS
(spatial, scientific, engineering,
etc.)
1980s:
• RDBMS, advanced data models
(extended-relational, OO,
deductive, etc.)
• Application-oriented DBMS
(spatial, scientific, engineering,
etc.)
1970s:
• Relational data model,
relational DBMS
implementation
1960s:
• Data collection, database
creation, IMS and network
DBMS
• HISTORY OF DATA MINING
INTRODUCTION TO DATA MINING
Evolution of Database Technology
• HISTORY OF DATA MINING
INTRODUCTION TO DATA MINING
Data mining involves many different algorithms to accomplish different
tasks. All of these algorithms attempt to fit a model to the data. The
algorithms examine the data and determine the model that is closest to
the characteristics of the data being examined. Data mining algorithms can
be characterized as consisting three parts:
Model
Preference
Search
The purpose of the algorithm is to fit a model to the data.
Some criteria must be used to fit one model over another.
All algorithms require some technique to search the data.
• HISTORY OF DATA MINING
INTRODUCTION TO DATA MINING
Example
Credit card companies must determine whether to authorize credit
card purchases. Suppose that based on past historical information
about purchases, each purchase is placed into one of four classes: (1)
authorize, (2) ask for further identification before authorization, (3) do
not authorize, and (4) do not authorize but contact police. The data
mining functions here are twofold. First, the historical data must be
examined to determine how the data fit into the four classes. Then the
problem is to apply this model to each new purchase. Although the
second part indeed may be stated as a simple database query, the first
part cannot be.
• HISTORY OF DATA MINING
INTRODUCTION TO DATA MINING
Example
In Example 1, we're dealing with credit card data and want to create
categories (classes) based on past purchase outcomes. It examine
factors like purchase amount and customer history to define these
categories. For instance, we might approve small purchases for loyal
customers but reject transactions on reported stolen cards. The
challenge is to figure out the right criteria for each category by looking
at the data patterns.
• HISTORY OF DATA MINING
INTRODUCTION TO DATA MINING
As seen in Figure 3, the model
that is created can be either
predictive or descriptive in
nature. In this figure, it show
under each model type some of
the most common data mining
tasks that use that type of model.
predictive model descriptive model
• HISTORY OF DATA MINING
INTRODUCTION TO DATA MINING
predictive model descriptive model
makes a prediction about values of data
using known results
found from different data. Predictive
modeling may be made based on the use of
other historical data.
A descriptive model identifies patterns or
relationships in data. Unlike the
predictive model, a descriptive model
serves as a way to explore the properties
of the data examined, not to predict new
properties. Clustering, summarization,
association rules, and sequence discovery
are usually viewed as descriptive in
nature.
INTRODUCTION TO DATA MINING
• BASIC DATA MINING TASK
Summarization
Association Rules Sequence Discovery
Time Series Analysis
Prediction
Clustering
Classification
Regression
INTRODUCTION TO DATA MINING
• BASIC DATA MINING TASK
Classification
Classification maps data into predefined groups or classes. It is often referred to
as supervised learning because the classes are determined before examining the
data. Two examples of classification applications are determining whether to
make a bank loan and identifying credit risks. Classification algorithms require
that the classes be defined based on data attribute values. They often describe
these classes by looking at the characteristics of data already known to belong to
the classes. Pattern recognition is a type of classification where an input pattern
is classified into one of several classes based on its similarity to these predefined
Classes.
INTRODUCTION TO DATA MINING
• BASIC DATA MINING TASK
Prediction
Many real-world data mining applications can be seen as predicting future data
states based on past and current data. Prediction can be viewed as a type of
classification. (Note: This is a data mining task that is different from the prediction
model, although the prediction task is a type of prediction model.) The difference
is that prediction is predicting a future state rather than a current state. Here we
are referring to a type of application rather than to a type of data mining modeling
approach, as discussed earlier. Prediction applications include flooding, speech
recognition, machine learning, and pattern recognition. Although future values
may be predicted using time series analysis or regression techniques, other
approaches may be used as well. Example 5 illustrates the process.
INTRODUCTION TO DATA MINING
• BASIC DATA MINING TASK
Regression
Regression analysis is a statistical method used to model and examine the
relationship between a dependent variable (the outcome or response) and
one or more independent variables (predictors or features). The goal of
regression analysis is to understand how changes in the independent
variables are associated with changes in the dependent variable, allowing
for prediction or estimation of the dependent variable based on the values
of the independent variables.
INTRODUCTION TO DATA MINING
• BASIC DATA MINING TASK
Clustering
Clustering is similar to classification except that the groups are not predefined,
but rather defined by the data alone. Clustering is alternatively referred to as
unsupervised learning or segmentation. It can be thought of as partitioning or
segmenting the data into groups that might or might not be disjointed. The
clustering is usually accomplished by determining the similarity among the data
on predefined attributes. The most similar data are grouped into clusters.
Example 6 provides a simple clustering example. Since the clusters are not
predefined, a domain expert is often required to interpret the meaning of the
created clusters.
INTRODUCTION TO DATA MINING
• BASIC DATA MINING TASK
With time series analysis, the value of an attribute is examined
as it varies over time. The values usually are obtained as evenly
spaced time points (daily, weekly, hourly, etc.).
Time Series Analysis
INTRODUCTION TO DATA MINING
• BASIC DATA MINING TASK
Summarization maps data into subsets with associated simple
descriptions. Summarization is also called characterization or
generalization. It extracts or derives representative information about the
database. This may be accomplished by actually retrieving portions of the
data. Alternatively, summary type information (such as the mean of some
numeric attribute) can be derived from the data. The summarization
succinctly characterizes the contents of the database.
Summarization
INTRODUCTION TO DATA MINING
• BASIC DATA MINING TASK
Link analysis, alternatively referred to as affinity analysis or
association, refers to the data mining task of uncovering
relationships among data. The best example of this type of
application is to determine association rules. An association
rule is a model that identifies specific types of data associations.
These associations are often used in the retail sales community
to identify items that are frequently purchased together.
Association Rules
INTRODUCTION TO DATA MINING
• BASIC DATA MINING TASK
Sequential analysis or sequence discovery is used to determine sequential
patterns in data. These patterns are based on a time sequence of actions. These
patterns are similar to associations in that data (or events) are found to be related,
but the relationship is based on time. Unlike a market basket analysis, which
requires the items to be purchased at the same time, in sequence discovery the
items are purchased over time in some order. Example 9 illustrates the discovery
of some simple patterns. A similar type of discovery can be seen in the sequence
within which data are purchased. For example, most people who purchase CD
players may be found to purchase CDs within one week. As we will see, temporal
association rules really fall into this category.
Sequence Discovery
INTRODUCTION TO DATA MINING
• THE DATA MINING PROCESS
Figure 5 illustrates the phases, and
the iterative nature, of a data
mining project. The process flow
shows that a data mining project
does not stop when a particular
solution is deployed. The results of
data mining trigger new business
questions, which in turn can be
used to develop more focused
models.
INTRODUCTION TO DATA MINING
• THE DATA MINING PROCESS
Problem Definition
This initial phase of a data mining project
focuses on understanding the project
objectives and requirements. Once you have
specified the project from a business
perspective, you can formulate it as a data
mining problem and develop a preliminary
implementation plan.
INTRODUCTION TO DATA MINING
• THE DATA MINING PROCESS
The data understanding phase involves data
collection and exploration. As you take a closer
look at the data, you can determine how well it
addresses the business problem. You might decide
to remove some of the data or add additional
data. This is also the time to identify data quality
problems and to scan for patterns in the data.
Data Gathering and Preparation
INTRODUCTION TO DATA MINING
• THE DATA MINING PROCESS
In this phase, you select and apply various
modeling techniques and calibrate the parameters
to optimal values. If the algorithm requires data
transformations, you will need to step back to the
previous phase to implement them.
Model Building and Evaluation
INTRODUCTION TO DATA MINING
• THE DATA MINING PROCESS
Knowledge deployment is the use of data mining
within a target environment. In the deployment
phase, insight and actionable information can be
derived from data.
Knowledge Deployment
INTRODUCTION TO DATA MINING
There are many important implementation
issues associated with data mining:
• DATA MINING ISSUES
Human Interaction Overfitting Outliers
Interpretation of
result
Visualization of
result
Large datasets High dimensionality Multimedia data Missing data Irrelevant data
Noisy data Changing data Integration Application
INTRODUCTION TO DATA MINING
• DATA MINING ISSUES
Human Interaction
Since data mining problems are often not precisely
stated, interfaces may be needed with both domain
and technical experts. Technical experts are used to
formulate the queries and assist in interpreting the
results. Users are needed to identify training data
and desired results.
INTRODUCTION TO DATA MINING
• DATA MINING ISSUES
When a model is generated that is associated with a given database
state it is desirable that the model also fit future database states. Overfitting
occurs when the model does not fit future states. This may be caused by
assumptions that are made about the data or may simply be caused by the small
size of the training database. For example, a classification model for an employee
database may be developed to classify employees as short, medium, or tall. If the
training database is quite small, the model might erroneously indicate that a short
person is anyone under five feet eight inches because there is only one entry in
the training database under five feet eight. In this case, many future employees
would be erroneously classified as short. Overfitting can arise under other
circumstances as well, even though the data are not changing.
Overfitting
INTRODUCTION TO DATA MINING
Outliers
• DATA MINING ISSUES
There are often many data entries that do not fit nicely into
the derived model. This becomes even more of an issue with
very large databases. If a model is developed that includes
these outliers, then the model may not behave well for data
that are not outliers.
Interpretation of
result
INTRODUCTION TO DATA MINING
• DATA MINING ISSUES
Currently, data mining output may require
experts to correctly interpret the results, which
might otherwise be meaningless to the average
database user.
Visualization of
result
INTRODUCTION TO DATA MINING
• DATA MINING ISSUES
To easily view and understand the output of
data mining algorithms, visualization of the
results is helpful.
Large datasets
INTRODUCTION TO DATA MINING
• DATA MINING ISSUES
The massive datasets associated with data mining
create problems when applying algorithms designed
for small datasets. Many modeling applications grow
exponentially on the dataset size and thus are too
inefficient for larger datasets. Sampling and
parallelization are effective tools to attack this
scalability problem.
High dimensionality
INTRODUCTION TO DATA MINING
• DATA MINING ISSUES
A conventional database schema may be composed
of many different attributes. The problem here is that
not all attributes may be needed to solve a given data
mining problem.
Multimedia data
INTRODUCTION TO DATA MINING
• DATA MINING ISSUES
Most previous data mining algorithms are targeted to
traditional data types (numeric, character, text, etc.).
The use of multimedia data such as is found in GIS
databases complicates or invalidates many proposed
algorithms.
Missing data
INTRODUCTION TO DATA MINING
• DATA MINING ISSUES
During the preprocessing phase of KDD, missing data
may be replaced with estimates. This and other
approaches to handling missing data can lead to
invalid results in the data mining step.
Irrelevant data
INTRODUCTION TO DATA MINING
• DATA MINING ISSUES
Some attributes in the database might not be of
interest to the data mining task being developed.
Noisy data
INTRODUCTION TO DATA MINING
• DATA MINING ISSUES
Some attribute values might be invalid or incorrect.
These values are often corrected before running data
mining applications.
Changing data
INTRODUCTION TO DATA MINING
• DATA MINING ISSUES
Databases cannot be assumed to be static. However,
most data mining algorithms do assume a static
database. This requires that the algorithm be
completely rerun anytime the database changes.
Integration
INTRODUCTION TO DATA MINING
• DATA MINING ISSUES
The KDD process is not currently integrated into
normal data processing activities. KDD requests may
be treated as special, unusual, or onetime needs.
This makes them inefficient, ineffective, and not
general enough to be used on an ongoing basis.
Integration of data mining functions into traditional
DBMS systems is certainly a desirable goal.
Application
INTRODUCTION TO DATA MINING
• DATA MINING ISSUES
Determining the intended use for the information obtained
from the data mining function is a challenge. Indeed, how
business executives can effectively use the output is
sometimes considered the more difficult part, not the
running of the algorithms themselves. Because the data are
of a type that has not previously been known, business
practices may have to be modified to determine
how to effectively use the information uncovered.
INTRODUCTION TO DATA MINING
• PROBLEM DEFINITION
Understanding the project objectives and requirements from
a domain perspective and then converting this knowledge
into a data science problem definition with a preliminary plan
designed to achieve the objectives. Data science projects are
often structured around the specific needs of an industry
sector (as shown below) or even tailored and built for a single
organization. A successful data science project starts from a
well-defined question or need.
INTRODUCTION TO DATA MINING
• PROBLEM DEFINITION
INTRODUCTION TO DATA MINING
• PROBLEM DEFINITION
INTRODUCTION TO DATA MINING
• PROBLEM DEFINITION
INTRODUCTION TO DATA MINING
• DATA PREPARATION
Data preparation is about constructing a dataset from one or
more data sources to be used for exploration and modeling. It
is a solid practice to start with an initial dataset to get familiar
with the data, to discover first insights into the data and have
a good understanding of any possible data quality issues.
Data preparation is often a time consuming process and
heavily prone to errors.
INTRODUCTION TO DATA MINING
• DATA PREPARATION
Data is information typically the results of measurement
(numerical) or counting (categorical).
Data
INTRODUCTION TO DATA MINING
• DATA PREPARATION
Data
Variables serve as placeholders for data. There are two types
of variables, numerical and categorical.
INTRODUCTION TO DATA MINING
• DATA PREPARATION
Data
A numerical or continuous variable is one that can accept any
value within a finite or infinite interval (e.g., height, weight,
temperature, blood glucose, ...). There are two types of
numerical data, interval and ratio. Data on an interval scale
can be added and subtracted but cannot be meaningfully
multiplied or divided because there is no true zero. For
example, we cannot say that one day is twice as hot as
another day. On the other hand, data on a ratio scale has true
zero and can be added, subtracted, multiplied or divided
(e.g., weight).
INTRODUCTION TO DATA MINING
• DATA PREPARATION
Data
A categorical or discrete variable is one that can accept two
or more values (categories). There are two types of
categorical data, nominal and ordinal. Nominal data does not
have an intrinsic ordering in the categories. For example,
"gender" with two categories, male and female. In contrast,
ordinal data does have an intrinsic ordering in the categories.
For example, "level of energy" with three orderly categories
(low, medium and high).
INTRODUCTION TO DATA MINING
• DATA PREPARATION
Dataset
Dataset is a collection
of data, usually
presented in a
tabular form. Each
column represents a
particular variable,
and each row
corresponds to a
given member of the
data.
INTRODUCTION TO DATA MINING
• DATA PREPARATION
Dataset
In predictive modeling, predictors or attributes are the input
variables and target or class attribute is the output variable
whose value is determined by the values of the predictors and
function of the predictive model.
INTRODUCTION TO DATA MINING
• DATA PREPARATION
Database
Database collects, stores and manages information so users can retrieve,
add, update or remove such information. It presents information in tables
with rows and columns. A table is referred to as a relation in the sense
that it is a collection of objects of the same type (rows). Data in a table can
be related according to common keys or concepts, and the ability to
retrieve related data from related tables is the basis for the term relational
database. A Database Management System (DBMS) handles the way data
is stored, maintained, and retrieved. Most data science toolboxes connect
to databases through ODBC (Open Database Connectivity) or JDBC (Java
Database Connectivity) interfaces.
INTRODUCTION TO DATA MINING
• DATA PREPARATION
Database
SQL (Structured Query Language) is a
database computer language for
managing and manipulating data in
relational database management
systems (RDBMS). SQL Data
Definition Language (DDL) permits
database tables to be created,
altered or deleted. We can also
define indexes (keys), specify links
between tables, and impose
constraints between database tables.
CREATE TABLE : creates a new table
ALTER TABLE : alters a table
DROP TABLE : deletes a table
CREATE INDEX : creates an index
DROP INDEX : deletes an index
INTRODUCTION TO DATA MINING
• DATA PREPARATION
Database
SQL Data Manipulation Language
(DML) is a language which enables
users to access and manipulate data.
SELECT : retrieval of data from the
database
INSERT INTO : insertion of new data
into the database
UPDATE : modification of data in the
database
DELETE : deletion of data in the
database
INTRODUCTION TO DATA MINING
• DATA PREPARATION
Database
ETL extracts data from data sources
and loads it into data destinations
using a set of transformation
functions.
ETL (Extraction, Transformation and Loading)
Data extraction provides the ability to extract
data from a variety of data sources, such as flat
files, relational databases, streaming data, XML
files, and ODBC/JDBC data sources.
Data transformation provides the ability to
cleanse, convert, aggregate, merge, and split
data.
Data loading provides the ability to load data into
destination databases via update, insert or delete
statements, or in bulk.
INTRODUCTION TO DATA MINING
• DATA PREPARATION
Database
ETL (Extraction, Transformation and Loading)

More Related Content

Similar to Data Mining Presentation.pptx

Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptxHarsha Patel
 
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONSEXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONSeditorijettcs
 
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONSEXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONSeditorijettcs
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data AnalyticsUtkarsh Sharma
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousingSunny Gandhi
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onwordSulman Ahmed
 
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptxLesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptxcloudserviceuit
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.docbutest
 
Data Mining System and Applications: A Review
Data Mining System and Applications: A ReviewData Mining System and Applications: A Review
Data Mining System and Applications: A Reviewijdpsjournal
 

Similar to Data Mining Presentation.pptx (20)

Data Mining
Data MiningData Mining
Data Mining
 
Data mining
Data miningData mining
Data mining
 
Data mining
Data miningData mining
Data mining
 
Big data overview
Big data overviewBig data overview
Big data overview
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
Chapter 1.pdf
Chapter 1.pdfChapter 1.pdf
Chapter 1.pdf
 
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONSEXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
 
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONSEXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
 
Data mining
Data miningData mining
Data mining
 
CLUSTER ANALYSIS.pptx
CLUSTER ANALYSIS.pptxCLUSTER ANALYSIS.pptx
CLUSTER ANALYSIS.pptx
 
data mining
data miningdata mining
data mining
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
 
ml-03x01.pdf
ml-03x01.pdfml-03x01.pdf
ml-03x01.pdf
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptxLesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.doc
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Data Mining System and Applications: A Review
Data Mining System and Applications: A ReviewData Mining System and Applications: A Review
Data Mining System and Applications: A Review
 

More from ChingChingErm

history_of_kkkdrfthaerdthgEWRHGADRTHADFAHERHAEDRFHADFHGRTEHASDRFH.pptx
history_of_kkkdrfthaerdthgEWRHGADRTHADFAHERHAEDRFHADFHGRTEHASDRFH.pptxhistory_of_kkkdrfthaerdthgEWRHGADRTHADFAHERHAEDRFHADFHGRTEHASDRFH.pptx
history_of_kkkdrfthaerdthgEWRHGADRTHADFAHERHAEDRFHADFHGRTEHASDRFH.pptxChingChingErm
 
Info_Management_report-1.pptx
Info_Management_report-1.pptxInfo_Management_report-1.pptx
Info_Management_report-1.pptxChingChingErm
 
Data Structures and Algorithms presentation.pptx
Data Structures and Algorithms presentation.pptxData Structures and Algorithms presentation.pptx
Data Structures and Algorithms presentation.pptxChingChingErm
 
RecursiveBinarySearch.pptx
RecursiveBinarySearch.pptxRecursiveBinarySearch.pptx
RecursiveBinarySearch.pptxChingChingErm
 
Data Structures and Algorithms presentation.pptx
Data Structures and Algorithms presentation.pptxData Structures and Algorithms presentation.pptx
Data Structures and Algorithms presentation.pptxChingChingErm
 

More from ChingChingErm (6)

history_of_kkkdrfthaerdthgEWRHGADRTHADFAHERHAEDRFHADFHGRTEHASDRFH.pptx
history_of_kkkdrfthaerdthgEWRHGADRTHADFAHERHAEDRFHADFHGRTEHASDRFH.pptxhistory_of_kkkdrfthaerdthgEWRHGADRTHADFAHERHAEDRFHADFHGRTEHASDRFH.pptx
history_of_kkkdrfthaerdthgEWRHGADRTHADFAHERHAEDRFHADFHGRTEHASDRFH.pptx
 
Info_Management_report-1.pptx
Info_Management_report-1.pptxInfo_Management_report-1.pptx
Info_Management_report-1.pptx
 
PE Report.pptx
PE Report.pptxPE Report.pptx
PE Report.pptx
 
Data Structures and Algorithms presentation.pptx
Data Structures and Algorithms presentation.pptxData Structures and Algorithms presentation.pptx
Data Structures and Algorithms presentation.pptx
 
RecursiveBinarySearch.pptx
RecursiveBinarySearch.pptxRecursiveBinarySearch.pptx
RecursiveBinarySearch.pptx
 
Data Structures and Algorithms presentation.pptx
Data Structures and Algorithms presentation.pptxData Structures and Algorithms presentation.pptx
Data Structures and Algorithms presentation.pptx
 

Recently uploaded

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Recently uploaded (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Data Mining Presentation.pptx

  • 2. • WHAT IS DATA MINING? INTRODUCTION TO DATA MINING STATISTICS MACHINE LEARNING ARTIFICIAL INTELLIGENT DATABASES Combines: Data mining, or data science, is about analyzing information to understand the past and predict the future.
  • 3. • WHAT IS DATA MINING? INTRODUCTION TO DATA MINING STATISTICS MACHINE LEARNING ARTIFICIAL INTELLIGENT DATABASES The science of collecting, classifying, summarizing, organizing, analyzing, and interpreting data. The study of computer algorithms dealing with the simulation of intelligent behaviors in order to perform those activities that are normally thought to require intelligence. The study of computer algorithms to learn in order to improve automatically through experience. The science and technology of collecting, storing and managing data so users can retrieve, add, update or remove such data.
  • 4. • WHAT IS DATA MINING? INTRODUCTION TO DATA MINING Businesses, as well as fields like engineering and medicine, find value in using data mining to extract useful knowledge from their accumulated data. This helps them attract more clients, increase sales, and make more profits.
  • 5. • HISTORY OF DATA MINING INTRODUCTION TO DATA MINING First of all, in 1960s statisticians used the terms “Data Fishing” or “Data Dredging”. That was to refer what they considered the bad practice of analyzing data. Consequently, the term “Data Mining” appeared around 1990 in the database community.
  • 6. • HISTORY OF DATA MINING INTRODUCTION TO DATA MINING The current evolution of data mining functions and products is the result of years of influence from many disciplines, including databases, information databases, information retrieval, statistics, algorithms and machine learning.
  • 7. 1990-now, data science • The flood of data from new scientific instruments and simulations • The ability to economically store and manage petabytes of data online • The Internet and computing Grid that makes all these archives • universally accessible • Scientific information management, acquisition, organization, query, • and visualization tasks scale almost linearly with data volumes. Data • mining is a major new challenge! 1950s-1990s, computational science • Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.) • Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models. 1600-1950s, theoretical science • Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding. Before 1600, empirical science • HISTORY OF DATA MINING INTRODUCTION TO DATA MINING Evolution of Sciences
  • 8. 2000s • Stream data management and mining • Data mining and its applications • Web technology (XML, data integration) and global information systems 1980s: • RDBMS, advanced data models (extended-relational, OO, deductive, etc.) • Application-oriented DBMS (spatial, scientific, engineering, etc.) 1980s: • RDBMS, advanced data models (extended-relational, OO, deductive, etc.) • Application-oriented DBMS (spatial, scientific, engineering, etc.) 1970s: • Relational data model, relational DBMS implementation 1960s: • Data collection, database creation, IMS and network DBMS • HISTORY OF DATA MINING INTRODUCTION TO DATA MINING Evolution of Database Technology
  • 9. • HISTORY OF DATA MINING INTRODUCTION TO DATA MINING Data mining involves many different algorithms to accomplish different tasks. All of these algorithms attempt to fit a model to the data. The algorithms examine the data and determine the model that is closest to the characteristics of the data being examined. Data mining algorithms can be characterized as consisting three parts: Model Preference Search The purpose of the algorithm is to fit a model to the data. Some criteria must be used to fit one model over another. All algorithms require some technique to search the data.
  • 10. • HISTORY OF DATA MINING INTRODUCTION TO DATA MINING Example Credit card companies must determine whether to authorize credit card purchases. Suppose that based on past historical information about purchases, each purchase is placed into one of four classes: (1) authorize, (2) ask for further identification before authorization, (3) do not authorize, and (4) do not authorize but contact police. The data mining functions here are twofold. First, the historical data must be examined to determine how the data fit into the four classes. Then the problem is to apply this model to each new purchase. Although the second part indeed may be stated as a simple database query, the first part cannot be.
  • 11. • HISTORY OF DATA MINING INTRODUCTION TO DATA MINING Example In Example 1, we're dealing with credit card data and want to create categories (classes) based on past purchase outcomes. It examine factors like purchase amount and customer history to define these categories. For instance, we might approve small purchases for loyal customers but reject transactions on reported stolen cards. The challenge is to figure out the right criteria for each category by looking at the data patterns.
  • 12. • HISTORY OF DATA MINING INTRODUCTION TO DATA MINING As seen in Figure 3, the model that is created can be either predictive or descriptive in nature. In this figure, it show under each model type some of the most common data mining tasks that use that type of model. predictive model descriptive model
  • 13. • HISTORY OF DATA MINING INTRODUCTION TO DATA MINING predictive model descriptive model makes a prediction about values of data using known results found from different data. Predictive modeling may be made based on the use of other historical data. A descriptive model identifies patterns or relationships in data. Unlike the predictive model, a descriptive model serves as a way to explore the properties of the data examined, not to predict new properties. Clustering, summarization, association rules, and sequence discovery are usually viewed as descriptive in nature.
  • 14. INTRODUCTION TO DATA MINING • BASIC DATA MINING TASK Summarization Association Rules Sequence Discovery Time Series Analysis Prediction Clustering Classification Regression
  • 15. INTRODUCTION TO DATA MINING • BASIC DATA MINING TASK Classification Classification maps data into predefined groups or classes. It is often referred to as supervised learning because the classes are determined before examining the data. Two examples of classification applications are determining whether to make a bank loan and identifying credit risks. Classification algorithms require that the classes be defined based on data attribute values. They often describe these classes by looking at the characteristics of data already known to belong to the classes. Pattern recognition is a type of classification where an input pattern is classified into one of several classes based on its similarity to these predefined Classes.
  • 16. INTRODUCTION TO DATA MINING • BASIC DATA MINING TASK Prediction Many real-world data mining applications can be seen as predicting future data states based on past and current data. Prediction can be viewed as a type of classification. (Note: This is a data mining task that is different from the prediction model, although the prediction task is a type of prediction model.) The difference is that prediction is predicting a future state rather than a current state. Here we are referring to a type of application rather than to a type of data mining modeling approach, as discussed earlier. Prediction applications include flooding, speech recognition, machine learning, and pattern recognition. Although future values may be predicted using time series analysis or regression techniques, other approaches may be used as well. Example 5 illustrates the process.
  • 17. INTRODUCTION TO DATA MINING • BASIC DATA MINING TASK Regression Regression analysis is a statistical method used to model and examine the relationship between a dependent variable (the outcome or response) and one or more independent variables (predictors or features). The goal of regression analysis is to understand how changes in the independent variables are associated with changes in the dependent variable, allowing for prediction or estimation of the dependent variable based on the values of the independent variables.
  • 18. INTRODUCTION TO DATA MINING • BASIC DATA MINING TASK Clustering Clustering is similar to classification except that the groups are not predefined, but rather defined by the data alone. Clustering is alternatively referred to as unsupervised learning or segmentation. It can be thought of as partitioning or segmenting the data into groups that might or might not be disjointed. The clustering is usually accomplished by determining the similarity among the data on predefined attributes. The most similar data are grouped into clusters. Example 6 provides a simple clustering example. Since the clusters are not predefined, a domain expert is often required to interpret the meaning of the created clusters.
  • 19. INTRODUCTION TO DATA MINING • BASIC DATA MINING TASK With time series analysis, the value of an attribute is examined as it varies over time. The values usually are obtained as evenly spaced time points (daily, weekly, hourly, etc.). Time Series Analysis
  • 20. INTRODUCTION TO DATA MINING • BASIC DATA MINING TASK Summarization maps data into subsets with associated simple descriptions. Summarization is also called characterization or generalization. It extracts or derives representative information about the database. This may be accomplished by actually retrieving portions of the data. Alternatively, summary type information (such as the mean of some numeric attribute) can be derived from the data. The summarization succinctly characterizes the contents of the database. Summarization
  • 21. INTRODUCTION TO DATA MINING • BASIC DATA MINING TASK Link analysis, alternatively referred to as affinity analysis or association, refers to the data mining task of uncovering relationships among data. The best example of this type of application is to determine association rules. An association rule is a model that identifies specific types of data associations. These associations are often used in the retail sales community to identify items that are frequently purchased together. Association Rules
  • 22. INTRODUCTION TO DATA MINING • BASIC DATA MINING TASK Sequential analysis or sequence discovery is used to determine sequential patterns in data. These patterns are based on a time sequence of actions. These patterns are similar to associations in that data (or events) are found to be related, but the relationship is based on time. Unlike a market basket analysis, which requires the items to be purchased at the same time, in sequence discovery the items are purchased over time in some order. Example 9 illustrates the discovery of some simple patterns. A similar type of discovery can be seen in the sequence within which data are purchased. For example, most people who purchase CD players may be found to purchase CDs within one week. As we will see, temporal association rules really fall into this category. Sequence Discovery
  • 23. INTRODUCTION TO DATA MINING • THE DATA MINING PROCESS Figure 5 illustrates the phases, and the iterative nature, of a data mining project. The process flow shows that a data mining project does not stop when a particular solution is deployed. The results of data mining trigger new business questions, which in turn can be used to develop more focused models.
  • 24. INTRODUCTION TO DATA MINING • THE DATA MINING PROCESS Problem Definition This initial phase of a data mining project focuses on understanding the project objectives and requirements. Once you have specified the project from a business perspective, you can formulate it as a data mining problem and develop a preliminary implementation plan.
  • 25. INTRODUCTION TO DATA MINING • THE DATA MINING PROCESS The data understanding phase involves data collection and exploration. As you take a closer look at the data, you can determine how well it addresses the business problem. You might decide to remove some of the data or add additional data. This is also the time to identify data quality problems and to scan for patterns in the data. Data Gathering and Preparation
  • 26. INTRODUCTION TO DATA MINING • THE DATA MINING PROCESS In this phase, you select and apply various modeling techniques and calibrate the parameters to optimal values. If the algorithm requires data transformations, you will need to step back to the previous phase to implement them. Model Building and Evaluation
  • 27. INTRODUCTION TO DATA MINING • THE DATA MINING PROCESS Knowledge deployment is the use of data mining within a target environment. In the deployment phase, insight and actionable information can be derived from data. Knowledge Deployment
  • 28. INTRODUCTION TO DATA MINING There are many important implementation issues associated with data mining: • DATA MINING ISSUES Human Interaction Overfitting Outliers Interpretation of result Visualization of result Large datasets High dimensionality Multimedia data Missing data Irrelevant data Noisy data Changing data Integration Application
  • 29. INTRODUCTION TO DATA MINING • DATA MINING ISSUES Human Interaction Since data mining problems are often not precisely stated, interfaces may be needed with both domain and technical experts. Technical experts are used to formulate the queries and assist in interpreting the results. Users are needed to identify training data and desired results.
  • 30. INTRODUCTION TO DATA MINING • DATA MINING ISSUES When a model is generated that is associated with a given database state it is desirable that the model also fit future database states. Overfitting occurs when the model does not fit future states. This may be caused by assumptions that are made about the data or may simply be caused by the small size of the training database. For example, a classification model for an employee database may be developed to classify employees as short, medium, or tall. If the training database is quite small, the model might erroneously indicate that a short person is anyone under five feet eight inches because there is only one entry in the training database under five feet eight. In this case, many future employees would be erroneously classified as short. Overfitting can arise under other circumstances as well, even though the data are not changing. Overfitting
  • 31. INTRODUCTION TO DATA MINING Outliers • DATA MINING ISSUES There are often many data entries that do not fit nicely into the derived model. This becomes even more of an issue with very large databases. If a model is developed that includes these outliers, then the model may not behave well for data that are not outliers.
  • 32. Interpretation of result INTRODUCTION TO DATA MINING • DATA MINING ISSUES Currently, data mining output may require experts to correctly interpret the results, which might otherwise be meaningless to the average database user.
  • 33. Visualization of result INTRODUCTION TO DATA MINING • DATA MINING ISSUES To easily view and understand the output of data mining algorithms, visualization of the results is helpful.
  • 34. Large datasets INTRODUCTION TO DATA MINING • DATA MINING ISSUES The massive datasets associated with data mining create problems when applying algorithms designed for small datasets. Many modeling applications grow exponentially on the dataset size and thus are too inefficient for larger datasets. Sampling and parallelization are effective tools to attack this scalability problem.
  • 35. High dimensionality INTRODUCTION TO DATA MINING • DATA MINING ISSUES A conventional database schema may be composed of many different attributes. The problem here is that not all attributes may be needed to solve a given data mining problem.
  • 36. Multimedia data INTRODUCTION TO DATA MINING • DATA MINING ISSUES Most previous data mining algorithms are targeted to traditional data types (numeric, character, text, etc.). The use of multimedia data such as is found in GIS databases complicates or invalidates many proposed algorithms.
  • 37. Missing data INTRODUCTION TO DATA MINING • DATA MINING ISSUES During the preprocessing phase of KDD, missing data may be replaced with estimates. This and other approaches to handling missing data can lead to invalid results in the data mining step.
  • 38. Irrelevant data INTRODUCTION TO DATA MINING • DATA MINING ISSUES Some attributes in the database might not be of interest to the data mining task being developed.
  • 39. Noisy data INTRODUCTION TO DATA MINING • DATA MINING ISSUES Some attribute values might be invalid or incorrect. These values are often corrected before running data mining applications.
  • 40. Changing data INTRODUCTION TO DATA MINING • DATA MINING ISSUES Databases cannot be assumed to be static. However, most data mining algorithms do assume a static database. This requires that the algorithm be completely rerun anytime the database changes.
  • 41. Integration INTRODUCTION TO DATA MINING • DATA MINING ISSUES The KDD process is not currently integrated into normal data processing activities. KDD requests may be treated as special, unusual, or onetime needs. This makes them inefficient, ineffective, and not general enough to be used on an ongoing basis. Integration of data mining functions into traditional DBMS systems is certainly a desirable goal.
  • 42. Application INTRODUCTION TO DATA MINING • DATA MINING ISSUES Determining the intended use for the information obtained from the data mining function is a challenge. Indeed, how business executives can effectively use the output is sometimes considered the more difficult part, not the running of the algorithms themselves. Because the data are of a type that has not previously been known, business practices may have to be modified to determine how to effectively use the information uncovered.
  • 43. INTRODUCTION TO DATA MINING • PROBLEM DEFINITION Understanding the project objectives and requirements from a domain perspective and then converting this knowledge into a data science problem definition with a preliminary plan designed to achieve the objectives. Data science projects are often structured around the specific needs of an industry sector (as shown below) or even tailored and built for a single organization. A successful data science project starts from a well-defined question or need.
  • 44. INTRODUCTION TO DATA MINING • PROBLEM DEFINITION
  • 45. INTRODUCTION TO DATA MINING • PROBLEM DEFINITION
  • 46. INTRODUCTION TO DATA MINING • PROBLEM DEFINITION
  • 47. INTRODUCTION TO DATA MINING • DATA PREPARATION Data preparation is about constructing a dataset from one or more data sources to be used for exploration and modeling. It is a solid practice to start with an initial dataset to get familiar with the data, to discover first insights into the data and have a good understanding of any possible data quality issues. Data preparation is often a time consuming process and heavily prone to errors.
  • 48. INTRODUCTION TO DATA MINING • DATA PREPARATION Data is information typically the results of measurement (numerical) or counting (categorical). Data
  • 49. INTRODUCTION TO DATA MINING • DATA PREPARATION Data Variables serve as placeholders for data. There are two types of variables, numerical and categorical.
  • 50. INTRODUCTION TO DATA MINING • DATA PREPARATION Data A numerical or continuous variable is one that can accept any value within a finite or infinite interval (e.g., height, weight, temperature, blood glucose, ...). There are two types of numerical data, interval and ratio. Data on an interval scale can be added and subtracted but cannot be meaningfully multiplied or divided because there is no true zero. For example, we cannot say that one day is twice as hot as another day. On the other hand, data on a ratio scale has true zero and can be added, subtracted, multiplied or divided (e.g., weight).
  • 51. INTRODUCTION TO DATA MINING • DATA PREPARATION Data A categorical or discrete variable is one that can accept two or more values (categories). There are two types of categorical data, nominal and ordinal. Nominal data does not have an intrinsic ordering in the categories. For example, "gender" with two categories, male and female. In contrast, ordinal data does have an intrinsic ordering in the categories. For example, "level of energy" with three orderly categories (low, medium and high).
  • 52. INTRODUCTION TO DATA MINING • DATA PREPARATION Dataset Dataset is a collection of data, usually presented in a tabular form. Each column represents a particular variable, and each row corresponds to a given member of the data.
  • 53. INTRODUCTION TO DATA MINING • DATA PREPARATION Dataset In predictive modeling, predictors or attributes are the input variables and target or class attribute is the output variable whose value is determined by the values of the predictors and function of the predictive model.
  • 54. INTRODUCTION TO DATA MINING • DATA PREPARATION Database Database collects, stores and manages information so users can retrieve, add, update or remove such information. It presents information in tables with rows and columns. A table is referred to as a relation in the sense that it is a collection of objects of the same type (rows). Data in a table can be related according to common keys or concepts, and the ability to retrieve related data from related tables is the basis for the term relational database. A Database Management System (DBMS) handles the way data is stored, maintained, and retrieved. Most data science toolboxes connect to databases through ODBC (Open Database Connectivity) or JDBC (Java Database Connectivity) interfaces.
  • 55. INTRODUCTION TO DATA MINING • DATA PREPARATION Database SQL (Structured Query Language) is a database computer language for managing and manipulating data in relational database management systems (RDBMS). SQL Data Definition Language (DDL) permits database tables to be created, altered or deleted. We can also define indexes (keys), specify links between tables, and impose constraints between database tables. CREATE TABLE : creates a new table ALTER TABLE : alters a table DROP TABLE : deletes a table CREATE INDEX : creates an index DROP INDEX : deletes an index
  • 56. INTRODUCTION TO DATA MINING • DATA PREPARATION Database SQL Data Manipulation Language (DML) is a language which enables users to access and manipulate data. SELECT : retrieval of data from the database INSERT INTO : insertion of new data into the database UPDATE : modification of data in the database DELETE : deletion of data in the database
  • 57. INTRODUCTION TO DATA MINING • DATA PREPARATION Database ETL extracts data from data sources and loads it into data destinations using a set of transformation functions. ETL (Extraction, Transformation and Loading) Data extraction provides the ability to extract data from a variety of data sources, such as flat files, relational databases, streaming data, XML files, and ODBC/JDBC data sources. Data transformation provides the ability to cleanse, convert, aggregate, merge, and split data. Data loading provides the ability to load data into destination databases via update, insert or delete statements, or in bulk.
  • 58. INTRODUCTION TO DATA MINING • DATA PREPARATION Database ETL (Extraction, Transformation and Loading)