notes_dmdw_chap1.docx

DATA MINING
CHAPTER – 1
INTRODUCTION TO DATA MINING
What is Data Mining?
The process of extracting information to identify patterns, trends, and useful data that would allow
the business to take the data-driven decision from huge sets of data is called Data Mining.
In other words, we can say that Data Mining is the process of investigating hidden patterns of
information to various perspectives for categorization into useful data, which is collected and
assembled in particular areas such as data warehouses, efficient analysis, data mining algorithm,
helping decision making and other data requirement to eventually cost-cutting and generating
revenue.
Data mining is the act of automatically searching for large stores of information to find trends and
patterns that go beyond simple analysis procedures. Data mining utilizes complex mathematical
algorithms for data segments and evaluates the probability of future events. Data Mining is also
called Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge databases to solve
business problems. It primarily turns raw data into useful information.
Data mining is one of the most useful techniques that help entrepreneurs, researchers, and
individuals to extract valuable information from huge sets of data. Data mining is also
called Knowledge Discovery in Database (KDD). The knowledge discovery process includes Data
cleaning, Data integration, Data selection, Data transformation, Data mining, Pattern evaluation, and
Knowledge presentation.
Data mining is the process of extracting useful information from large sets of data. It involves
using various techniques from statistics, machine learning, and database systems to identify
patterns, relationships, and trends in the data. This information can then be used to make data-
driven decisions, solve business problems, and uncover hidden insights. Applications of data
mining include customer profiling and segmentation, market basket analysis, anomaly detection,
and predictive modelling. Data mining tools and technologies are widely used in various
industries, including finance, healthcare, retail, and telecommunications.
In general terms, “Mining” is the process of extraction of some valuable material from the earth
e.g. coal mining, diamond mining, etc. In the context of computer science, “Data Mining” can be
referred to as knowledge mining from data, knowledge extraction, data/pattern analysis,
data archaeology, and data dredging. It is basically the process carried out for the extraction
of useful information from a bulk of data or data warehouses. One can see that the term itself is
a little confusing. In the case of coal or diamond mining, the result of the extraction process is
coal or diamond. But in the case of Data Mining, the result of the extraction process is not data!!
Instead, data mining results are the patterns and knowledge that we gain at the end of the
extraction process. In that sense, we can think of Data Mining as a step in the process of
Knowledge Discovery or Knowledge Extraction.

Gregory Piatetsky-Shapiro coined the term “Knowledge Discovery in Databases” in 1989.
However, the term ‘data mining’ became more popular in the business and press communities.
Currently, Data Mining and Knowledge Discovery are used interchangeably.
Nowadays, data mining is used in almost all places where a large amount of data is stored and
processed. For example, banks typically use ‘data mining’ to find out their prospective customers
who could be interested in credit cards, personal loans, or insurance as well. Since banks have
the transaction details and detailed profiles of their customers, they analyse all this data and try
to find out patterns that help them predict that certain customers could be interested in personal
loans, etc.
Main Purpose of Data Mining
Data Mining
Basically, Data mining has been integrated with many other techniques from other domains such
as statistics, machine learning, pattern recognition, database and data warehouse systems,
information retrieval, visualization, etc. to gather more information about the data and to
help predict hidden patterns, future trends, and behaviours and allows businesses to make
decisions.
Technically, data mining is the computational process of analysing data from different
perspectives, dimensions, angles and categorizing/summarizing it into meaningful information.
Data Mining can be applied to any type of data e.g. Data Warehouses, Transactional
Databases, Relational Databases, Multimedia Databases, Spatial Databases, Time-series
Databases, World Wide Web.
Data mining as a whole process
The whole process of Data Mining consists of three main phases:
1. Data Pre-processing – Data cleaning, integration, selection, and transformation takes place
2. Data Extraction – Occurrence of exact data mining
3. Data Evaluation and Presentation – Analysing and presenting results

Types of Data Mining
Data mining can be performed on the following types of data:
 Relational Database:
A relational database is a collection of multiple data sets formally organized by tables, records, and
columns from which data can be accessed in various ways without having to recognize the database
tables. Tables convey and share information, which facilitates data searchability, reporting, and
organization.
 Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within the
organization to provide meaningful business insights. The huge amount of data comes from multiple
places such as Marketing and Finance. The extracted data is utilized for analytical purposes and helps
in decision- making for a business organization. The data warehouse is designed for the analysis of
data rather than transaction processing.
 Data Repositories:
The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT structure.
For example, a group of databases, where an organization has kept various kinds of information.
 Object-Relational Database:
A combination of an object-oriented database model and relational database model is called an
object-relational model. It supports Classes, Objects, Inheritance, etc.
One of the primary objectives of the Object-relational data model is to close the gap between the
Relational database and the object-oriented model practices frequently utilized in many
programming languages, for example, C++, Java, C#, and so on.
 Transactional Database:

A transactional database refers to a database management system (DBMS) that has the potential to
undo a database transaction if it is not performed appropriately. Even though this was a unique
capability a very long while back, today, most of the relational database systems support
transactional database activities.
There are several benefits (advantages) of data mining, including:
1. Improved decision making: Data mining can provide valuable insights that can help
organizations make better decisions by identifying patterns and trends in large data sets.
2. Increased efficiency: Data mining can automate repetitive and time-consuming tasks, such
as data cleaning and preparation, which can help organizations save time and resources.
3. Enhanced competitiveness: Data mining can help organizations gain a competitive edge by
uncovering new business opportunities and identifying areas for improvement.
4. Improved customer service: Data mining can help organizations better understand their
customers and tailor their products and services to meet their needs.
5. Fraud detection: Data mining can be used to identify fraudulent activities by detecting
unusual patterns and anomalies in data.
6. Predictive modeling: Data mining can be used to build models that can predict future
events and trends, which can be used to make proactive decisions.
7. New product development: Data mining can be used to identify new product opportunities
by analyzing customer purchase patterns and preferences.
8. Risk management: Data mining can be used to identify potential risks by analyzing data on
customer behavior, market conditions, and other factors.
Disadvantages of Data Mining:
1. Privacy concerns: Data mining can raise privacy concerns as it involves collecting and
analyzing large amounts of data, which can include sensitive information about individuals.
2. Complexity: Data mining can be a complex process that requires specialized skills and
knowledge to implement and interpret the results.
3. Unintended consequences: Data mining can lead to unintended consequences, such as bias
or discrimination, if the data or models are not properly understood or used.
4. Data Quality: Data mining process heavily depends on the quality of data, if data is not
accurate or consistent, the results can be misleading
5. High cost: Data mining can be an expensive process, requiring significant investments in
hardware, software, and personnel.
Real-life examples of Data Mining
 Market Basket Analysis: It is a technique that gives the careful study of purchases done
by a customer in a supermarket. The concept is basically applied to identify the items that
are bought together by a customer. Say, if a person buys bread, what are the chances that
he/she will also purchase butter. This analysis helps in promoting offers and deals by the
companies. The same is done with the help of data mining.
 Protein Folding: It is a technique that carefully studies the biological cells and predicts
the protein interactions and functionality within biological cells. Applications of this
research include determining causes and possible cures for Alzheimer’s,
Parkinson’s, and cancer caused by Protein misfolding.

 Fraud Detection: Nowadays, in this land of cell phones, we can use data mining to
analyze cell phone activities for comparing suspicious phone activity. This can help us to
detects calls made on cloned phones. Similarly, with credit cards, comparing purchases
with historical purchases can detect activity with stolen cards.
Data mining also has many successful applications, such as business intelligence, Web search,
bioinformatics, health informatics, finance, digital libraries, and digital governments.
BASIC DATA MINING TASKS
Data Mining functions are used to define the trends or correlations contained in data mining
activities. In comparison, data mining activities can be divided into 2 categories:
1. Predictive Data Mining:
Predictive data mining tasks come up with a model from the available data set that is helpful in
predicting unknown or future values of another data set of interest.
Example :- A medical practitioner trying to diagnose a disease based on the medical test results
of patient can be considered as a predictive data mining task.
It helps developers to provide unlabelled definitions of attributes. Based on previous tests,
the software estimates the characteristics that are absent. For example: Judging from the
findings of a patient’s medical examinations is he suffering from any particular disease.
2. Descriptive Data Mining:
Descriptive data mining tasks usually finds data describing patterns and comes up with new,
significant information from the available data set.
Example :- A retailer trying to identify products that are purchased together can be considered
as a descriptive data mining task.
It includes certain knowledge to understand what is happening within the data without a
previous idea. The common data features are highlighted in the data set. For examples: count,
average etc.

PREDICTIVE DATA MINING:
A predictive model of a data mining task comprises classification, regression, prediction, and time
series analysis. The predictive model of data mining is also called statistical regression. It refers to
a monitoring learning technique that includes an explication of the dependency of a few attribute's
values upon the other attribute's value in the same product and the growth of a model that can
predict these attribute's values in previous cases.
1. Classification:
In data mining, classification refers to a form of data analysis where a machine learning model
assigns a specific category to a new observation. It is based on what the model has learned
from the data sets. In other words, classification is the act of assigning objects to many
predefined categories.
One example of classification in the banking and financial services industry is identifying
whether transactions are fraudulent or not. In the same way, machine learning can also be
used to predict whether a loan application would be approved or not.
2. Regression:
Regression refers to a method that verifies the value of data for a function. Generally, it is
used for appropriate data.
A linear regression model in the context of machine learning or statistics is basically a linear
approach for modelling the relationships between the dependent variable known as the result
and your independent variable is known as features.
If your model has only one independent variable, it is called simple linear regression, and else
it is called multiple linear regression.
Types of regression
a. Linear Regression: Linear regression is related to the search for the optimal line which
fits the two attributes so that with the help of one attribute, we can predict the other.
b. Multi-linear regression: Multi-linear regression includes two or more than two attributes,
and the data are fit to multi-dimensional space.
3. Prediction:
In data mining, prediction is used to identify data value based on the description of another
corresponding data value. The prediction in data mining is known as Numeric Prediction.
Generally, regression analysis is used for prediction. For example, in credit card fraud
detection, data history for a particular person's credit card usage has to be analysed. If any
abnormal pattern was detected, it should be reported as 'fraudulent action'.
4. Time series analysis:
Time series analysis refers to the data sets based on time. It serves as an independent variable
to predict the dependent variable in time.
Time series is a sequence of events where the next event is determined by one or more of the
preceding events. Time series reflects the process being measured and there are certain
components that affect the behaviour of a process. Time series analysis includes methods to
analyse time-series data in order to extract useful patterns, trends, rules and statistics. Stock
market prediction is an important application of time- series analysis.
DESCRIPTIVE DATA MINING:

A descriptive model differentiates the patterns and relationships in data. A descriptive model does
not attempt to generalize to a statistical population or random process. A predictive model attempts
to generalize to a population or random process. Predictive models should give prediction intervals
and must be cross-validated; that is, they must prove that they can be used to make predictions with
data that was not used in constructing the model.
Descriptive analytics focuses on the summarization and conversion of the data into useful
information for reporting and monitoring.
1. Clustering:
Clustering is grouping a set of objects so that objects in the same group called a cluster are
more similar than those in other group’s clusters.
Clustering is used to identify data objects that are similar to one another. The similarity can
be decided based on a number of factors like purchase behaviour, responsiveness to certain
actions, geographical locations and so on. For example, an insurance company can cluster its
customers based on age, residence, income etc. This group information will be helpful to
understand the customers better and hence provide better customized services.
2. Association rules:
Association rules determine a causal relationship between huge sets of data objects. The way
the algorithm works is that you have. For example, a list of items you purchase at the grocery
store for the past six months data, and it calculates a percentage at which items are purchased
together. For example, what are the chances of you buying milk with cereal?
Association discovers the association or connection among a set of items. Association
identifies the relationships between objects. Association analysis is used for commodity
management, advertising, catalog design, direct marketing etc. A retailer can identify the
products that normally customers purchase together or even find the customers who respond
to the promotion of same kind of products. If a retailer finds that beer and nappy are bought
together mostly, he can put nappies on sale to promote the sale of beer.
3. Sequence:
Sequence refers to the discovery of useful patterns in the data is in relation to some objective
of how it is interesting.
4. Summarization:
Summarization holds a data set in more depth which is easy to understand form.
Summarization is the generalization of data. A set of relevant data is summarized which result
in a smaller set that gives aggregated information of the data. For example, the shopping
done by a customer can be summarized into total products, total spending, offers used, etc.
Such high level summarized information can be useful for sales or customer relationship team
for detailed customer and purchase behaviour analysis. Data can be summarized in different
abstraction levels and from different angles.
DATA MINING VS KNOWLEDGE DISCOVERY IN DATABASES
Although the two terms KDD and Data Mining are heavily used interchangeably, they refer to two
related yet slightly different concepts.

KDD is the overall process of extracting knowledge from data, while Data Mining is a step inside the
KDD process, which deals with identifying patterns in data.
And Data Mining is only the application of a specific algorithm based on the overall goal of the KDD
process.
KDD is an iterative process where evaluation measures can be enhanced, mining can be refined, and
new data can be integrated and transformed to get different and more appropriate results.
KNOWLEDGE DISCOVERY IN DATABASES (KDD).
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful,
previously unknown, and potentially valuable information from large datasets.
The KDD process in data mining typically involves the following steps:
1. Selection: Select a relevant subset of the data for analysis.
2. Pre-processing: Clean and transform the data to make it ready for analysis. This may include
tasks such as data normalization, missing value handling, and data integration.
3. Transformation: Transform the data into a format suitable for data mining, such as a matrix
or a graph.
4. Data Mining: Apply data mining techniques and algorithms to the data to extract useful
information and insights. This may include tasks such as clustering, classification, association
rule mining, and anomaly detection.
5. Interpretation: Interpret the results and extract knowledge from the data. This may include
tasks such as visualizing the results, evaluating the quality of the discovered patterns and
identifying relationships and associations among the data.
6. Evaluation: Evaluate the results to ensure that the extracted knowledge is useful, accurate,
and meaningful.
7. Deployment: Use the discovered knowledge to solve the business problem and make
decisions.
The KDD process is an iterative process and it requires multiple iterations of the above steps to
extract accurate knowledge from the data.
Why do we need Data Mining?
Volume of information is increasing everyday than we can handle from business transactions,
scientific data, sensor data, Pictures, videos, etc. So, we need a system that will be capable of

extracting essence of information available and that can automatically generate report,
views or summary of data for better decision-making.
Advantages of KDD:
1. Improves decision-making: KDD provides valuable insights and knowledge that can help
organizations make better decisions.
2. Increased efficiency: KDD automates repetitive and time-consuming tasks and makes the data
ready for analysis, which saves time and money.
3. Better customer service: KDD helps organizations gain a better understanding of their
customers’ needs and preferences, which can help them provide better customer service.
4. Fraud detection: KDD can be used to detect fraudulent activities by identifying patterns and
anomalies in the data that may indicate fraud.
5. Predictive modelling: KDD can be used to build predictive models that can forecast future
trends and patterns.
Disadvantages of KDD:
1. Privacy concerns: KDD can raise privacy concerns as it involves collecting and analyzing large
amounts of data, which can include sensitive information about individuals.
2. Complexity: KDD can be a complex process that requires specialized skills and knowledge to
implement and interpret the results.
3. Unintended consequences: KDD can lead to unintended consequences, such as bias or
discrimination, if the data or models are not properly understood or used.
4. Data Quality: KDD process heavily depends on the quality of data, if data is not accurate or
consistent, the results can be misleading
5. High cost: KDD can be an expensive process, requiring significant investments in hardware,
software, and personnel.
6. Over fitting: KDD process can lead to over fitting, which is a common problem in machine
learning where a model learns the detail and noise in the training data to the extent that it
negatively impacts the performance of the model on new unseen data.
Parameter KDD Data Mining
Definition
KDD refers to a process of identifying
valid, novel, potentially useful, and
ultimately understandable patterns
and relationships in data.
Data Mining refers to a process of
extracting useful and valuable
information or patterns from large
data sets.
Objective To find useful knowledge from data.
To extract useful information from
data.
Techniques
Used
Data cleaning, data integration, data
selection, data transformation, data
Association rules, classification,
clustering, regression, decision

mining, pattern evaluation, and
knowledge representation and
visualization.
trees, neural networks, and
dimensionality reduction.
Output
Structured information, such as rules
and models that can be used to make
decisions or predictions.
Patterns, associations, or insights
that can be used to improve
decision-making or understanding.
Focus
Focus is on the discovery of useful
knowledge, rather than simply finding
patterns in data.
Focus is on the discovery of
patterns or relationships in data.
Role of
domain
expertise
Domain expertise is important in KDD,
as it helps in defining the goals of the
process, choosing appropriate data,
and interpreting the results.
Domain expertise is less critical in
data mining, as the algorithms are
designed to identify patterns
without relying on prior knowledge.
MAJOR ISSUES IN DATA MINING:
1. Mining different kinds of knowledge in databases – The need for different users is not
same. Different users may be interested in different kinds of knowledge. Therefore it is
necessary for data mining to cover a broad range of knowledge discovery tasks.
2. Interactive mining of knowledge at multiple levels of abstraction – The data mining
process needs to be interactive because it allows users to focus on search for patterns,
providing and refining data mining requests based on returned results.
3. Incorporation of background knowledge – To guide discovery process and to express
discovered patterns, background knowledge can be used to express discovered patterns
not only in concise terms but at multiple levels of abstraction.
4. Data mining query languages and ad-hoc data mining – Data Mining Query language
that allows user to describe ad-hoc mining tasks should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
5. Presentation and visualization of data mining results – Once patterns are discovered it
needs to be expressed in high-level languages, visual representations. These
representations should be easily understandable by users.
6. Handling noisy or incomplete data – The data cleaning methods are required that can
handle noise, incomplete objects while mining data regularities. If data cleaning methods
are not there then accuracy of discovered patterns will be poor.
7. Pattern evaluation – It refers to interestingness of problem. The patterns discovered
should be interesting because either they represent common knowledge or lack of
novelty.
8. Efficiency and scalability of data mining algorithms – In order to effectively extract
information from huge amount of data in databases, data mining algorithm must be
efficient and scalable.
9. Parallel, distributed, and incremental mining algorithms – The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate development of parallel and distributed data mining algorithms. These

algorithms divide data into partitions that are further processed parallel. Then results from
partitions are merged. The incremental algorithms update databases without having
mined data again from scratch.
DATA MINING METRICS
Data mining is one of the forms of artificial intelligence that uses perception models, analytical
models, and multiple algorithms to simulate the techniques of the human brain. Data mining
supports machines to take human decisions and create human choices.
The user of the data mining tools will have to direct the machine rules, preferences, and even
experiences to have decision support data mining metrics are as follows −
 Usefulness − Usefulness involves several metrics that tell us whether the model provides
useful data. For instance, a data mining model that correlates save the location with sales can
be both accurate and reliable, but cannot be useful, because it cannot generalize that result
by inserting more stores at the same location.
Furthermore, it does not answer the fundamental business question of why specific locations have
more sales. It can also find that a model that appears successful is meaningless because it depends
on cross-correlations in the data.
 Return on Investment (ROI) − Data mining tools will find interesting patterns buried inside
the data and develop predictive models. These models will have several measures for
denoting how well they fit the records. It is not clear how to create a decision based on some
of the measures reported as an element of data mining analyses.
 Access Financial Information during Data Mining −The simplest way to frame decisions
in financial terms is to augment the raw information that is generally mined to also contain
financial data. Some organizations are investing and developing data warehouses, and data
marts.
The design of a warehouse or mart contains considerations about the types of analyses and data
needed for expected queries. It is designing warehouses in a way that allows access to financial
information along with access to more typical data on product attributes, user profiles, etc. can be
useful.
 Converting Data Mining Metrics into Financial Terms − A general data mining metric is
the measure of "Lift". Lift is a measure of what is achieved by using the specific model or
pattern relative to a base rate in which the model is not used. High values mean much is
achieved. It can seem then that one can simply create a decision based on Lift.
 Accuracy − Accuracy is a measure of how well the model correlates results with the attributes
in the data that has been supported. There are several measures of accuracy, but all measures
of accuracy are dependent on the information that is used. In reality, values can be missing
or approximate, or the data can have been changed by several processes.
It is the procedure of exploration and development, it can decide to accept a specific amount of
error in the data, especially if the data is fairly uniform in its characteristics. For example, a model
that predicts sales for a specific store based on past sales can be powerfully correlated and very
accurate, even if that store consistently used the wrong accounting techniques. Thus, measurements
of accuracy should be balanced by assessments of reliability.

SOCIAL IMPLICATIONS OF DATA MINING
There are various social implications of data mining which are as follows −
 Privacy − It is a loaded issue. In current years privacy concerns have taken on a more
important role in American society as merchants, insurance companies, and government
agencies amass warehouses including personal records.
The concerns that people have over the group of this data will generally extend to some analytic
capabilities used to the data. Users of data mining should start thinking about how their use of this
technology will be impacted by legal problems associated with privacy.
 Profiling − Data Mining and profiling is a developing field that attempts to organize,
understand, analyse, reason, and use the explosion of data in this information age. The
process contains using algorithms and experience to extract design or anomalies that are
very complex, difficult, or time-consuming to recognize.
The founder of Microsoft's Exploration Team used complex data mining algorithms to solve an issue
that had haunted astronomers for some years. The problem of reviewing, describing, and
categorizing 2 billion sky objects recorded over 3 decades. The algorithm extracted the relevant
design to allocate the sky objects like stars or galaxies. The algorithms were able to extract the
feature that represented sky objects as stars or galaxies. This developing field of data mining and
profiling has several frontiers where it can be used.
 Unauthorized Used − Trends obtain through data mining designed to be used for
marketing goals or some other ethical goals, can be misused. Unethical businesses or people
can use the data obtained through data mining to take benefit of vulnerable people or
discriminate against a specific group of people. Furthermore, the data mining technique is
not 100 percent accurate; thus mistakes do appear which can have serious results.
APPLICATIONS OF DATA MINING
1. Financial Analysis
2. Biological Analysis
3. Scientific Analysis
4. Intrusion Detection
5. Fraud Detection
6. Research Analysis
7. Market Basket Analysis:
8. Education
9. CRM (Customer Relationship Management):
Alias of Data Mining
 Exploratory Data Analysis
 Data Driven Analysis
 Deductive Learning

notes_dmdw_chap1.docx

Recommended

Recommended

More Related Content

Similar to notes_dmdw_chap1.docx

Similar to notes_dmdw_chap1.docx (20)

Recently uploaded

Recently uploaded (20)

notes_dmdw_chap1.docx