SlideShare a Scribd company logo
DATA MINING
CHAPTER – 1
INTRODUCTION TO DATA MINING
What is Data Mining?
The process of extracting information to identify patterns, trends, and useful data that would allow
the business to take the data-driven decision from huge sets of data is called Data Mining.
In other words, we can say that Data Mining is the process of investigating hidden patterns of
information to various perspectives for categorization into useful data, which is collected and
assembled in particular areas such as data warehouses, efficient analysis, data mining algorithm,
helping decision making and other data requirement to eventually cost-cutting and generating
revenue.
Data mining is the act of automatically searching for large stores of information to find trends and
patterns that go beyond simple analysis procedures. Data mining utilizes complex mathematical
algorithms for data segments and evaluates the probability of future events. Data Mining is also
called Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge databases to solve
business problems. It primarily turns raw data into useful information.
Data mining is one of the most useful techniques that help entrepreneurs, researchers, and
individuals to extract valuable information from huge sets of data. Data mining is also
called Knowledge Discovery in Database (KDD). The knowledge discovery process includes Data
cleaning, Data integration, Data selection, Data transformation, Data mining, Pattern evaluation, and
Knowledge presentation.
Data mining is the process of extracting useful information from large sets of data. It involves
using various techniques from statistics, machine learning, and database systems to identify
patterns, relationships, and trends in the data. This information can then be used to make data-
driven decisions, solve business problems, and uncover hidden insights. Applications of data
mining include customer profiling and segmentation, market basket analysis, anomaly detection,
and predictive modelling. Data mining tools and technologies are widely used in various
industries, including finance, healthcare, retail, and telecommunications.
In general terms, “Mining” is the process of extraction of some valuable material from the earth
e.g. coal mining, diamond mining, etc. In the context of computer science, “Data Mining” can be
referred to as knowledge mining from data, knowledge extraction, data/pattern analysis,
data archaeology, and data dredging. It is basically the process carried out for the extraction
of useful information from a bulk of data or data warehouses. One can see that the term itself is
a little confusing. In the case of coal or diamond mining, the result of the extraction process is
coal or diamond. But in the case of Data Mining, the result of the extraction process is not data!!
Instead, data mining results are the patterns and knowledge that we gain at the end of the
extraction process. In that sense, we can think of Data Mining as a step in the process of
Knowledge Discovery or Knowledge Extraction.
Gregory Piatetsky-Shapiro coined the term “Knowledge Discovery in Databases” in 1989.
However, the term ‘data mining’ became more popular in the business and press communities.
Currently, Data Mining and Knowledge Discovery are used interchangeably.
Nowadays, data mining is used in almost all places where a large amount of data is stored and
processed. For example, banks typically use ‘data mining’ to find out their prospective customers
who could be interested in credit cards, personal loans, or insurance as well. Since banks have
the transaction details and detailed profiles of their customers, they analyse all this data and try
to find out patterns that help them predict that certain customers could be interested in personal
loans, etc.
Main Purpose of Data Mining
Data Mining
Basically, Data mining has been integrated with many other techniques from other domains such
as statistics, machine learning, pattern recognition, database and data warehouse systems,
information retrieval, visualization, etc. to gather more information about the data and to
help predict hidden patterns, future trends, and behaviours and allows businesses to make
decisions.
Technically, data mining is the computational process of analysing data from different
perspectives, dimensions, angles and categorizing/summarizing it into meaningful information.
Data Mining can be applied to any type of data e.g. Data Warehouses, Transactional
Databases, Relational Databases, Multimedia Databases, Spatial Databases, Time-series
Databases, World Wide Web.
Data mining as a whole process
The whole process of Data Mining consists of three main phases:
1. Data Pre-processing – Data cleaning, integration, selection, and transformation takes place
2. Data Extraction – Occurrence of exact data mining
3. Data Evaluation and Presentation – Analysing and presenting results
Types of Data Mining
Data mining can be performed on the following types of data:
 Relational Database:
A relational database is a collection of multiple data sets formally organized by tables, records, and
columns from which data can be accessed in various ways without having to recognize the database
tables. Tables convey and share information, which facilitates data searchability, reporting, and
organization.
 Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within the
organization to provide meaningful business insights. The huge amount of data comes from multiple
places such as Marketing and Finance. The extracted data is utilized for analytical purposes and helps
in decision- making for a business organization. The data warehouse is designed for the analysis of
data rather than transaction processing.
 Data Repositories:
The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT structure.
For example, a group of databases, where an organization has kept various kinds of information.
 Object-Relational Database:
A combination of an object-oriented database model and relational database model is called an
object-relational model. It supports Classes, Objects, Inheritance, etc.
One of the primary objectives of the Object-relational data model is to close the gap between the
Relational database and the object-oriented model practices frequently utilized in many
programming languages, for example, C++, Java, C#, and so on.
 Transactional Database:
A transactional database refers to a database management system (DBMS) that has the potential to
undo a database transaction if it is not performed appropriately. Even though this was a unique
capability a very long while back, today, most of the relational database systems support
transactional database activities.
There are several benefits (advantages) of data mining, including:
1. Improved decision making: Data mining can provide valuable insights that can help
organizations make better decisions by identifying patterns and trends in large data sets.
2. Increased efficiency: Data mining can automate repetitive and time-consuming tasks, such
as data cleaning and preparation, which can help organizations save time and resources.
3. Enhanced competitiveness: Data mining can help organizations gain a competitive edge by
uncovering new business opportunities and identifying areas for improvement.
4. Improved customer service: Data mining can help organizations better understand their
customers and tailor their products and services to meet their needs.
5. Fraud detection: Data mining can be used to identify fraudulent activities by detecting
unusual patterns and anomalies in data.
6. Predictive modeling: Data mining can be used to build models that can predict future
events and trends, which can be used to make proactive decisions.
7. New product development: Data mining can be used to identify new product opportunities
by analyzing customer purchase patterns and preferences.
8. Risk management: Data mining can be used to identify potential risks by analyzing data on
customer behavior, market conditions, and other factors.
Disadvantages of Data Mining:
1. Privacy concerns: Data mining can raise privacy concerns as it involves collecting and
analyzing large amounts of data, which can include sensitive information about individuals.
2. Complexity: Data mining can be a complex process that requires specialized skills and
knowledge to implement and interpret the results.
3. Unintended consequences: Data mining can lead to unintended consequences, such as bias
or discrimination, if the data or models are not properly understood or used.
4. Data Quality: Data mining process heavily depends on the quality of data, if data is not
accurate or consistent, the results can be misleading
5. High cost: Data mining can be an expensive process, requiring significant investments in
hardware, software, and personnel.
Real-life examples of Data Mining
 Market Basket Analysis: It is a technique that gives the careful study of purchases done
by a customer in a supermarket. The concept is basically applied to identify the items that
are bought together by a customer. Say, if a person buys bread, what are the chances that
he/she will also purchase butter. This analysis helps in promoting offers and deals by the
companies. The same is done with the help of data mining.
 Protein Folding: It is a technique that carefully studies the biological cells and predicts
the protein interactions and functionality within biological cells. Applications of this
research include determining causes and possible cures for Alzheimer’s,
Parkinson’s, and cancer caused by Protein misfolding.
 Fraud Detection: Nowadays, in this land of cell phones, we can use data mining to
analyze cell phone activities for comparing suspicious phone activity. This can help us to
detects calls made on cloned phones. Similarly, with credit cards, comparing purchases
with historical purchases can detect activity with stolen cards.
Data mining also has many successful applications, such as business intelligence, Web search,
bioinformatics, health informatics, finance, digital libraries, and digital governments.
BASIC DATA MINING TASKS
Data Mining functions are used to define the trends or correlations contained in data mining
activities. In comparison, data mining activities can be divided into 2 categories:
1. Predictive Data Mining:
Predictive data mining tasks come up with a model from the available data set that is helpful in
predicting unknown or future values of another data set of interest.
Example :- A medical practitioner trying to diagnose a disease based on the medical test results
of patient can be considered as a predictive data mining task.
It helps developers to provide unlabelled definitions of attributes. Based on previous tests,
the software estimates the characteristics that are absent. For example: Judging from the
findings of a patient’s medical examinations is he suffering from any particular disease.
2. Descriptive Data Mining:
Descriptive data mining tasks usually finds data describing patterns and comes up with new,
significant information from the available data set.
Example :- A retailer trying to identify products that are purchased together can be considered
as a descriptive data mining task.
It includes certain knowledge to understand what is happening within the data without a
previous idea. The common data features are highlighted in the data set. For examples: count,
average etc.
PREDICTIVE DATA MINING:
A predictive model of a data mining task comprises classification, regression, prediction, and time
series analysis. The predictive model of data mining is also called statistical regression. It refers to
a monitoring learning technique that includes an explication of the dependency of a few attribute's
values upon the other attribute's value in the same product and the growth of a model that can
predict these attribute's values in previous cases.
1. Classification:
In data mining, classification refers to a form of data analysis where a machine learning model
assigns a specific category to a new observation. It is based on what the model has learned
from the data sets. In other words, classification is the act of assigning objects to many
predefined categories.
One example of classification in the banking and financial services industry is identifying
whether transactions are fraudulent or not. In the same way, machine learning can also be
used to predict whether a loan application would be approved or not.
2. Regression:
Regression refers to a method that verifies the value of data for a function. Generally, it is
used for appropriate data.
A linear regression model in the context of machine learning or statistics is basically a linear
approach for modelling the relationships between the dependent variable known as the result
and your independent variable is known as features.
If your model has only one independent variable, it is called simple linear regression, and else
it is called multiple linear regression.
Types of regression
a. Linear Regression: Linear regression is related to the search for the optimal line which
fits the two attributes so that with the help of one attribute, we can predict the other.
b. Multi-linear regression: Multi-linear regression includes two or more than two attributes,
and the data are fit to multi-dimensional space.
3. Prediction:
In data mining, prediction is used to identify data value based on the description of another
corresponding data value. The prediction in data mining is known as Numeric Prediction.
Generally, regression analysis is used for prediction. For example, in credit card fraud
detection, data history for a particular person's credit card usage has to be analysed. If any
abnormal pattern was detected, it should be reported as 'fraudulent action'.
4. Time series analysis:
Time series analysis refers to the data sets based on time. It serves as an independent variable
to predict the dependent variable in time.
Time series is a sequence of events where the next event is determined by one or more of the
preceding events. Time series reflects the process being measured and there are certain
components that affect the behaviour of a process. Time series analysis includes methods to
analyse time-series data in order to extract useful patterns, trends, rules and statistics. Stock
market prediction is an important application of time- series analysis.
DESCRIPTIVE DATA MINING:
A descriptive model differentiates the patterns and relationships in data. A descriptive model does
not attempt to generalize to a statistical population or random process. A predictive model attempts
to generalize to a population or random process. Predictive models should give prediction intervals
and must be cross-validated; that is, they must prove that they can be used to make predictions with
data that was not used in constructing the model.
Descriptive analytics focuses on the summarization and conversion of the data into useful
information for reporting and monitoring.
1. Clustering:
Clustering is grouping a set of objects so that objects in the same group called a cluster are
more similar than those in other group’s clusters.
Clustering is used to identify data objects that are similar to one another. The similarity can
be decided based on a number of factors like purchase behaviour, responsiveness to certain
actions, geographical locations and so on. For example, an insurance company can cluster its
customers based on age, residence, income etc. This group information will be helpful to
understand the customers better and hence provide better customized services.
2. Association rules:
Association rules determine a causal relationship between huge sets of data objects. The way
the algorithm works is that you have. For example, a list of items you purchase at the grocery
store for the past six months data, and it calculates a percentage at which items are purchased
together. For example, what are the chances of you buying milk with cereal?
Association discovers the association or connection among a set of items. Association
identifies the relationships between objects. Association analysis is used for commodity
management, advertising, catalog design, direct marketing etc. A retailer can identify the
products that normally customers purchase together or even find the customers who respond
to the promotion of same kind of products. If a retailer finds that beer and nappy are bought
together mostly, he can put nappies on sale to promote the sale of beer.
3. Sequence:
Sequence refers to the discovery of useful patterns in the data is in relation to some objective
of how it is interesting.
4. Summarization:
Summarization holds a data set in more depth which is easy to understand form.
Summarization is the generalization of data. A set of relevant data is summarized which result
in a smaller set that gives aggregated information of the data. For example, the shopping
done by a customer can be summarized into total products, total spending, offers used, etc.
Such high level summarized information can be useful for sales or customer relationship team
for detailed customer and purchase behaviour analysis. Data can be summarized in different
abstraction levels and from different angles.
DATA MINING VS KNOWLEDGE DISCOVERY IN DATABASES
Although the two terms KDD and Data Mining are heavily used interchangeably, they refer to two
related yet slightly different concepts.
KDD is the overall process of extracting knowledge from data, while Data Mining is a step inside the
KDD process, which deals with identifying patterns in data.
And Data Mining is only the application of a specific algorithm based on the overall goal of the KDD
process.
KDD is an iterative process where evaluation measures can be enhanced, mining can be refined, and
new data can be integrated and transformed to get different and more appropriate results.
KNOWLEDGE DISCOVERY IN DATABASES (KDD).
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful,
previously unknown, and potentially valuable information from large datasets.
The KDD process in data mining typically involves the following steps:
1. Selection: Select a relevant subset of the data for analysis.
2. Pre-processing: Clean and transform the data to make it ready for analysis. This may include
tasks such as data normalization, missing value handling, and data integration.
3. Transformation: Transform the data into a format suitable for data mining, such as a matrix
or a graph.
4. Data Mining: Apply data mining techniques and algorithms to the data to extract useful
information and insights. This may include tasks such as clustering, classification, association
rule mining, and anomaly detection.
5. Interpretation: Interpret the results and extract knowledge from the data. This may include
tasks such as visualizing the results, evaluating the quality of the discovered patterns and
identifying relationships and associations among the data.
6. Evaluation: Evaluate the results to ensure that the extracted knowledge is useful, accurate,
and meaningful.
7. Deployment: Use the discovered knowledge to solve the business problem and make
decisions.
The KDD process is an iterative process and it requires multiple iterations of the above steps to
extract accurate knowledge from the data.
Why do we need Data Mining?
Volume of information is increasing everyday than we can handle from business transactions,
scientific data, sensor data, Pictures, videos, etc. So, we need a system that will be capable of
extracting essence of information available and that can automatically generate report,
views or summary of data for better decision-making.
Advantages of KDD:
1. Improves decision-making: KDD provides valuable insights and knowledge that can help
organizations make better decisions.
2. Increased efficiency: KDD automates repetitive and time-consuming tasks and makes the data
ready for analysis, which saves time and money.
3. Better customer service: KDD helps organizations gain a better understanding of their
customers’ needs and preferences, which can help them provide better customer service.
4. Fraud detection: KDD can be used to detect fraudulent activities by identifying patterns and
anomalies in the data that may indicate fraud.
5. Predictive modelling: KDD can be used to build predictive models that can forecast future
trends and patterns.
Disadvantages of KDD:
1. Privacy concerns: KDD can raise privacy concerns as it involves collecting and analyzing large
amounts of data, which can include sensitive information about individuals.
2. Complexity: KDD can be a complex process that requires specialized skills and knowledge to
implement and interpret the results.
3. Unintended consequences: KDD can lead to unintended consequences, such as bias or
discrimination, if the data or models are not properly understood or used.
4. Data Quality: KDD process heavily depends on the quality of data, if data is not accurate or
consistent, the results can be misleading
5. High cost: KDD can be an expensive process, requiring significant investments in hardware,
software, and personnel.
6. Over fitting: KDD process can lead to over fitting, which is a common problem in machine
learning where a model learns the detail and noise in the training data to the extent that it
negatively impacts the performance of the model on new unseen data.
Parameter KDD Data Mining
Definition
KDD refers to a process of identifying
valid, novel, potentially useful, and
ultimately understandable patterns
and relationships in data.
Data Mining refers to a process of
extracting useful and valuable
information or patterns from large
data sets.
Objective To find useful knowledge from data.
To extract useful information from
data.
Techniques
Used
Data cleaning, data integration, data
selection, data transformation, data
Association rules, classification,
clustering, regression, decision
mining, pattern evaluation, and
knowledge representation and
visualization.
trees, neural networks, and
dimensionality reduction.
Output
Structured information, such as rules
and models that can be used to make
decisions or predictions.
Patterns, associations, or insights
that can be used to improve
decision-making or understanding.
Focus
Focus is on the discovery of useful
knowledge, rather than simply finding
patterns in data.
Focus is on the discovery of
patterns or relationships in data.
Role of
domain
expertise
Domain expertise is important in KDD,
as it helps in defining the goals of the
process, choosing appropriate data,
and interpreting the results.
Domain expertise is less critical in
data mining, as the algorithms are
designed to identify patterns
without relying on prior knowledge.
MAJOR ISSUES IN DATA MINING:
1. Mining different kinds of knowledge in databases – The need for different users is not
same. Different users may be interested in different kinds of knowledge. Therefore it is
necessary for data mining to cover a broad range of knowledge discovery tasks.
2. Interactive mining of knowledge at multiple levels of abstraction – The data mining
process needs to be interactive because it allows users to focus on search for patterns,
providing and refining data mining requests based on returned results.
3. Incorporation of background knowledge – To guide discovery process and to express
discovered patterns, background knowledge can be used to express discovered patterns
not only in concise terms but at multiple levels of abstraction.
4. Data mining query languages and ad-hoc data mining – Data Mining Query language
that allows user to describe ad-hoc mining tasks should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
5. Presentation and visualization of data mining results – Once patterns are discovered it
needs to be expressed in high-level languages, visual representations. These
representations should be easily understandable by users.
6. Handling noisy or incomplete data – The data cleaning methods are required that can
handle noise, incomplete objects while mining data regularities. If data cleaning methods
are not there then accuracy of discovered patterns will be poor.
7. Pattern evaluation – It refers to interestingness of problem. The patterns discovered
should be interesting because either they represent common knowledge or lack of
novelty.
8. Efficiency and scalability of data mining algorithms – In order to effectively extract
information from huge amount of data in databases, data mining algorithm must be
efficient and scalable.
9. Parallel, distributed, and incremental mining algorithms – The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate development of parallel and distributed data mining algorithms. These
algorithms divide data into partitions that are further processed parallel. Then results from
partitions are merged. The incremental algorithms update databases without having
mined data again from scratch.
DATA MINING METRICS
Data mining is one of the forms of artificial intelligence that uses perception models, analytical
models, and multiple algorithms to simulate the techniques of the human brain. Data mining
supports machines to take human decisions and create human choices.
The user of the data mining tools will have to direct the machine rules, preferences, and even
experiences to have decision support data mining metrics are as follows −
 Usefulness − Usefulness involves several metrics that tell us whether the model provides
useful data. For instance, a data mining model that correlates save the location with sales can
be both accurate and reliable, but cannot be useful, because it cannot generalize that result
by inserting more stores at the same location.
Furthermore, it does not answer the fundamental business question of why specific locations have
more sales. It can also find that a model that appears successful is meaningless because it depends
on cross-correlations in the data.
 Return on Investment (ROI) − Data mining tools will find interesting patterns buried inside
the data and develop predictive models. These models will have several measures for
denoting how well they fit the records. It is not clear how to create a decision based on some
of the measures reported as an element of data mining analyses.
 Access Financial Information during Data Mining −The simplest way to frame decisions
in financial terms is to augment the raw information that is generally mined to also contain
financial data. Some organizations are investing and developing data warehouses, and data
marts.
The design of a warehouse or mart contains considerations about the types of analyses and data
needed for expected queries. It is designing warehouses in a way that allows access to financial
information along with access to more typical data on product attributes, user profiles, etc. can be
useful.
 Converting Data Mining Metrics into Financial Terms − A general data mining metric is
the measure of "Lift". Lift is a measure of what is achieved by using the specific model or
pattern relative to a base rate in which the model is not used. High values mean much is
achieved. It can seem then that one can simply create a decision based on Lift.
 Accuracy − Accuracy is a measure of how well the model correlates results with the attributes
in the data that has been supported. There are several measures of accuracy, but all measures
of accuracy are dependent on the information that is used. In reality, values can be missing
or approximate, or the data can have been changed by several processes.
It is the procedure of exploration and development, it can decide to accept a specific amount of
error in the data, especially if the data is fairly uniform in its characteristics. For example, a model
that predicts sales for a specific store based on past sales can be powerfully correlated and very
accurate, even if that store consistently used the wrong accounting techniques. Thus, measurements
of accuracy should be balanced by assessments of reliability.
SOCIAL IMPLICATIONS OF DATA MINING
There are various social implications of data mining which are as follows −
 Privacy − It is a loaded issue. In current years privacy concerns have taken on a more
important role in American society as merchants, insurance companies, and government
agencies amass warehouses including personal records.
The concerns that people have over the group of this data will generally extend to some analytic
capabilities used to the data. Users of data mining should start thinking about how their use of this
technology will be impacted by legal problems associated with privacy.
 Profiling − Data Mining and profiling is a developing field that attempts to organize,
understand, analyse, reason, and use the explosion of data in this information age. The
process contains using algorithms and experience to extract design or anomalies that are
very complex, difficult, or time-consuming to recognize.
The founder of Microsoft's Exploration Team used complex data mining algorithms to solve an issue
that had haunted astronomers for some years. The problem of reviewing, describing, and
categorizing 2 billion sky objects recorded over 3 decades. The algorithm extracted the relevant
design to allocate the sky objects like stars or galaxies. The algorithms were able to extract the
feature that represented sky objects as stars or galaxies. This developing field of data mining and
profiling has several frontiers where it can be used.
 Unauthorized Used − Trends obtain through data mining designed to be used for
marketing goals or some other ethical goals, can be misused. Unethical businesses or people
can use the data obtained through data mining to take benefit of vulnerable people or
discriminate against a specific group of people. Furthermore, the data mining technique is
not 100 percent accurate; thus mistakes do appear which can have serious results.
APPLICATIONS OF DATA MINING
1. Financial Analysis
2. Biological Analysis
3. Scientific Analysis
4. Intrusion Detection
5. Fraud Detection
6. Research Analysis
7. Market Basket Analysis:
8. Education
9. CRM (Customer Relationship Management):
Alias of Data Mining
 Exploratory Data Analysis
 Data Driven Analysis
 Deductive Learning

More Related Content

Similar to notes_dmdw_chap1.docx

Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
Seerat Malik
 
Data Mining Presentation for College Harsh.pptx
Data Mining Presentation for College Harsh.pptxData Mining Presentation for College Harsh.pptx
Data Mining Presentation for College Harsh.pptx
hp41112004
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
Sushil Kulkarni
 
datamining.ppt
datamining.pptdatamining.ppt
datamining.ppt
PerumalPitchandi
 
datamining.ppt
datamining.pptdatamining.ppt
datamining.ppt
ssusereadde9
 
datamining management slyabbus and ppt.pptx
datamining management slyabbus and ppt.pptxdatamining management slyabbus and ppt.pptx
datamining management slyabbus and ppt.pptx
shyam1985
 
datamining.ppt
datamining.pptdatamining.ppt
datamining.ppt
Arvind Bhisikar
 
Week-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxWeek-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptx
Take1As
 
Data Mining
Data MiningData Mining
Data Mining
SOMASUNDARAM T
 
data minig for eng with all topics and history
data minig for eng with all topics and historydata minig for eng with all topics and history
data minig for eng with all topics and history
nbaisane16
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective Approach
IRJET Journal
 
Data mining
Data miningData mining
Data miningSilicon
 
Application of data mining
Application of data miningApplication of data mining
Application of data mining
SHIVANI SONI
 
Data mining
Data miningData mining
Data mining
Gagan Mittal
 
Data Mining Applications And Feature Scope Survey
Data Mining Applications And Feature Scope SurveyData Mining Applications And Feature Scope Survey
Data Mining Applications And Feature Scope Survey
NIET Journal of Engineering & Technology (NIETJET)
 
Unit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdfUnit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdf
Sitamarhi Institute of Technology
 
Unit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdfUnit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdf
Sitamarhi Institute of Technology
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
FellowBuddy.com
 
Business Intelligence
Business IntelligenceBusiness Intelligence
Business Intelligence
Sukirti Garg
 
BVRM 402 IMS UNIT V
BVRM 402 IMS UNIT VBVRM 402 IMS UNIT V
BVRM 402 IMS UNIT V
DrNilimaThakur
 

Similar to notes_dmdw_chap1.docx (20)

Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
 
Data Mining Presentation for College Harsh.pptx
Data Mining Presentation for College Harsh.pptxData Mining Presentation for College Harsh.pptx
Data Mining Presentation for College Harsh.pptx
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
datamining.ppt
datamining.pptdatamining.ppt
datamining.ppt
 
datamining.ppt
datamining.pptdatamining.ppt
datamining.ppt
 
datamining management slyabbus and ppt.pptx
datamining management slyabbus and ppt.pptxdatamining management slyabbus and ppt.pptx
datamining management slyabbus and ppt.pptx
 
datamining.ppt
datamining.pptdatamining.ppt
datamining.ppt
 
Week-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxWeek-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptx
 
Data Mining
Data MiningData Mining
Data Mining
 
data minig for eng with all topics and history
data minig for eng with all topics and historydata minig for eng with all topics and history
data minig for eng with all topics and history
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective Approach
 
Data mining
Data miningData mining
Data mining
 
Application of data mining
Application of data miningApplication of data mining
Application of data mining
 
Data mining
Data miningData mining
Data mining
 
Data Mining Applications And Feature Scope Survey
Data Mining Applications And Feature Scope SurveyData Mining Applications And Feature Scope Survey
Data Mining Applications And Feature Scope Survey
 
Unit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdfUnit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdf
 
Unit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdfUnit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdf
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
Business Intelligence
Business IntelligenceBusiness Intelligence
Business Intelligence
 
BVRM 402 IMS UNIT V
BVRM 402 IMS UNIT VBVRM 402 IMS UNIT V
BVRM 402 IMS UNIT V
 

Recently uploaded

2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
chanes7
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
amberjdewit93
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
Bisnar Chase Personal Injury Attorneys
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
ak6969907
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
AyyanKhan40
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
IreneSebastianRueco1
 
Delivering Micro-Credentials in Technical and Vocational Education and Training
Delivering Micro-Credentials in Technical and Vocational Education and TrainingDelivering Micro-Credentials in Technical and Vocational Education and Training
Delivering Micro-Credentials in Technical and Vocational Education and Training
AG2 Design
 
Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
chanes7
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
Krisztián Száraz
 
Landownership in the Philippines under the Americans-2-pptx.pptx
Landownership in the Philippines under the Americans-2-pptx.pptxLandownership in the Philippines under the Americans-2-pptx.pptx
Landownership in the Philippines under the Americans-2-pptx.pptx
JezreelCabil2
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
Group Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana BuscigliopptxGroup Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana Buscigliopptx
ArianaBusciglio
 

Recently uploaded (20)

2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
 
Delivering Micro-Credentials in Technical and Vocational Education and Training
Delivering Micro-Credentials in Technical and Vocational Education and TrainingDelivering Micro-Credentials in Technical and Vocational Education and Training
Delivering Micro-Credentials in Technical and Vocational Education and Training
 
Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
 
Landownership in the Philippines under the Americans-2-pptx.pptx
Landownership in the Philippines under the Americans-2-pptx.pptxLandownership in the Philippines under the Americans-2-pptx.pptx
Landownership in the Philippines under the Americans-2-pptx.pptx
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
Group Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana BuscigliopptxGroup Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana Buscigliopptx
 

notes_dmdw_chap1.docx

  • 1. DATA MINING CHAPTER – 1 INTRODUCTION TO DATA MINING What is Data Mining? The process of extracting information to identify patterns, trends, and useful data that would allow the business to take the data-driven decision from huge sets of data is called Data Mining. In other words, we can say that Data Mining is the process of investigating hidden patterns of information to various perspectives for categorization into useful data, which is collected and assembled in particular areas such as data warehouses, efficient analysis, data mining algorithm, helping decision making and other data requirement to eventually cost-cutting and generating revenue. Data mining is the act of automatically searching for large stores of information to find trends and patterns that go beyond simple analysis procedures. Data mining utilizes complex mathematical algorithms for data segments and evaluates the probability of future events. Data Mining is also called Knowledge Discovery of Data (KDD). Data Mining is a process used by organizations to extract specific data from huge databases to solve business problems. It primarily turns raw data into useful information. Data mining is one of the most useful techniques that help entrepreneurs, researchers, and individuals to extract valuable information from huge sets of data. Data mining is also called Knowledge Discovery in Database (KDD). The knowledge discovery process includes Data cleaning, Data integration, Data selection, Data transformation, Data mining, Pattern evaluation, and Knowledge presentation. Data mining is the process of extracting useful information from large sets of data. It involves using various techniques from statistics, machine learning, and database systems to identify patterns, relationships, and trends in the data. This information can then be used to make data- driven decisions, solve business problems, and uncover hidden insights. Applications of data mining include customer profiling and segmentation, market basket analysis, anomaly detection, and predictive modelling. Data mining tools and technologies are widely used in various industries, including finance, healthcare, retail, and telecommunications. In general terms, “Mining” is the process of extraction of some valuable material from the earth e.g. coal mining, diamond mining, etc. In the context of computer science, “Data Mining” can be referred to as knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging. It is basically the process carried out for the extraction of useful information from a bulk of data or data warehouses. One can see that the term itself is a little confusing. In the case of coal or diamond mining, the result of the extraction process is coal or diamond. But in the case of Data Mining, the result of the extraction process is not data!! Instead, data mining results are the patterns and knowledge that we gain at the end of the extraction process. In that sense, we can think of Data Mining as a step in the process of Knowledge Discovery or Knowledge Extraction.
  • 2. Gregory Piatetsky-Shapiro coined the term “Knowledge Discovery in Databases” in 1989. However, the term ‘data mining’ became more popular in the business and press communities. Currently, Data Mining and Knowledge Discovery are used interchangeably. Nowadays, data mining is used in almost all places where a large amount of data is stored and processed. For example, banks typically use ‘data mining’ to find out their prospective customers who could be interested in credit cards, personal loans, or insurance as well. Since banks have the transaction details and detailed profiles of their customers, they analyse all this data and try to find out patterns that help them predict that certain customers could be interested in personal loans, etc. Main Purpose of Data Mining Data Mining Basically, Data mining has been integrated with many other techniques from other domains such as statistics, machine learning, pattern recognition, database and data warehouse systems, information retrieval, visualization, etc. to gather more information about the data and to help predict hidden patterns, future trends, and behaviours and allows businesses to make decisions. Technically, data mining is the computational process of analysing data from different perspectives, dimensions, angles and categorizing/summarizing it into meaningful information. Data Mining can be applied to any type of data e.g. Data Warehouses, Transactional Databases, Relational Databases, Multimedia Databases, Spatial Databases, Time-series Databases, World Wide Web. Data mining as a whole process The whole process of Data Mining consists of three main phases: 1. Data Pre-processing – Data cleaning, integration, selection, and transformation takes place 2. Data Extraction – Occurrence of exact data mining 3. Data Evaluation and Presentation – Analysing and presenting results
  • 3. Types of Data Mining Data mining can be performed on the following types of data:  Relational Database: A relational database is a collection of multiple data sets formally organized by tables, records, and columns from which data can be accessed in various ways without having to recognize the database tables. Tables convey and share information, which facilitates data searchability, reporting, and organization.  Data warehouses: A Data Warehouse is the technology that collects the data from various sources within the organization to provide meaningful business insights. The huge amount of data comes from multiple places such as Marketing and Finance. The extracted data is utilized for analytical purposes and helps in decision- making for a business organization. The data warehouse is designed for the analysis of data rather than transaction processing.  Data Repositories: The Data Repository generally refers to a destination for data storage. However, many IT professionals utilize the term more clearly to refer to a specific kind of setup within an IT structure. For example, a group of databases, where an organization has kept various kinds of information.  Object-Relational Database: A combination of an object-oriented database model and relational database model is called an object-relational model. It supports Classes, Objects, Inheritance, etc. One of the primary objectives of the Object-relational data model is to close the gap between the Relational database and the object-oriented model practices frequently utilized in many programming languages, for example, C++, Java, C#, and so on.  Transactional Database:
  • 4. A transactional database refers to a database management system (DBMS) that has the potential to undo a database transaction if it is not performed appropriately. Even though this was a unique capability a very long while back, today, most of the relational database systems support transactional database activities. There are several benefits (advantages) of data mining, including: 1. Improved decision making: Data mining can provide valuable insights that can help organizations make better decisions by identifying patterns and trends in large data sets. 2. Increased efficiency: Data mining can automate repetitive and time-consuming tasks, such as data cleaning and preparation, which can help organizations save time and resources. 3. Enhanced competitiveness: Data mining can help organizations gain a competitive edge by uncovering new business opportunities and identifying areas for improvement. 4. Improved customer service: Data mining can help organizations better understand their customers and tailor their products and services to meet their needs. 5. Fraud detection: Data mining can be used to identify fraudulent activities by detecting unusual patterns and anomalies in data. 6. Predictive modeling: Data mining can be used to build models that can predict future events and trends, which can be used to make proactive decisions. 7. New product development: Data mining can be used to identify new product opportunities by analyzing customer purchase patterns and preferences. 8. Risk management: Data mining can be used to identify potential risks by analyzing data on customer behavior, market conditions, and other factors. Disadvantages of Data Mining: 1. Privacy concerns: Data mining can raise privacy concerns as it involves collecting and analyzing large amounts of data, which can include sensitive information about individuals. 2. Complexity: Data mining can be a complex process that requires specialized skills and knowledge to implement and interpret the results. 3. Unintended consequences: Data mining can lead to unintended consequences, such as bias or discrimination, if the data or models are not properly understood or used. 4. Data Quality: Data mining process heavily depends on the quality of data, if data is not accurate or consistent, the results can be misleading 5. High cost: Data mining can be an expensive process, requiring significant investments in hardware, software, and personnel. Real-life examples of Data Mining  Market Basket Analysis: It is a technique that gives the careful study of purchases done by a customer in a supermarket. The concept is basically applied to identify the items that are bought together by a customer. Say, if a person buys bread, what are the chances that he/she will also purchase butter. This analysis helps in promoting offers and deals by the companies. The same is done with the help of data mining.  Protein Folding: It is a technique that carefully studies the biological cells and predicts the protein interactions and functionality within biological cells. Applications of this research include determining causes and possible cures for Alzheimer’s, Parkinson’s, and cancer caused by Protein misfolding.
  • 5.  Fraud Detection: Nowadays, in this land of cell phones, we can use data mining to analyze cell phone activities for comparing suspicious phone activity. This can help us to detects calls made on cloned phones. Similarly, with credit cards, comparing purchases with historical purchases can detect activity with stolen cards. Data mining also has many successful applications, such as business intelligence, Web search, bioinformatics, health informatics, finance, digital libraries, and digital governments. BASIC DATA MINING TASKS Data Mining functions are used to define the trends or correlations contained in data mining activities. In comparison, data mining activities can be divided into 2 categories: 1. Predictive Data Mining: Predictive data mining tasks come up with a model from the available data set that is helpful in predicting unknown or future values of another data set of interest. Example :- A medical practitioner trying to diagnose a disease based on the medical test results of patient can be considered as a predictive data mining task. It helps developers to provide unlabelled definitions of attributes. Based on previous tests, the software estimates the characteristics that are absent. For example: Judging from the findings of a patient’s medical examinations is he suffering from any particular disease. 2. Descriptive Data Mining: Descriptive data mining tasks usually finds data describing patterns and comes up with new, significant information from the available data set. Example :- A retailer trying to identify products that are purchased together can be considered as a descriptive data mining task. It includes certain knowledge to understand what is happening within the data without a previous idea. The common data features are highlighted in the data set. For examples: count, average etc.
  • 6. PREDICTIVE DATA MINING: A predictive model of a data mining task comprises classification, regression, prediction, and time series analysis. The predictive model of data mining is also called statistical regression. It refers to a monitoring learning technique that includes an explication of the dependency of a few attribute's values upon the other attribute's value in the same product and the growth of a model that can predict these attribute's values in previous cases. 1. Classification: In data mining, classification refers to a form of data analysis where a machine learning model assigns a specific category to a new observation. It is based on what the model has learned from the data sets. In other words, classification is the act of assigning objects to many predefined categories. One example of classification in the banking and financial services industry is identifying whether transactions are fraudulent or not. In the same way, machine learning can also be used to predict whether a loan application would be approved or not. 2. Regression: Regression refers to a method that verifies the value of data for a function. Generally, it is used for appropriate data. A linear regression model in the context of machine learning or statistics is basically a linear approach for modelling the relationships between the dependent variable known as the result and your independent variable is known as features. If your model has only one independent variable, it is called simple linear regression, and else it is called multiple linear regression. Types of regression a. Linear Regression: Linear regression is related to the search for the optimal line which fits the two attributes so that with the help of one attribute, we can predict the other. b. Multi-linear regression: Multi-linear regression includes two or more than two attributes, and the data are fit to multi-dimensional space. 3. Prediction: In data mining, prediction is used to identify data value based on the description of another corresponding data value. The prediction in data mining is known as Numeric Prediction. Generally, regression analysis is used for prediction. For example, in credit card fraud detection, data history for a particular person's credit card usage has to be analysed. If any abnormal pattern was detected, it should be reported as 'fraudulent action'. 4. Time series analysis: Time series analysis refers to the data sets based on time. It serves as an independent variable to predict the dependent variable in time. Time series is a sequence of events where the next event is determined by one or more of the preceding events. Time series reflects the process being measured and there are certain components that affect the behaviour of a process. Time series analysis includes methods to analyse time-series data in order to extract useful patterns, trends, rules and statistics. Stock market prediction is an important application of time- series analysis. DESCRIPTIVE DATA MINING:
  • 7. A descriptive model differentiates the patterns and relationships in data. A descriptive model does not attempt to generalize to a statistical population or random process. A predictive model attempts to generalize to a population or random process. Predictive models should give prediction intervals and must be cross-validated; that is, they must prove that they can be used to make predictions with data that was not used in constructing the model. Descriptive analytics focuses on the summarization and conversion of the data into useful information for reporting and monitoring. 1. Clustering: Clustering is grouping a set of objects so that objects in the same group called a cluster are more similar than those in other group’s clusters. Clustering is used to identify data objects that are similar to one another. The similarity can be decided based on a number of factors like purchase behaviour, responsiveness to certain actions, geographical locations and so on. For example, an insurance company can cluster its customers based on age, residence, income etc. This group information will be helpful to understand the customers better and hence provide better customized services. 2. Association rules: Association rules determine a causal relationship between huge sets of data objects. The way the algorithm works is that you have. For example, a list of items you purchase at the grocery store for the past six months data, and it calculates a percentage at which items are purchased together. For example, what are the chances of you buying milk with cereal? Association discovers the association or connection among a set of items. Association identifies the relationships between objects. Association analysis is used for commodity management, advertising, catalog design, direct marketing etc. A retailer can identify the products that normally customers purchase together or even find the customers who respond to the promotion of same kind of products. If a retailer finds that beer and nappy are bought together mostly, he can put nappies on sale to promote the sale of beer. 3. Sequence: Sequence refers to the discovery of useful patterns in the data is in relation to some objective of how it is interesting. 4. Summarization: Summarization holds a data set in more depth which is easy to understand form. Summarization is the generalization of data. A set of relevant data is summarized which result in a smaller set that gives aggregated information of the data. For example, the shopping done by a customer can be summarized into total products, total spending, offers used, etc. Such high level summarized information can be useful for sales or customer relationship team for detailed customer and purchase behaviour analysis. Data can be summarized in different abstraction levels and from different angles. DATA MINING VS KNOWLEDGE DISCOVERY IN DATABASES Although the two terms KDD and Data Mining are heavily used interchangeably, they refer to two related yet slightly different concepts.
  • 8. KDD is the overall process of extracting knowledge from data, while Data Mining is a step inside the KDD process, which deals with identifying patterns in data. And Data Mining is only the application of a specific algorithm based on the overall goal of the KDD process. KDD is an iterative process where evaluation measures can be enhanced, mining can be refined, and new data can be integrated and transformed to get different and more appropriate results. KNOWLEDGE DISCOVERY IN DATABASES (KDD). KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful, previously unknown, and potentially valuable information from large datasets. The KDD process in data mining typically involves the following steps: 1. Selection: Select a relevant subset of the data for analysis. 2. Pre-processing: Clean and transform the data to make it ready for analysis. This may include tasks such as data normalization, missing value handling, and data integration. 3. Transformation: Transform the data into a format suitable for data mining, such as a matrix or a graph. 4. Data Mining: Apply data mining techniques and algorithms to the data to extract useful information and insights. This may include tasks such as clustering, classification, association rule mining, and anomaly detection. 5. Interpretation: Interpret the results and extract knowledge from the data. This may include tasks such as visualizing the results, evaluating the quality of the discovered patterns and identifying relationships and associations among the data. 6. Evaluation: Evaluate the results to ensure that the extracted knowledge is useful, accurate, and meaningful. 7. Deployment: Use the discovered knowledge to solve the business problem and make decisions. The KDD process is an iterative process and it requires multiple iterations of the above steps to extract accurate knowledge from the data. Why do we need Data Mining? Volume of information is increasing everyday than we can handle from business transactions, scientific data, sensor data, Pictures, videos, etc. So, we need a system that will be capable of
  • 9. extracting essence of information available and that can automatically generate report, views or summary of data for better decision-making. Advantages of KDD: 1. Improves decision-making: KDD provides valuable insights and knowledge that can help organizations make better decisions. 2. Increased efficiency: KDD automates repetitive and time-consuming tasks and makes the data ready for analysis, which saves time and money. 3. Better customer service: KDD helps organizations gain a better understanding of their customers’ needs and preferences, which can help them provide better customer service. 4. Fraud detection: KDD can be used to detect fraudulent activities by identifying patterns and anomalies in the data that may indicate fraud. 5. Predictive modelling: KDD can be used to build predictive models that can forecast future trends and patterns. Disadvantages of KDD: 1. Privacy concerns: KDD can raise privacy concerns as it involves collecting and analyzing large amounts of data, which can include sensitive information about individuals. 2. Complexity: KDD can be a complex process that requires specialized skills and knowledge to implement and interpret the results. 3. Unintended consequences: KDD can lead to unintended consequences, such as bias or discrimination, if the data or models are not properly understood or used. 4. Data Quality: KDD process heavily depends on the quality of data, if data is not accurate or consistent, the results can be misleading 5. High cost: KDD can be an expensive process, requiring significant investments in hardware, software, and personnel. 6. Over fitting: KDD process can lead to over fitting, which is a common problem in machine learning where a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new unseen data. Parameter KDD Data Mining Definition KDD refers to a process of identifying valid, novel, potentially useful, and ultimately understandable patterns and relationships in data. Data Mining refers to a process of extracting useful and valuable information or patterns from large data sets. Objective To find useful knowledge from data. To extract useful information from data. Techniques Used Data cleaning, data integration, data selection, data transformation, data Association rules, classification, clustering, regression, decision
  • 10. mining, pattern evaluation, and knowledge representation and visualization. trees, neural networks, and dimensionality reduction. Output Structured information, such as rules and models that can be used to make decisions or predictions. Patterns, associations, or insights that can be used to improve decision-making or understanding. Focus Focus is on the discovery of useful knowledge, rather than simply finding patterns in data. Focus is on the discovery of patterns or relationships in data. Role of domain expertise Domain expertise is important in KDD, as it helps in defining the goals of the process, choosing appropriate data, and interpreting the results. Domain expertise is less critical in data mining, as the algorithms are designed to identify patterns without relying on prior knowledge. MAJOR ISSUES IN DATA MINING: 1. Mining different kinds of knowledge in databases – The need for different users is not same. Different users may be interested in different kinds of knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge discovery tasks. 2. Interactive mining of knowledge at multiple levels of abstraction – The data mining process needs to be interactive because it allows users to focus on search for patterns, providing and refining data mining requests based on returned results. 3. Incorporation of background knowledge – To guide discovery process and to express discovered patterns, background knowledge can be used to express discovered patterns not only in concise terms but at multiple levels of abstraction. 4. Data mining query languages and ad-hoc data mining – Data Mining Query language that allows user to describe ad-hoc mining tasks should be integrated with a data warehouse query language and optimized for efficient and flexible data mining. 5. Presentation and visualization of data mining results – Once patterns are discovered it needs to be expressed in high-level languages, visual representations. These representations should be easily understandable by users. 6. Handling noisy or incomplete data – The data cleaning methods are required that can handle noise, incomplete objects while mining data regularities. If data cleaning methods are not there then accuracy of discovered patterns will be poor. 7. Pattern evaluation – It refers to interestingness of problem. The patterns discovered should be interesting because either they represent common knowledge or lack of novelty. 8. Efficiency and scalability of data mining algorithms – In order to effectively extract information from huge amount of data in databases, data mining algorithm must be efficient and scalable. 9. Parallel, distributed, and incremental mining algorithms – The factors such as huge size of databases, wide distribution of data, and complexity of data mining methods motivate development of parallel and distributed data mining algorithms. These
  • 11. algorithms divide data into partitions that are further processed parallel. Then results from partitions are merged. The incremental algorithms update databases without having mined data again from scratch. DATA MINING METRICS Data mining is one of the forms of artificial intelligence that uses perception models, analytical models, and multiple algorithms to simulate the techniques of the human brain. Data mining supports machines to take human decisions and create human choices. The user of the data mining tools will have to direct the machine rules, preferences, and even experiences to have decision support data mining metrics are as follows −  Usefulness − Usefulness involves several metrics that tell us whether the model provides useful data. For instance, a data mining model that correlates save the location with sales can be both accurate and reliable, but cannot be useful, because it cannot generalize that result by inserting more stores at the same location. Furthermore, it does not answer the fundamental business question of why specific locations have more sales. It can also find that a model that appears successful is meaningless because it depends on cross-correlations in the data.  Return on Investment (ROI) − Data mining tools will find interesting patterns buried inside the data and develop predictive models. These models will have several measures for denoting how well they fit the records. It is not clear how to create a decision based on some of the measures reported as an element of data mining analyses.  Access Financial Information during Data Mining −The simplest way to frame decisions in financial terms is to augment the raw information that is generally mined to also contain financial data. Some organizations are investing and developing data warehouses, and data marts. The design of a warehouse or mart contains considerations about the types of analyses and data needed for expected queries. It is designing warehouses in a way that allows access to financial information along with access to more typical data on product attributes, user profiles, etc. can be useful.  Converting Data Mining Metrics into Financial Terms − A general data mining metric is the measure of "Lift". Lift is a measure of what is achieved by using the specific model or pattern relative to a base rate in which the model is not used. High values mean much is achieved. It can seem then that one can simply create a decision based on Lift.  Accuracy − Accuracy is a measure of how well the model correlates results with the attributes in the data that has been supported. There are several measures of accuracy, but all measures of accuracy are dependent on the information that is used. In reality, values can be missing or approximate, or the data can have been changed by several processes. It is the procedure of exploration and development, it can decide to accept a specific amount of error in the data, especially if the data is fairly uniform in its characteristics. For example, a model that predicts sales for a specific store based on past sales can be powerfully correlated and very accurate, even if that store consistently used the wrong accounting techniques. Thus, measurements of accuracy should be balanced by assessments of reliability.
  • 12. SOCIAL IMPLICATIONS OF DATA MINING There are various social implications of data mining which are as follows −  Privacy − It is a loaded issue. In current years privacy concerns have taken on a more important role in American society as merchants, insurance companies, and government agencies amass warehouses including personal records. The concerns that people have over the group of this data will generally extend to some analytic capabilities used to the data. Users of data mining should start thinking about how their use of this technology will be impacted by legal problems associated with privacy.  Profiling − Data Mining and profiling is a developing field that attempts to organize, understand, analyse, reason, and use the explosion of data in this information age. The process contains using algorithms and experience to extract design or anomalies that are very complex, difficult, or time-consuming to recognize. The founder of Microsoft's Exploration Team used complex data mining algorithms to solve an issue that had haunted astronomers for some years. The problem of reviewing, describing, and categorizing 2 billion sky objects recorded over 3 decades. The algorithm extracted the relevant design to allocate the sky objects like stars or galaxies. The algorithms were able to extract the feature that represented sky objects as stars or galaxies. This developing field of data mining and profiling has several frontiers where it can be used.  Unauthorized Used − Trends obtain through data mining designed to be used for marketing goals or some other ethical goals, can be misused. Unethical businesses or people can use the data obtained through data mining to take benefit of vulnerable people or discriminate against a specific group of people. Furthermore, the data mining technique is not 100 percent accurate; thus mistakes do appear which can have serious results. APPLICATIONS OF DATA MINING 1. Financial Analysis 2. Biological Analysis 3. Scientific Analysis 4. Intrusion Detection 5. Fraud Detection 6. Research Analysis 7. Market Basket Analysis: 8. Education 9. CRM (Customer Relationship Management): Alias of Data Mining  Exploratory Data Analysis  Data Driven Analysis  Deductive Learning