Introduction to Data Mining Concepts and Techniques

Unit-1
INTRODUCTION TO
DATA MINING
By,
MRUTYUNJAYA S Y
Assistant Professor,
CSE dept,
CMREC, Hyderabad

HISTORY
• The history of Data Mining started very recently
as it is commonly considered with new
technology.
• However data is a discipline with a long history.
• It starts with the early Data Mining methods
Bayes’ Theorem (1700`s) and Regression
analysis (1800`s) which were mostly identifying
patterns in data.

Why Mine Data? Commercial Viewpoint
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)

Why Mine Data? Scientific Viewpoint
• Data collected and stored at
enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data
– scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw
data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation

What Is Data Mining?
• Data mining (knowledge discovery in databases):
– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns
from data in large databases
– Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis.
– Data mining is a process used by companies to turn
raw data into useful information. By using software to look
for patterns in large batches of data, businesses can learn
more about their customers and develop more effective
marketing strategies as well as increase sales and
decrease costs.

????Questions????
• Where exactly the DATA can be stored?
• How and from where to extract the DATA?
• From where the DATA we can be MINE?
• Answer is DATA WAREHOUSE........

• Draws ideas from machine learning/AI,
pattern recognition, statistics, and
database systems
• Must address:
– Enormity of data
– High dimensionality
of data
– Heterogeneous,
distributed nature
of data
Origins of Data Mining
AI /
Machine Learning
Statistics
Data Mining
Database
systems

Database Processing vs. Data
Mining Processing
• Query
– Well defined
– SQL
• Query
– Poorly defined
– No precise query language
 Output
– Precise
– Subset of database
 Output
– Fuzzy
– Not a subset of database

10
Query Examples
• Database
• Data Mining
– Find all customers who have purchased milk
– Find all items which are frequently purchased
with milk. (association rules)
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more
than $10,000 in the last month.
– Find all credit applicants who are poor credit
risks. (classification)
– Identify customers with similar buying habits.
(Clustering)

Data Mining: Classification Schemes
• Decisions in data mining
– Kinds of databases to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adapted
• Data mining tasks
– Descriptive data mining
– Predictive data mining

Decisions in Data Mining
• Databases to be mined
– Relational, transactional, object-oriented, object-relational,
time-series, text, multi-media, heterogeneous, legacy, WWW,
etc.
• Knowledge to be mined
– Characterization, discrimination, association, classification,
clustering, trend, deviation and outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, DNA mining, stock
market analysis, Web mining, Weblog analysis, etc.

Data Mining Tasks
• Prediction Tasks
– Use some variables to predict unknown or future values of other
variables
• Description Tasks
– Find human-interpretable patterns that describe the data.
Common data mining tasks
– Classification [Predictive]
– Clustering [Descriptive]
– Association Rule Discovery [Descriptive]
– Sequential Pattern Discovery [Descriptive]
– Regression [Predictive]
– Deviation Detection [Predictive]

Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is
the class.
• Find a model for class attribute as a function of
the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the model.
– Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.

Classification: Application 1
• Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
• Collect various demographic, lifestyle, and company-
interaction related information about all such customers.
– Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier
model.

• Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
• Use credit card transactions and the information on its
account-holder as attributes.
– When does a customer buy, what does he buy, how often he
pays on time, etc
• Label past transactions as fraud or fair transactions. This
forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card
transactions on an account.

• Sky Survey Cataloging
– Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic
survey images (from Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.
– Approach:
• Segment the image.
• Measure image attributes (features) - 40 of them per object.
• Model the class based on these features.
• Success Story: Could find 16 new high red-shift quasars,
some of the farthest objects that are difficult to find!

Classifying Galaxies
Early
Intermediate
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Class:
• Stages of Formation
Attributes:
• Image features,
• Characteristics of light
waves received, etc.

Data Mining: A KDD Process
– Data mining: the core of
knowledge discovery
process.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Data Selection
Data Preprocessing
Data Mining
Pattern Evaluation

Steps of a KDD Process
• Learning the application domain:
– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation:
– Find useful features, dimensionality/variable reduction, invariant
representation.
• Choosing functions of data mining
– summarization, classification, regression, association, clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge

Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP

November 13, 2022 Data Mining: Concepts and
Techniques
24
Architecture of a Typical Data
Mining System
Data
Warehouse
Data cleaning & data integration Filtering
Databases
Database or data
warehouse server
Data mining engine
Pattern evaluation
Graphical user interface
Knowledge-base

Data Mining: Confluence of Multiple
Disciplines
Data Mining
Database
Technology
Statistics
Other
Disciplines
Information
Science
Machine
Learning
Visualization

Examples of Large Datasets
• Government: IRS, NGA, …
• Large corporations
– WALMART: 20M transactions per day
– MOBIL: 100 TB geological databases
– AT&T 300 M calls per day
– Credit card companies
• Scientific
– NASA, EOS project: 50 GB per hour
– Environmental datasets

DATA MINING APPLICATIONS
• Areas of Use (Huge usage in All Fields)
– Internet – Discover needs of customers
– Economics – Predict stock prices
– Science – Predict environmental change
– Medicine – Match patients with similar problems  cure
• Credit Card Company wants to discover information about
clients from databases. Want to find:
– Clients who respond to promotions in “Junk Mail”
– Clients that are likely to change to another competitor

November 13, 2022 Data Preprocessing 28
Data Preprocessing
 Why preprocess the data?
 Descriptive data summarization
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary

Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data
 e.g., occupation=“ ”
 noisy: containing errors or outliers
 e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes
or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records

Why Is Data Dirty?
 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was
collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning

Why Is Data Preprocessing Important?
 No quality data, no quality mining results!
 Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
 Data warehouse needs consistent integration of quality
data
 Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse

Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same
or similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially
for numerical data

Forms of Data Preprocessing

Data Cleaning
 Importance
 “Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
 “Data cleaning is the number one problem in data
warehousing”—DCI survey
 Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration

Missing Data
 Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data
 Missing data may need to be inferred.

Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data

How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
 Regression
 smooth by fitting the data into regression functions
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human (e.g.,
deal with possible outliers)

Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)
 Check field overloading
 Check uniqueness rule, consecutive rule and null rule
 Use commercial tools
 Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
 Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering
to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)

Introduction to Data Mining Concepts and Techniques

Recommended

Recommended

More Related Content

Similar to Introduction to Data Mining Concepts and Techniques

Similar to Introduction to Data Mining Concepts and Techniques (20)

Recently uploaded

Recently uploaded (20)

Introduction to Data Mining Concepts and Techniques