2. What is data mining?
• Data mining is the process of sorting through large data
sets to identify patterns and relationships that can help
solve business problems through data analysis. Data
mining techniques and tools enable enterprises
to predict future trends and make more-informed
business decisions.
3. KDD
• Many people treat data mining as a synonym for
another popularly used term, knowledge discovery from
data, or KDD, while others view data mining as merely
an essential step in the process of knowledge discovery.
The knowledge discovery process is shown in Figure 1
• as an iterative sequence of the following steps
5. Data Cleaning
Removal of noise, inconsistent data, and outliers
Strategies to handle missing data fields.
Data Integration
Data from various sources such as databases, data warehouse, and
transactional data are integrated.
where multiple data sources may be combined into a single data format.
Data Selection
Data relevant to the analysis task is retrieved from the database.
Collecting only necessary information to the model.
Finding useful features to represent data depending on the goal of the task.
le.
6. Data Transformation
Data are transformed and consolidated into forms appropriate
for mining by performing summary or aggregation operations.
By using transformation methods invariant representations for
the data is found.
Data Mining
An essential process where intelligent methods are applied to
extract data patterns.
Deciding which model and parameter may be appropriate.
Pattern Evaluation
To identify the truly interesting patterns representing
knowledge based on interesting measures.
Knowledge Presentation
Visualization and knowledge representation techniques are
used to present mined knowledge to users.
Visualizations can be in form of graphs, charts or tab
7. What Kinds of Data Can Be Mined?
• As a general technology, data mining can be applied to
any kind of data as long as the data are meaningful for a
target application.The most basic forms of data for
mining applications are
• Database data
• Data warehouse dta and
• Transactional data
8. What Kinds of Patterns Can Be
Mined?
• We have observed various types of data and information
repositories on which data mining can be performed. Let us
now examine the kinds of patterns that can be mined.There
are a number of data mining functionalities
• These include characterization and discrimination. In general,
such tasks can be classified into two categories: descriptive
and predictive. Descriptive mining tasks characterize
properties of the data in a target data set. Predictive mining
tasks perform induction on the current data in order to make
predictions. Data mining functionalities, and the kinds of
patterns they can discover. Interesting patterns represent
knowledge.
9. Class/Concept Description:
• Characterization and Discrimination Data entries can
be associated with classes or concepts. For example, in
the All Electronics store, classes of items for sale include
computers and printers, and concepts of customers
include big Spenders and budget Spenders. It can be
useful to describe individual classes and concepts in
summarized, concise, and yet precise terms. Such
descriptions of a class or a concept are called
class/concept descriptions.
10. Mining Frequent Patterns,
Associations, and Correlations
• Frequent patterns, as the name suggests, are patterns that occur frequently in
data.There are many kinds of frequent patterns, including frequent itemsets,
frequent subsequences (also known as sequential patterns), and frequent
substructures.A frequent itemset typically refers to a set of items that often
appear together in a transactional data set—for example, milk and bread,
which are frequently bought together in grocery stores by many customers.A
frequently occurring subsequence, such as the pattern that customers, tend to
purchase first a laptop, followed by a digital camera, and then a memory card,
is a (frequent) sequential pattern. A substructure can refer to different
structural forms (e.g., graphs, trees, or lattices) that may be combined with
itemsets or subsequences. If a substructure occurs frequently, it is called a
(frequent) structured pattern. Mining frequent patterns leads to the discovery
of interesting associations and correlations within data.
11. Data Objects and AttributeTypes
• sets are made up of data objects. A data object represents an
entity—in a sales database, the objects may be customers,
store items, and sales; in a medical database, the objects
may be patients; in a university database, the objects may be
students, professors, and courses. Data objects are typically
described by attributes. Data objects can also be referred to
as samples, examples, instances, data points, or objects. If
the data objects are stored in a database, they are data
tuples.That is, the rows of a database correspond to the data
objects, and the columns correspond to the attributes. In this
section, we define attributes and look at the various attribute
types.
12. What Is an Attribute?
• An attribute is an object’s property or characteristics.
For example. A person’s hair colour, air humidity etc.
• An attribute set defines an object.The object is also
referred to as a record of the instances or entity.
• Different types of attributes or data types:
13. • Nominal Attribute:
NominalAttributes only provide enough attributes to differentiate
between one object and another. Such as Student Roll No., Sex of
the Person.
• OrdinalAttribute:
The ordinal attribute value provides sufficient information to
order the objects. Such as Rankings,Grades, Height
• Binary Attribute:
These are 0 and 1.Where 0 is the absence of any features and 1 is
the inclusion of any characteristics.
• Numeric attribute:
• It is quantitative, such that quantity can be measured and
represented in integer or real values ,are of two types
Interval Scaled attribute:
It is measured on a scale of equal size units,these attributes allow
us to compare such as temperature in C or F and thus values of
attributes have ordered.
Ratio Scaled attribute:
Both differences and ratios are significant for Ratio. For eg. age,
length, andWeight.
14. Basic Statistical Descriptions of Data
• For data preprocessing to be successful, it is essential to have
an overall picture of your data. Basic statistical descriptions
can be used to identify properties of the data and highlight
which data values should be treated as noise or outliers.This
section discusses three areas of basic statistical descriptions.
We start with measures of central tendency :
15. Measuring the CentralTendency:
• Mean, Median, and Mode In this section, we look at various ways to
measure the central tendency of data.
16. Data Preprocessing:
• Data Pre-processing is a preliminary step during data mining. It is
any type of processing performed on raw data to transform data
into formats that are easier to use.
•
• Why Is Data Preprocessing Important?
• In the real world, data is frequently unclean – missing key values,
containing inconsistencies or displaying “noise” (containing errors
and outliers).Without data preprocessing, these data mistakes will
survive and detract from the quality of data mining
17. Data Quality: Why do we
preprocess the data?
• Many characteristics act as a deciding factor for data quality, such as
incompleteness and incoherent information, which are common properties of
the big database in the real world. Factors used for data quality assessment
are:
• Accuracy:
There are many possible reasons for flawed or inaccurate data here. i.e. Having
incorrect values of properties that could be human or computer errors.
• Completeness:
For some reasons, incomplete data can occur, attributes of interest such as
customer information for sales & transaction data may not always be
available.
18. Continue
• Consistency:
Incorrect data can also result from inconsistencies in naming convention or data
codes, or from input field incoherent format. Duplicate tuples need cleaning of
details, too.
• Timeliness:
It also affects the quality of the data. At the end of the month, several sales
representatives fail to file their sales records on time.There are also several
corrections & adjustments which flow into after the end of the month. Data stored
in the database are incomplete for a time after each month.
• Believability:
It is reflective of how much users trust the data.
• Interpretability:
It is a reflection of how easy the users can understand the data.
19. Data Cleaning
• Real-world data tend to be incomplete, noisy, and
inconsistent. Data cleaning (or data cleansing) routines
attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the
data. In this section, you will study basic methods for
data cleaning
20. Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as customer income in sales
data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred.
21. How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—
not effective when the percentage of missing values per attribute varies considerably.
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or decision tree
22. Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which requires data cleaning
• duplicate records
• incomplete data
• inconsistent data
23. How to Handle Noisy Data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with possible outliers)
24. Simple Discretization Methods:
Binning
• Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well
• Equal-depth (frequency) partitioning
• Divides the range into N intervals, each containing approximately same number of samples
• Good data scaling
• Managing categorical attributes can be tricky
25. Data Cleaning as a Process
• Data discrepancy detection
• Use metadata (e.g., domain, range, dependency, distribution)
• Check field overloading
• Check uniqueness rule, consecutive rule and null rule
• Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering
to find outliers)
• Data migration and integration
• Data migration tools: allow transformations to be specified
• ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface
• Integration of the two processes
• Iterative and interactive (e.g., Potter’s Wheels)
26. Data Integration
• Data mining often requires data integration—the
merging of data from multiple data stores. Careful
integration can help reduce and avoid redundancies and
inconsistencies in the resulting data set.This can help
improve the accuracy and speed of the subsequent data
mining process.
27. Handling Redundancy in Data
Integration
• Redundant data occur often when integration of multiple databases
• Object identification: The same attribute or object may have different
names in different databases
• Derivable data: One attribute may be a “derived” attribute in another
table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation analysis
• Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
28. DataTransformation
28
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified
range
◦ min-max normalization
◦ z-score normalization
◦ normalization by decimal scaling
Attribute/feature construction
◦ New attributes constructed from the given ones
29. Data Normalization
• The range of attributes (features) values differ, thus one
feature might overpower the other one.
• Solution: Normalization
• Scaling data values in a range such as [0 … 1], [-1 … 1] prevents
outweighing features with large range like ‘salary’ over
features with smaller range like ‘age’.