Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for further analysis. The key tasks in data preprocessing are data cleaning to handle missing values, noise, outliers and inconsistencies; data integration of multiple data sources; data transformation through normalization, aggregation, and attribute construction; data reduction to reduce data size through methods like binning, clustering and sampling; and data discretization to reduce continuous attribute values into intervals. Descriptive statistics like mean, median and histograms can help identify noise and outliers during data cleaning.
This presentation educates you about Clustering, Overview, Types of Clustering, Types of clustering algorithms, K-means clustering, Hierarchical clustering, Difference between K Means and Hierarchical clustering and Applications of Clustering.\
For more topics stay tuned with Learnbay.
Data reduction: breaking down large sets of data into more-manageable groups or segments that provide better insight.
- Data sampling
- Data cleaning
- Data transformation
- Data segmentation
- Dimension reduction
Overview of basic concepts related to Data Mining: database, data model, fuzzy sets, information retrieval, data warehouse, dimensional modeling, data cubes, OLAP, machine learning.
This presentation educates you about Clustering, Overview, Types of Clustering, Types of clustering algorithms, K-means clustering, Hierarchical clustering, Difference between K Means and Hierarchical clustering and Applications of Clustering.\
For more topics stay tuned with Learnbay.
Data reduction: breaking down large sets of data into more-manageable groups or segments that provide better insight.
- Data sampling
- Data cleaning
- Data transformation
- Data segmentation
- Dimension reduction
Overview of basic concepts related to Data Mining: database, data model, fuzzy sets, information retrieval, data warehouse, dimensional modeling, data cubes, OLAP, machine learning.
Data Science - Part V - Decision Trees & Random Forests Derek Kane
This lecture provides an overview of decision tree machine learning algorithms and random forest ensemble techniques. The practical example includes diagnosing Type II diabetes and evaluating customer churn in the telecommunication industry.
Introduction to random forest and gradient boosting methods a lectureShreyas S K
This presentation is an attempt to explain random forest and gradient boosting methods in layman terms with many real life examples related to the concepts
Data Science - Part V - Decision Trees & Random Forests Derek Kane
This lecture provides an overview of decision tree machine learning algorithms and random forest ensemble techniques. The practical example includes diagnosing Type II diabetes and evaluating customer churn in the telecommunication industry.
Introduction to random forest and gradient boosting methods a lectureShreyas S K
This presentation is an attempt to explain random forest and gradient boosting methods in layman terms with many real life examples related to the concepts
2. Why Data Preprocessing?
• Data in the real world is dirty
• Incomplete
Incomplete data may come from
Human/hardware/software problems
e.g., occupation=“ ”
• Noisy:
Faulty data collection instruments
e.g., Salary=“-10”
4. Major Tasks in Data Preprocessing?
• Data cleaning
Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
• Data integration
Integration of multiple databases, data cubes, or files
• Data transformation
Normalization and aggregation
• Data reduction
Obtains reduced representation in volume but
produces the same or similar analytical results
• Data discretization
Part of data reduction but with particular importance,
especially for numerical data
5. Descriptive Data Summarization
• It is a techniques can be used to identify the
which data values should be treated as noise
or outliers.
• Measures of central tendency include
1. mean
2. median
3. mode
4. midrange,
6. Graphic Displays of Basic Descriptive
Data Summaries
• Aside from the bar charts, pie charts, and line
graphs used in most statistical or graphical
data presentation.
• Histogram
• Quantile plots
• q-q plots
• scatter plots
• loess curves.
11. Data Cleaning
• Data cleaning (or data cleansing) routines
attempt to fill in
• missing values
• noise
• identifying outliers
• correct inconsistencies.
12. Missing Values
1. Ignore the tuple
2. Fill in the missing value manually
3. Use a global constant to fill in the missing value
4. Use the attribute mean to fill in the missing value
5. Use the attribute mean for all samples belonging
to the same class as the given tuple
6. Use the most probable value to fill in the missing
value
Method 6,however, is a popular strategy.
13. Noisy Data
1. Binning:
• The sorted values are distributed into a
number
of “buckets,” or bins
ex: Bin = 4,8,15
• Smoothing by bin means
Bin = 9
• Smoothing by bin boundaries
4,4,15
14. 2.Regression
• Data can be smoothed by fitting the data to a
function.
• Linear regression involves finding the “best” line
to fit two attributes
• so that one attribute can be used to predict the
other.
• Multiple linear regression is an extension of linear
regression
15. 3. Clustering
• Outliers may be detected by clustering, where
similar values are organized into groups, or
“clusters.”
• The values that fall outside of the set of
clusters may be considered outliers
..........
……
…
…..
16. SOME RULES
• The data should also be examined regarding
• unique rules - each value attribute must be
different from all other values
• consecutive rules - no missing values between
the lowest and highest values .
• null rules - A null rule specifies the use of
blanks,question marks, special characters.
17. Data Integration
• Data integration, which combines data from
multiple sources into a coherent data store.
• Data integration Technique:
• Schema integration
• Redundancy
• correlation analysis
18. Data Transformation
• In data transformation, the data are transformed
or consolidated into forms appropriate for
mining.
• Data transformation can involve the following:
• Smoothing - to remove noise from the data.
• Aggregation - summary or aggregation operations
are applied to the data.
• Ex : the daily sales data may be aggregated so as
to compute monthly and annual total amounts.
19. Cont….
• Generalization - low-level or “primitive” (raw)
data are replaced by higher-level concepts
through the use of concept hierarchies.
• Normalization - the attribute data are scaled
so as to fall within a small specified range,
such as 1:0 to 1:0, or 0:0 to 1:0.
• Attribute construction - new attributes are
constructed and added from the given set of
attributes.
20. Data Reduction
• Data reduction techniques can be applied to obtain a
reduced representation of the data set that is much
smaller in volume.
1. Data cube aggregation
where aggregation operations are applied to the data
in the construction of a data cube.
2. Attribute subset selection
where irrelevant, weakly relevant, or redundant
attributes or dimensions may be detected and removed.
21. Cont…
3.Dimensionality reduction
where encoding mechanisms are used to
reduce the data set size.
4.Numerosity reduction
where the data are replaced or estimated by
alternative
•clustering
• sampling
• histograms.
22. Data Discretization
• Data discretization techniques can be used to
reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals.
• Binning
• Histogram Analysis
• Entropy-Based Discretization
• Interval Merging by x2 Analysis
• Cluster Analysis
• Discretization by Intuitive Partitioning