SlideShare a Scribd company logo
Data
Lecture 2
• Last week:
– Course Motivation
– Data Mining basics
• This week:
– Data
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 2
Summary – last week
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 3
Agenda
– Attributes and Objects
– Types of Data
– Data Quality
– Similarity and Distance
– Data Preprocessing
What is Data?
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 4
• Collection of data objects
and their attributes
• An attribute is a property or
characteristic of an object
• Examples: eye color of a
person, temperature, etc.
• Attribute is also known as
variable, field, characteristic,
dimension, or feature
• A collection of attributes
describe an object
• Object is also known as
record, point, case, sample,
entity, or instance
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects
Attribute Values
• Attribute values are numbers or symbols assigned to
an attribute for a particular object
• Distinction between attributes and attribute values
– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
– But properties of attribute can be different than
the properties of the values used to represent the
attribute
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 5
Types of Attributes
There are different types of attributes or measurement scales:
– Nominal: are used to categorise/classify objects, individuals,
groups, or even phenomena. Mutually exclusive
• Examples: Gender, Country, Religion, Ethnicity
– Ordinal: the data can be categorized and ranked. The rank is
known but the magnitude of the rank is unknown.
• Examples: Rate the smartphone brands (iPhone, Samsung, HTC)
– Interval: You can categorize, rank, and infer equal
intervals between neighbouring data points, but the origin point is
arbitrary.
• Examples: My university education prepare my well for my career.
(Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree)
– Ratio: You can categorize, rank, and infer equal intervals between
neighbouring data points, and there is a true zero point.
• Examples: Weight, income, Age, Sales volume
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 6
Properties of Attribute Values
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 7
Attribute Type Description Examples Operations
Categorical
Qualitative
Nominal Nominal attribute
values only
distinguish. (=, ¹)
Gender, Country,
Religion, Ethnicity
mode, frequency
(%, count)
Ordinal Ordinal attribute
values also order
objects.
(<, >)
Rate the smartphone
brands (iPhone,
Samsung, HTC)
median,
percentiles, rank
correlation, run
tests, sign tests
Numeric
Quantitative
Interval For interval
attributes,
differences between
values are
meaningful. (+, - )
My university
education prepare
my well for my
career.
Strongly Agree,
Agree, Neutral,
Disagree, Strongly
Disagree
mean, standard
deviation,
Pearson's
correlation, t and
F tests
Ratio For ratio variables,
both differences and
ratios are
meaningful. (*, /)
Weight, income,
Age, Sales volume
geometric mean,
harmonic mean,
percent variation
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 8
Asymmetric Attributes
• Only presence (a non-zero attribute value) is regarded
as important
• Words present in documents
• Items present in customer transactions
• If we met a friend in the grocery store would we ever
say the following?
“I see our purchases are very similar since we didn’t buy
most of the same things.”
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 9
Important Characteristics of Data
– Dimensionality (number of attributes)
• High dimensional data brings a number of
challenges
– Sparsity
• Only presence counts
– Resolution
• Patterns depend on the scale
– Size
• Type of analysis may depend on size of data
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 10
Types of data sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
– Molecular Structures
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 11
Record Data
• Data that consists of a collection of records, each of
which consists of a fixed set of attributes
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 12
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Matrix
• If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute
• Such a data set can be represented by an m by n
matrix, where there are m rows, one for each object,
and n columns, one for each attribute
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 13
1.1
2.2
16.22
6.25
12.65
1.2
2.7
15.22
5.27
10.23
Thickness
Load
Distance
Projection
of y load
Projection
of x Load
1.1
2.2
16.22
6.25
12.65
1.2
2.7
15.22
5.27
10.23
Thickness
Load
Distance
Projection
of y load
Projection
of x Load
Document Data
• Each document becomes a ‘term’ vector
– Each term is a component (attribute) of the vector
– The value of each component is the number of times the
corresponding term occurs in the document.
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 14
Document 1
season
timeout
lost
win
game
score
ball
play
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
Transaction Data
• A special type of data, where
– Each transaction involves a set of items.
– For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased
are the items.
– Can represent transaction data as record data
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 15
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
• Examples: Generic graph, a molecule, and webpages
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 16
5
2
1
2
5
Benzene Molecule: C6H6
Ordered Data
• Sequences of transactions
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 17
An element of
the sequence
Items/Events
Ordered Data
• Spatio-Temporal Data
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 18
Average Monthly
Temperature of
land and ocean
Data Quality
• Poor data quality negatively affects many data
processing efforts
• Data mining example: a classification model for
detecting people who are loan risks is built using poor
data
– Some credit-worthy candidates are denied loans
– More loans are given to individuals that default
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 19
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
• Examples of data quality problems:
– Noise and outliers
– Wrong data
– Fake data
– Missing values
– Duplicate data
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 20
Noise
• For objects, noise is an extraneous (irrelevant) object
• For attributes, noise refers to modification of original
values
– Examples: distortion of a person’s voice when talking on a poor
phone and “snow” on television screen
– The figures below show two sine waves of the same magnitude
and different frequencies, the waves combined, and the two
sine waves with random noise
• The magnitude and shape of the original signal is distorted
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 21
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data
objects in the data set
– Case 1: Outliers are
noise that interferes
with data analysis
– Case 2: Outliers are
the goal of our analysis
• Credit card fraud
• Intrusion detection
• Causes?
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 22
Data Preprocessing
• Aggregation
• Sampling
• Discretization and Binarization
• Attribute Transformation
• Dimensionality Reduction
• Feature subset selection
• Feature creation
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 23
Aggregation
• Combining two or more attributes (or objects) into a single
attribute (or object)
• Purpose
– Data reduction - reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc.
• Days aggregated into weeks, months, or years
– More “stable” data - aggregated data tends to have less variability
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 24
Sampling
• Sampling is the main technique employed for data
reduction.
– It is often used for both the preliminary investigation of the data
and the final data analysis.
• Statisticians often sample because obtaining the entire
set of data of interest is too expensive or time
consuming.
• Sampling is typically used in data mining because
processing the entire set of data of interest is too
expensive or time consuming.
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 25
Sample Size
• The key principle for effective sampling is the following:
– Using a sample will work almost as well as using the entire data
set, if the sample is representative
– A sample is representative if it has approximately the same
properties (of interest) as the original set of data
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 26
8000 points 2000 Points 500 Points
Types of Sampling
• Simple Random Sampling
– There is an equal probability of selecting any particular item
– Sampling without replacement
• As each item is selected, it is removed from the
population
– Sampling with replacement
• Objects are not removed from the population as
they are selected for the sample.
• In sampling with replacement, the same object
can be picked up more than once
• Stratified sampling
– Split the data into several partitions; then draw random samples
from each partition
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 27
Discretization
• Discretization is the process of converting a continuous
attribute into an ordinal attribute
– A potentially infinite number of values are mapped into a small
number of categories
– Discretization is used in both unsupervised and supervised
settings
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 28
Binarization
• Binarization maps a continuous or categorical attribute
into one or more binary variables
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 29
Attribute Transformation
• An attribute transform is a function that maps the entire
set of values of a given attribute to a new set of
replacement values such that each old value can be
identified with one of the new values
– Simple functions: xk, log(x), ex, |x|
– Normalization
• Refers to various techniques to adjust to
differences among attributes in terms of
frequency of occurrence, mean, variance, range
• Take out unwanted, common signal, e.g.,
seasonality
– In statistics, standardization refers to subtracting off the means
and dividing by the standard deviation
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 30
Dimensionality Reduction
• Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory
required by data mining algorithms
– Allow data to be more easily
visualized
– May help to eliminate irrelevant
features or reduce noise
• Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear
techniques
• Goal is to find a projection that
captures the largest amount of
variation in data
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 31
x2
x1
e
Feature Subset Selection
• Another way to reduce dimensionality of data
• Redundant features
– Duplicate much or all of the information contained in one or
more other attributes
– Example: purchase price of a product and the amount of sales
tax paid
• Irrelevant features
– Contain no information that is useful for the data mining task at
hand
– Example: students' ID is often irrelevant to the task of predicting
students' GPA
• Many techniques developed, especially for classification
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 32
Feature Creation
• Create new attributes that can capture the important
information in a data set much more efficiently than the
original attributes
• Three general methodologies:
– Feature extraction
• Example: extracting edges from images
– Feature construction
• Example: dividing mass by volume to get density
– Mapping data to new space
• Example: Fourier and wavelet analysis
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 33
Summary
– Attributes and Objects
– Types of Data
– Data Quality
– Similarity and Distance
– Data Preprocessing
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 34
• Association Rule Mining
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 35
Next week

More Related Content

Similar to Lecture2.pdf

© Tan,Steinbach, Kumar Introduction to Data Mining 418.docx
© Tan,Steinbach, Kumar Introduction to Data Mining        418.docx© Tan,Steinbach, Kumar Introduction to Data Mining        418.docx
© Tan,Steinbach, Kumar Introduction to Data Mining 418.docx
gerardkortney
 
Data Mining - Introduction and Data
Data Mining - Introduction and DataData Mining - Introduction and Data
Data Mining - Introduction and Data
Darío Garigliotti
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
Kamal Acharya
 
Student Affairs Assessment Committee Training
Student Affairs Assessment Committee TrainingStudent Affairs Assessment Committee Training
Student Affairs Assessment Committee Training
Stan Dura
 
APSY3206 Lecture 1.pptx
APSY3206 Lecture 1.pptxAPSY3206 Lecture 1.pptx
APSY3206 Lecture 1.pptx
MariaMalikAwan
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingextraganesh
 
Preprocessing techniques in data mining with solve examples
Preprocessing techniques in data mining with solve examplesPreprocessing techniques in data mining with solve examples
Preprocessing techniques in data mining with solve examples
MuhammadSaleemKhan26
 
Data Visualization dataviz superpower
Data Visualization dataviz superpowerData Visualization dataviz superpower
Data Visualization dataviz superpower
Jen Stirrup
 
introDM.ppt
introDM.pptintroDM.ppt
introDM.ppt
Arumugam Prakash
 
introDMintroDMintroDMintroDMintroDMintroDM.ppt
introDMintroDMintroDMintroDMintroDMintroDM.pptintroDMintroDMintroDMintroDMintroDMintroDM.ppt
introDMintroDMintroDMintroDMintroDMintroDM.ppt
DEEPAK948083
 
Data Mining Lecture_1.pptx
Data Mining Lecture_1.pptxData Mining Lecture_1.pptx
Data Mining Lecture_1.pptx
Subrata Kumer Paul
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
Sulman Ahmed
 
datamining_LectureTwo(Data Pipeline) 2.pdf
datamining_LectureTwo(Data Pipeline) 2.pdfdatamining_LectureTwo(Data Pipeline) 2.pdf
datamining_LectureTwo(Data Pipeline) 2.pdf
ssuser3e6464
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data science
AbhishekKumarSingh260
 
Is Your Marketing Database "Model Ready"?
Is Your Marketing Database "Model Ready"?Is Your Marketing Database "Model Ready"?
Is Your Marketing Database "Model Ready"?Vivastream
 
Pelatihan Data Analitik
Pelatihan Data AnalitikPelatihan Data Analitik
Pelatihan Data Analitik
John Sihotang, Dr, MM, Ir
 
Personalizing the Consumer Experience with Data
Personalizing the Consumer Experience with DataPersonalizing the Consumer Experience with Data
Personalizing the Consumer Experience with Data
InfoTrust LLC
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
Roshan575917
 

Similar to Lecture2.pdf (20)

chap2_data.ppt
chap2_data.pptchap2_data.ppt
chap2_data.ppt
 
© Tan,Steinbach, Kumar Introduction to Data Mining 418.docx
© Tan,Steinbach, Kumar Introduction to Data Mining        418.docx© Tan,Steinbach, Kumar Introduction to Data Mining        418.docx
© Tan,Steinbach, Kumar Introduction to Data Mining 418.docx
 
Data Mining - Introduction and Data
Data Mining - Introduction and DataData Mining - Introduction and Data
Data Mining - Introduction and Data
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Student Affairs Assessment Committee Training
Student Affairs Assessment Committee TrainingStudent Affairs Assessment Committee Training
Student Affairs Assessment Committee Training
 
APSY3206 Lecture 1.pptx
APSY3206 Lecture 1.pptxAPSY3206 Lecture 1.pptx
APSY3206 Lecture 1.pptx
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Preprocessing techniques in data mining with solve examples
Preprocessing techniques in data mining with solve examplesPreprocessing techniques in data mining with solve examples
Preprocessing techniques in data mining with solve examples
 
Data Visualization dataviz superpower
Data Visualization dataviz superpowerData Visualization dataviz superpower
Data Visualization dataviz superpower
 
introDM.ppt
introDM.pptintroDM.ppt
introDM.ppt
 
introDMintroDMintroDMintroDMintroDMintroDM.ppt
introDMintroDMintroDMintroDMintroDMintroDM.pptintroDMintroDMintroDMintroDMintroDMintroDM.ppt
introDMintroDMintroDMintroDMintroDMintroDM.ppt
 
Data Mining Lecture_1.pptx
Data Mining Lecture_1.pptxData Mining Lecture_1.pptx
Data Mining Lecture_1.pptx
 
3 module 2
3 module 23 module 2
3 module 2
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
datamining_LectureTwo(Data Pipeline) 2.pdf
datamining_LectureTwo(Data Pipeline) 2.pdfdatamining_LectureTwo(Data Pipeline) 2.pdf
datamining_LectureTwo(Data Pipeline) 2.pdf
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data science
 
Is Your Marketing Database "Model Ready"?
Is Your Marketing Database "Model Ready"?Is Your Marketing Database "Model Ready"?
Is Your Marketing Database "Model Ready"?
 
Pelatihan Data Analitik
Pelatihan Data AnalitikPelatihan Data Analitik
Pelatihan Data Analitik
 
Personalizing the Consumer Experience with Data
Personalizing the Consumer Experience with DataPersonalizing the Consumer Experience with Data
Personalizing the Consumer Experience with Data
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 

Recently uploaded

一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 

Recently uploaded (20)

一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 

Lecture2.pdf

  • 2. • Last week: – Course Motivation – Data Mining basics • This week: – Data Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 2 Summary – last week
  • 3. Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 3 Agenda – Attributes and Objects – Types of Data – Data Quality – Similarity and Distance – Data Preprocessing
  • 4. What is Data? Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 4 • Collection of data objects and their attributes • An attribute is a property or characteristic of an object • Examples: eye color of a person, temperature, etc. • Attribute is also known as variable, field, characteristic, dimension, or feature • A collection of attributes describe an object • Object is also known as record, point, case, sample, entity, or instance Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Attributes Objects
  • 5. Attribute Values • Attribute values are numbers or symbols assigned to an attribute for a particular object • Distinction between attributes and attribute values – Same attribute can be mapped to different attribute values • Example: height can be measured in feet or meters – Different attributes can be mapped to the same set of values • Example: Attribute values for ID and age are integers – But properties of attribute can be different than the properties of the values used to represent the attribute Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 5
  • 6. Types of Attributes There are different types of attributes or measurement scales: – Nominal: are used to categorise/classify objects, individuals, groups, or even phenomena. Mutually exclusive • Examples: Gender, Country, Religion, Ethnicity – Ordinal: the data can be categorized and ranked. The rank is known but the magnitude of the rank is unknown. • Examples: Rate the smartphone brands (iPhone, Samsung, HTC) – Interval: You can categorize, rank, and infer equal intervals between neighbouring data points, but the origin point is arbitrary. • Examples: My university education prepare my well for my career. (Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree) – Ratio: You can categorize, rank, and infer equal intervals between neighbouring data points, and there is a true zero point. • Examples: Weight, income, Age, Sales volume Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 6
  • 7. Properties of Attribute Values Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 7 Attribute Type Description Examples Operations Categorical Qualitative Nominal Nominal attribute values only distinguish. (=, ¹) Gender, Country, Religion, Ethnicity mode, frequency (%, count) Ordinal Ordinal attribute values also order objects. (<, >) Rate the smartphone brands (iPhone, Samsung, HTC) median, percentiles, rank correlation, run tests, sign tests Numeric Quantitative Interval For interval attributes, differences between values are meaningful. (+, - ) My university education prepare my well for my career. Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree mean, standard deviation, Pearson's correlation, t and F tests Ratio For ratio variables, both differences and ratios are meaningful. (*, /) Weight, income, Age, Sales volume geometric mean, harmonic mean, percent variation
  • 8. Discrete and Continuous Attributes • Discrete Attribute – Has only a finite or countably infinite set of values – Examples: zip codes, counts, or the set of words in a collection of documents – Often represented as integer variables. – Note: binary attributes are a special case of discrete attributes • Continuous Attribute – Has real numbers as attribute values – Examples: temperature, height, or weight. – Practically, real values can only be measured and represented using a finite number of digits. – Continuous attributes are typically represented as floating- point variables. Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 8
  • 9. Asymmetric Attributes • Only presence (a non-zero attribute value) is regarded as important • Words present in documents • Items present in customer transactions • If we met a friend in the grocery store would we ever say the following? “I see our purchases are very similar since we didn’t buy most of the same things.” Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 9
  • 10. Important Characteristics of Data – Dimensionality (number of attributes) • High dimensional data brings a number of challenges – Sparsity • Only presence counts – Resolution • Patterns depend on the scale – Size • Type of analysis may depend on size of data Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 10
  • 11. Types of data sets • Record – Data Matrix – Document Data – Transaction Data • Graph – World Wide Web – Molecular Structures • Ordered – Spatial Data – Temporal Data – Sequential Data – Genetic Sequence Data Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 11
  • 12. Record Data • Data that consists of a collection of records, each of which consists of a fixed set of attributes Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 12 Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10
  • 13. Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute • Such a data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 13 1.1 2.2 16.22 6.25 12.65 1.2 2.7 15.22 5.27 10.23 Thickness Load Distance Projection of y load Projection of x Load 1.1 2.2 16.22 6.25 12.65 1.2 2.7 15.22 5.27 10.23 Thickness Load Distance Projection of y load Projection of x Load
  • 14. Document Data • Each document becomes a ‘term’ vector – Each term is a component (attribute) of the vector – The value of each component is the number of times the corresponding term occurs in the document. Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 14 Document 1 season timeout lost win game score ball play coach team Document 2 Document 3 3 0 5 0 2 6 0 2 0 2 0 0 7 0 2 1 0 0 3 0 0 1 0 0 1 2 2 0 3 0
  • 15. Transaction Data • A special type of data, where – Each transaction involves a set of items. – For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. – Can represent transaction data as record data Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 15 TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk
  • 16. Graph Data • Examples: Generic graph, a molecule, and webpages Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 16 5 2 1 2 5 Benzene Molecule: C6H6
  • 17. Ordered Data • Sequences of transactions Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 17 An element of the sequence Items/Events
  • 18. Ordered Data • Spatio-Temporal Data Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 18 Average Monthly Temperature of land and ocean
  • 19. Data Quality • Poor data quality negatively affects many data processing efforts • Data mining example: a classification model for detecting people who are loan risks is built using poor data – Some credit-worthy candidates are denied loans – More loans are given to individuals that default Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 19
  • 20. Data Quality • What kinds of data quality problems? • How can we detect problems with the data? • What can we do about these problems? • Examples of data quality problems: – Noise and outliers – Wrong data – Fake data – Missing values – Duplicate data Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 20
  • 21. Noise • For objects, noise is an extraneous (irrelevant) object • For attributes, noise refers to modification of original values – Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen – The figures below show two sine waves of the same magnitude and different frequencies, the waves combined, and the two sine waves with random noise • The magnitude and shape of the original signal is distorted Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 21
  • 22. Outliers • Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set – Case 1: Outliers are noise that interferes with data analysis – Case 2: Outliers are the goal of our analysis • Credit card fraud • Intrusion detection • Causes? Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 22
  • 23. Data Preprocessing • Aggregation • Sampling • Discretization and Binarization • Attribute Transformation • Dimensionality Reduction • Feature subset selection • Feature creation Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 23
  • 24. Aggregation • Combining two or more attributes (or objects) into a single attribute (or object) • Purpose – Data reduction - reduce the number of attributes or objects – Change of scale • Cities aggregated into regions, states, countries, etc. • Days aggregated into weeks, months, or years – More “stable” data - aggregated data tends to have less variability Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 24
  • 25. Sampling • Sampling is the main technique employed for data reduction. – It is often used for both the preliminary investigation of the data and the final data analysis. • Statisticians often sample because obtaining the entire set of data of interest is too expensive or time consuming. • Sampling is typically used in data mining because processing the entire set of data of interest is too expensive or time consuming. Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 25
  • 26. Sample Size • The key principle for effective sampling is the following: – Using a sample will work almost as well as using the entire data set, if the sample is representative – A sample is representative if it has approximately the same properties (of interest) as the original set of data Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 26 8000 points 2000 Points 500 Points
  • 27. Types of Sampling • Simple Random Sampling – There is an equal probability of selecting any particular item – Sampling without replacement • As each item is selected, it is removed from the population – Sampling with replacement • Objects are not removed from the population as they are selected for the sample. • In sampling with replacement, the same object can be picked up more than once • Stratified sampling – Split the data into several partitions; then draw random samples from each partition Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 27
  • 28. Discretization • Discretization is the process of converting a continuous attribute into an ordinal attribute – A potentially infinite number of values are mapped into a small number of categories – Discretization is used in both unsupervised and supervised settings Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 28
  • 29. Binarization • Binarization maps a continuous or categorical attribute into one or more binary variables Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 29
  • 30. Attribute Transformation • An attribute transform is a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values – Simple functions: xk, log(x), ex, |x| – Normalization • Refers to various techniques to adjust to differences among attributes in terms of frequency of occurrence, mean, variance, range • Take out unwanted, common signal, e.g., seasonality – In statistics, standardization refers to subtracting off the means and dividing by the standard deviation Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 30
  • 31. Dimensionality Reduction • Purpose: – Avoid curse of dimensionality – Reduce amount of time and memory required by data mining algorithms – Allow data to be more easily visualized – May help to eliminate irrelevant features or reduce noise • Techniques – Principal Components Analysis (PCA) – Singular Value Decomposition – Others: supervised and non-linear techniques • Goal is to find a projection that captures the largest amount of variation in data Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 31 x2 x1 e
  • 32. Feature Subset Selection • Another way to reduce dimensionality of data • Redundant features – Duplicate much or all of the information contained in one or more other attributes – Example: purchase price of a product and the amount of sales tax paid • Irrelevant features – Contain no information that is useful for the data mining task at hand – Example: students' ID is often irrelevant to the task of predicting students' GPA • Many techniques developed, especially for classification Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 32
  • 33. Feature Creation • Create new attributes that can capture the important information in a data set much more efficiently than the original attributes • Three general methodologies: – Feature extraction • Example: extracting edges from images – Feature construction • Example: dividing mass by volume to get density – Mapping data to new space • Example: Fourier and wavelet analysis Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 33
  • 34. Summary – Attributes and Objects – Types of Data – Data Quality – Similarity and Distance – Data Preprocessing Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 34
  • 35. • Association Rule Mining Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 35 Next week