5. 5 Dvijesh Shastri
Data Collection of Objects Collection of Attributes
Example: Student, customer, movies, etc.
6. 6 Dvijesh Shastri
Data: Collection of objects (examples) and their
attributes (features) Matrix format
Attribute: is a property or characteristic of an object
Attribute is also known as variable, field, characteristic, or feature.
Examples: eye color of a person, temperature, etc.
Object: is a collection of attributes that describe an object.
Object is also known as record, data point, sample, entity, observation,
or instance.
Attributes
Objects
7. 7 Dvijesh Shastri
Types of Attributes
Nominal (or Categorical)
Are = or ≠ to other values
Examples: ID numbers, eye color, zip codes
Ordinal
Obey a < relationship
Examples: rankings (e.g., taste of potato chips on a scale from
1-5), height in {tall, medium, short}
Interval
Examples: calendar dates, temperature
Quantitative/Ratio
Can do arithmetic on them
Examples: length, time, counts, temperature
Qualitative
Quantitative
8. 8 Dvijesh Shastri
Properties of Attribute Values
The type of an attribute depends on which of the following
properties it possesses:
Distinctness: = ≠
Order: < >
Addition: + -
Multiplication: * /
Nominal attribute: distinctness
Ordinal attribute: distinctness & order
Interval attribute: distinctness, order & addition
Ratio attribute: all 4 properties
10. 10 Dvijesh Shastri
Why do I care?
Different types of attributes may be preprocessed differently (Noise cleaning,
missing value, normalization)
Ex: Missing value for Qualitative Attributes Use Mode
for Quantitative Attributes Median, Mean, linear interpolation
ML algorithms may work better on certain kinds of attributes
Ex: Qualitative data Decision Tree
Quantitative data kNN
14. 14 Dvijesh Shastri
Types of data sets
1. Record Data
– Data Matrix
– Document Data
– Transaction Data
2. Graph-based Data
– Social Network
– World Wide Web
3. Ordered Data
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
team coach play ball score game win lost timeout season
Document 1 1 0 1 0 1 1 0 1 0 1
Document 2 0 1 0 1 1 0 0 1 0 0
Document 3 0 1 0 0 1 1 1 0 1 0
5
2
1
2
5
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
1.1
2.2
16.22
6.25
12.65
1.2
2.7
15.22
5.27
10.23
Thickness
Load
Distance
Projection
of y load
Projection
of x Load
1.1
2.2
16.22
6.25
12.65
1.2
2.7
15.22
5.27
10.23
Thickness
Load
Distance
Projection
of y load
Projection
of x Load
15. 15 Dvijesh Shastri
Why do I care?
POLL 3: Why do I care to know about the types of datasets?
16. 16 Dvijesh Shastri
Why do I care?
Type of data set determines which tools and techniques can be used to
analyze the data.
5
2
1
2
5
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Classification
Association
Anlysis
Network
Analysis
Graph data Transaction data
Record data
18. 18 Dvijesh Shastri
A. Structured Data
Structured data conforms to a data model or schema
and is often stored in tabular form.
It is used to capture relationships between different
entities and is therefore most often stored in a
relational database.
Due to the abundance of tools and databases that
natively support structured data, it rarely requires
special consideration in regards to processing or
storage.
Examples of this type of data include banking
transactions, invoices, and customer records.
19. 19 Dvijesh Shastri
B. Unstructured Data
Data that does not conform to a
data model or data schema is
known as unstructured data.
It is estimated that unstructured
data makes up 80% of the data
within any given enterprise.
Unstructured data has a faster
growth rate than structured data.