SlideShare a Scribd company logo
Prof. Pier Luca Lanzi
Data Representation
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi
Readings
• “Data Mining and Analysis” – Chapter 1
• “Mining of Massive Datasets” – Chapter 1
2
Prof. Pier Luca Lanzi
syllabusdescribing data
Prof. Pier Luca Lanzi
Contact Lenses Data 4
NoneReducedYesHypermetropePre-presbyopic
NoneNormalYesHypermetropePre-presbyopic
NoneReducedNoMyopePresbyopic
NoneNormalNoMyopePresbyopic
NoneReducedYesMyopePresbyopic
HardNormalYesMyopePresbyopic
NoneReducedNoHypermetropePresbyopic
SoftNormalNoHypermetropePresbyopic
NoneReducedYesHypermetropePresbyopic
NoneNormalYesHypermetropePresbyopic
SoftNormalNoHypermetropePre-presbyopic
NoneReducedNoHypermetropePre-presbyopic
HardNormalYesMyopePre-presbyopic
NoneReducedYesMyopePre-presbyopic
SoftNormalNoMyopePre-presbyopic
NoneReducedNoMyopePre-presbyopic
hardNormalYesHypermetropeYoung
NoneReducedYesHypermetropeYoung
SoftNormalNoHypermetropeYoung
NoneReducedNoHypermetropeYoung
HardNormalYesMyopeYoung
NoneReducedYesMyopeYoung
SoftNormalNoMyopeYoung
NoneReducedNoMyopeYoung
Recommended lensesTear production rateAstigmatismSpectacle prescriptionAge
Prof. Pier Luca Lanzi
Data Representation
• Data are typically abstracted as a matrix, with n rows and d
columns, given as
• Rows are called instances, examples, records, transactions,
objects, points, feature-vectors, etc.
• Columns are called attributes, properties, features, dimensions,
variables, fields, etc.
5
Prof. Pier Luca Lanzi
6
Prof. Pier Luca Lanzi
Instances, Attributes, Concepts
• Instances (observations, case)
§ The atomic elements of information from a dataset
§ Also known as records, prototypes, or examples
• Attributes (variable)
§ Measures aspects of an instance
§ Also known as features or variables
§ Each instance is composed of a certain number of attributes
• Concepts
§ Special content inside the data
§ Kind of things that can be learned
§ Intelligible and operational concept description
7
Prof. Pier Luca Lanzi
Two Versions of the Weather Data 8
……………
YesFalseNormalMildRainy
YesFalseHighHotOvercast
NoTrueHighHotSunny
NoFalseHighHotSunny
PlayWindyHumidityTemperatureOutlook
……………
YesFalse8075Rainy
YesFalse8683Overcast
NoTrue9080Sunny
NoFalse8585Sunny
PlayWindyHumidityTemperatureOutlook
Prof. Pier Luca Lanzi
syllabusattribute types
Prof. Pier Luca Lanzi
Attributes
• Numeric Attributes
§Real-valued or integer-valued domain
§Interval-scaled when only differences are meaningful
(e.g., temperature)
§Ratio-scaled when differences and ratios are meaningful
(e.g., Age)
• Categorical Attributes
§Set-valued domain composed of a set of symbols
§Nominal when only equality is meaningful
(e.g., domain(Sex) = { M, F})
§Ordinal when both equality (are two values the same?) and
inequality (is one value less than another?) are meaningful
(e.g., domain(Education) = { High School, BS, MS, PhD})
10
Prof. Pier Luca Lanzi
Numerical Attributes
• Not only ordered but measured in fixed and equal units
• Examples
§Attribute “temperature” expressed in degrees
§Attribute “year”
• Characteristics
§Difference of two values makes sense
§Sum or product doesn’t make sense
§Zero point is not defined
• Sometimes they are divided into “discrete” and “continuous”
11
Prof. Pier Luca Lanzi
Nominal Attributes (or Categorical)
• Values are distinct symbols
• Values themselves serve only as labels or names
• Example
§Attribute “outlook” from weather data
§Values: “sunny”, “overcast”, and “rainy”
• Characteristics
§No relation is implied among nominal values
§No ordering
§No distance measure
§Only equality tests can be performed
12
Prof. Pier Luca Lanzi
Other Types of Attributes
• Ratio Attributes
§Numerical attributes for which the measurement scheme
defines a zero point (e.g., an attribute representing distance)
• Ordinal Attributes
§Categorical attributes with an imposed order on values
§No distance between values defined
§For instance, the attribute “temperature” in weather data
“hot” > “mild” > “cool”
13
Prof. Pier Luca Lanzi
Nominal or Ordinal?
• Attribute “age” nominal
§If age = young and astigmatic = no
and tear production rate = normal
then recommendation = soft
• Attribute “age” ordinal
(e.g. “young” < “pre-presbyopic” < “presbyopic”)
§If age≤pre-presbyopic and astigmatic = no
and tear production rate = normal
then recommendation = soft
14
Prof. Pier Luca Lanzi
Why Specifying Attribute Types?
• Some algorithms fit some specific data types best
• Express the best possible patterns into data
• Make the most adequate comparisons
• Example
§Outlook > “sunny” does not make sense, while
§Temperature > “cool” or
§Humidity > 70 does
• Additional uses of attribute type
§Check for valid values
§Deal with missing values, etc.
15
Prof. Pier Luca Lanzi
syllabusmissing values
Prof. Pier Luca Lanzi
Why Missing Values Exist?
• Faulty equipment, incorrect measurements, missing cells in manual
data entry, censored/anonymous data
• Review scores for movies, books, etc.
• Very frequent in questionnaires for medical scenarios
• Censored/anonymous data
• In practice, a low rate of missing values may be suspicious
• Interview data (“Did you ever …”)
17
Prof. Pier Luca Lanzi
Missing Values
• Frequently indicated by out-of-range entries (e.g. max/min float),
Nan or special values (e.g., zero)
• Missing value may have significance in itself
§E.g. missing test in a medical examination
• Most schemes assume that is not the case
§“missing” may need to be coded as additional value
• Does absence of value have some significance?
§If it does, “missing” is a separate value
§If it does not, “missing” must be treated in a special way
18
Prof. Pier Luca Lanzi
What Types of Missing Values?
• Missing completely at random (MCAR)
§ The distribution of an example having a missing value for an attribute does not depend on
either the observed data or the missing data
§ Example: some survey questions contain a random sample of the whole questionnaire
• Missing at random (MAR)
§ The distribution of an example having a missing value for an attribute depends on the
observed data, but does not depend on the missing data
§ Missing at Random means the propensity for a data point to be missing is not related to the
missing data, but it is related to some of the observed data
§ Whether or not someone answered #13 on your survey has nothing to do with the missing
values, but it does have to do with the values of some other variable
§ For example, people who don’t declare their salary not because of the amount of it but just
because they don’t want to.
• Not missing at random (NMAR)
§ The distribution of an example having a missing depends on the missing values.
§ For example, respondents with high income less likely to report income
• Note that NMAR and MAR might be difficult to identify and often require domain knowledge
19
Prof. Pier Luca Lanzi
Dealing with Missing Values
• Use what you know
§ Why data is missing
§ Distribution of missing data
• Decide on the best strategy to yield the least biased estimates
§ Deletion Methods (listwise deletion, pairwise deletion)
§ Single Imputation Methods (mean/mode substitution, dummy variable
method, single regression)
§ Model-Based Methods (maximum Likelihood, multiple imputation
20
Prof. Pier Luca Lanzi
Strategies for missing values handling
• The handling of missing data depends on the type
• Discarding all the examples with a missing values
§ Simplest approach
§ Allows the use of unmodified data mining methods
§ Only practical if there are few examples with missing values. Otherwise, it
can introduce bias
• Fill in the missing value manually J
• Convert the missing values into a new value
§ Use a special value for it
§ Add an attribute that indicates if value is missing or not
§ Greatly increases the difficulty of the data mining process
• Imputation methods
§ Assign a value to the missing one, based on the rest of the dataset. Use
the unmodified data mining methods.
21
Prof. Pier Luca Lanzi
Listwise Deletion
(Complete Case Analysis)
• Only analyze cases with available data
on each variable
• Simple, but reduces the data
• Comparability across analyses
• Does not use all the information
• Estimates may be biased if data not
MCAR
22
Prof. Pier Luca Lanzi
Pairwise deletion
(Available Case Analysis)
• Analysis with all cases in which
the variables of interest are
present
• Advantage
§Keeps as many cases as
possible for each analysis
§Uses all information
possible with each analysis
• Disadvantage
§Can’t compare analyses
because sample different
each time
23
Prof. Pier Luca Lanzi
Imputation methods
• Extract a model from the dataset to perform the imputation
• Suitable for MCAR and, to a lesser extent, for MAR
• Not suitable for NMAR type of missing data
• For NMAR we need to go back to the source of the data to
obtain more information
• Survey of imputation methods available at
http://sci2s.ugr.es/MVDM/index.php
http://sci2s.ugr.es/MVDM/biblio.php
24
Prof. Pier Luca Lanzi
Single Imputation Methods
• Mean/mode substitution (most common value)
§ Replace missing value with sample mean or mode
§ Run analyses as if all complete cases
§ Advantages: Can use complete case analysis methods
§ Disadvantages: Reduces variability
• Dummy variable control
§ Create an indicator for missing value (1=value is missing for observation;
0=value is observed for observation)
§ Impute missing values to a constant (such as the mean)
§ Include missing indicator in the algorithm
§ Advantage: uses all available information about missing observation
§ Disadvantage: results in biased estimates, not theoretically driven
• Regression Imputation
§ Replaces missing values with predicted score from a regression equation.
25
Prof. Pier Luca Lanzi
Do Not Impute (DNI)
• Simply use the default policy of the data mining method
• Works only if the policy exists
26
Prof. Pier Luca Lanzi
syllabusinaccurate values
Prof. Pier Luca Lanzi
Inaccurate Values
• Data has not been collected for mining it
• Errors and omissions that don’t affect original purpose of data
(e.g. age of customer)
• Typographical errors in nominal attributes,
thus values need to be checked for consistency
• Typographical and measurement errors in numeric attributes,
thus outliers need to be identified
• Errors may be deliberate (e.g. wrong zip codes)
28
Prof. Pier Luca Lanzi
syllabusthe geometric view
Prof. Pier Luca Lanzi
The Geometrical View of the Data
• When the data matrix contains only numerical values
§Every row can be viewed as a point in a d-dimension space
§Every column as a point in a n-dimensional space
30
Prof. Pier Luca Lanzi
31
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
syllabusthe probabilistic view
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
syllabusdata format
Prof. Pier Luca Lanzi
Data Format
• Most commercial tools have their own proprietary format
• Most tools import excel files and comma-separated value files
37
Year,Make,Model,Length
1997,Ford,E350,2.34
2000,Mercury,Cougar,2.38
Year;Make;Model;Length
1997;Ford;E350;2,34
2000;Mercury;Cougar;2,38
Prof. Pier Luca Lanzi
Attribute-Relation File Format
(ARFF)
38
%
% ARFF file for weather data with some numeric features
%
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {true, false}
@attribute play? {yes, no}
@data
sunny, 85, 85, false, no
sunny, 80, 90, true, no
overcast, 83, 86, false, yes
...
http://www.cs.waikato.ac.nz/~ml/weka/arff.html
Prof. Pier Luca Lanzi
Additional Attribute Types
• ARFF supports string attributes:
• Similar to nominal attributes but list of values
is not pre-specified
• ARFF also supports date attributes:
• Uses the ISO-8601 combined date
and time format yyyy-MM-dd-THH:mm:ss
39
@attribute description string
@attribute today date
Prof. Pier Luca Lanzi
Additional Attribute Types
• ARFF supports sparse data, for instance the following examples,
• Can also be represented as,
40
0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A”
0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B”
{1 26, 6 63, 10 “class A”}
{3 42, 10 “class B”}
Prof. Pier Luca Lanzi
Missing Values in ARFF 41
@relation labor
@attribute 'duration' real
@attribute 'wage-increase-first-year' real
@attribute 'wage-increase-second-year' real
@attribute 'wage-increase-third-year' real
@attribute 'cost-of-living-adjustment' {'none','tcf','tc'}
@attribute 'working-hours' real
@attribute 'pension' {'none','ret_allw','empl_contr'}
@attribute 'standby-pay' real
@attribute 'shift-differential' real
@attribute 'education-allowance' {'yes','no'}
@attribute 'statutory-holidays' real
@attribute 'vacation' {'below_average','average','generous'}
@attribute 'longterm-disability-assistance' {'yes','no'}
@attribute 'contribution-to-dental-plan' {'none','half','full'}
@attribute 'bereavement-assistance' {'yes','no'}
@attribute 'contribution-to-health-plan' {'none','half','full'}
@attribute 'class' {'bad','good'}
@data
1,5,?, ?, ?,40, ?, ?,2, ?,11,'average', ?, ?,'yes',?,'good'
2,4.5,5.8, ?, ?,35,'ret_allw', ?, ?,'yes',11,'below_average', ?,'full', ?,'full','good'
?, ?, ?, ?, ?,38,'empl_contr', ?,5, ?,11,'generous','yes','half','yes','half','good'
3,3.7,4,5,'tc', ?, ?, ?, ?,'yes', ?, ?, ?, ?,'yes', ?,'good'
Prof. Pier Luca Lanzi
Attribute Types and Interpretation
• Interpretation of attribute types in ARFF depends
on the mining scheme
• Numeric attributes are interpreted as
§Ordinal scales if less-than and greater-than are used
§Ratio scales if distance calculations are performed
(normalization/standardization may be required)
• Instance-based schemes define distance between nominal values
(0 if values are equal, 1 otherwise)
• Integers in some given data file: nominal, ordinal, or ratio scale?
42
Prof. Pier Luca Lanzi
DSPL: Dataset Publishing Language
• Open format by Google available at
http://code.google.com/apis/publicdata/
• Use existing data: add an XML metadata file existing CSV
• Read by the Google Public Data Explorer, which includes
animated bar chart, motion chart, and map visualization
• Allow linking to concepts in other datasets
• Geo-enabled: allows adding latitude and longitude data to your
concept definitions
43
Prof. Pier Luca Lanzi
syllabusmodel representation
Prof. Pier Luca Lanzi
Predictive Model Markup Language
• XML-based markup language developed by the Data Mining
Group (DMG) to provide a way for applications to define models
related to predictive analytics and data mining
• The goal is to share models between applications
• Vendor-independent method of defining models
• Allow to exchange of models between applications.
• PMML Components: data dictionary, data transformations, model,
mining schema, targets, output
45
Prof. Pier Luca Lanzi
syllabusdata repositories
Prof. Pier Luca Lanzi
Publicly Available Datasets
• UCI repository
§http://archive.ics.uci.edu/ml/
§Probably the most famous collection of datasets
• Kaggle
§http://www.kaggle.com/
§It is not a static repository of datasets, but a site that manages
Data Mining competitions
§Example of the modern concept of crowdsourcing
• KDNuggets
§http://www.kdnuggets.com/datasets/
47

More Related Content

What's hot

Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
Iffat Firozy
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
hktripathy
 
2.3 Graphs that enlighten and graphs that deceive
2.3 Graphs that enlighten and graphs that deceive2.3 Graphs that enlighten and graphs that deceive
2.3 Graphs that enlighten and graphs that deceive
Long Beach City College
 
Missing Data and Causes
Missing Data and CausesMissing Data and Causes
Missing Data and Causes
akanni azeez olamide
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
Knoldus Inc.
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
Institute of Technology Telkom
 
Data reduction
Data reductionData reduction
Data reduction
GowriLatha1
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”
Dr.(Mrs).Gethsiyal Augasta
 
Missing Data imputation
Missing Data imputationMissing Data imputation
Missing Data imputation
Наталя Шаховська
 
5.5 graph mining
5.5 graph mining5.5 graph mining
5.5 graph mining
Krish_ver2
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
guest0edcaf
 
Data discretization
Data discretizationData discretization
Data discretization
Hadi M.Abachi
 
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
digitalzombie
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptx
NTUConcepts1
 
Dataa miining
Dataa miiningDataa miining
Dataa miining
SUBBIAH SURESH
 
Missing Data and data imputation techniques
Missing Data and data imputation techniquesMissing Data and data imputation techniques
Missing Data and data imputation techniques
Omar F. Althuwaynee
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingksamyMCA
 

What's hot (20)

Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
 
2.3 Graphs that enlighten and graphs that deceive
2.3 Graphs that enlighten and graphs that deceive2.3 Graphs that enlighten and graphs that deceive
2.3 Graphs that enlighten and graphs that deceive
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Missing Data and Causes
Missing Data and CausesMissing Data and Causes
Missing Data and Causes
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Descriptive statistics -review(2)
Descriptive statistics -review(2)Descriptive statistics -review(2)
Descriptive statistics -review(2)
 
Data reduction
Data reductionData reduction
Data reduction
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”
 
Missing Data imputation
Missing Data imputationMissing Data imputation
Missing Data imputation
 
5.5 graph mining
5.5 graph mining5.5 graph mining
5.5 graph mining
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Data discretization
Data discretizationData discretization
Data discretization
 
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptx
 
Data PreProcessing
Data PreProcessingData PreProcessing
Data PreProcessing
 
Dataa miining
Dataa miiningDataa miining
Dataa miining
 
Missing Data and data imputation techniques
Missing Data and data imputation techniquesMissing Data and data imputation techniques
Missing Data and data imputation techniques
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 

Similar to DMTM Lecture 05 Data representation

DMTM 2015 - 03 Data Representation
DMTM 2015 - 03 Data RepresentationDMTM 2015 - 03 Data Representation
DMTM 2015 - 03 Data Representation
Pier Luca Lanzi
 
DMTM 2015 - 16 Data Preparation
DMTM 2015 - 16 Data PreparationDMTM 2015 - 16 Data Preparation
DMTM 2015 - 16 Data Preparation
Pier Luca Lanzi
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
Pier Luca Lanzi
 
Simple math for anomaly detection toufic boubez - metafor software - monito...
Simple math for anomaly detection   toufic boubez - metafor software - monito...Simple math for anomaly detection   toufic boubez - metafor software - monito...
Simple math for anomaly detection toufic boubez - metafor software - monito...
tboubez
 
Making sense of citizen science data: A review of methods
Making sense of citizen science data: A review of methodsMaking sense of citizen science data: A review of methods
Making sense of citizen science data: A review of methods
olivier gimenez
 
2010 smg training_cardiff_day1_session3_higgins
2010 smg training_cardiff_day1_session3_higgins2010 smg training_cardiff_day1_session3_higgins
2010 smg training_cardiff_day1_session3_higginsrgveroniki
 
The zen of predictive modelling
The zen of predictive modellingThe zen of predictive modelling
The zen of predictive modelling
Quinton Anderson
 
Anomalies! You can't escape them.
Anomalies! You can't escape them.Anomalies! You can't escape them.
Anomalies! You can't escape them.
CSIRO
 
Pitfalls of multivariate pattern analysis(MVPA), fMRI
Pitfalls of multivariate pattern analysis(MVPA), fMRI Pitfalls of multivariate pattern analysis(MVPA), fMRI
Pitfalls of multivariate pattern analysis(MVPA), fMRI
Emily Yunha Shin
 
07 Handling of Uncertainties in the Safety Case
07 Handling of Uncertainties in the Safety Case07 Handling of Uncertainties in the Safety Case
07 Handling of Uncertainties in the Safety Case
Sandia National Laboratories: Energy & Climate: Renewables
 
Kevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data MiningKevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data Mining
Library and Information Science Research Coalition
 
CS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptxCS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptx
PrudhvirajEluri1
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
jillmitchell8778
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
jillmitchell8778
 
PSYA4 - Research methods
PSYA4 - Research methodsPSYA4 - Research methods
PSYA4 - Research methodsNicky Burt
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
tboubez
 
Data in science
Data in science Data in science
Data in science
Sreejith Aravindakshan
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data mining
Pier Luca Lanzi
 

Similar to DMTM Lecture 05 Data representation (20)

DMTM 2015 - 03 Data Representation
DMTM 2015 - 03 Data RepresentationDMTM 2015 - 03 Data Representation
DMTM 2015 - 03 Data Representation
 
DMTM 2015 - 16 Data Preparation
DMTM 2015 - 16 Data PreparationDMTM 2015 - 16 Data Preparation
DMTM 2015 - 16 Data Preparation
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
 
Simple math for anomaly detection toufic boubez - metafor software - monito...
Simple math for anomaly detection   toufic boubez - metafor software - monito...Simple math for anomaly detection   toufic boubez - metafor software - monito...
Simple math for anomaly detection toufic boubez - metafor software - monito...
 
Making sense of citizen science data: A review of methods
Making sense of citizen science data: A review of methodsMaking sense of citizen science data: A review of methods
Making sense of citizen science data: A review of methods
 
SQLDay2013_MarcinSzeliga_DataInDataMining
SQLDay2013_MarcinSzeliga_DataInDataMiningSQLDay2013_MarcinSzeliga_DataInDataMining
SQLDay2013_MarcinSzeliga_DataInDataMining
 
2010 smg training_cardiff_day1_session3_higgins
2010 smg training_cardiff_day1_session3_higgins2010 smg training_cardiff_day1_session3_higgins
2010 smg training_cardiff_day1_session3_higgins
 
The zen of predictive modelling
The zen of predictive modellingThe zen of predictive modelling
The zen of predictive modelling
 
Anomalies! You can't escape them.
Anomalies! You can't escape them.Anomalies! You can't escape them.
Anomalies! You can't escape them.
 
Pitfalls of multivariate pattern analysis(MVPA), fMRI
Pitfalls of multivariate pattern analysis(MVPA), fMRI Pitfalls of multivariate pattern analysis(MVPA), fMRI
Pitfalls of multivariate pattern analysis(MVPA), fMRI
 
07 Handling of Uncertainties in the Safety Case
07 Handling of Uncertainties in the Safety Case07 Handling of Uncertainties in the Safety Case
07 Handling of Uncertainties in the Safety Case
 
Kevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data MiningKevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data Mining
 
CS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptxCS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptx
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
PSYA4 - Research methods
PSYA4 - Research methodsPSYA4 - Research methods
PSYA4 - Research methods
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
 
Data in science
Data in science Data in science
Data in science
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data mining
 
Stats chapter 5
Stats chapter 5Stats chapter 5
Stats chapter 5
 

More from Pier Luca Lanzi

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
Pier Luca Lanzi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
Pier Luca Lanzi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
Pier Luca Lanzi
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
Pier Luca Lanzi
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
Pier Luca Lanzi
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
Pier Luca Lanzi
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
Pier Luca Lanzi
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
Pier Luca Lanzi
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
Pier Luca Lanzi
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
Pier Luca Lanzi
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
Pier Luca Lanzi
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clustering
Pier Luca Lanzi
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
Pier Luca Lanzi
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
Pier Luca Lanzi
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
Pier Luca Lanzi
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
Pier Luca Lanzi
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
Pier Luca Lanzi
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 Classification
Pier Luca Lanzi
 

More from Pier Luca Lanzi (20)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clustering
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 Classification
 

Recently uploaded

PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
AyyanKhan40
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
RitikBhardwaj56
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
Bisnar Chase Personal Injury Attorneys
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
ak6969907
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
IreneSebastianRueco1
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
Krisztián Száraz
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
Israel Genealogy Research Association
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
deeptiverma2406
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
taiba qazi
 
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
NelTorrente
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
Priyankaranawat4
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
Group Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana BuscigliopptxGroup Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana Buscigliopptx
ArianaBusciglio
 

Recently uploaded (20)

PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
 
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
Group Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana BuscigliopptxGroup Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana Buscigliopptx
 

DMTM Lecture 05 Data representation

  • 1. Prof. Pier Luca Lanzi Data Representation Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
  • 2. Prof. Pier Luca Lanzi Readings • “Data Mining and Analysis” – Chapter 1 • “Mining of Massive Datasets” – Chapter 1 2
  • 3. Prof. Pier Luca Lanzi syllabusdescribing data
  • 4. Prof. Pier Luca Lanzi Contact Lenses Data 4 NoneReducedYesHypermetropePre-presbyopic NoneNormalYesHypermetropePre-presbyopic NoneReducedNoMyopePresbyopic NoneNormalNoMyopePresbyopic NoneReducedYesMyopePresbyopic HardNormalYesMyopePresbyopic NoneReducedNoHypermetropePresbyopic SoftNormalNoHypermetropePresbyopic NoneReducedYesHypermetropePresbyopic NoneNormalYesHypermetropePresbyopic SoftNormalNoHypermetropePre-presbyopic NoneReducedNoHypermetropePre-presbyopic HardNormalYesMyopePre-presbyopic NoneReducedYesMyopePre-presbyopic SoftNormalNoMyopePre-presbyopic NoneReducedNoMyopePre-presbyopic hardNormalYesHypermetropeYoung NoneReducedYesHypermetropeYoung SoftNormalNoHypermetropeYoung NoneReducedNoHypermetropeYoung HardNormalYesMyopeYoung NoneReducedYesMyopeYoung SoftNormalNoMyopeYoung NoneReducedNoMyopeYoung Recommended lensesTear production rateAstigmatismSpectacle prescriptionAge
  • 5. Prof. Pier Luca Lanzi Data Representation • Data are typically abstracted as a matrix, with n rows and d columns, given as • Rows are called instances, examples, records, transactions, objects, points, feature-vectors, etc. • Columns are called attributes, properties, features, dimensions, variables, fields, etc. 5
  • 6. Prof. Pier Luca Lanzi 6
  • 7. Prof. Pier Luca Lanzi Instances, Attributes, Concepts • Instances (observations, case) § The atomic elements of information from a dataset § Also known as records, prototypes, or examples • Attributes (variable) § Measures aspects of an instance § Also known as features or variables § Each instance is composed of a certain number of attributes • Concepts § Special content inside the data § Kind of things that can be learned § Intelligible and operational concept description 7
  • 8. Prof. Pier Luca Lanzi Two Versions of the Weather Data 8 …………… YesFalseNormalMildRainy YesFalseHighHotOvercast NoTrueHighHotSunny NoFalseHighHotSunny PlayWindyHumidityTemperatureOutlook …………… YesFalse8075Rainy YesFalse8683Overcast NoTrue9080Sunny NoFalse8585Sunny PlayWindyHumidityTemperatureOutlook
  • 9. Prof. Pier Luca Lanzi syllabusattribute types
  • 10. Prof. Pier Luca Lanzi Attributes • Numeric Attributes §Real-valued or integer-valued domain §Interval-scaled when only differences are meaningful (e.g., temperature) §Ratio-scaled when differences and ratios are meaningful (e.g., Age) • Categorical Attributes §Set-valued domain composed of a set of symbols §Nominal when only equality is meaningful (e.g., domain(Sex) = { M, F}) §Ordinal when both equality (are two values the same?) and inequality (is one value less than another?) are meaningful (e.g., domain(Education) = { High School, BS, MS, PhD}) 10
  • 11. Prof. Pier Luca Lanzi Numerical Attributes • Not only ordered but measured in fixed and equal units • Examples §Attribute “temperature” expressed in degrees §Attribute “year” • Characteristics §Difference of two values makes sense §Sum or product doesn’t make sense §Zero point is not defined • Sometimes they are divided into “discrete” and “continuous” 11
  • 12. Prof. Pier Luca Lanzi Nominal Attributes (or Categorical) • Values are distinct symbols • Values themselves serve only as labels or names • Example §Attribute “outlook” from weather data §Values: “sunny”, “overcast”, and “rainy” • Characteristics §No relation is implied among nominal values §No ordering §No distance measure §Only equality tests can be performed 12
  • 13. Prof. Pier Luca Lanzi Other Types of Attributes • Ratio Attributes §Numerical attributes for which the measurement scheme defines a zero point (e.g., an attribute representing distance) • Ordinal Attributes §Categorical attributes with an imposed order on values §No distance between values defined §For instance, the attribute “temperature” in weather data “hot” > “mild” > “cool” 13
  • 14. Prof. Pier Luca Lanzi Nominal or Ordinal? • Attribute “age” nominal §If age = young and astigmatic = no and tear production rate = normal then recommendation = soft • Attribute “age” ordinal (e.g. “young” < “pre-presbyopic” < “presbyopic”) §If age≤pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft 14
  • 15. Prof. Pier Luca Lanzi Why Specifying Attribute Types? • Some algorithms fit some specific data types best • Express the best possible patterns into data • Make the most adequate comparisons • Example §Outlook > “sunny” does not make sense, while §Temperature > “cool” or §Humidity > 70 does • Additional uses of attribute type §Check for valid values §Deal with missing values, etc. 15
  • 16. Prof. Pier Luca Lanzi syllabusmissing values
  • 17. Prof. Pier Luca Lanzi Why Missing Values Exist? • Faulty equipment, incorrect measurements, missing cells in manual data entry, censored/anonymous data • Review scores for movies, books, etc. • Very frequent in questionnaires for medical scenarios • Censored/anonymous data • In practice, a low rate of missing values may be suspicious • Interview data (“Did you ever …”) 17
  • 18. Prof. Pier Luca Lanzi Missing Values • Frequently indicated by out-of-range entries (e.g. max/min float), Nan or special values (e.g., zero) • Missing value may have significance in itself §E.g. missing test in a medical examination • Most schemes assume that is not the case §“missing” may need to be coded as additional value • Does absence of value have some significance? §If it does, “missing” is a separate value §If it does not, “missing” must be treated in a special way 18
  • 19. Prof. Pier Luca Lanzi What Types of Missing Values? • Missing completely at random (MCAR) § The distribution of an example having a missing value for an attribute does not depend on either the observed data or the missing data § Example: some survey questions contain a random sample of the whole questionnaire • Missing at random (MAR) § The distribution of an example having a missing value for an attribute depends on the observed data, but does not depend on the missing data § Missing at Random means the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data § Whether or not someone answered #13 on your survey has nothing to do with the missing values, but it does have to do with the values of some other variable § For example, people who don’t declare their salary not because of the amount of it but just because they don’t want to. • Not missing at random (NMAR) § The distribution of an example having a missing depends on the missing values. § For example, respondents with high income less likely to report income • Note that NMAR and MAR might be difficult to identify and often require domain knowledge 19
  • 20. Prof. Pier Luca Lanzi Dealing with Missing Values • Use what you know § Why data is missing § Distribution of missing data • Decide on the best strategy to yield the least biased estimates § Deletion Methods (listwise deletion, pairwise deletion) § Single Imputation Methods (mean/mode substitution, dummy variable method, single regression) § Model-Based Methods (maximum Likelihood, multiple imputation 20
  • 21. Prof. Pier Luca Lanzi Strategies for missing values handling • The handling of missing data depends on the type • Discarding all the examples with a missing values § Simplest approach § Allows the use of unmodified data mining methods § Only practical if there are few examples with missing values. Otherwise, it can introduce bias • Fill in the missing value manually J • Convert the missing values into a new value § Use a special value for it § Add an attribute that indicates if value is missing or not § Greatly increases the difficulty of the data mining process • Imputation methods § Assign a value to the missing one, based on the rest of the dataset. Use the unmodified data mining methods. 21
  • 22. Prof. Pier Luca Lanzi Listwise Deletion (Complete Case Analysis) • Only analyze cases with available data on each variable • Simple, but reduces the data • Comparability across analyses • Does not use all the information • Estimates may be biased if data not MCAR 22
  • 23. Prof. Pier Luca Lanzi Pairwise deletion (Available Case Analysis) • Analysis with all cases in which the variables of interest are present • Advantage §Keeps as many cases as possible for each analysis §Uses all information possible with each analysis • Disadvantage §Can’t compare analyses because sample different each time 23
  • 24. Prof. Pier Luca Lanzi Imputation methods • Extract a model from the dataset to perform the imputation • Suitable for MCAR and, to a lesser extent, for MAR • Not suitable for NMAR type of missing data • For NMAR we need to go back to the source of the data to obtain more information • Survey of imputation methods available at http://sci2s.ugr.es/MVDM/index.php http://sci2s.ugr.es/MVDM/biblio.php 24
  • 25. Prof. Pier Luca Lanzi Single Imputation Methods • Mean/mode substitution (most common value) § Replace missing value with sample mean or mode § Run analyses as if all complete cases § Advantages: Can use complete case analysis methods § Disadvantages: Reduces variability • Dummy variable control § Create an indicator for missing value (1=value is missing for observation; 0=value is observed for observation) § Impute missing values to a constant (such as the mean) § Include missing indicator in the algorithm § Advantage: uses all available information about missing observation § Disadvantage: results in biased estimates, not theoretically driven • Regression Imputation § Replaces missing values with predicted score from a regression equation. 25
  • 26. Prof. Pier Luca Lanzi Do Not Impute (DNI) • Simply use the default policy of the data mining method • Works only if the policy exists 26
  • 27. Prof. Pier Luca Lanzi syllabusinaccurate values
  • 28. Prof. Pier Luca Lanzi Inaccurate Values • Data has not been collected for mining it • Errors and omissions that don’t affect original purpose of data (e.g. age of customer) • Typographical errors in nominal attributes, thus values need to be checked for consistency • Typographical and measurement errors in numeric attributes, thus outliers need to be identified • Errors may be deliberate (e.g. wrong zip codes) 28
  • 29. Prof. Pier Luca Lanzi syllabusthe geometric view
  • 30. Prof. Pier Luca Lanzi The Geometrical View of the Data • When the data matrix contains only numerical values §Every row can be viewed as a point in a d-dimension space §Every column as a point in a n-dimensional space 30
  • 31. Prof. Pier Luca Lanzi 31
  • 33. Prof. Pier Luca Lanzi syllabusthe probabilistic view
  • 36. Prof. Pier Luca Lanzi syllabusdata format
  • 37. Prof. Pier Luca Lanzi Data Format • Most commercial tools have their own proprietary format • Most tools import excel files and comma-separated value files 37 Year,Make,Model,Length 1997,Ford,E350,2.34 2000,Mercury,Cougar,2.38 Year;Make;Model;Length 1997;Ford;E350;2,34 2000;Mercury;Cougar;2,38
  • 38. Prof. Pier Luca Lanzi Attribute-Relation File Format (ARFF) 38 % % ARFF file for weather data with some numeric features % @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes ... http://www.cs.waikato.ac.nz/~ml/weka/arff.html
  • 39. Prof. Pier Luca Lanzi Additional Attribute Types • ARFF supports string attributes: • Similar to nominal attributes but list of values is not pre-specified • ARFF also supports date attributes: • Uses the ISO-8601 combined date and time format yyyy-MM-dd-THH:mm:ss 39 @attribute description string @attribute today date
  • 40. Prof. Pier Luca Lanzi Additional Attribute Types • ARFF supports sparse data, for instance the following examples, • Can also be represented as, 40 0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A” 0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B” {1 26, 6 63, 10 “class A”} {3 42, 10 “class B”}
  • 41. Prof. Pier Luca Lanzi Missing Values in ARFF 41 @relation labor @attribute 'duration' real @attribute 'wage-increase-first-year' real @attribute 'wage-increase-second-year' real @attribute 'wage-increase-third-year' real @attribute 'cost-of-living-adjustment' {'none','tcf','tc'} @attribute 'working-hours' real @attribute 'pension' {'none','ret_allw','empl_contr'} @attribute 'standby-pay' real @attribute 'shift-differential' real @attribute 'education-allowance' {'yes','no'} @attribute 'statutory-holidays' real @attribute 'vacation' {'below_average','average','generous'} @attribute 'longterm-disability-assistance' {'yes','no'} @attribute 'contribution-to-dental-plan' {'none','half','full'} @attribute 'bereavement-assistance' {'yes','no'} @attribute 'contribution-to-health-plan' {'none','half','full'} @attribute 'class' {'bad','good'} @data 1,5,?, ?, ?,40, ?, ?,2, ?,11,'average', ?, ?,'yes',?,'good' 2,4.5,5.8, ?, ?,35,'ret_allw', ?, ?,'yes',11,'below_average', ?,'full', ?,'full','good' ?, ?, ?, ?, ?,38,'empl_contr', ?,5, ?,11,'generous','yes','half','yes','half','good' 3,3.7,4,5,'tc', ?, ?, ?, ?,'yes', ?, ?, ?, ?,'yes', ?,'good'
  • 42. Prof. Pier Luca Lanzi Attribute Types and Interpretation • Interpretation of attribute types in ARFF depends on the mining scheme • Numeric attributes are interpreted as §Ordinal scales if less-than and greater-than are used §Ratio scales if distance calculations are performed (normalization/standardization may be required) • Instance-based schemes define distance between nominal values (0 if values are equal, 1 otherwise) • Integers in some given data file: nominal, ordinal, or ratio scale? 42
  • 43. Prof. Pier Luca Lanzi DSPL: Dataset Publishing Language • Open format by Google available at http://code.google.com/apis/publicdata/ • Use existing data: add an XML metadata file existing CSV • Read by the Google Public Data Explorer, which includes animated bar chart, motion chart, and map visualization • Allow linking to concepts in other datasets • Geo-enabled: allows adding latitude and longitude data to your concept definitions 43
  • 44. Prof. Pier Luca Lanzi syllabusmodel representation
  • 45. Prof. Pier Luca Lanzi Predictive Model Markup Language • XML-based markup language developed by the Data Mining Group (DMG) to provide a way for applications to define models related to predictive analytics and data mining • The goal is to share models between applications • Vendor-independent method of defining models • Allow to exchange of models between applications. • PMML Components: data dictionary, data transformations, model, mining schema, targets, output 45
  • 46. Prof. Pier Luca Lanzi syllabusdata repositories
  • 47. Prof. Pier Luca Lanzi Publicly Available Datasets • UCI repository §http://archive.ics.uci.edu/ml/ §Probably the most famous collection of datasets • Kaggle §http://www.kaggle.com/ §It is not a static repository of datasets, but a site that manages Data Mining competitions §Example of the modern concept of crowdsourcing • KDNuggets §http://www.kdnuggets.com/datasets/ 47