Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Video game design and programming course for the Master in Computer Engineering at the Politecnico di Milano.
http://www.facebook.com/polimigamecollective
https://twitter.com/@POLIMIGC
http://www.youtube.com/PierLucaLanzi
http://www.polimigamecollective.org
Politecnico di Milano, Videogiochi, Video Games, Computer Engineering
Video game design and programming course for the Master in Computer Engineering at the Politecnico di Milano.
http://www.facebook.com/polimigamecollective
https://twitter.com/@POLIMIGC
http://www.youtube.com/PierLucaLanzi
http://www.polimigamecollective.org
Politecnico di Milano, Videogiochi, Video Games, Computer Engineering, game design, game development, sviluppo videogiochi
Abstract—In evolutionary high-level synthesis, design solutions
have to be evaluated to extract information about some
figures of merit (such as performance, area, etc.) and to allow
the genetic algorithm to evolve and converge to Pareto-optimal
solutions. Since the execution time of such evaluations increases
with the complexity of the specification, this could lead to
unacceptable execution time of the overall methodology. This
paper presents a model to exploit fitness inheritance in a multiobjective
optimization algorithm (i.e. NSGA-II [1]) by substituting
the expensive real evaluations with an estimation based
on neighbors in an hypothetical design space. The estimations
are based on a measure of distance between individuals and
a weighted average on fitnesses of closer ones. The results
shows that the Pareto-optimal set obtained by applying the
proposed model good approximates the set obtained without
fitness inheritance and overall execution time is reduced more
than 25% in average.
Developing Analytic Technique and Defeating Cognitive Bias in Securitychrissanders88
In this presentation, I discuss the evolution to the analysis era in information security and the challenges associated with it. This includes several examples of cognitive biases and the negative effects they can have on the analysis process. I also discuss different analytic techniques that can enhance analysis such as differential diagnosis and relational investigation.
In this slideshare I briefly review the topic of ergodicity and WEIRDness in Qualitative Research. Disclaimer: Past performance is not indicative of future results.
Research misconduct: science's self-administered poisonLeonid Schneider
Microb&Co Workshop 7ICME, October 2016,
Catania October 2016 Talk 1
How research misconduct happens and how it can be prevented. The roles of universities, journals and funders
Going through a PhD may be seen as a requirement for an academic career or a different kind of job, simply as “the next step” in education, as something to do “because why not?”, or even just as a hobby you have on the side. What it really is though, is a life-changing experience, something that can be terribly painful and amazingly rewarding at the same time. In that journey I learned a few lessons in the hard way, lessons that I wish someone had told me about at the time. In this talk I’ll try to do just that and not talk about the content and process of a PhD, but rather about you, the person, during your PhD.
On practical philosophy of research in science and technologySeppo Karrila
An attempt to indoctrinate graduate students with some philosophy of science and good practices in their research. Some references are included to disturbing trends known from poor practices that appear common to some fields, to make clear the importance of reliable methods, in particular the Scientific Method. Trigger warning: not trying to be nice to everybody.
A talk to beginning graduate students, Part 2.
This is about the fundamentals of knowledge, understanding and science, promoting the scientific method and Karl Popper's views. The remainder outlines the practice of a thesis work, from hypothesis through proposal onwards... And I shamelessly mock pyramidology and related fields...
Introduction to quantitative and qualitative researchLiz FitzGerald
This presentation, delivered in an Open University CALRG Building Knowledge session, gives a preliminary introduction to both quantitative and qualitative research approaches. There has been widespread debate when considering the relative merits of quantitative and qualitative strategies for research. Positions taken by individual researchers vary considerably, from those who see the two strategies as entirely separate, polar opposites that are based upon alternative views of the world, to those who are happy to mix these strategies within their research projects. We consider the different strengths, weaknesses and suitability of different approaches and draw upon some examples to highlight their use within educational technology.
Michael Bolton - Heuristics: Solving Problems RapidlyTEST Huddle
EuroSTAR Software Testing Conference 2008 presentation on Heuristics: Solving Problems Rapidly by Michael Bolton. See more at conferences.eurostarsoftwaretesting.com/past-presentations/
sience 2.0 : an illustration of good research practices in a real studywolf vanpaemel
a presentation explaining the what, how and why of some of the features of science 2.0 (replication, registration, high power, bayesian statistics, estimation, co-pilot multi-software approach, distinction between confirmatory and exploratory analyses, and open science) using steegen et al. (2014) as a running example.
Similar to DMTM 2015 - 03 Data Representation (20)
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
DMTM Lecture 13 Representative based clusteringPier Luca Lanzi
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides from the 2016/2017 edition of the Video game Design and Programming course at the Politecnico di Milano. More information at http://www.polimigamecollective.org Some of the video games developed by the students during the course are available at https://polimi-game-collective.itch.io
Slides from the 2016/2017 edition of the Video game Design and Programming course at the Politecnico di Milano. More information at http://www.polimigamecollective.org Some of the video games developed by the students during the course are available at https://polimi-game-collective.itch.io
Slides from the 2016/2017 edition of the Video game Design and Programming course at the Politecnico di Milano. More information at http://www.polimigamecollective.org Some of the video games developed by the students during the course are available at https://polimi-game-collective.itch.io
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
4. Prof. Pier Luca Lanzi
Contact Lenses Data
4
None
Reduced
Yes
Hypermetrope
Pre-presbyopic
None
Normal
Yes
Hypermetrope
Pre-presbyopic
None
Reduced
No
Myope
Presbyopic
None
Normal
No
Myope
Presbyopic
None
Reduced
Yes
Myope
Presbyopic
Hard
Normal
Yes
Myope
Presbyopic
None
Reduced
No
Hypermetrope
Presbyopic
Soft
Normal
No
Hypermetrope
Presbyopic
None
Reduced
Yes
Hypermetrope
Presbyopic
None
Normal
Yes
Hypermetrope
Presbyopic
Soft
Normal
No
Hypermetrope
Pre-presbyopic
None
Reduced
No
Hypermetrope
Pre-presbyopic
Hard
Normal
Yes
Myope
Pre-presbyopic
None
Reduced
Yes
Myope
Pre-presbyopic
Soft
Normal
No
Myope
Pre-presbyopic
None
Reduced
No
Myope
Pre-presbyopic
hard
Normal
Yes
Hypermetrope
Young
None
Reduced
Yes
Hypermetrope
Young
Soft
Normal
No
Hypermetrope
Young
None
Reduced
No
Hypermetrope
Young
Hard
Normal
Yes
Myope
Young
None
Reduced
Yes
Myope
Young
Soft
Normal
No
Myope
Young
None
Reduced
No
Myope
Young
Recommended lenses
Tear production rate
Astigmatism
Spectacle prescription
Age
5. Prof. Pier Luca Lanzi
• Data are often abstracted as an nxd data matrix, with n rows and
d columns, given as
• Rows are called instances, examples, records, transactions,
objects, points, feature-vectors, etc.
• Columns are called attributes, properties, features, dimensions,
variables, fields, etc.
5
7. Prof. Pier Luca Lanzi
Instances, Attributes, Concepts
• Instances (observations, case)
§ The atomic elements of information from a dataset
§ Also known as records, prototypes, or examples
• Attributes (variable)
§ Measures aspects of an instance
§ Also known as features or variables
§ Each instance is composed of a certain number of attributes
• Concepts
§ Special content inside the data
§ Kind of things that can be learned
§ Intelligible and operational concept description
7
9. Prof. Pier Luca Lanzi
Two Versions of the Weather Data
9
…
…
…
…
…
Yes
False
Normal
Mild
Rainy
Yes
False
High
Hot
Overcast
No
True
High
Hot
Sunny
No
False
High
Hot
Sunny
Play
Windy
Humidity
Temperature
Outlook
…
…
…
…
…
Yes
False
80
75
Rainy
Yes
False
86
83
Overcast
No
True
90
80
Sunny
No
False
85
85
Sunny
Play
Windy
Humidity
Temperature
Outlook
11. Prof. Pier Luca Lanzi
Attributes
• Numeric Attributes
§ Real-valued or integer-valued domain
§ Interval-scaled when only differences are meaningful
(e.g., temperature)
§ Ratio-scaled when differences and ratios are meaningful
(e.g., Age)
• Categorical Attributes
§ Set-valued domain composed of a set of symbols
§ Nominal when only equality is meaningful
(e.g., domain(Sex) = { M, F})
§ Ordinal when both equality (are two values the same?) and
inequality (is one value less than another?) are meaningful
(e.g., domain(Education) = { High School, BS, MS, PhD})
11
12. Prof. Pier Luca Lanzi
Numerical Attributes
• Not only ordered but measured in fixed and equal units
• Examples
§ Attribute “temperature” expressed in degrees
§ Attribute “year”
• Characteristics
§ Difference of two values makes sense
§ Sum or product doesn’t make sense
§ Zero point is not defined
• Sometimes they are divided into “discrete” and “continuous”
12
13. Prof. Pier Luca Lanzi
Ratio Attributes
• Ratio quantities are ones for which the
measurement scheme defines a zero point
• Example
§ Attribute “distance”
• Characteristics
§ Distance between an object and itself is zero
§ Ratio quantities are treated as real numbers
§ All mathematical operations are allowed
§ Is there an “inherently” defined zero point?
§ It depends on scientific knowledge
13
14. Prof. Pier Luca Lanzi
Nominal Attributes (or Categorical)
• Values are distinct symbols
• Values themselves serve only as labels or names
• Example
§ Attribute “outlook” from weather data
§ Values: “sunny”, “overcast”, and “rainy”
• Characteristics
§ No relation is implied among nominal values
§ No ordering
§ No distance measure
§ Only equality tests can be performed
14
15. Prof. Pier Luca Lanzi
Ordinal Attributes
• Impose order on values
• No distance between values defined
• Example
§ The attribute “temperature” in weather data
§ Values: “hot” “mild” “cool”
• Characteristics
§ Addition and subtraction don’t make sense
§ Distinction between nominal and ordinal not always clear (e.g.
attribute “outlook”)
15
16. Prof. Pier Luca Lanzi
Nominal or Ordinal?
• Attribute “age” nominal
§ If age = young and astigmatic = no
and tear production rate = normal
then recommendation = soft
• Attribute “age” ordinal
(e.g. “young” “pre-presbyopic” “presbyopic”)
§ If age≤pre-presbyopic and astigmatic = no
and tear production rate = normal
then recommendation = soft
16
17. Prof. Pier Luca Lanzi
Why Specifying Attribute Types?
• Some algorithms fit some specific data types best
• Express the best possible patterns into data
• Make the most adequate comparisons
• Example
§ Outlook “sunny” does not make sense, while
§ Temperature “cool” or
§ Humidity 70 does
• Additional uses of attribute type
§ Check for valid values
§ Deal with missing values, etc.
17
19. Prof. Pier Luca Lanzi
Why Missing Values Exist?
• Faulty equipment, incorrect measurements, missing cells in manual
data entry, censored/anonymous data
• Review scores for movies, books, etc.
• Very frequent in questionnaires for medical scenarios
• Censored/anonymous data
• In practice, a low rate of missing values may be suspicious
• Interview data (“Did you ever …”)
19
20. Prof. Pier Luca Lanzi
Missing Values
• Frequently indicated by out-of-range entries (e.g. max/min float)
• Missing value may have significance in itself
§ E.g. missing test in a medical examination
• Most schemes assume that is not the case
§ “missing” may need to be coded as additional value
• Does absence of value have some significance?
§ If it does, “missing” is a separate value
§ If it does not, “missing” must be treated in a special way
20
21. Prof. Pier Luca Lanzi
What Types of Missing Values?
• Missing completely at random (MCAR)
§ The distribution of an example having a missing value for an attribute does not depend on
either the observed data or the missing data
§ Example: some survey questions contain a random sample of the whole questionnaire
• Missing at random (MAR)
§ The distribution of an example having a missing value for an attribute depends on the
observed data, but does not depend on the missing data
§ Missing at Random means the propensity for a data point to be missing is not related to the
missing data, but it is related to some of the observed data.
§ Whether or not someone answered #13 on your survey has nothing to do with the missing
values, but it does have to do with the values of some other variable.
§ Example: Respondents in service occupations less likely to report income
• Not missing at random (NMAR)
§ the distribution of an example having a missing value for an attribute depends on the missing
values.
§ Example: respondents with high income less likely to report income
21
22. Prof. Pier Luca Lanzi
Dealing with Missing Values
• Use what you know
§ Why data is missing
§ Distribution of missing data
• Decide on the best strategy to yield the least biased estimates
§ Deletion Methods (listwise deletion, pairwise deletion)
§ Single Imputation Methods (mean/mode substitution, dummy variable
method, single regression)
§ Model-Based Methods (maximum Likelihood, multiple imputation
22
23. Prof. Pier Luca Lanzi
Strategies for missing values handling
• The handling of missing data depends on the type
• Discarding all the examples with a missing values
§ Simplest approach
§ Allows the use of unmodified data mining methods
§ Only practical if there are few examples with missing values. Otherwise, it
can introduce bias
• Fill in the missing value manually J
• Convert the missing values into a new value
§ Use a special value for it
§ Add an attribute that indicates if value is missing or not
§ Greatly increases the difficulty of the data mining process
• Imputation methods
§ Assign a value to the missing one, based on the rest of the dataset. Use
the unmodified data mining methods.
23
24. Prof. Pier Luca Lanzi
Listwise Deletion (Complete Case Analysis)
• Only analyze cases with available data
on each variable
• Simple, but reduces the data
• Comparability across analyses
• Does not use all the information
• Estimates may be biased if data not
MCAR
24
25. Prof. Pier Luca Lanzi
Pairwise deletion (Available Case Analysis)
• Analysis with all cases in which
the variables of interest are
present
• Advantage
§ Keeps as many cases as
possible for each analysis
§ Uses all information
possible with each analysis
• Disadvantage
§ Can’t compare analyses
because sample different
each time
25
26. Prof. Pier Luca Lanzi
Imputation methods
• Extract a model from the dataset to perform the imputation
• Suitable for MCAR and, to a lesser extent, for MAR
• Not suitable for NMAR type of missing data
• For NMAR we need to go back to the source of the data to
obtain more information
• Survey of imputation methods available at
http://sci2s.ugr.es/MVDM/index.php
http://sci2s.ugr.es/MVDM/biblio.php
26
27. Prof. Pier Luca Lanzi
Single Imputation Methods
• Mean/mode substitution (most common value)
§ Replace missing value with sample mean or mode
§ Run analyses as if all complete cases
§ Advantages: Can use complete case analysis methods
§ Disadvantages: Reduces variability
• Dummy variable control
§ Create an indicator for missing value (1=value is missing for observation;
0=value is observed for observation)
§ Impute missing values to a constant (such as the mean)
§ Include missing indicator in the algorithm
§ Advantage: uses all available information about missing observation
§ Disadvantage: results in biased estimates, not theoretically driven
• Regression Imputation
§ Replaces missing values with predicted score from a regression equation.
27
31. Prof. Pier Luca Lanzi
Inaccurate Values
• Data has not been collected for mining it
• Errors and omissions that don’t affect original purpose of data
(e.g. age of customer)
• Typographical errors in nominal attributes,
thus values need to be checked for consistency
• Typographical and measurement errors in numeric attributes,
thus outliers need to be identified
• Errors may be deliberate (e.g. wrong zip codes)
31
33. Prof. Pier Luca Lanzi
The Geometrical View of the Data
• When the data matrix contains only numerical values
§ Every row can be viewed as a point in a d-dimension space
§ Every column as a point in a n-dimensional space
33
44. Prof. Pier Luca Lanzi
Data Format
• Most commercial tools have their own proprietary format
• Most tools import excel files and comma-separated value files
44
Year,Make,Model,Length
1997,Ford,E350,2.34
2000,Mercury,Cougar,2.38
Year;Make;Model;Length
1997;Ford;E350;2,34
2000;Mercury;Cougar;2,38
45. Prof. Pier Luca Lanzi
Attribute-Relation File Format (ARFF)
45
%
% ARFF file for weather data with some numeric features
%
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {true, false}
@attribute play? {yes, no}
@data
sunny, 85, 85, false, no
sunny, 80, 90, true, no
overcast, 83, 86, false, yes
...
http://www.cs.waikato.ac.nz/~ml/weka/arff.html!
46. Prof. Pier Luca Lanzi
Additional Attribute Types
• ARFF supports string attributes:
• Similar to nominal attributes but list of values
is not pre-specified
• ARFF also supports date attributes:
• Uses the ISO-8601 combined date
and time format yyyy-MM-dd-THH:mm:ss
46
@attribute description string
@attribute today date
47. Prof. Pier Luca Lanzi
Additional Attribute Types
• ARFF supports sparse data, for instance the following examples,
• Can also be represented as,
47
0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A”
0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B”
{1 26, 6 63, 10 “class A”}
{3 42, 10 “class B”}
48. Prof. Pier Luca Lanzi
Missing Values in ARFF
48
@relation labor
@attribute 'duration' real
@attribute 'wage-increase-first-year' real
@attribute 'wage-increase-second-year' real
@attribute 'wage-increase-third-year' real
@attribute 'cost-of-living-adjustment' {'none','tcf','tc'}
@attribute 'working-hours' real
@attribute 'pension' {'none','ret_allw','empl_contr'}
@attribute 'standby-pay' real
@attribute 'shift-differential' real
@attribute 'education-allowance' {'yes','no'}
@attribute 'statutory-holidays' real
@attribute 'vacation' {'below_average','average','generous'}
@attribute 'longterm-disability-assistance' {'yes','no'}
@attribute 'contribution-to-dental-plan' {'none','half','full'}
@attribute 'bereavement-assistance' {'yes','no'}
@attribute 'contribution-to-health-plan' {'none','half','full'}
@attribute 'class' {'bad','good'}
@data
1,5,?, ?, ?,40, ?, ?,2, ?,11,'average', ?, ?,'yes',?,'good'
2,4.5,5.8, ?, ?,35,'ret_allw', ?, ?,'yes',11,'below_average', ?,'full', ?,'full','good'
?, ?, ?, ?, ?,38,'empl_contr', ?,5, ?,11,'generous','yes','half','yes','half','good'
3,3.7,4,5,'tc', ?, ?, ?, ?,'yes', ?, ?, ?, ?,'yes', ?,'good'
49. Prof. Pier Luca Lanzi
Attribute Types and Interpretation
• Interpretation of attribute types in ARFF depends
on the mining scheme
• Numeric attributes are interpreted as
§ Ordinal scales if less-than and greater-than are used
§ Ratio scales if distance calculations are performed
(normalization/standardization may be required)
• Instance-based schemes define distance between nominal values
(0 if values are equal, 1 otherwise)
• Integers in some given data file: nominal, ordinal, or ratio scale?
49
50. Prof. Pier Luca Lanzi
DSPL: Dataset Publishing Language
• Open format by Google available at
http://code.google.com/apis/publicdata/
• Use existing data: add an XML metadata file existing CSV
• Read by the Google Public Data Explorer, which includes
animated bar chart, motion chart, and map visualization
• Allow linking to concepts in other datasets
• Geo-enabled: allows adding latitude and longitude data to your
concept definitions
50
52. Prof. Pier Luca Lanzi
Predictive Model Markup Language
• XML-based markup language developed by the Data Mining
Group (DMG) to provide a way for applications to define models
related to predictive analytics and data mining
• The goal is to share models between applications
• Vendor-independent method of defining models
• Allow to exchange of models between applications.
• PMML Components: data dictionary, data transformations, model,
mining schema, targets, output
52
54. Prof. Pier Luca Lanzi
Publicly Available Datasets
• UCI repository
§ http://archive.ics.uci.edu/ml/
§ Probably the most famous collection of datasets
• Kaggle
§ http://www.kaggle.com/
§ It is not a static repository of datasets, but a site that manages
Data Mining competitions
§ Example of the modern concept of crowdsourcing
54
55. Prof. Pier Luca Lanzi
Publicly Available Datasets
• KDNuggets
§ http://www.kdnuggets.com/datasets/
• PSPbenchmarks
§ http://www.infobiotic.net/PSPbenchmarks/
§ Datasets derived from Protein Structure Prediction problems
§ Interesting benchmarks because they can be parametrised in a
very large variety of ways
• Pascal Large Scale Learning Challenge
§ http://largescale.ml.tu-berlin.de/about/
55