Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Naive Bayes Classifier is a machine learning technique that is exceedingly useful to address several classification problems. It is often used as a baseline classifier to benchmark results. It is also used as a standalone classifier for tasks such as spam filtering where the naive assumption (conditional independence) made by the classifier seem reasonable. In this presentation we discuss the mathematical basis for the Naive Bayes and illustrate with examples
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Naive Bayes Classifier is a machine learning technique that is exceedingly useful to address several classification problems. It is often used as a baseline classifier to benchmark results. It is also used as a standalone classifier for tasks such as spam filtering where the naive assumption (conditional independence) made by the classifier seem reasonable. In this presentation we discuss the mathematical basis for the Naive Bayes and illustrate with examples
Video game design and programming course for the Master in Computer Engineering at the Politecnico di Milano.
http://www.facebook.com/polimigamecollective
https://twitter.com/@POLIMIGC
http://www.youtube.com/PierLucaLanzi
http://www.polimigamecollective.org
Politecnico di Milano, Videogiochi, Video Games, Computer Engineering
Video game design and programming course for the Master in Computer Engineering at the Politecnico di Milano.
http://www.facebook.com/polimigamecollective
https://twitter.com/@POLIMIGC
http://www.youtube.com/PierLucaLanzi
http://www.polimigamecollective.org
Politecnico di Milano, Videogiochi, Video Games, Computer Engineering, game design, game development, sviluppo videogiochi
Video game design and programming course for the Master in Computer Engineering at the Politecnico di Milano. http://www.facebook.com/polimigamecollective https://twitter.com/@POLIMIGC http://www.youtube.com/PierLucaLanzi http://www.polimigamecollective.org
Politecnico di Milano, Videogiochi, Video Games, Computer Engineering, game design, game development, sviluppo videogiochi
Video game design and programming course for the Master in Computer Engineering at the Politecnico di Milano.
http://www.facebook.com/polimigamecollective
https://twitter.com/@POLIMIGC
http://www.youtube.com/PierLucaLanzi
http://www.polimigamecollective.org
Politecnico di Milano, Videogiochi, Video Games, Computer Engineering, game design, game development, sviluppo videogiochi
Video game design and programming course for the Master in Computer Engineering at the Politecnico di Milano. http://www.facebook.com/polimigamecollective https://twitter.com/@POLIMIGC http://www.youtube.com/PierLucaLanzi http://www.polimigamecollective.org
Video game design and programming course for the Master in Computer Engineering at the Politecnico di Milano. http://www.facebook.com/polimigamecollective https://twitter.com/@POLIMIGC http://www.youtube.com/PierLucaLanzi http://www.polimigamecollective.org
Video game design and programming course for the Master in Computer Engineering at the Politecnico di Milano. http://www.facebook.com/polimigamecollective https://twitter.com/@POLIMIGC http://www.youtube.com/PierLucaLanzi http://www.polimigamecollective.org
Politecnico di Milano, Videogiochi, Video Games, Computer Engineering, game design, game development, sviluppo videogiochi
Video game design and programming course for the Master in Computer Engineering at the Politecnico di Milano. http://www.facebook.com/polimigamecollective https://twitter.com/@POLIMIGC http://www.youtube.com/PierLucaLanzi http://www.polimigamecollective.org
Video game design and programming course for the Master in Computer Engineering at the Politecnico di Milano. http://www.facebook.com/polimigamecollective https://twitter.com/@POLIMIGC http://www.youtube.com/PierLucaLanzi http://www.polimigamecollective.org
Politecnico di Milano, Videogiochi, Video Games, Computer Engineering, game design, game development, sviluppo videogiochi
Valencian Summer School 2015
Day 1
Lecture 5
Data Transformation and Feature Engineering
Charles Parker (Alston Trading)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
The outcome of the Academy Award for Best Picture surprised us all. But, could that have been predicted? In this practical workshop you'll use a dataset that contains previous Oscar winners to build a prediction model to guess the winner for Best Picture. You'll get an introduction to a data scientist's tools and methods, including an overview of basic machine learning concepts. Unlike this year's Oscars, our model will predict only one winner!
A presentation on a special category of databases called Deductive Databases. It is an attempt to merge logic programming with relational database. Other types include Object-oriented databases, Graph databases, XML databases, Multi-model databases, etc.
Goal: Provide an overview of data mining
Define data mining
Data mining vs. databases
Basic data mining tasks
Data mining development
Data mining issues
There are many examples of text-based documents (all in ‘electronic’ format…)
e-mails, corporate Web pages, customer surveys, résumés, medical records, DNA sequences, technical papers, incident reports, news stories and more…
Not enough time or patience to read
Can we extract the most vital kernels of information…
So, we wish to find a way to gain knowledge (in summarised form) from all that text, without reading or examining them fully first…!
Some others (e.g. DNA seq.) are hard to comprehend!
Many of us data science and business analytics practitioners perform research and analysis for decision makers on a regular basis. The deliverable of such analysis often results in a Power Point presentation, and/or a model that needs to be productionalized. The code used to produce the analysis also needs to be considered a deliverable.
Many of us perform analysis without reproducibility in mind. With the increasing democratization of data, it is becoming more and more important for people that may not have scientific training to be able to create analysis that can be picked up by somebody else who can then reproduce your results. That, and creating reproducible research is just solid science.
We are going to spend an evening walking though the various tools available to create reproducible research on Big Data. You will get introduced to the Tidyverse of R packages and how to use them. We will discuss the ins and outs of various notebook technologies like Jupyter, and Zeppelin. You will have an opportunity to learn how to get up and running with R and Spark and the various options you have to learn on real clusters instead of just your local environment. There also be a quick introduction to source control and the various options you have around using Git.
The theme of the evening will be “getting started”. We will go over various training resources and show you the optimal path to go from zero to master. Some commentary will be provided around the current state of the job market and intel from the front lines of the data science language wars. This is a large topic and the evening will be fairly dynamic and responsive to the needs of the audience.
Bob Wakefield has spent the better part of 16 years building data systems for many organizations across various industries. He has been running Hadoop in a lab environment for 3 years. He is the principal of Mass Street Analytics, LLC a boutique data consultancy. Mass Street is a Hortonworks Consultant Partner and Confluent Partner.
In his spare time, he likes to work on an equity investment application that combines various sources of information to automatically arrive at investing decisions. When he is not doing that, you’ll find him flying his A-10 simulator. Full CV can be found here: https://www.linkedin.com/in/bobwakefieldmba/
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
DMTM Lecture 13 Representative based clusteringPier Luca Lanzi
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides from the 2016/2017 edition of the Video game Design and Programming course at the Politecnico di Milano. More information at http://www.polimigamecollective.org Some of the video games developed by the students during the course are available at https://polimi-game-collective.itch.io
Slides from the 2016/2017 edition of the Video game Design and Programming course at the Politecnico di Milano. More information at http://www.polimigamecollective.org Some of the video games developed by the students during the course are available at https://polimi-game-collective.itch.io
Slides from the 2016/2017 edition of the Video game Design and Programming course at the Politecnico di Milano. More information at http://www.polimigamecollective.org Some of the video games developed by the students during the course are available at https://polimi-game-collective.itch.io
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Han’s Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insider’s LMA Course, this piece examines the course’s effects via a variety of Tim Han LMA course reviews and Success Insider comments.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
1. Prof. Pier Luca Lanzi
Data Preparation
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
2. Prof. Pier Luca Lanzi
Why Data Preprocessing?
• Data in the real world are “dirty”
• They are usually incomplete
§ Missing attribute values
§ Missing attributes of interest, or containing only aggregate data
• They are usually noisy, containing errors or outliers
§ e.g., Salary=“-10”
• They are usually inconsistent and contain discrepancies
§ e.g., Age=“42” Birthday=“03/07/1997”
§ e.g., Was rating “1,2,3”, now rating “A, B, C”
§ e.g., discrepancy between duplicate records
2
3. Prof. Pier Luca Lanzi
Why Is Data Dirty?
• Incomplete data may come from
§ “Not applicable” data value when collected
§ Different considerations between the time when the data was
collected and when it is analyzed.
§ Human/hardware/software problems
• Noisy data (incorrect values) may come from
§ Faulty data collection instruments
§ Human or computer error at data entry
§ Errors in data transmission
• Inconsistent data may come from
§ Different data sources
§ Functional dependency violation (e.g., modify some linked data)
• Duplicate records also need data cleaning
3
4. Prof. Pier Luca Lanzi
Why Is Data Preprocessing Important?
• No quality in data, no quality in the mining results!
(trash in, trash out)
• Quality decisions needs quality in data
• E.g., duplicate or missing data may cause incorrect or even
misleading statistics.
• Data warehouses need consistent integration of quality data
• Data extraction, cleaning, and transformation comprises the
majority of the work of building a data warehouse
4
5. Prof. Pier Luca Lanzi
Major Tasks in Data Preprocessing
• Data cleaning
§ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
§ Integration of multiple databases, data cubes, or files
• Data reduction
§ Dimensionality reduction
§ Numerosity reduction
§ Data compression
• Data transformation and data discretization
§ Normalization
§ Concept hierarchy generation
5
6. Prof. Pier Luca Lanzi
Outliers
• Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set
6
7. Prof. Pier Luca Lanzi
Missing Values
• Reasons for missing values
§ Information is not collected
(e.g., people decline to give their age and weight)
§ Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
• Handling missing values
§ Eliminate Data Objects
§ Estimate Missing Values
§ Ignore the Missing Value During Analysis
§ Replace with all possible values (weighted by their
probabilities)
7
8. Prof. Pier Luca Lanzi
Duplicate Data
• Data set may include data objects that are duplicates, or almost
duplicates of one another
• Major issue when merging data from heterogeous sources
• Examples: same person with multiple email addresses
• Data cleaning: process of dealing with duplicate data issues
8
9. Prof. Pier Luca Lanzi
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
9
10. Prof. Pier Luca Lanzi
Aggregation
• Combining two or more attributes (or objects)
into a single attribute (or object)
• Data reduction
§ Reduce the number of attributes or objects
• Change of scale
§ Cities aggregated into regions
§ States, countries, etc
• More “stable” data
§ Aggregated data tends to have less variability
10
11. Prof. Pier Luca Lanzi
Sampling
• Sampling is the main technique employed for data selection
• It is often used for both the preliminary investigation of the data
and the final data analysis.
• Statisticians sample because obtaining the entire set of data of
interest is too expensive or time consuming
• Sampling is used in data mining because processing the entire set
of data of interest is too expensive or time consuming
• Using a sample will work almost as well as using the entire data
sets, if the sample is representative
• A sample is representative if it has approximately the same
property (of interest) as the original set of data
11
12. Prof. Pier Luca Lanzi
Types of Sampling
• Simple Random Sampling
§ There is an equal probability of selecting any particular item
• Sampling without replacement
§ As each item is selected, it is removed from the population
• Sampling with replacement
§ Objects are not removed from the population as they are
selected for the sample.
§ In sampling with replacement, the same object can
be picked up more than once
• Stratified sampling
§ Split the data into several partitions
§ Then draw random samples from each partition
12
14. Prof. Pier Luca Lanzi
Sample Size
• What sample size is necessary to get at least
one object from each of 10 groups.
14
15. Prof. Pier Luca Lanzi
Dimensionality Reduction
• Purpose
§ Avoid curse of dimensionality
§ Reduce amount of time and memory required by data mining
algorithms
§ Allow data to be more easily visualized
§ May help to eliminate irrelevant features or reduce noise
• Techniques
§ Principle Component Analysis
§ Singular Value Decomposition
§ Others: supervised and non-linear techniques
15
16. Prof. Pier Luca Lanzi
Dimensionality Reduction:
Principal Component Analysis (PCA)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal
vectors (principal components) that can be best used to
represent data
• Works for numeric data only
• Used when the number of dimensions is large
16
17. Prof. Pier Luca Lanzi
Dimensionality Reduction:
Principal Component Analysis (PCA)
• Normalize input data
• Compute k orthonormal (unit) vectors, i.e., principal components
• Each input data (vector) is a linear combination of the k principal
component vectors
• The principal components are sorted in order of decreasing
“significance” or strength
• Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance. (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data
17
18. Prof. Pier Luca Lanzi
Dimensionality Reduction:
Principal Component Analysis (PCA)
• Goal is to find a projection that captures the largest amount of
variation in data
18
Original axes
**
*
*
*
*
* *
*
*
*
*
*
*
*
*
*
* *
*
*
*
*
*
Data points
First principal componentSecond principal component
21. Prof. Pier Luca Lanzi
Singular Value Decomposition
• Theorem (Press 1992). It is always possible to decompose matrix
A into A = UΛVT where U, Λ, V are unique
• U, V are column orthonormal, i.e., columns are unit vectors,
orthogonal to each other, so that UTU = I; VTV = I
• Li,i are the singular values, they are positive, and sorted in
decreasing order
21
22. Prof. Pier Luca Lanzi
Singular Value Decomposition
• Data are viewed as a matrix A with n examples described by m
numerical attributes
A[n x m] = U[n x r] Λ [ r x r] (V[m x r])T
• A stores the original data
• U represents the n examples using r new concepts/attributes
• Λ represents the strength of each ‘concept’ (r is the rank of A)
• V contain m terms, r concepts
• Λi,i are called the singular values of A, if A is singular, some of the
wi will be 0; in general rank(A) = number of nonzero wi
• SVD is mostly unique (up to permutation of singular values, or if
some Li are equal)
22
23. Prof. Pier Luca Lanzi
Singular Value Decomposition
for Dimensionality Reduction
• Suppose you want to find best rank-k approximation to A
• Set all but the largest k singular values to zero
• Can form compact representation by eliminating columns of U
and V corresponding to zeroed Li
23
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
26. Prof. Pier Luca Lanzi
Feature Subset Selection
• Another way to reduce dimensionality of data
• Redundant features
§ duplicate much or all of the information contained in one or
more other attributes
§ Example: purchase price of a product and the amount of sales
tax paid
• Irrelevant features
§ contain no information that is useful for the data mining task
at hand
§ Example: students' ID is often irrelevant to the task of
predicting students' GPA
26
27. Prof. Pier Luca Lanzi
Feature Subset Selection
• Brute-force approach
§ Try all possible feature subsets as input to data mining algorithm
• Embedded approaches
§ Feature selection occurs naturally as part of
the data mining algorithm
• Filter approaches
§ Features are selected using a procedure that is
independent from a specific data mining algorithm
§ E.g., attributes are selected based on correlation measures
• Wrapper approaches
§ Use a data mining algorithm as a black box
to find best subset of attributes
§ E.g., apply a genetic algorithm and an algorithm for decision tree to
find the best set of features for a decision tree
27
28. Prof. Pier Luca Lanzi
Example of using the information gain to rank the importance of attributes.
29. Prof. Pier Luca Lanzi
Decision trees applied to all the attributes.
30. Prof. Pier Luca Lanzi
Decision trees applied to all the breast dataset without the four least ranked attributes.
31. Prof. Pier Luca Lanzi
Wrapper approach using exhaustive search and decision trees.
32. Prof. Pier Luca Lanzi
Feature Creation
• Create new attributes that can capture the important information
in a data set much more efficiently than the original attributes
• E.g., given the birthday, create the attribute age
• Three general methodologies:
§ Feature Extraction: domain-specific
§ Mapping Data to New Space
§ Feature Construction: combining features
32
33. Prof. Pier Luca Lanzi
Discretization
• Three types of attributes
§ Nominal: values from an unordered set, e.g., color
§ Ordinal: values from an ordered set,
e.g., military or academic rank
§ Continuous: real numbers,
e.g., integer or real numbers
• Discretization
§ Divide the range of a continuous attribute into intervals
§ Some classification algorithms only accept categorical
attributes
§ Reduce data size by discretization
§ Prepare for further analysis
33
34. Prof. Pier Luca Lanzi
Discretization Approaches
• Supervised
§ Attributes are discretized using the class information
§ Generates intervals that tries to minimize the loss of
information about the class
• Unsupervised
§ Attributes are discretized solely based on their values
34
35. Prof. Pier Luca Lanzi
Supervised Discretization
• Entropy based approach
35
3 categories for both x and y 5 categories for both x and y
36. Prof. Pier Luca Lanzi
Discretization Without Using Class Labels 36
Data Equal interval width
Equal frequency K-means
37. Prof. Pier Luca Lanzi
Segmentation by Natural Partitioning
• A simply 3-4-5 rule can be used to segment numeric data into
relatively uniform, “natural” intervals.
• If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equi-width intervals (3
equal-width intervals for 3, 6, and 9; and 3 intervals in the
grouping of 2-3-2 for 7)
• If it covers 2, 4, or 8 distinct values at the most significant digit,
partition the range into 4 intervals
• If it covers 5 or 10 distinct values at the most significant digit,
partition the range into 5 intervals
37
38. Prof. Pier Luca Lanzi
Summary
• Data exploration and preparation, or preprocessing, is a big issue
for both data warehousing and data mining
• Descriptive data summarization is need for quality data
preprocessing
• Data preparation includes
§ Data cleaning and data integration
§ Data reduction and feature selection
§ Discretization
• A lot a methods have been developed but data preprocessing still
an active area of research
38