SlideShare a Scribd company logo
Machine Learning for Data Mining
Introduction
Andres Mendez-Vazquez
May 13, 2015
1 / 56
Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vā€™s
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUā€™s
5 Projects
What projects can you do?
2 / 56
Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vā€™s
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUā€™s
5 Projects
What projects can you do?
3 / 56
Intuitive Deļ¬nition: Volume
When looking at the Volumes of Information, we have:
Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!!
Examples of these Volumes are
1 Records
2 Transactions
3 Web Searches
4 etc
4 / 56
Intuitive Deļ¬nition: Volume
When looking at the Volumes of Information, we have:
Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!!
Examples of these Volumes are
1 Records
2 Transactions
3 Web Searches
4 etc
4 / 56
Intuitive Deļ¬nition: Volume
When looking at the Volumes of Information, we have:
Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!!
Examples of these Volumes are
1 Records
2 Transactions
3 Web Searches
4 etc
4 / 56
Intuitive Deļ¬nition: Volume
When looking at the Volumes of Information, we have:
Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!!
Examples of these Volumes are
1 Records
2 Transactions
3 Web Searches
4 etc
4 / 56
Intuitive Deļ¬nition: Volume
When looking at the Volumes of Information, we have:
Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!!
Examples of these Volumes are
1 Records
2 Transactions
3 Web Searches
4 etc
4 / 56
However
Something Notable
What constitutes truly ā€œhighā€ volume varies by industry and even
geography!!!
Simply look at the DNA data for a cellular cycle.
Example
5 / 56
However
Something Notable
What constitutes truly ā€œhighā€ volume varies by industry and even
geography!!!
Simply look at the DNA data for a cellular cycle.
Example
5 / 56
Intuitive Deļ¬nition: Variety
When looking at the Structure of the Information, we have:
Variety like there is not tomorrow:
It is structured, semi-structured, unstructured
So
Do you have some examples of structures in Information?
6 / 56
Intuitive Deļ¬nition: Variety
When looking at the Structure of the Information, we have:
Variety like there is not tomorrow:
It is structured, semi-structured, unstructured
So
Do you have some examples of structures in Information?
6 / 56
Intuitive Deļ¬nition: Variety
When looking at the Structure of the Information, we have:
Variety like there is not tomorrow:
It is structured, semi-structured, unstructured
So
Do you have some examples of structures in Information?
6 / 56
Intuitive Deļ¬nition: Volume
When Looking at the Velocity of this Information?
Data in Motion!!!
Velocity:
Dynamic Generation
Real Time Generation
Problems with that: Latency
Lag time between capture or generation and when it is available!!!
7 / 56
Intuitive Deļ¬nition: Volume
When Looking at the Velocity of this Information?
Data in Motion!!!
Velocity:
Dynamic Generation
Real Time Generation
Problems with that: Latency
Lag time between capture or generation and when it is available!!!
7 / 56
Intuitive Deļ¬nition: Volume
When Looking at the Velocity of this Information?
Data in Motion!!!
Velocity:
Dynamic Generation
Real Time Generation
Problems with that: Latency
Lag time between capture or generation and when it is available!!!
7 / 56
Intuitive Deļ¬nition: Volume
When Looking at the Velocity of this Information?
Data in Motion!!!
Velocity:
Dynamic Generation
Real Time Generation
Problems with that: Latency
Lag time between capture or generation and when it is available!!!
7 / 56
Intuitive Deļ¬nition: Volume
When Looking at the Velocity of this Information?
Data in Motion!!!
Velocity:
Dynamic Generation
Real Time Generation
Problems with that: Latency
Lag time between capture or generation and when it is available!!!
7 / 56
For example
Imagine that I have a stream of m = 1025
integers with Ranges from
[a1, ..., an] with n = 10, 000, 000
Now, somebody ask you to ļ¬nd the most frequent item!!!
A naive algorithm
1 Take hash table with a counter.
2 Then, put numbers in the hash table.
Problems
Which problems we have?
8 / 56
For example
Imagine that I have a stream of m = 1025
integers with Ranges from
[a1, ..., an] with n = 10, 000, 000
Now, somebody ask you to ļ¬nd the most frequent item!!!
A naive algorithm
1 Take hash table with a counter.
2 Then, put numbers in the hash table.
Problems
Which problems we have?
8 / 56
For example
Imagine that I have a stream of m = 1025
integers with Ranges from
[a1, ..., an] with n = 10, 000, 000
Now, somebody ask you to ļ¬nd the most frequent item!!!
A naive algorithm
1 Take hash table with a counter.
2 Then, put numbers in the hash table.
Problems
Which problems we have?
8 / 56
However
There is the
Count-Min Sketch Algorithm
Invented by
Charikar, Chen and Farch-Colton in 2004
With Properties
Space Used Error Probability Error
O 1
log 1
Ī“ Ā· (log m + log n) Ī“
9 / 56
However
There is the
Count-Min Sketch Algorithm
Invented by
Charikar, Chen and Farch-Colton in 2004
With Properties
Space Used Error Probability Error
O 1
log 1
Ī“ Ā· (log m + log n) Ī“
9 / 56
However
There is the
Count-Min Sketch Algorithm
Invented by
Charikar, Chen and Farch-Colton in 2004
With Properties
Space Used Error Probability Error
O 1
log 1
Ī“ Ā· (log m + log n) Ī“
9 / 56
Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vā€™s
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUā€™s
5 Projects
What projects can you do?
10 / 56
Complexity
Given all these things
It is necessary to correlate and share data across entities.
It is necessary to link, match and transform data across business
entities and systems.
With this...
Complexity goes through the roof!!!
11 / 56
Complexity
Given all these things
It is necessary to correlate and share data across entities.
It is necessary to link, match and transform data across business
entities and systems.
With this...
Complexity goes through the roof!!!
11 / 56
Complexity
Given all these things
It is necessary to correlate and share data across entities.
It is necessary to link, match and transform data across business
entities and systems.
With this...
Complexity goes through the roof!!!
11 / 56
And it is through the roof!!! Linking open-data community
project
12 / 56
Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬€erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬€erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬€erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬€erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬€erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬€erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
Cautionary Tale
Something Notable
In 1880 the USA made a Census of the Population in diļ¬€erent aspects:
Population
Mortality
Agriculture
Manufacturing
However
Once data was collected it took 7 years to say something!!!
13 / 56
Ahhh...
Thus, Hollering came with the following machine (Circa 1890)!!!
14 / 56
Hollering Tabulating Machine
It was basically a sorter and counter
Using punching cards as memories.
And Mercury Sensors.
Example
15 / 56
Hollering Tabulating Machine
It was basically a sorter and counter
Using punching cards as memories.
And Mercury Sensors.
Example
15 / 56
It was FAST!!!
It took only!!!
2 years!!!
Nevertheless in 1837
Babbageā€™s Diļ¬€erence engine was
The First General Computer!!!
Turing-complete!!!
Way more complex than the tabulator!!! 53 years earlier!!!
16 / 56
It was FAST!!!
It took only!!!
2 years!!!
Nevertheless in 1837
Babbageā€™s Diļ¬€erence engine was
The First General Computer!!!
Turing-complete!!!
Way more complex than the tabulator!!! 53 years earlier!!!
16 / 56
It was FAST!!!
It took only!!!
2 years!!!
Nevertheless in 1837
Babbageā€™s Diļ¬€erence engine was
The First General Computer!!!
Turing-complete!!!
Way more complex than the tabulator!!! 53 years earlier!!!
16 / 56
It was FAST!!!
It took only!!!
2 years!!!
Nevertheless in 1837
Babbageā€™s Diļ¬€erence engine was
The First General Computer!!!
Turing-complete!!!
Way more complex than the tabulator!!! 53 years earlier!!!
16 / 56
It was FAST!!!
It took only!!!
2 years!!!
Nevertheless in 1837
Babbageā€™s Diļ¬€erence engine was
The First General Computer!!!
Turing-complete!!!
Way more complex than the tabulator!!! 53 years earlier!!!
16 / 56
Funny!!!
Funny!!!
17 / 56
The Problem
Actually, it never reached completion because
Babbage was actually a yucky project manager!!!
18 / 56
The Problem
Actually, it never reached completion because
Babbage was actually a yucky project manager!!!
18 / 56
Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vā€™s
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUā€™s
5 Projects
What projects can you do?
19 / 56
Data is Everywhere!
Lots of data is being collected and warehoused
Web data, e-commerce
Purchases at department/ grocery stores
Bank/Credit Card transactions
Social Network
Many Places
20 / 56
Data is Everywhere!
Lots of data is being collected and warehoused
Web data, e-commerce
Purchases at department/ grocery stores
Bank/Credit Card transactions
Social Network
Many Places
20 / 56
Data is Everywhere!
Lots of data is being collected and warehoused
Web data, e-commerce
Purchases at department/ grocery stores
Bank/Credit Card transactions
Social Network
Many Places
20 / 56
Data is Everywhere!
Lots of data is being collected and warehoused
Web data, e-commerce
Purchases at department/ grocery stores
Bank/Credit Card transactions
Social Network
Many Places
20 / 56
Data is Everywhere!
Lots of data is being collected and warehoused
Web data, e-commerce
Purchases at department/ grocery stores
Bank/Credit Card transactions
Social Network
Many Places
20 / 56
The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: ā€œBig data: The next frontier for innovation, competition, and pro-
ductivityā€
McKinsey Global Institute 2011 21 / 56
The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: ā€œBig data: The next frontier for innovation, competition, and pro-
ductivityā€
McKinsey Global Institute 2011 21 / 56
The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: ā€œBig data: The next frontier for innovation, competition, and pro-
ductivityā€
McKinsey Global Institute 2011 21 / 56
The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: ā€œBig data: The next frontier for innovation, competition, and pro-
ductivityā€
McKinsey Global Institute 2011 21 / 56
The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: ā€œBig data: The next frontier for innovation, competition, and pro-
ductivityā€
McKinsey Global Institute 2011 21 / 56
The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: ā€œBig data: The next frontier for innovation, competition, and pro-
ductivityā€
McKinsey Global Institute 2011 21 / 56
The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: ā€œBig data: The next frontier for innovation, competition, and pro-
ductivityā€
McKinsey Global Institute 2011 21 / 56
The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: ā€œBig data: The next frontier for innovation, competition, and pro-
ductivityā€
McKinsey Global Institute 2011 21 / 56
The Staggering Numbers
A Ocean of Data
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes (Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
Generation
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Source: ā€œBig data: The next frontier for innovation, competition, and pro-
ductivityā€
McKinsey Global Institute 2011 21 / 56
Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
Type of Data
Thus
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
And more...
Graph Data
Social Network, Semantic Web (RDF), . . .
Streaming Data
You can only scan the data once
22 / 56
The Ever Growing Landscape
23 / 56
Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ā€œlearnā€ from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ā€œlearnā€ from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ā€œlearnā€ from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ā€œlearnā€ from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ā€œlearnā€ from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ā€œlearnā€ from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ā€œlearnā€ from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
Machine Learning
Deļ¬nition
Algorithms or techniques that enable computer (machine) to ā€œlearnā€ from
data. Related with many areas such as data mining, statistics, information
theory, etc.
Algorithm Types:
Unsupervised Learning
Supervised learning
Reinforcement learning
Examples
Artiļ¬cial Neural Network (ANN)
Support Vector Machine (SVM)
Expectation-Maximization (EM)
Deterministic Annealing (DA)
24 / 56
Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vā€™s
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUā€™s
5 Projects
What projects can you do?
25 / 56
Machine Learning Process
Process
1 Feature Extraction/Feature Generation
2 Clustering ā‰ˆ Class Identiļ¬cation ā‰ˆ Unsupervised Learning
3 Classiļ¬cation ā‰ˆ Supervised Learning
Then...
We start thinking: We need to process a lot of data...
Or...
LARGE SCALE MACHINE LEARNING
26 / 56
Machine Learning Process
Process
1 Feature Extraction/Feature Generation
2 Clustering ā‰ˆ Class Identiļ¬cation ā‰ˆ Unsupervised Learning
3 Classiļ¬cation ā‰ˆ Supervised Learning
Then...
We start thinking: We need to process a lot of data...
Or...
LARGE SCALE MACHINE LEARNING
26 / 56
Machine Learning Process
Process
1 Feature Extraction/Feature Generation
2 Clustering ā‰ˆ Class Identiļ¬cation ā‰ˆ Unsupervised Learning
3 Classiļ¬cation ā‰ˆ Supervised Learning
Then...
We start thinking: We need to process a lot of data...
Or...
LARGE SCALE MACHINE LEARNING
26 / 56
Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vā€™s
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUā€™s
5 Projects
What projects can you do?
27 / 56
Feature Generation/Dimensionality Reduction
Feature Generation
Given a set of measurements, the goal is to discover compact and
informative representations of the obtained data.
Examples
1 The Karhunenā€“LoĆØve transform ā‰ˆ Principal Component Analysis
1 Popular for feature generation and Dimensionality Reduction
2 The Singular Value Decomposition
1 Used for Dimensionality Reduction
28 / 56
Feature Generation/Dimensionality Reduction
Feature Generation
Given a set of measurements, the goal is to discover compact and
informative representations of the obtained data.
Examples
1 The Karhunenā€“LoĆØve transform ā‰ˆ Principal Component Analysis
1 Popular for feature generation and Dimensionality Reduction
2 The Singular Value Decomposition
1 Used for Dimensionality Reduction
28 / 56
Feature Generation/Dimensionality Reduction
Feature Generation
Given a set of measurements, the goal is to discover compact and
informative representations of the obtained data.
Examples
1 The Karhunenā€“LoĆØve transform ā‰ˆ Principal Component Analysis
1 Popular for feature generation and Dimensionality Reduction
2 The Singular Value Decomposition
1 Used for Dimensionality Reduction
28 / 56
Feature Generation/Dimensionality Reduction
Feature Generation
Given a set of measurements, the goal is to discover compact and
informative representations of the obtained data.
Examples
1 The Karhunenā€“LoĆØve transform ā‰ˆ Principal Component Analysis
1 Popular for feature generation and Dimensionality Reduction
2 The Singular Value Decomposition
1 Used for Dimensionality Reduction
28 / 56
Dimension Reduction/Feature Extraction
Deļ¬nition
Process to transform high-dimensional data into low-dimensional ones for
improving accuracy, understanding, or removing noises.
Why?
Curse of dimensionality: Complexity grows exponentially in volume by
adding extra dimensions.
29 / 56
Dimension Reduction/Feature Extraction
Deļ¬nition
Process to transform high-dimensional data into low-dimensional ones for
improving accuracy, understanding, or removing noises.
Why?
Curse of dimensionality: Complexity grows exponentially in volume by
adding extra dimensions.
29 / 56
Feature Selection
Feature Selection
Which features should be used for the classiļ¬er?
Why? The Curse of Dimensionality!!!
Hypothesis Testing to discriminate good features
30 / 56
Feature Selection
Feature Selection
Which features should be used for the classiļ¬er?
Why? The Curse of Dimensionality!!!
30 / 56
What can be done?
Measures for Class Separability
Example: Between-class scatter matrix:
Sb =
M
i=1
Pi (Āµi āˆ’ Āµ0) (Āµi āˆ’ Āµ0)T
(1)
Where:
Āµ0 is the global mean vector, Āµ0 = M
i=1 Pi Āµi .
Āµi the median of class Ļ‰i .
Pi
āˆ¼= ni
N .
31 / 56
What can be done?
Measures for Class Separability
Example: Between-class scatter matrix:
Sb =
M
i=1
Pi (Āµi āˆ’ Āµ0) (Āµi āˆ’ Āµ0)T
(1)
Where:
Āµ0 is the global mean vector, Āµ0 = M
i=1 Pi Āµi .
Āµi the median of class Ļ‰i .
Pi
āˆ¼= ni
N .
31 / 56
What can be done?
Measures for Class Separability
Example: Between-class scatter matrix:
Sb =
M
i=1
Pi (Āµi āˆ’ Āµ0) (Āµi āˆ’ Āµ0)T
(1)
Where:
Āµ0 is the global mean vector, Āµ0 = M
i=1 Pi Āµi .
Āµi the median of class Ļ‰i .
Pi
āˆ¼= ni
N .
31 / 56
What can be done?
Measures for Class Separability
Example: Between-class scatter matrix:
Sb =
M
i=1
Pi (Āµi āˆ’ Āµ0) (Āµi āˆ’ Āµ0)T
(1)
Where:
Āµ0 is the global mean vector, Āµ0 = M
i=1 Pi Āµi .
Āµi the median of class Ļ‰i .
Pi
āˆ¼= ni
N .
31 / 56
What can be done?
Feature Subset Selection
Examples:
Filter Approach
All combinations of features are used together with a separability
measure.
Wrapper Approach:
Use the decided classiļ¬er itself to ļ¬nd the best set.
32 / 56
What can be done?
Feature Subset Selection
Examples:
Filter Approach
All combinations of features are used together with a separability
measure.
Wrapper Approach:
Use the decided classiļ¬er itself to ļ¬nd the best set.
32 / 56
What can be done?
Feature Subset Selection
Examples:
Filter Approach
All combinations of features are used together with a separability
measure.
Wrapper Approach:
Use the decided classiļ¬er itself to ļ¬nd the best set.
32 / 56
What can be done?
Feature Subset Selection
Examples:
Filter Approach
All combinations of features are used together with a separability
measure.
Wrapper Approach:
Use the decided classiļ¬er itself to ļ¬nd the best set.
32 / 56
What can be done?
Feature Subset Selection
Examples:
Filter Approach
All combinations of features are used together with a separability
measure.
Wrapper Approach:
Use the decided classiļ¬er itself to ļ¬nd the best set.
32 / 56
Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vā€™s
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUā€™s
5 Projects
What projects can you do?
33 / 56
Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
Classiļ¬cation
Deļ¬nition
A procedure dividing data into the given set of categories based on the
training set in a supervised way.
What we want from classiļ¬cation?
Generalization Vs. Speciļ¬cation
Hard to achieve both
Avoid - overļ¬tting/overtraining
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
34 / 56
Avoid - overļ¬tting/overtraining
Validation and Training Error
Underfitting Overfitting
Validation Error
Training Error
35 / 56
Examples of Classiļ¬cation Algorithms
Many Possible Algorithms
Linear Classiļ¬ers: Perceptron
Probability Classiļ¬ers: Naive Bayes
Kernel Methods Classiļ¬ers : Support Vector Machines
Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks
Graph Model Classiļ¬ers:
. . .
36 / 56
Examples of Classiļ¬cation Algorithms
Many Possible Algorithms
Linear Classiļ¬ers: Perceptron
Probability Classiļ¬ers: Naive Bayes
Kernel Methods Classiļ¬ers : Support Vector Machines
Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks
Graph Model Classiļ¬ers:
. . .
36 / 56
Examples of Classiļ¬cation Algorithms
Many Possible Algorithms
Linear Classiļ¬ers: Perceptron
Probability Classiļ¬ers: Naive Bayes
Kernel Methods Classiļ¬ers : Support Vector Machines
Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks
Graph Model Classiļ¬ers:
. . .
36 / 56
Examples of Classiļ¬cation Algorithms
Many Possible Algorithms
Linear Classiļ¬ers: Perceptron
Probability Classiļ¬ers: Naive Bayes
Kernel Methods Classiļ¬ers : Support Vector Machines
Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks
Graph Model Classiļ¬ers:
. . .
36 / 56
Examples of Classiļ¬cation Algorithms
Many Possible Algorithms
Linear Classiļ¬ers: Perceptron
Probability Classiļ¬ers: Naive Bayes
Kernel Methods Classiļ¬ers : Support Vector Machines
Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks
Graph Model Classiļ¬ers:
. . .
36 / 56
Examples of Classiļ¬cation Algorithms
Many Possible Algorithms
Linear Classiļ¬ers: Perceptron
Probability Classiļ¬ers: Naive Bayes
Kernel Methods Classiļ¬ers : Support Vector Machines
Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks
Graph Model Classiļ¬ers:
. . .
36 / 56
Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vā€™s
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUā€™s
5 Projects
What projects can you do?
37 / 56
Clustering Analysis
Deļ¬nition
Grouping unlabeled data into clusters, for the purpose of inference of
hidden structures or information.
Using, for example
Dissimilarity measurement
Angle : Inner product, . . .
Non-metric : Rank, Intensity, . . .
Distance : Euclidean (l2), Manhattan(l1), . . .
38 / 56
Clustering Analysis
Deļ¬nition
Grouping unlabeled data into clusters, for the purpose of inference of
hidden structures or information.
Using, for example
Dissimilarity measurement
Angle : Inner product, . . .
Non-metric : Rank, Intensity, . . .
Distance : Euclidean (l2), Manhattan(l1), . . .
38 / 56
Clustering Analysis
Deļ¬nition
Grouping unlabeled data into clusters, for the purpose of inference of
hidden structures or information.
Using, for example
Dissimilarity measurement
Angle : Inner product, . . .
Non-metric : Rank, Intensity, . . .
Distance : Euclidean (l2), Manhattan(l1), . . .
38 / 56
Clustering Analysis
Deļ¬nition
Grouping unlabeled data into clusters, for the purpose of inference of
hidden structures or information.
Using, for example
Dissimilarity measurement
Angle : Inner product, . . .
Non-metric : Rank, Intensity, . . .
Distance : Euclidean (l2), Manhattan(l1), . . .
38 / 56
Clustering Analysis
Deļ¬nition
Grouping unlabeled data into clusters, for the purpose of inference of
hidden structures or information.
Using, for example
Dissimilarity measurement
Angle : Inner product, . . .
Non-metric : Rank, Intensity, . . .
Distance : Euclidean (l2), Manhattan(l1), . . .
38 / 56
Example
39 / 56
Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
Examples of Clustering Algorithms
Clustering
1 Basic Clustering Algorithms
1 K-means
2 Clustering Based in Cost Functions
1 Fuzzy C-means
2 Possibilistic
3 Hierarchical Clustering
1 Entropy based
4 Clustering Based in Graph Theory
40 / 56
Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vā€™s
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUā€™s
5 Projects
What projects can you do?
41 / 56
What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting information or patterns from data in large
databases.
Alternative names and their ā€œinside storiesā€:
Knowledge discovery(mining) in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology
Business intelligence
etc.
42 / 56
What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting information or patterns from data in large
databases.
Alternative names and their ā€œinside storiesā€:
Knowledge discovery(mining) in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology
Business intelligence
etc.
42 / 56
What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting information or patterns from data in large
databases.
Alternative names and their ā€œinside storiesā€:
Knowledge discovery(mining) in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology
Business intelligence
etc.
42 / 56
What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting information or patterns from data in large
databases.
Alternative names and their ā€œinside storiesā€:
Knowledge discovery(mining) in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology
Business intelligence
etc.
42 / 56
What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting information or patterns from data in large
databases.
Alternative names and their ā€œinside storiesā€:
Knowledge discovery(mining) in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology
Business intelligence
etc.
42 / 56
What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting information or patterns from data in large
databases.
Alternative names and their ā€œinside storiesā€:
Knowledge discovery(mining) in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology
Business intelligence
etc.
42 / 56
Examples: What is (not) Data Mining?
What is not Data Mining?
1 Look up phone number in phone directory
2 Query a Web search engine for information about ā€œAmazonā€
What is Data Mining?
1 Certain names are more prevalent in certain US locations (Oā€™Brien,
Oā€™Rurke, Oā€™Reilly. . . in Boston area)
2 Group together similar documents returned by search engine
according to their context (e.g. Amazon rainforest, Amazon.com)
43 / 56
Examples: What is (not) Data Mining?
What is not Data Mining?
1 Look up phone number in phone directory
2 Query a Web search engine for information about ā€œAmazonā€
What is Data Mining?
1 Certain names are more prevalent in certain US locations (Oā€™Brien,
Oā€™Rurke, Oā€™Reilly. . . in Boston area)
2 Group together similar documents returned by search engine
according to their context (e.g. Amazon rainforest, Amazon.com)
43 / 56
Examples: What is (not) Data Mining?
What is not Data Mining?
1 Look up phone number in phone directory
2 Query a Web search engine for information about ā€œAmazonā€
What is Data Mining?
1 Certain names are more prevalent in certain US locations (Oā€™Brien,
Oā€™Rurke, Oā€™Reilly. . . in Boston area)
2 Group together similar documents returned by search engine
according to their context (e.g. Amazon rainforest, Amazon.com)
43 / 56
Examples: What is (not) Data Mining?
What is not Data Mining?
1 Look up phone number in phone directory
2 Query a Web search engine for information about ā€œAmazonā€
What is Data Mining?
1 Certain names are more prevalent in certain US locations (Oā€™Brien,
Oā€™Rurke, Oā€™Reilly. . . in Boston area)
2 Group together similar documents returned by search engine
according to their context (e.g. Amazon rainforest, Amazon.com)
43 / 56
Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vā€™s
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUā€™s
5 Projects
What projects can you do?
44 / 56
Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
Data mining Applications
Applications
Mining the Web for Structured Data
Near Neighbor Search in High Dimensional Data.
Frequent itemsets and Association Rules
Structure of the webgraph
PageRank
Link Analysis
Proximity on Graphs
Mining data streams.
Large scale supervised machine learning techniques.
45 / 56
Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vā€™s
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUā€™s
5 Projects
What projects can you do?
46 / 56
Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ā€œtransactions.ā€
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬€spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ā€œtransactions.ā€
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬€spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ā€œtransactions.ā€
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬€spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ā€œtransactions.ā€
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬€spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ā€œtransactions.ā€
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬€spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ā€œtransactions.ā€
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬€spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ā€œtransactions.ā€
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬€spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
Example: Frequent Itemsets
Based in the Market-Basket Model
1 On the one hand, we have items.
2 On the other we have baskets, sometimes called ā€œtransactions.ā€
1 Each basket consists of a set of items (an itemset)
2 They are small.
Examples
1 {Cat, and, dog, bites}
2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬€spring}
3 {Cat, killer, likely, is, a, big, dog}
4 {Professional, free, advice, on, dog, training, puppy}
47 / 56
Example: Frequent Itemsets
Then, we do the following
Transaction ID Cat Dog and a mated
1 1 1 1 0 0
2 1 1 1 1 1
3 1 1 0 1 0
4 0 1 0 0 0
48 / 56
Combinatorial Problem
Problem
How many subsets we have?
But we can do the following
Given the itemset x in a database D and a set of transactions {ti }iāˆˆI
supp(x, D) = |{ti āˆˆ D|x āˆˆ ti }| (2)
Then, setting a threshold
How many frequent (supp(x, D) > ) itemsets?
49 / 56
Combinatorial Problem
Problem
How many subsets we have?
But we can do the following
Given the itemset x in a database D and a set of transactions {ti }iāˆˆI
supp(x, D) = |{ti āˆˆ D|x āˆˆ ti }| (2)
Then, setting a threshold
How many frequent (supp(x, D) > ) itemsets?
49 / 56
Combinatorial Problem
Problem
How many subsets we have?
But we can do the following
Given the itemset x in a database D and a set of transactions {ti }iāˆˆI
supp(x, D) = |{ti āˆˆ D|x āˆˆ ti }| (2)
Then, setting a threshold
How many frequent (supp(x, D) > ) itemsets?
49 / 56
Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vā€™s
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUā€™s
5 Projects
What projects can you do?
50 / 56
Hardware Solutions: ASICS
Application-Speciļ¬c Integrated Circuit (ASIC)
An ASIC is an integrated circuit customized for a particular use, rather
than intended for general-purpose use.
It allows for
1 Lower Power Consumption.
2 Better Colling Approaches.
Example: From Microsoft Research
51 / 56
Hardware Solutions: ASICS
Application-Speciļ¬c Integrated Circuit (ASIC)
An ASIC is an integrated circuit customized for a particular use, rather
than intended for general-purpose use.
It allows for
1 Lower Power Consumption.
2 Better Colling Approaches.
Example: From Microsoft Research
51 / 56
Hardware Solutions: ASICS
Application-Speciļ¬c Integrated Circuit (ASIC)
An ASIC is an integrated circuit customized for a particular use, rather
than intended for general-purpose use.
It allows for
1 Lower Power Consumption.
2 Better Colling Approaches.
Example: From Microsoft Research
51 / 56
Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vā€™s
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUā€™s
5 Projects
What projects can you do?
52 / 56
Hardware Solutions: GPUā€™s
IDEAS
Based on CUDA parallel computing architecture from Nvidia
Emphasis on executing many concurrent LIGHT threads instead of
one HEAVY thread as in CPUs
Hardware for 8800
53 / 56
Advantages
Massively parallel
Hundreds of cores, millions of threads
High throughput
Limitations
May not be applicable for all tasks
Generic hardware (CPUs) closing the gap
54 / 56
Outline
1 Why are we interested in Analyzing Data?
Intuitive Deļ¬nition: The 3Vā€™s
Complexity
Data Everywhere
2 Machine Learning
Machine Learning Process
Features
Classiļ¬cation
Clustering Analysis
3 Data Mining
Deļ¬nition
Applications
Example: Frequent Itemsets
4 Hardware Support
ASICS
GPUā€™s
5 Projects
What projects can you do?
55 / 56
Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬‚uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬‚uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬‚uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬‚uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬‚uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬‚uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬‚uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56
Projects
Possible topic are:
Oil exploration detection.
Association Rule Preprocessing Project.
Neural Network-Based Financial Market Forecasting Project.
Page Ranking - Improving over the Google Matrix
Inļ¬‚uence Maximization in Social Networks.
Web Word Relevance Measures.
Recommendation Systems.
There are more possibilities at https://www.kaggle.com/competitions
56 / 56

More Related Content

Viewers also liked

31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity
Andres Mendez-Vazquez
Ā 
Periodic table (1)
Periodic table (1)Periodic table (1)
Periodic table (1)jghopwood
Ā 
What's the Matter?
What's the Matter?What's the Matter?
What's the Matter?
Stephen Taylor
Ā 
11 Machine Learning Important Issues in Machine Learning
11 Machine Learning Important Issues in Machine Learning11 Machine Learning Important Issues in Machine Learning
11 Machine Learning Important Issues in Machine Learning
Andres Mendez-Vazquez
Ā 
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization
Andres Mendez-Vazquez
Ā 
Preparation Data Structures 11 graphs
Preparation Data Structures 11 graphsPreparation Data Structures 11 graphs
Preparation Data Structures 11 graphs
Andres Mendez-Vazquez
Ā 
27 Machine Learning Unsupervised Measure Properties
27 Machine Learning Unsupervised Measure Properties27 Machine Learning Unsupervised Measure Properties
27 Machine Learning Unsupervised Measure Properties
Andres Mendez-Vazquez
Ā 
15 B-Trees
15 B-Trees15 B-Trees
15 B-Trees
Andres Mendez-Vazquez
Ā 
13 Machine Learning Supervised Decision Trees
13 Machine Learning Supervised Decision Trees13 Machine Learning Supervised Decision Trees
13 Machine Learning Supervised Decision Trees
Andres Mendez-Vazquez
Ā 
20 Single Source Shorthest Path
20 Single Source Shorthest Path20 Single Source Shorthest Path
20 Single Source Shorthest Path
Andres Mendez-Vazquez
Ā 
28 Dealing with the NP Poblems: Exponential Search and Approximation Algorithms
28 Dealing with the NP Poblems: Exponential Search and Approximation Algorithms28 Dealing with the NP Poblems: Exponential Search and Approximation Algorithms
28 Dealing with the NP Poblems: Exponential Search and Approximation Algorithms
Andres Mendez-Vazquez
Ā 
24 Machine Learning Combining Models - Ada Boost
24 Machine Learning Combining Models - Ada Boost24 Machine Learning Combining Models - Ada Boost
24 Machine Learning Combining Models - Ada Boost
Andres Mendez-Vazquez
Ā 
23 Machine Learning Feature Generation
23 Machine Learning Feature Generation23 Machine Learning Feature Generation
23 Machine Learning Feature Generation
Andres Mendez-Vazquez
Ā 
12 Machine Learning Supervised Hidden Markov Chains
12 Machine Learning  Supervised Hidden Markov Chains12 Machine Learning  Supervised Hidden Markov Chains
12 Machine Learning Supervised Hidden Markov Chains
Andres Mendez-Vazquez
Ā 
Drainage
DrainageDrainage
Drainage
deepak190
Ā 
Work energy power 2 reading assignment -revision 2
Work energy power 2 reading assignment -revision 2Work energy power 2 reading assignment -revision 2
Work energy power 2 reading assignment -revision 2sashrilisdi
Ā 
08 Machine Learning Maximum Aposteriori
08 Machine Learning Maximum Aposteriori08 Machine Learning Maximum Aposteriori
08 Machine Learning Maximum Aposteriori
Andres Mendez-Vazquez
Ā 
Stormwater Management Technical Seminar - Armtec
Stormwater Management Technical Seminar - ArmtecStormwater Management Technical Seminar - Armtec
Stormwater Management Technical Seminar - Armtec
Communications Branding
Ā 
Preparation Data Structures 09 hash tables
Preparation Data Structures 09 hash tablesPreparation Data Structures 09 hash tables
Preparation Data Structures 09 hash tables
Andres Mendez-Vazquez
Ā 

Viewers also liked (20)

31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity
Ā 
Periodic table (1)
Periodic table (1)Periodic table (1)
Periodic table (1)
Ā 
What's the Matter?
What's the Matter?What's the Matter?
What's the Matter?
Ā 
11 Machine Learning Important Issues in Machine Learning
11 Machine Learning Important Issues in Machine Learning11 Machine Learning Important Issues in Machine Learning
11 Machine Learning Important Issues in Machine Learning
Ā 
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization
Ā 
Preparation Data Structures 11 graphs
Preparation Data Structures 11 graphsPreparation Data Structures 11 graphs
Preparation Data Structures 11 graphs
Ā 
27 Machine Learning Unsupervised Measure Properties
27 Machine Learning Unsupervised Measure Properties27 Machine Learning Unsupervised Measure Properties
27 Machine Learning Unsupervised Measure Properties
Ā 
15 B-Trees
15 B-Trees15 B-Trees
15 B-Trees
Ā 
13 Machine Learning Supervised Decision Trees
13 Machine Learning Supervised Decision Trees13 Machine Learning Supervised Decision Trees
13 Machine Learning Supervised Decision Trees
Ā 
20 Single Source Shorthest Path
20 Single Source Shorthest Path20 Single Source Shorthest Path
20 Single Source Shorthest Path
Ā 
28 Dealing with the NP Poblems: Exponential Search and Approximation Algorithms
28 Dealing with the NP Poblems: Exponential Search and Approximation Algorithms28 Dealing with the NP Poblems: Exponential Search and Approximation Algorithms
28 Dealing with the NP Poblems: Exponential Search and Approximation Algorithms
Ā 
24 Machine Learning Combining Models - Ada Boost
24 Machine Learning Combining Models - Ada Boost24 Machine Learning Combining Models - Ada Boost
24 Machine Learning Combining Models - Ada Boost
Ā 
23 Machine Learning Feature Generation
23 Machine Learning Feature Generation23 Machine Learning Feature Generation
23 Machine Learning Feature Generation
Ā 
Waves
WavesWaves
Waves
Ā 
12 Machine Learning Supervised Hidden Markov Chains
12 Machine Learning  Supervised Hidden Markov Chains12 Machine Learning  Supervised Hidden Markov Chains
12 Machine Learning Supervised Hidden Markov Chains
Ā 
Drainage
DrainageDrainage
Drainage
Ā 
Work energy power 2 reading assignment -revision 2
Work energy power 2 reading assignment -revision 2Work energy power 2 reading assignment -revision 2
Work energy power 2 reading assignment -revision 2
Ā 
08 Machine Learning Maximum Aposteriori
08 Machine Learning Maximum Aposteriori08 Machine Learning Maximum Aposteriori
08 Machine Learning Maximum Aposteriori
Ā 
Stormwater Management Technical Seminar - Armtec
Stormwater Management Technical Seminar - ArmtecStormwater Management Technical Seminar - Armtec
Stormwater Management Technical Seminar - Armtec
Ā 
Preparation Data Structures 09 hash tables
Preparation Data Structures 09 hash tablesPreparation Data Structures 09 hash tables
Preparation Data Structures 09 hash tables
Ā 

Similar to 01 Machine Learning Introduction

On Inherent Complexity of Computation, by Attila Szegedi
On Inherent Complexity of Computation, by Attila SzegediOn Inherent Complexity of Computation, by Attila Szegedi
On Inherent Complexity of Computation, by Attila Szegedi
ZeroTurnaround
Ā 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
c.titus.brown
Ā 
Just the basics_strata_2013
Just the basics_strata_2013Just the basics_strata_2013
Just the basics_strata_2013Ken Mwai
Ā 
Getting comfortable with Data
Getting comfortable with DataGetting comfortable with Data
Getting comfortable with Data
Ritvvij Parrikh
Ā 
Applying data visualisation to the analytics process
Applying data visualisation to the analytics processApplying data visualisation to the analytics process
Applying data visualisation to the analytics process
Casper Blicher Olsen
Ā 
Week 7 - Module 16 - PPT- Artificial Intelligence Representation of Knowledge...
Week 7 - Module 16 - PPT- Artificial Intelligence Representation of Knowledge...Week 7 - Module 16 - PPT- Artificial Intelligence Representation of Knowledge...
Week 7 - Module 16 - PPT- Artificial Intelligence Representation of Knowledge...
tscreations481
Ā 
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
Enterprise Search Europe 2015:  Fishing the big data streams - the future of ...Enterprise Search Europe 2015:  Fishing the big data streams - the future of ...
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
Charlie Hull
Ā 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
c.titus.brown
Ā 
DevelopingDataScienceProfession
DevelopingDataScienceProfessionDevelopingDataScienceProfession
DevelopingDataScienceProfessionGary Rector
Ā 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
Stephen Withington
Ā 
Library Users of the Future... Or, projecting outward from that fringe of res...
Library Users of the Future... Or, projecting outward from that fringe of res...Library Users of the Future... Or, projecting outward from that fringe of res...
Library Users of the Future... Or, projecting outward from that fringe of res...
James Baker
Ā 
James Baker: Library Users of the Future
James Baker: Library Users of the FutureJames Baker: Library Users of the Future
James Baker: Library Users of the Future
Bodleian Libraries Staff Development
Ā 
Data and science
Data and scienceData and science
Data and science
Anand Deshpande
Ā 
Large Components in the Rearview Mirror
Large Components in the Rearview MirrorLarge Components in the Rearview Mirror
Large Components in the Rearview Mirror
Michelle Brush
Ā 
High-Performance Networking Use Cases in Life Sciences
High-Performance Networking Use Cases in Life SciencesHigh-Performance Networking Use Cases in Life Sciences
High-Performance Networking Use Cases in Life Sciences
Ari Berman
Ā 
Machine Learning, AI and the Brain
Machine Learning, AI and the Brain Machine Learning, AI and the Brain
Machine Learning, AI and the Brain
TechExeter
Ā 
DataHub
DataHubDataHub
Big Data com Python
Big Data com PythonBig Data com Python
Big Data com Python
Marcel Caraciolo
Ā 
Contemporary Literacy
Contemporary LiteracyContemporary Literacy
Contemporary Literacy
dwarlick
Ā 
2015 Arts Midwest Workshop: Embracing the Digital Age
2015 Arts Midwest Workshop: Embracing the Digital Age2015 Arts Midwest Workshop: Embracing the Digital Age
2015 Arts Midwest Workshop: Embracing the Digital Age
The Metropolitan Museum of Art
Ā 

Similar to 01 Machine Learning Introduction (20)

On Inherent Complexity of Computation, by Attila Szegedi
On Inherent Complexity of Computation, by Attila SzegediOn Inherent Complexity of Computation, by Attila Szegedi
On Inherent Complexity of Computation, by Attila Szegedi
Ā 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
Ā 
Just the basics_strata_2013
Just the basics_strata_2013Just the basics_strata_2013
Just the basics_strata_2013
Ā 
Getting comfortable with Data
Getting comfortable with DataGetting comfortable with Data
Getting comfortable with Data
Ā 
Applying data visualisation to the analytics process
Applying data visualisation to the analytics processApplying data visualisation to the analytics process
Applying data visualisation to the analytics process
Ā 
Week 7 - Module 16 - PPT- Artificial Intelligence Representation of Knowledge...
Week 7 - Module 16 - PPT- Artificial Intelligence Representation of Knowledge...Week 7 - Module 16 - PPT- Artificial Intelligence Representation of Knowledge...
Week 7 - Module 16 - PPT- Artificial Intelligence Representation of Knowledge...
Ā 
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
Enterprise Search Europe 2015:  Fishing the big data streams - the future of ...Enterprise Search Europe 2015:  Fishing the big data streams - the future of ...
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
Ā 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
Ā 
DevelopingDataScienceProfession
DevelopingDataScienceProfessionDevelopingDataScienceProfession
DevelopingDataScienceProfession
Ā 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
Ā 
Library Users of the Future... Or, projecting outward from that fringe of res...
Library Users of the Future... Or, projecting outward from that fringe of res...Library Users of the Future... Or, projecting outward from that fringe of res...
Library Users of the Future... Or, projecting outward from that fringe of res...
Ā 
James Baker: Library Users of the Future
James Baker: Library Users of the FutureJames Baker: Library Users of the Future
James Baker: Library Users of the Future
Ā 
Data and science
Data and scienceData and science
Data and science
Ā 
Large Components in the Rearview Mirror
Large Components in the Rearview MirrorLarge Components in the Rearview Mirror
Large Components in the Rearview Mirror
Ā 
High-Performance Networking Use Cases in Life Sciences
High-Performance Networking Use Cases in Life SciencesHigh-Performance Networking Use Cases in Life Sciences
High-Performance Networking Use Cases in Life Sciences
Ā 
Machine Learning, AI and the Brain
Machine Learning, AI and the Brain Machine Learning, AI and the Brain
Machine Learning, AI and the Brain
Ā 
DataHub
DataHubDataHub
DataHub
Ā 
Big Data com Python
Big Data com PythonBig Data com Python
Big Data com Python
Ā 
Contemporary Literacy
Contemporary LiteracyContemporary Literacy
Contemporary Literacy
Ā 
2015 Arts Midwest Workshop: Embracing the Digital Age
2015 Arts Midwest Workshop: Embracing the Digital Age2015 Arts Midwest Workshop: Embracing the Digital Age
2015 Arts Midwest Workshop: Embracing the Digital Age
Ā 

More from Andres Mendez-Vazquez

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
Andres Mendez-Vazquez
Ā 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
Andres Mendez-Vazquez
Ā 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
Andres Mendez-Vazquez
Ā 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
Andres Mendez-Vazquez
Ā 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
Andres Mendez-Vazquez
Ā 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
Andres Mendez-Vazquez
Ā 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
Andres Mendez-Vazquez
Ā 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
Andres Mendez-Vazquez
Ā 
Zetta global
Zetta globalZetta global
Zetta global
Andres Mendez-Vazquez
Ā 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
Andres Mendez-Vazquez
Ā 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
Andres Mendez-Vazquez
Ā 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
Andres Mendez-Vazquez
Ā 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
Andres Mendez-Vazquez
Ā 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
Andres Mendez-Vazquez
Ā 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
Andres Mendez-Vazquez
Ā 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
Andres Mendez-Vazquez
Ā 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
Andres Mendez-Vazquez
Ā 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
Andres Mendez-Vazquez
Ā 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
Andres Mendez-Vazquez
Ā 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
Andres Mendez-Vazquez
Ā 

More from Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
Ā 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
Ā 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
Ā 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
Ā 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
Ā 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
Ā 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
Ā 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
Ā 
Zetta global
Zetta globalZetta global
Zetta global
Ā 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
Ā 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
Ā 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
Ā 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
Ā 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
Ā 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
Ā 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
Ā 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
Ā 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
Ā 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
Ā 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
Ā 

Recently uploaded

Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
aqil azizi
Ā 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
Ā 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
Ā 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
Ā 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
Ā 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
Ā 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
Ā 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
Ā 
äø€ęƔäø€åŽŸē‰ˆ(IITęƕäøščƁ)伊利čÆŗ伊ē†å·„大学ęƕäøščÆęˆē»©å•äø“äøšåŠžē†
äø€ęƔäø€åŽŸē‰ˆ(IITęƕäøščƁ)伊利čÆŗ伊ē†å·„大学ęƕäøščÆęˆē»©å•äø“äøšåŠžē†äø€ęƔäø€åŽŸē‰ˆ(IITęƕäøščƁ)伊利čÆŗ伊ē†å·„大学ęƕäøščÆęˆē»©å•äø“äøšåŠžē†
äø€ęƔäø€åŽŸē‰ˆ(IITęƕäøščƁ)伊利čÆŗ伊ē†å·„大学ęƕäøščÆęˆē»©å•äø“äøšåŠžē†
zwunae
Ā 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
symbo111
Ā 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
Ā 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
Ā 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
Ā 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
Ā 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
Ā 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
Ā 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
Ā 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
Ā 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
Ā 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
Ā 

Recently uploaded (20)

Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Ā 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Ā 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Ā 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
Ā 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
Ā 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Ā 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Ā 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
Ā 
äø€ęƔäø€åŽŸē‰ˆ(IITęƕäøščƁ)伊利čÆŗ伊ē†å·„大学ęƕäøščÆęˆē»©å•äø“äøšåŠžē†
äø€ęƔäø€åŽŸē‰ˆ(IITęƕäøščƁ)伊利čÆŗ伊ē†å·„大学ęƕäøščÆęˆē»©å•äø“äøšåŠžē†äø€ęƔäø€åŽŸē‰ˆ(IITęƕäøščƁ)伊利čÆŗ伊ē†å·„大学ęƕäøščÆęˆē»©å•äø“äøšåŠžē†
äø€ęƔäø€åŽŸē‰ˆ(IITęƕäøščƁ)伊利čÆŗ伊ē†å·„大学ęƕäøščÆęˆē»©å•äø“äøšåŠžē†
Ā 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
Ā 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Ā 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
Ā 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Ā 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Ā 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Ā 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Ā 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Ā 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Ā 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
Ā 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Ā 

01 Machine Learning Introduction

  • 1. Machine Learning for Data Mining Introduction Andres Mendez-Vazquez May 13, 2015 1 / 56
  • 2. Outline 1 Why are we interested in Analyzing Data? Intuitive Deļ¬nition: The 3Vā€™s Complexity Data Everywhere 2 Machine Learning Machine Learning Process Features Classiļ¬cation Clustering Analysis 3 Data Mining Deļ¬nition Applications Example: Frequent Itemsets 4 Hardware Support ASICS GPUā€™s 5 Projects What projects can you do? 2 / 56
  • 3. Outline 1 Why are we interested in Analyzing Data? Intuitive Deļ¬nition: The 3Vā€™s Complexity Data Everywhere 2 Machine Learning Machine Learning Process Features Classiļ¬cation Clustering Analysis 3 Data Mining Deļ¬nition Applications Example: Frequent Itemsets 4 Hardware Support ASICS GPUā€™s 5 Projects What projects can you do? 3 / 56
  • 4. Intuitive Deļ¬nition: Volume When looking at the Volumes of Information, we have: Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!! Examples of these Volumes are 1 Records 2 Transactions 3 Web Searches 4 etc 4 / 56
  • 5. Intuitive Deļ¬nition: Volume When looking at the Volumes of Information, we have: Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!! Examples of these Volumes are 1 Records 2 Transactions 3 Web Searches 4 etc 4 / 56
  • 6. Intuitive Deļ¬nition: Volume When looking at the Volumes of Information, we have: Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!! Examples of these Volumes are 1 Records 2 Transactions 3 Web Searches 4 etc 4 / 56
  • 7. Intuitive Deļ¬nition: Volume When looking at the Volumes of Information, we have: Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!! Examples of these Volumes are 1 Records 2 Transactions 3 Web Searches 4 etc 4 / 56
  • 8. Intuitive Deļ¬nition: Volume When looking at the Volumes of Information, we have: Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!! Examples of these Volumes are 1 Records 2 Transactions 3 Web Searches 4 etc 4 / 56
  • 9. However Something Notable What constitutes truly ā€œhighā€ volume varies by industry and even geography!!! Simply look at the DNA data for a cellular cycle. Example 5 / 56
  • 10. However Something Notable What constitutes truly ā€œhighā€ volume varies by industry and even geography!!! Simply look at the DNA data for a cellular cycle. Example 5 / 56
  • 11. Intuitive Deļ¬nition: Variety When looking at the Structure of the Information, we have: Variety like there is not tomorrow: It is structured, semi-structured, unstructured So Do you have some examples of structures in Information? 6 / 56
  • 12. Intuitive Deļ¬nition: Variety When looking at the Structure of the Information, we have: Variety like there is not tomorrow: It is structured, semi-structured, unstructured So Do you have some examples of structures in Information? 6 / 56
  • 13. Intuitive Deļ¬nition: Variety When looking at the Structure of the Information, we have: Variety like there is not tomorrow: It is structured, semi-structured, unstructured So Do you have some examples of structures in Information? 6 / 56
  • 14. Intuitive Deļ¬nition: Volume When Looking at the Velocity of this Information? Data in Motion!!! Velocity: Dynamic Generation Real Time Generation Problems with that: Latency Lag time between capture or generation and when it is available!!! 7 / 56
  • 15. Intuitive Deļ¬nition: Volume When Looking at the Velocity of this Information? Data in Motion!!! Velocity: Dynamic Generation Real Time Generation Problems with that: Latency Lag time between capture or generation and when it is available!!! 7 / 56
  • 16. Intuitive Deļ¬nition: Volume When Looking at the Velocity of this Information? Data in Motion!!! Velocity: Dynamic Generation Real Time Generation Problems with that: Latency Lag time between capture or generation and when it is available!!! 7 / 56
  • 17. Intuitive Deļ¬nition: Volume When Looking at the Velocity of this Information? Data in Motion!!! Velocity: Dynamic Generation Real Time Generation Problems with that: Latency Lag time between capture or generation and when it is available!!! 7 / 56
  • 18. Intuitive Deļ¬nition: Volume When Looking at the Velocity of this Information? Data in Motion!!! Velocity: Dynamic Generation Real Time Generation Problems with that: Latency Lag time between capture or generation and when it is available!!! 7 / 56
  • 19. For example Imagine that I have a stream of m = 1025 integers with Ranges from [a1, ..., an] with n = 10, 000, 000 Now, somebody ask you to ļ¬nd the most frequent item!!! A naive algorithm 1 Take hash table with a counter. 2 Then, put numbers in the hash table. Problems Which problems we have? 8 / 56
  • 20. For example Imagine that I have a stream of m = 1025 integers with Ranges from [a1, ..., an] with n = 10, 000, 000 Now, somebody ask you to ļ¬nd the most frequent item!!! A naive algorithm 1 Take hash table with a counter. 2 Then, put numbers in the hash table. Problems Which problems we have? 8 / 56
  • 21. For example Imagine that I have a stream of m = 1025 integers with Ranges from [a1, ..., an] with n = 10, 000, 000 Now, somebody ask you to ļ¬nd the most frequent item!!! A naive algorithm 1 Take hash table with a counter. 2 Then, put numbers in the hash table. Problems Which problems we have? 8 / 56
  • 22. However There is the Count-Min Sketch Algorithm Invented by Charikar, Chen and Farch-Colton in 2004 With Properties Space Used Error Probability Error O 1 log 1 Ī“ Ā· (log m + log n) Ī“ 9 / 56
  • 23. However There is the Count-Min Sketch Algorithm Invented by Charikar, Chen and Farch-Colton in 2004 With Properties Space Used Error Probability Error O 1 log 1 Ī“ Ā· (log m + log n) Ī“ 9 / 56
  • 24. However There is the Count-Min Sketch Algorithm Invented by Charikar, Chen and Farch-Colton in 2004 With Properties Space Used Error Probability Error O 1 log 1 Ī“ Ā· (log m + log n) Ī“ 9 / 56
  • 25. Outline 1 Why are we interested in Analyzing Data? Intuitive Deļ¬nition: The 3Vā€™s Complexity Data Everywhere 2 Machine Learning Machine Learning Process Features Classiļ¬cation Clustering Analysis 3 Data Mining Deļ¬nition Applications Example: Frequent Itemsets 4 Hardware Support ASICS GPUā€™s 5 Projects What projects can you do? 10 / 56
  • 26. Complexity Given all these things It is necessary to correlate and share data across entities. It is necessary to link, match and transform data across business entities and systems. With this... Complexity goes through the roof!!! 11 / 56
  • 27. Complexity Given all these things It is necessary to correlate and share data across entities. It is necessary to link, match and transform data across business entities and systems. With this... Complexity goes through the roof!!! 11 / 56
  • 28. Complexity Given all these things It is necessary to correlate and share data across entities. It is necessary to link, match and transform data across business entities and systems. With this... Complexity goes through the roof!!! 11 / 56
  • 29. And it is through the roof!!! Linking open-data community project 12 / 56
  • 30. Cautionary Tale Something Notable In 1880 the USA made a Census of the Population in diļ¬€erent aspects: Population Mortality Agriculture Manufacturing However Once data was collected it took 7 years to say something!!! 13 / 56
  • 31. Cautionary Tale Something Notable In 1880 the USA made a Census of the Population in diļ¬€erent aspects: Population Mortality Agriculture Manufacturing However Once data was collected it took 7 years to say something!!! 13 / 56
  • 32. Cautionary Tale Something Notable In 1880 the USA made a Census of the Population in diļ¬€erent aspects: Population Mortality Agriculture Manufacturing However Once data was collected it took 7 years to say something!!! 13 / 56
  • 33. Cautionary Tale Something Notable In 1880 the USA made a Census of the Population in diļ¬€erent aspects: Population Mortality Agriculture Manufacturing However Once data was collected it took 7 years to say something!!! 13 / 56
  • 34. Cautionary Tale Something Notable In 1880 the USA made a Census of the Population in diļ¬€erent aspects: Population Mortality Agriculture Manufacturing However Once data was collected it took 7 years to say something!!! 13 / 56
  • 35. Cautionary Tale Something Notable In 1880 the USA made a Census of the Population in diļ¬€erent aspects: Population Mortality Agriculture Manufacturing However Once data was collected it took 7 years to say something!!! 13 / 56
  • 36. Cautionary Tale Something Notable In 1880 the USA made a Census of the Population in diļ¬€erent aspects: Population Mortality Agriculture Manufacturing However Once data was collected it took 7 years to say something!!! 13 / 56
  • 37. Ahhh... Thus, Hollering came with the following machine (Circa 1890)!!! 14 / 56
  • 38. Hollering Tabulating Machine It was basically a sorter and counter Using punching cards as memories. And Mercury Sensors. Example 15 / 56
  • 39. Hollering Tabulating Machine It was basically a sorter and counter Using punching cards as memories. And Mercury Sensors. Example 15 / 56
  • 40. It was FAST!!! It took only!!! 2 years!!! Nevertheless in 1837 Babbageā€™s Diļ¬€erence engine was The First General Computer!!! Turing-complete!!! Way more complex than the tabulator!!! 53 years earlier!!! 16 / 56
  • 41. It was FAST!!! It took only!!! 2 years!!! Nevertheless in 1837 Babbageā€™s Diļ¬€erence engine was The First General Computer!!! Turing-complete!!! Way more complex than the tabulator!!! 53 years earlier!!! 16 / 56
  • 42. It was FAST!!! It took only!!! 2 years!!! Nevertheless in 1837 Babbageā€™s Diļ¬€erence engine was The First General Computer!!! Turing-complete!!! Way more complex than the tabulator!!! 53 years earlier!!! 16 / 56
  • 43. It was FAST!!! It took only!!! 2 years!!! Nevertheless in 1837 Babbageā€™s Diļ¬€erence engine was The First General Computer!!! Turing-complete!!! Way more complex than the tabulator!!! 53 years earlier!!! 16 / 56
  • 44. It was FAST!!! It took only!!! 2 years!!! Nevertheless in 1837 Babbageā€™s Diļ¬€erence engine was The First General Computer!!! Turing-complete!!! Way more complex than the tabulator!!! 53 years earlier!!! 16 / 56
  • 46. The Problem Actually, it never reached completion because Babbage was actually a yucky project manager!!! 18 / 56
  • 47. The Problem Actually, it never reached completion because Babbage was actually a yucky project manager!!! 18 / 56
  • 48. Outline 1 Why are we interested in Analyzing Data? Intuitive Deļ¬nition: The 3Vā€™s Complexity Data Everywhere 2 Machine Learning Machine Learning Process Features Classiļ¬cation Clustering Analysis 3 Data Mining Deļ¬nition Applications Example: Frequent Itemsets 4 Hardware Support ASICS GPUā€™s 5 Projects What projects can you do? 19 / 56
  • 49. Data is Everywhere! Lots of data is being collected and warehoused Web data, e-commerce Purchases at department/ grocery stores Bank/Credit Card transactions Social Network Many Places 20 / 56
  • 50. Data is Everywhere! Lots of data is being collected and warehoused Web data, e-commerce Purchases at department/ grocery stores Bank/Credit Card transactions Social Network Many Places 20 / 56
  • 51. Data is Everywhere! Lots of data is being collected and warehoused Web data, e-commerce Purchases at department/ grocery stores Bank/Credit Card transactions Social Network Many Places 20 / 56
  • 52. Data is Everywhere! Lots of data is being collected and warehoused Web data, e-commerce Purchases at department/ grocery stores Bank/Credit Card transactions Social Network Many Places 20 / 56
  • 53. Data is Everywhere! Lots of data is being collected and warehoused Web data, e-commerce Purchases at department/ grocery stores Bank/Credit Card transactions Social Network Many Places 20 / 56
  • 54. The Staggering Numbers A Ocean of Data How many data in the world? 800 Terabytes, 2000 160 Exabytes, 2006 500 Exabytes (Internet), 2009 2.7 Zettabytes, 2012 35 Zettabytes by 2020 Generation How many data generated ONE day? 7 TB, Twitter 10 TB, Facebook Source: ā€œBig data: The next frontier for innovation, competition, and pro- ductivityā€ McKinsey Global Institute 2011 21 / 56
  • 55. The Staggering Numbers A Ocean of Data How many data in the world? 800 Terabytes, 2000 160 Exabytes, 2006 500 Exabytes (Internet), 2009 2.7 Zettabytes, 2012 35 Zettabytes by 2020 Generation How many data generated ONE day? 7 TB, Twitter 10 TB, Facebook Source: ā€œBig data: The next frontier for innovation, competition, and pro- ductivityā€ McKinsey Global Institute 2011 21 / 56
  • 56. The Staggering Numbers A Ocean of Data How many data in the world? 800 Terabytes, 2000 160 Exabytes, 2006 500 Exabytes (Internet), 2009 2.7 Zettabytes, 2012 35 Zettabytes by 2020 Generation How many data generated ONE day? 7 TB, Twitter 10 TB, Facebook Source: ā€œBig data: The next frontier for innovation, competition, and pro- ductivityā€ McKinsey Global Institute 2011 21 / 56
  • 57. The Staggering Numbers A Ocean of Data How many data in the world? 800 Terabytes, 2000 160 Exabytes, 2006 500 Exabytes (Internet), 2009 2.7 Zettabytes, 2012 35 Zettabytes by 2020 Generation How many data generated ONE day? 7 TB, Twitter 10 TB, Facebook Source: ā€œBig data: The next frontier for innovation, competition, and pro- ductivityā€ McKinsey Global Institute 2011 21 / 56
  • 58. The Staggering Numbers A Ocean of Data How many data in the world? 800 Terabytes, 2000 160 Exabytes, 2006 500 Exabytes (Internet), 2009 2.7 Zettabytes, 2012 35 Zettabytes by 2020 Generation How many data generated ONE day? 7 TB, Twitter 10 TB, Facebook Source: ā€œBig data: The next frontier for innovation, competition, and pro- ductivityā€ McKinsey Global Institute 2011 21 / 56
  • 59. The Staggering Numbers A Ocean of Data How many data in the world? 800 Terabytes, 2000 160 Exabytes, 2006 500 Exabytes (Internet), 2009 2.7 Zettabytes, 2012 35 Zettabytes by 2020 Generation How many data generated ONE day? 7 TB, Twitter 10 TB, Facebook Source: ā€œBig data: The next frontier for innovation, competition, and pro- ductivityā€ McKinsey Global Institute 2011 21 / 56
  • 60. The Staggering Numbers A Ocean of Data How many data in the world? 800 Terabytes, 2000 160 Exabytes, 2006 500 Exabytes (Internet), 2009 2.7 Zettabytes, 2012 35 Zettabytes by 2020 Generation How many data generated ONE day? 7 TB, Twitter 10 TB, Facebook Source: ā€œBig data: The next frontier for innovation, competition, and pro- ductivityā€ McKinsey Global Institute 2011 21 / 56
  • 61. The Staggering Numbers A Ocean of Data How many data in the world? 800 Terabytes, 2000 160 Exabytes, 2006 500 Exabytes (Internet), 2009 2.7 Zettabytes, 2012 35 Zettabytes by 2020 Generation How many data generated ONE day? 7 TB, Twitter 10 TB, Facebook Source: ā€œBig data: The next frontier for innovation, competition, and pro- ductivityā€ McKinsey Global Institute 2011 21 / 56
  • 62. The Staggering Numbers A Ocean of Data How many data in the world? 800 Terabytes, 2000 160 Exabytes, 2006 500 Exabytes (Internet), 2009 2.7 Zettabytes, 2012 35 Zettabytes by 2020 Generation How many data generated ONE day? 7 TB, Twitter 10 TB, Facebook Source: ā€œBig data: The next frontier for innovation, competition, and pro- ductivityā€ McKinsey Global Institute 2011 21 / 56
  • 63. Type of Data Thus Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) And more... Graph Data Social Network, Semantic Web (RDF), . . . Streaming Data You can only scan the data once 22 / 56
  • 64. Type of Data Thus Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) And more... Graph Data Social Network, Semantic Web (RDF), . . . Streaming Data You can only scan the data once 22 / 56
  • 65. Type of Data Thus Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) And more... Graph Data Social Network, Semantic Web (RDF), . . . Streaming Data You can only scan the data once 22 / 56
  • 66. Type of Data Thus Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) And more... Graph Data Social Network, Semantic Web (RDF), . . . Streaming Data You can only scan the data once 22 / 56
  • 67. Type of Data Thus Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) And more... Graph Data Social Network, Semantic Web (RDF), . . . Streaming Data You can only scan the data once 22 / 56
  • 68. Type of Data Thus Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) And more... Graph Data Social Network, Semantic Web (RDF), . . . Streaming Data You can only scan the data once 22 / 56
  • 69. Type of Data Thus Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) And more... Graph Data Social Network, Semantic Web (RDF), . . . Streaming Data You can only scan the data once 22 / 56
  • 70. The Ever Growing Landscape 23 / 56
  • 71. Machine Learning Deļ¬nition Algorithms or techniques that enable computer (machine) to ā€œlearnā€ from data. Related with many areas such as data mining, statistics, information theory, etc. Algorithm Types: Unsupervised Learning Supervised learning Reinforcement learning Examples Artiļ¬cial Neural Network (ANN) Support Vector Machine (SVM) Expectation-Maximization (EM) Deterministic Annealing (DA) 24 / 56
  • 72. Machine Learning Deļ¬nition Algorithms or techniques that enable computer (machine) to ā€œlearnā€ from data. Related with many areas such as data mining, statistics, information theory, etc. Algorithm Types: Unsupervised Learning Supervised learning Reinforcement learning Examples Artiļ¬cial Neural Network (ANN) Support Vector Machine (SVM) Expectation-Maximization (EM) Deterministic Annealing (DA) 24 / 56
  • 73. Machine Learning Deļ¬nition Algorithms or techniques that enable computer (machine) to ā€œlearnā€ from data. Related with many areas such as data mining, statistics, information theory, etc. Algorithm Types: Unsupervised Learning Supervised learning Reinforcement learning Examples Artiļ¬cial Neural Network (ANN) Support Vector Machine (SVM) Expectation-Maximization (EM) Deterministic Annealing (DA) 24 / 56
  • 74. Machine Learning Deļ¬nition Algorithms or techniques that enable computer (machine) to ā€œlearnā€ from data. Related with many areas such as data mining, statistics, information theory, etc. Algorithm Types: Unsupervised Learning Supervised learning Reinforcement learning Examples Artiļ¬cial Neural Network (ANN) Support Vector Machine (SVM) Expectation-Maximization (EM) Deterministic Annealing (DA) 24 / 56
  • 75. Machine Learning Deļ¬nition Algorithms or techniques that enable computer (machine) to ā€œlearnā€ from data. Related with many areas such as data mining, statistics, information theory, etc. Algorithm Types: Unsupervised Learning Supervised learning Reinforcement learning Examples Artiļ¬cial Neural Network (ANN) Support Vector Machine (SVM) Expectation-Maximization (EM) Deterministic Annealing (DA) 24 / 56
  • 76. Machine Learning Deļ¬nition Algorithms or techniques that enable computer (machine) to ā€œlearnā€ from data. Related with many areas such as data mining, statistics, information theory, etc. Algorithm Types: Unsupervised Learning Supervised learning Reinforcement learning Examples Artiļ¬cial Neural Network (ANN) Support Vector Machine (SVM) Expectation-Maximization (EM) Deterministic Annealing (DA) 24 / 56
  • 77. Machine Learning Deļ¬nition Algorithms or techniques that enable computer (machine) to ā€œlearnā€ from data. Related with many areas such as data mining, statistics, information theory, etc. Algorithm Types: Unsupervised Learning Supervised learning Reinforcement learning Examples Artiļ¬cial Neural Network (ANN) Support Vector Machine (SVM) Expectation-Maximization (EM) Deterministic Annealing (DA) 24 / 56
  • 78. Machine Learning Deļ¬nition Algorithms or techniques that enable computer (machine) to ā€œlearnā€ from data. Related with many areas such as data mining, statistics, information theory, etc. Algorithm Types: Unsupervised Learning Supervised learning Reinforcement learning Examples Artiļ¬cial Neural Network (ANN) Support Vector Machine (SVM) Expectation-Maximization (EM) Deterministic Annealing (DA) 24 / 56
  • 79. Outline 1 Why are we interested in Analyzing Data? Intuitive Deļ¬nition: The 3Vā€™s Complexity Data Everywhere 2 Machine Learning Machine Learning Process Features Classiļ¬cation Clustering Analysis 3 Data Mining Deļ¬nition Applications Example: Frequent Itemsets 4 Hardware Support ASICS GPUā€™s 5 Projects What projects can you do? 25 / 56
  • 80. Machine Learning Process Process 1 Feature Extraction/Feature Generation 2 Clustering ā‰ˆ Class Identiļ¬cation ā‰ˆ Unsupervised Learning 3 Classiļ¬cation ā‰ˆ Supervised Learning Then... We start thinking: We need to process a lot of data... Or... LARGE SCALE MACHINE LEARNING 26 / 56
  • 81. Machine Learning Process Process 1 Feature Extraction/Feature Generation 2 Clustering ā‰ˆ Class Identiļ¬cation ā‰ˆ Unsupervised Learning 3 Classiļ¬cation ā‰ˆ Supervised Learning Then... We start thinking: We need to process a lot of data... Or... LARGE SCALE MACHINE LEARNING 26 / 56
  • 82. Machine Learning Process Process 1 Feature Extraction/Feature Generation 2 Clustering ā‰ˆ Class Identiļ¬cation ā‰ˆ Unsupervised Learning 3 Classiļ¬cation ā‰ˆ Supervised Learning Then... We start thinking: We need to process a lot of data... Or... LARGE SCALE MACHINE LEARNING 26 / 56
  • 83. Outline 1 Why are we interested in Analyzing Data? Intuitive Deļ¬nition: The 3Vā€™s Complexity Data Everywhere 2 Machine Learning Machine Learning Process Features Classiļ¬cation Clustering Analysis 3 Data Mining Deļ¬nition Applications Example: Frequent Itemsets 4 Hardware Support ASICS GPUā€™s 5 Projects What projects can you do? 27 / 56
  • 84. Feature Generation/Dimensionality Reduction Feature Generation Given a set of measurements, the goal is to discover compact and informative representations of the obtained data. Examples 1 The Karhunenā€“LoĆØve transform ā‰ˆ Principal Component Analysis 1 Popular for feature generation and Dimensionality Reduction 2 The Singular Value Decomposition 1 Used for Dimensionality Reduction 28 / 56
  • 85. Feature Generation/Dimensionality Reduction Feature Generation Given a set of measurements, the goal is to discover compact and informative representations of the obtained data. Examples 1 The Karhunenā€“LoĆØve transform ā‰ˆ Principal Component Analysis 1 Popular for feature generation and Dimensionality Reduction 2 The Singular Value Decomposition 1 Used for Dimensionality Reduction 28 / 56
  • 86. Feature Generation/Dimensionality Reduction Feature Generation Given a set of measurements, the goal is to discover compact and informative representations of the obtained data. Examples 1 The Karhunenā€“LoĆØve transform ā‰ˆ Principal Component Analysis 1 Popular for feature generation and Dimensionality Reduction 2 The Singular Value Decomposition 1 Used for Dimensionality Reduction 28 / 56
  • 87. Feature Generation/Dimensionality Reduction Feature Generation Given a set of measurements, the goal is to discover compact and informative representations of the obtained data. Examples 1 The Karhunenā€“LoĆØve transform ā‰ˆ Principal Component Analysis 1 Popular for feature generation and Dimensionality Reduction 2 The Singular Value Decomposition 1 Used for Dimensionality Reduction 28 / 56
  • 88. Dimension Reduction/Feature Extraction Deļ¬nition Process to transform high-dimensional data into low-dimensional ones for improving accuracy, understanding, or removing noises. Why? Curse of dimensionality: Complexity grows exponentially in volume by adding extra dimensions. 29 / 56
  • 89. Dimension Reduction/Feature Extraction Deļ¬nition Process to transform high-dimensional data into low-dimensional ones for improving accuracy, understanding, or removing noises. Why? Curse of dimensionality: Complexity grows exponentially in volume by adding extra dimensions. 29 / 56
  • 90. Feature Selection Feature Selection Which features should be used for the classiļ¬er? Why? The Curse of Dimensionality!!! Hypothesis Testing to discriminate good features 30 / 56
  • 91. Feature Selection Feature Selection Which features should be used for the classiļ¬er? Why? The Curse of Dimensionality!!! 30 / 56
  • 92. What can be done? Measures for Class Separability Example: Between-class scatter matrix: Sb = M i=1 Pi (Āµi āˆ’ Āµ0) (Āµi āˆ’ Āµ0)T (1) Where: Āµ0 is the global mean vector, Āµ0 = M i=1 Pi Āµi . Āµi the median of class Ļ‰i . Pi āˆ¼= ni N . 31 / 56
  • 93. What can be done? Measures for Class Separability Example: Between-class scatter matrix: Sb = M i=1 Pi (Āµi āˆ’ Āµ0) (Āµi āˆ’ Āµ0)T (1) Where: Āµ0 is the global mean vector, Āµ0 = M i=1 Pi Āµi . Āµi the median of class Ļ‰i . Pi āˆ¼= ni N . 31 / 56
  • 94. What can be done? Measures for Class Separability Example: Between-class scatter matrix: Sb = M i=1 Pi (Āµi āˆ’ Āµ0) (Āµi āˆ’ Āµ0)T (1) Where: Āµ0 is the global mean vector, Āµ0 = M i=1 Pi Āµi . Āµi the median of class Ļ‰i . Pi āˆ¼= ni N . 31 / 56
  • 95. What can be done? Measures for Class Separability Example: Between-class scatter matrix: Sb = M i=1 Pi (Āµi āˆ’ Āµ0) (Āµi āˆ’ Āµ0)T (1) Where: Āµ0 is the global mean vector, Āµ0 = M i=1 Pi Āµi . Āµi the median of class Ļ‰i . Pi āˆ¼= ni N . 31 / 56
  • 96. What can be done? Feature Subset Selection Examples: Filter Approach All combinations of features are used together with a separability measure. Wrapper Approach: Use the decided classiļ¬er itself to ļ¬nd the best set. 32 / 56
  • 97. What can be done? Feature Subset Selection Examples: Filter Approach All combinations of features are used together with a separability measure. Wrapper Approach: Use the decided classiļ¬er itself to ļ¬nd the best set. 32 / 56
  • 98. What can be done? Feature Subset Selection Examples: Filter Approach All combinations of features are used together with a separability measure. Wrapper Approach: Use the decided classiļ¬er itself to ļ¬nd the best set. 32 / 56
  • 99. What can be done? Feature Subset Selection Examples: Filter Approach All combinations of features are used together with a separability measure. Wrapper Approach: Use the decided classiļ¬er itself to ļ¬nd the best set. 32 / 56
  • 100. What can be done? Feature Subset Selection Examples: Filter Approach All combinations of features are used together with a separability measure. Wrapper Approach: Use the decided classiļ¬er itself to ļ¬nd the best set. 32 / 56
  • 101. Outline 1 Why are we interested in Analyzing Data? Intuitive Deļ¬nition: The 3Vā€™s Complexity Data Everywhere 2 Machine Learning Machine Learning Process Features Classiļ¬cation Clustering Analysis 3 Data Mining Deļ¬nition Applications Example: Frequent Itemsets 4 Hardware Support ASICS GPUā€™s 5 Projects What projects can you do? 33 / 56
  • 102. Classiļ¬cation Deļ¬nition A procedure dividing data into the given set of categories based on the training set in a supervised way. What we want from classiļ¬cation? Generalization Vs. Speciļ¬cation Hard to achieve both Avoid - overļ¬tting/overtraining Early stopping Holdout validation K-fold cross validation Leave-one-out cross-validation 34 / 56
  • 103. Classiļ¬cation Deļ¬nition A procedure dividing data into the given set of categories based on the training set in a supervised way. What we want from classiļ¬cation? Generalization Vs. Speciļ¬cation Hard to achieve both Avoid - overļ¬tting/overtraining Early stopping Holdout validation K-fold cross validation Leave-one-out cross-validation 34 / 56
  • 104. Classiļ¬cation Deļ¬nition A procedure dividing data into the given set of categories based on the training set in a supervised way. What we want from classiļ¬cation? Generalization Vs. Speciļ¬cation Hard to achieve both Avoid - overļ¬tting/overtraining Early stopping Holdout validation K-fold cross validation Leave-one-out cross-validation 34 / 56
  • 105. Classiļ¬cation Deļ¬nition A procedure dividing data into the given set of categories based on the training set in a supervised way. What we want from classiļ¬cation? Generalization Vs. Speciļ¬cation Hard to achieve both Avoid - overļ¬tting/overtraining Early stopping Holdout validation K-fold cross validation Leave-one-out cross-validation 34 / 56
  • 106. Classiļ¬cation Deļ¬nition A procedure dividing data into the given set of categories based on the training set in a supervised way. What we want from classiļ¬cation? Generalization Vs. Speciļ¬cation Hard to achieve both Avoid - overļ¬tting/overtraining Early stopping Holdout validation K-fold cross validation Leave-one-out cross-validation 34 / 56
  • 107. Classiļ¬cation Deļ¬nition A procedure dividing data into the given set of categories based on the training set in a supervised way. What we want from classiļ¬cation? Generalization Vs. Speciļ¬cation Hard to achieve both Avoid - overļ¬tting/overtraining Early stopping Holdout validation K-fold cross validation Leave-one-out cross-validation 34 / 56
  • 108. Classiļ¬cation Deļ¬nition A procedure dividing data into the given set of categories based on the training set in a supervised way. What we want from classiļ¬cation? Generalization Vs. Speciļ¬cation Hard to achieve both Avoid - overļ¬tting/overtraining Early stopping Holdout validation K-fold cross validation Leave-one-out cross-validation 34 / 56
  • 109. Classiļ¬cation Deļ¬nition A procedure dividing data into the given set of categories based on the training set in a supervised way. What we want from classiļ¬cation? Generalization Vs. Speciļ¬cation Hard to achieve both Avoid - overļ¬tting/overtraining Early stopping Holdout validation K-fold cross validation Leave-one-out cross-validation 34 / 56
  • 110. Avoid - overļ¬tting/overtraining Validation and Training Error Underfitting Overfitting Validation Error Training Error 35 / 56
  • 111. Examples of Classiļ¬cation Algorithms Many Possible Algorithms Linear Classiļ¬ers: Perceptron Probability Classiļ¬ers: Naive Bayes Kernel Methods Classiļ¬ers : Support Vector Machines Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks Graph Model Classiļ¬ers: . . . 36 / 56
  • 112. Examples of Classiļ¬cation Algorithms Many Possible Algorithms Linear Classiļ¬ers: Perceptron Probability Classiļ¬ers: Naive Bayes Kernel Methods Classiļ¬ers : Support Vector Machines Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks Graph Model Classiļ¬ers: . . . 36 / 56
  • 113. Examples of Classiļ¬cation Algorithms Many Possible Algorithms Linear Classiļ¬ers: Perceptron Probability Classiļ¬ers: Naive Bayes Kernel Methods Classiļ¬ers : Support Vector Machines Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks Graph Model Classiļ¬ers: . . . 36 / 56
  • 114. Examples of Classiļ¬cation Algorithms Many Possible Algorithms Linear Classiļ¬ers: Perceptron Probability Classiļ¬ers: Naive Bayes Kernel Methods Classiļ¬ers : Support Vector Machines Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks Graph Model Classiļ¬ers: . . . 36 / 56
  • 115. Examples of Classiļ¬cation Algorithms Many Possible Algorithms Linear Classiļ¬ers: Perceptron Probability Classiļ¬ers: Naive Bayes Kernel Methods Classiļ¬ers : Support Vector Machines Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks Graph Model Classiļ¬ers: . . . 36 / 56
  • 116. Examples of Classiļ¬cation Algorithms Many Possible Algorithms Linear Classiļ¬ers: Perceptron Probability Classiļ¬ers: Naive Bayes Kernel Methods Classiļ¬ers : Support Vector Machines Non-Linear Classiļ¬ers: Artiļ¬cial Neural Networks Graph Model Classiļ¬ers: . . . 36 / 56
  • 117. Outline 1 Why are we interested in Analyzing Data? Intuitive Deļ¬nition: The 3Vā€™s Complexity Data Everywhere 2 Machine Learning Machine Learning Process Features Classiļ¬cation Clustering Analysis 3 Data Mining Deļ¬nition Applications Example: Frequent Itemsets 4 Hardware Support ASICS GPUā€™s 5 Projects What projects can you do? 37 / 56
  • 118. Clustering Analysis Deļ¬nition Grouping unlabeled data into clusters, for the purpose of inference of hidden structures or information. Using, for example Dissimilarity measurement Angle : Inner product, . . . Non-metric : Rank, Intensity, . . . Distance : Euclidean (l2), Manhattan(l1), . . . 38 / 56
  • 119. Clustering Analysis Deļ¬nition Grouping unlabeled data into clusters, for the purpose of inference of hidden structures or information. Using, for example Dissimilarity measurement Angle : Inner product, . . . Non-metric : Rank, Intensity, . . . Distance : Euclidean (l2), Manhattan(l1), . . . 38 / 56
  • 120. Clustering Analysis Deļ¬nition Grouping unlabeled data into clusters, for the purpose of inference of hidden structures or information. Using, for example Dissimilarity measurement Angle : Inner product, . . . Non-metric : Rank, Intensity, . . . Distance : Euclidean (l2), Manhattan(l1), . . . 38 / 56
  • 121. Clustering Analysis Deļ¬nition Grouping unlabeled data into clusters, for the purpose of inference of hidden structures or information. Using, for example Dissimilarity measurement Angle : Inner product, . . . Non-metric : Rank, Intensity, . . . Distance : Euclidean (l2), Manhattan(l1), . . . 38 / 56
  • 122. Clustering Analysis Deļ¬nition Grouping unlabeled data into clusters, for the purpose of inference of hidden structures or information. Using, for example Dissimilarity measurement Angle : Inner product, . . . Non-metric : Rank, Intensity, . . . Distance : Euclidean (l2), Manhattan(l1), . . . 38 / 56
  • 124. Examples of Clustering Algorithms Clustering 1 Basic Clustering Algorithms 1 K-means 2 Clustering Based in Cost Functions 1 Fuzzy C-means 2 Possibilistic 3 Hierarchical Clustering 1 Entropy based 4 Clustering Based in Graph Theory 40 / 56
  • 125. Examples of Clustering Algorithms Clustering 1 Basic Clustering Algorithms 1 K-means 2 Clustering Based in Cost Functions 1 Fuzzy C-means 2 Possibilistic 3 Hierarchical Clustering 1 Entropy based 4 Clustering Based in Graph Theory 40 / 56
  • 126. Examples of Clustering Algorithms Clustering 1 Basic Clustering Algorithms 1 K-means 2 Clustering Based in Cost Functions 1 Fuzzy C-means 2 Possibilistic 3 Hierarchical Clustering 1 Entropy based 4 Clustering Based in Graph Theory 40 / 56
  • 127. Examples of Clustering Algorithms Clustering 1 Basic Clustering Algorithms 1 K-means 2 Clustering Based in Cost Functions 1 Fuzzy C-means 2 Possibilistic 3 Hierarchical Clustering 1 Entropy based 4 Clustering Based in Graph Theory 40 / 56
  • 128. Examples of Clustering Algorithms Clustering 1 Basic Clustering Algorithms 1 K-means 2 Clustering Based in Cost Functions 1 Fuzzy C-means 2 Possibilistic 3 Hierarchical Clustering 1 Entropy based 4 Clustering Based in Graph Theory 40 / 56
  • 129. Examples of Clustering Algorithms Clustering 1 Basic Clustering Algorithms 1 K-means 2 Clustering Based in Cost Functions 1 Fuzzy C-means 2 Possibilistic 3 Hierarchical Clustering 1 Entropy based 4 Clustering Based in Graph Theory 40 / 56
  • 130. Examples of Clustering Algorithms Clustering 1 Basic Clustering Algorithms 1 K-means 2 Clustering Based in Cost Functions 1 Fuzzy C-means 2 Possibilistic 3 Hierarchical Clustering 1 Entropy based 4 Clustering Based in Graph Theory 40 / 56
  • 131. Examples of Clustering Algorithms Clustering 1 Basic Clustering Algorithms 1 K-means 2 Clustering Based in Cost Functions 1 Fuzzy C-means 2 Possibilistic 3 Hierarchical Clustering 1 Entropy based 4 Clustering Based in Graph Theory 40 / 56
  • 132. Outline 1 Why are we interested in Analyzing Data? Intuitive Deļ¬nition: The 3Vā€™s Complexity Data Everywhere 2 Machine Learning Machine Learning Process Features Classiļ¬cation Clustering Analysis 3 Data Mining Deļ¬nition Applications Example: Frequent Itemsets 4 Hardware Support ASICS GPUā€™s 5 Projects What projects can you do? 41 / 56
  • 133. What Is Data Mining? Data mining (knowledge discovery in databases): Extraction of interesting information or patterns from data in large databases. Alternative names and their ā€œinside storiesā€: Knowledge discovery(mining) in databases (KDD) Knowledge extraction Data/pattern analysis Data archeology Business intelligence etc. 42 / 56
  • 134. What Is Data Mining? Data mining (knowledge discovery in databases): Extraction of interesting information or patterns from data in large databases. Alternative names and their ā€œinside storiesā€: Knowledge discovery(mining) in databases (KDD) Knowledge extraction Data/pattern analysis Data archeology Business intelligence etc. 42 / 56
  • 135. What Is Data Mining? Data mining (knowledge discovery in databases): Extraction of interesting information or patterns from data in large databases. Alternative names and their ā€œinside storiesā€: Knowledge discovery(mining) in databases (KDD) Knowledge extraction Data/pattern analysis Data archeology Business intelligence etc. 42 / 56
  • 136. What Is Data Mining? Data mining (knowledge discovery in databases): Extraction of interesting information or patterns from data in large databases. Alternative names and their ā€œinside storiesā€: Knowledge discovery(mining) in databases (KDD) Knowledge extraction Data/pattern analysis Data archeology Business intelligence etc. 42 / 56
  • 137. What Is Data Mining? Data mining (knowledge discovery in databases): Extraction of interesting information or patterns from data in large databases. Alternative names and their ā€œinside storiesā€: Knowledge discovery(mining) in databases (KDD) Knowledge extraction Data/pattern analysis Data archeology Business intelligence etc. 42 / 56
  • 138. What Is Data Mining? Data mining (knowledge discovery in databases): Extraction of interesting information or patterns from data in large databases. Alternative names and their ā€œinside storiesā€: Knowledge discovery(mining) in databases (KDD) Knowledge extraction Data/pattern analysis Data archeology Business intelligence etc. 42 / 56
  • 139. Examples: What is (not) Data Mining? What is not Data Mining? 1 Look up phone number in phone directory 2 Query a Web search engine for information about ā€œAmazonā€ What is Data Mining? 1 Certain names are more prevalent in certain US locations (Oā€™Brien, Oā€™Rurke, Oā€™Reilly. . . in Boston area) 2 Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com) 43 / 56
  • 140. Examples: What is (not) Data Mining? What is not Data Mining? 1 Look up phone number in phone directory 2 Query a Web search engine for information about ā€œAmazonā€ What is Data Mining? 1 Certain names are more prevalent in certain US locations (Oā€™Brien, Oā€™Rurke, Oā€™Reilly. . . in Boston area) 2 Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com) 43 / 56
  • 141. Examples: What is (not) Data Mining? What is not Data Mining? 1 Look up phone number in phone directory 2 Query a Web search engine for information about ā€œAmazonā€ What is Data Mining? 1 Certain names are more prevalent in certain US locations (Oā€™Brien, Oā€™Rurke, Oā€™Reilly. . . in Boston area) 2 Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com) 43 / 56
  • 142. Examples: What is (not) Data Mining? What is not Data Mining? 1 Look up phone number in phone directory 2 Query a Web search engine for information about ā€œAmazonā€ What is Data Mining? 1 Certain names are more prevalent in certain US locations (Oā€™Brien, Oā€™Rurke, Oā€™Reilly. . . in Boston area) 2 Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com) 43 / 56
  • 143. Outline 1 Why are we interested in Analyzing Data? Intuitive Deļ¬nition: The 3Vā€™s Complexity Data Everywhere 2 Machine Learning Machine Learning Process Features Classiļ¬cation Clustering Analysis 3 Data Mining Deļ¬nition Applications Example: Frequent Itemsets 4 Hardware Support ASICS GPUā€™s 5 Projects What projects can you do? 44 / 56
  • 144. Data mining Applications Applications Mining the Web for Structured Data Near Neighbor Search in High Dimensional Data. Frequent itemsets and Association Rules Structure of the webgraph PageRank Link Analysis Proximity on Graphs Mining data streams. Large scale supervised machine learning techniques. 45 / 56
  • 145. Data mining Applications Applications Mining the Web for Structured Data Near Neighbor Search in High Dimensional Data. Frequent itemsets and Association Rules Structure of the webgraph PageRank Link Analysis Proximity on Graphs Mining data streams. Large scale supervised machine learning techniques. 45 / 56
  • 146. Data mining Applications Applications Mining the Web for Structured Data Near Neighbor Search in High Dimensional Data. Frequent itemsets and Association Rules Structure of the webgraph PageRank Link Analysis Proximity on Graphs Mining data streams. Large scale supervised machine learning techniques. 45 / 56
  • 147. Data mining Applications Applications Mining the Web for Structured Data Near Neighbor Search in High Dimensional Data. Frequent itemsets and Association Rules Structure of the webgraph PageRank Link Analysis Proximity on Graphs Mining data streams. Large scale supervised machine learning techniques. 45 / 56
  • 148. Data mining Applications Applications Mining the Web for Structured Data Near Neighbor Search in High Dimensional Data. Frequent itemsets and Association Rules Structure of the webgraph PageRank Link Analysis Proximity on Graphs Mining data streams. Large scale supervised machine learning techniques. 45 / 56
  • 149. Data mining Applications Applications Mining the Web for Structured Data Near Neighbor Search in High Dimensional Data. Frequent itemsets and Association Rules Structure of the webgraph PageRank Link Analysis Proximity on Graphs Mining data streams. Large scale supervised machine learning techniques. 45 / 56
  • 150. Data mining Applications Applications Mining the Web for Structured Data Near Neighbor Search in High Dimensional Data. Frequent itemsets and Association Rules Structure of the webgraph PageRank Link Analysis Proximity on Graphs Mining data streams. Large scale supervised machine learning techniques. 45 / 56
  • 151. Data mining Applications Applications Mining the Web for Structured Data Near Neighbor Search in High Dimensional Data. Frequent itemsets and Association Rules Structure of the webgraph PageRank Link Analysis Proximity on Graphs Mining data streams. Large scale supervised machine learning techniques. 45 / 56
  • 152. Data mining Applications Applications Mining the Web for Structured Data Near Neighbor Search in High Dimensional Data. Frequent itemsets and Association Rules Structure of the webgraph PageRank Link Analysis Proximity on Graphs Mining data streams. Large scale supervised machine learning techniques. 45 / 56
  • 153. Outline 1 Why are we interested in Analyzing Data? Intuitive Deļ¬nition: The 3Vā€™s Complexity Data Everywhere 2 Machine Learning Machine Learning Process Features Classiļ¬cation Clustering Analysis 3 Data Mining Deļ¬nition Applications Example: Frequent Itemsets 4 Hardware Support ASICS GPUā€™s 5 Projects What projects can you do? 46 / 56
  • 154. Example: Frequent Itemsets Based in the Market-Basket Model 1 On the one hand, we have items. 2 On the other we have baskets, sometimes called ā€œtransactions.ā€ 1 Each basket consists of a set of items (an itemset) 2 They are small. Examples 1 {Cat, and, dog, bites} 2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬€spring} 3 {Cat, killer, likely, is, a, big, dog} 4 {Professional, free, advice, on, dog, training, puppy} 47 / 56
  • 155. Example: Frequent Itemsets Based in the Market-Basket Model 1 On the one hand, we have items. 2 On the other we have baskets, sometimes called ā€œtransactions.ā€ 1 Each basket consists of a set of items (an itemset) 2 They are small. Examples 1 {Cat, and, dog, bites} 2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬€spring} 3 {Cat, killer, likely, is, a, big, dog} 4 {Professional, free, advice, on, dog, training, puppy} 47 / 56
  • 156. Example: Frequent Itemsets Based in the Market-Basket Model 1 On the one hand, we have items. 2 On the other we have baskets, sometimes called ā€œtransactions.ā€ 1 Each basket consists of a set of items (an itemset) 2 They are small. Examples 1 {Cat, and, dog, bites} 2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬€spring} 3 {Cat, killer, likely, is, a, big, dog} 4 {Professional, free, advice, on, dog, training, puppy} 47 / 56
  • 157. Example: Frequent Itemsets Based in the Market-Basket Model 1 On the one hand, we have items. 2 On the other we have baskets, sometimes called ā€œtransactions.ā€ 1 Each basket consists of a set of items (an itemset) 2 They are small. Examples 1 {Cat, and, dog, bites} 2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬€spring} 3 {Cat, killer, likely, is, a, big, dog} 4 {Professional, free, advice, on, dog, training, puppy} 47 / 56
  • 158. Example: Frequent Itemsets Based in the Market-Basket Model 1 On the one hand, we have items. 2 On the other we have baskets, sometimes called ā€œtransactions.ā€ 1 Each basket consists of a set of items (an itemset) 2 They are small. Examples 1 {Cat, and, dog, bites} 2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬€spring} 3 {Cat, killer, likely, is, a, big, dog} 4 {Professional, free, advice, on, dog, training, puppy} 47 / 56
  • 159. Example: Frequent Itemsets Based in the Market-Basket Model 1 On the one hand, we have items. 2 On the other we have baskets, sometimes called ā€œtransactions.ā€ 1 Each basket consists of a set of items (an itemset) 2 They are small. Examples 1 {Cat, and, dog, bites} 2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬€spring} 3 {Cat, killer, likely, is, a, big, dog} 4 {Professional, free, advice, on, dog, training, puppy} 47 / 56
  • 160. Example: Frequent Itemsets Based in the Market-Basket Model 1 On the one hand, we have items. 2 On the other we have baskets, sometimes called ā€œtransactions.ā€ 1 Each basket consists of a set of items (an itemset) 2 They are small. Examples 1 {Cat, and, dog, bites} 2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬€spring} 3 {Cat, killer, likely, is, a, big, dog} 4 {Professional, free, advice, on, dog, training, puppy} 47 / 56
  • 161. Example: Frequent Itemsets Based in the Market-Basket Model 1 On the one hand, we have items. 2 On the other we have baskets, sometimes called ā€œtransactions.ā€ 1 Each basket consists of a set of items (an itemset) 2 They are small. Examples 1 {Cat, and, dog, bites} 2 {Yahoo, news, claims, cat, dog, and, produced, viable, oļ¬€spring} 3 {Cat, killer, likely, is, a, big, dog} 4 {Professional, free, advice, on, dog, training, puppy} 47 / 56
  • 162. Example: Frequent Itemsets Then, we do the following Transaction ID Cat Dog and a mated 1 1 1 1 0 0 2 1 1 1 1 1 3 1 1 0 1 0 4 0 1 0 0 0 48 / 56
  • 163. Combinatorial Problem Problem How many subsets we have? But we can do the following Given the itemset x in a database D and a set of transactions {ti }iāˆˆI supp(x, D) = |{ti āˆˆ D|x āˆˆ ti }| (2) Then, setting a threshold How many frequent (supp(x, D) > ) itemsets? 49 / 56
  • 164. Combinatorial Problem Problem How many subsets we have? But we can do the following Given the itemset x in a database D and a set of transactions {ti }iāˆˆI supp(x, D) = |{ti āˆˆ D|x āˆˆ ti }| (2) Then, setting a threshold How many frequent (supp(x, D) > ) itemsets? 49 / 56
  • 165. Combinatorial Problem Problem How many subsets we have? But we can do the following Given the itemset x in a database D and a set of transactions {ti }iāˆˆI supp(x, D) = |{ti āˆˆ D|x āˆˆ ti }| (2) Then, setting a threshold How many frequent (supp(x, D) > ) itemsets? 49 / 56
  • 166. Outline 1 Why are we interested in Analyzing Data? Intuitive Deļ¬nition: The 3Vā€™s Complexity Data Everywhere 2 Machine Learning Machine Learning Process Features Classiļ¬cation Clustering Analysis 3 Data Mining Deļ¬nition Applications Example: Frequent Itemsets 4 Hardware Support ASICS GPUā€™s 5 Projects What projects can you do? 50 / 56
  • 167. Hardware Solutions: ASICS Application-Speciļ¬c Integrated Circuit (ASIC) An ASIC is an integrated circuit customized for a particular use, rather than intended for general-purpose use. It allows for 1 Lower Power Consumption. 2 Better Colling Approaches. Example: From Microsoft Research 51 / 56
  • 168. Hardware Solutions: ASICS Application-Speciļ¬c Integrated Circuit (ASIC) An ASIC is an integrated circuit customized for a particular use, rather than intended for general-purpose use. It allows for 1 Lower Power Consumption. 2 Better Colling Approaches. Example: From Microsoft Research 51 / 56
  • 169. Hardware Solutions: ASICS Application-Speciļ¬c Integrated Circuit (ASIC) An ASIC is an integrated circuit customized for a particular use, rather than intended for general-purpose use. It allows for 1 Lower Power Consumption. 2 Better Colling Approaches. Example: From Microsoft Research 51 / 56
  • 170. Outline 1 Why are we interested in Analyzing Data? Intuitive Deļ¬nition: The 3Vā€™s Complexity Data Everywhere 2 Machine Learning Machine Learning Process Features Classiļ¬cation Clustering Analysis 3 Data Mining Deļ¬nition Applications Example: Frequent Itemsets 4 Hardware Support ASICS GPUā€™s 5 Projects What projects can you do? 52 / 56
  • 171. Hardware Solutions: GPUā€™s IDEAS Based on CUDA parallel computing architecture from Nvidia Emphasis on executing many concurrent LIGHT threads instead of one HEAVY thread as in CPUs Hardware for 8800 53 / 56
  • 172. Advantages Massively parallel Hundreds of cores, millions of threads High throughput Limitations May not be applicable for all tasks Generic hardware (CPUs) closing the gap 54 / 56
  • 173. Outline 1 Why are we interested in Analyzing Data? Intuitive Deļ¬nition: The 3Vā€™s Complexity Data Everywhere 2 Machine Learning Machine Learning Process Features Classiļ¬cation Clustering Analysis 3 Data Mining Deļ¬nition Applications Example: Frequent Itemsets 4 Hardware Support ASICS GPUā€™s 5 Projects What projects can you do? 55 / 56
  • 174. Projects Possible topic are: Oil exploration detection. Association Rule Preprocessing Project. Neural Network-Based Financial Market Forecasting Project. Page Ranking - Improving over the Google Matrix Inļ¬‚uence Maximization in Social Networks. Web Word Relevance Measures. Recommendation Systems. There are more possibilities at https://www.kaggle.com/competitions 56 / 56
  • 175. Projects Possible topic are: Oil exploration detection. Association Rule Preprocessing Project. Neural Network-Based Financial Market Forecasting Project. Page Ranking - Improving over the Google Matrix Inļ¬‚uence Maximization in Social Networks. Web Word Relevance Measures. Recommendation Systems. There are more possibilities at https://www.kaggle.com/competitions 56 / 56
  • 176. Projects Possible topic are: Oil exploration detection. Association Rule Preprocessing Project. Neural Network-Based Financial Market Forecasting Project. Page Ranking - Improving over the Google Matrix Inļ¬‚uence Maximization in Social Networks. Web Word Relevance Measures. Recommendation Systems. There are more possibilities at https://www.kaggle.com/competitions 56 / 56
  • 177. Projects Possible topic are: Oil exploration detection. Association Rule Preprocessing Project. Neural Network-Based Financial Market Forecasting Project. Page Ranking - Improving over the Google Matrix Inļ¬‚uence Maximization in Social Networks. Web Word Relevance Measures. Recommendation Systems. There are more possibilities at https://www.kaggle.com/competitions 56 / 56
  • 178. Projects Possible topic are: Oil exploration detection. Association Rule Preprocessing Project. Neural Network-Based Financial Market Forecasting Project. Page Ranking - Improving over the Google Matrix Inļ¬‚uence Maximization in Social Networks. Web Word Relevance Measures. Recommendation Systems. There are more possibilities at https://www.kaggle.com/competitions 56 / 56
  • 179. Projects Possible topic are: Oil exploration detection. Association Rule Preprocessing Project. Neural Network-Based Financial Market Forecasting Project. Page Ranking - Improving over the Google Matrix Inļ¬‚uence Maximization in Social Networks. Web Word Relevance Measures. Recommendation Systems. There are more possibilities at https://www.kaggle.com/competitions 56 / 56
  • 180. Projects Possible topic are: Oil exploration detection. Association Rule Preprocessing Project. Neural Network-Based Financial Market Forecasting Project. Page Ranking - Improving over the Google Matrix Inļ¬‚uence Maximization in Social Networks. Web Word Relevance Measures. Recommendation Systems. There are more possibilities at https://www.kaggle.com/competitions 56 / 56
  • 181. Projects Possible topic are: Oil exploration detection. Association Rule Preprocessing Project. Neural Network-Based Financial Market Forecasting Project. Page Ranking - Improving over the Google Matrix Inļ¬‚uence Maximization in Social Networks. Web Word Relevance Measures. Recommendation Systems. There are more possibilities at https://www.kaggle.com/competitions 56 / 56