SlideShare a Scribd company logo
Data Mining
Zahra Pourbahman and Behnaz Sadat Motavali
Supervisor: Dr. Alireza Bagheri
Advanced Database Course
November 2016
1/59
Outline
Introduction DM Methods
Complementary
Information
Conclusion
2/59
Introduction
What is Data Mining and why is it
importante?
1
Introduction DM Methods
Complementary
Information
Conclusion
3/59
“
Data Mining
 Extracting or mining knowledge from large amount of data
 Knowledge discovery in databases
4/59
Why Data
Mining?
5/59
Process of
Knowledge
Discovery
6/59
Data
Mining
Methods
Data Mining Methods
Predictive Descriptive
Classification
Association Rules
SVM
Clustering
K-Means
Regression
Apriori
Linear
7/59
DM Methods
What are Classification, Clustering, Association
Rules and Regression in Data Mining?
2
Introduction DM Methods
Complementary
Information
Conclusion
8/59
Classification
Classification
Clustering
Association
Rules
Regression
9/59
Classification
Problem  Given: Training set
 labeled set of 𝑁 input-output pairs 𝐷={xi , yi}
where 1<i<N
 𝑦 ∈ {1,…,𝐾}
 Goal: Given an input 𝒙 as a test data, assign it to
one of 𝐾 classes
 Examples:
▸ Spam filter
▸ Shape recognition
10/59
Learning and Decision Boundary
 Assume that training data is perfectly linearly separable
 Note that we seek w such that
wT x ≥ 0 when y = +1
wT x < 0 when y = −1
wT x n yn ≥ 0 E(w) = Σ wT x n yn
11/59
Learning and Decision Boundary
 Assume that training data is perfectly linearly separable
 Note that we seek w such that
wT x ≥ 0 when y = +1
wT x < 0 when y = −1
wT x n yn ≥ 0 E(w) = Σ wT x n yn
12/59
Learning and Decision Boundary
 Assume that training data is perfectly linearly separable
 Note that we seek w such that
wT x ≥ 0 when y = +1
wT x < 0 when y = −1
wT x n yn ≥ 0 E(w) = Σ wT x n yn
13/59
Margin
 Which line is better to select as the
boundary to provide more
generalization capability?
 Larger margin provides better
generalization to unseen data
 A hyperplane that is farthest from
all training samples
 The largest margin has equal
distances to the nearest sample of
both classes
14/59
Margin
 Which line is better to select as the
boundary to provide more
generalization capability?
 Larger margin provides better
generalization to unseen data
 A hyperplane that is farthest from
all training samples
 The largest margin has equal
distances to the nearest sample of
both classes
×
15/59
Hard Margin
Support Vector
Machine
(SVM)
 When training samples are not linearly separable, it has no
solution.
16/59
Beyond Linear Separability
 Noise in the linearly separable
classes
 Overlapping classes that can be
approximately separated by a
linear boundary
17/59
Beyond Linear
Separability:
Soft-Margin
SVM
 Soft margin:
Maximizing a margin while trying to minimize the distance
between misclassified points and their correct margin plane
 SVM with slack variables:
Allows samples to fall within the margin, but penalizes them
18/59
Soft-Margin
SVM:
Parameter 𝐶 is a tradeoff parameter:
 small 𝐶 allows margin constraints to be easily
ignored large margin
 large 𝐶 makes constraints hard to ignore
narrow margin
 𝐶=∞ enforces all constraints: hard margin
19/59
Support Vectors
 Hard Margin Support Vectors :
(SVs) = {𝑥i | 𝛼>0}
 The direction of hyper-plane
can be found only based on
support vectors:
 The direction of hyper-plane can be found only based on support
vectors
𝑊 = 𝛼𝑖 𝑦(𝑖)
𝑥(𝑖)
𝛼 𝑖
20/59
Classifying
New Samples
Using only
SVs in SVM
Classification of a new sample 𝒙:
21/59
Clustering
Classification
Clustering
Association
Rules
Regression
22/59
Clusty.com
23/59
Clustering
Problem  We have a set of unlabeled data points {𝒙(i) }
where 1<i<N and we intend to find groups of
similar objects (based on the observed
features)
24/59
K-means
Clustering  Given: the number of clusters 𝐾 and a set of
unlabeled data 𝒳=𝒙1,…,𝒙N
 Goal: find groups of data points 𝒞={𝒞1,𝒞2,…,𝒞k}
 Hard Partitioning:
∀𝑗,𝒞𝑗≠∅
∀𝑖,𝑗,𝒞𝑖∩𝒞𝑗=∅
 Inter-cluster distances are small (compared with
intra-cluster distances)
25/59
Distortion measure
 Our goal is to find 𝒞={𝒞1,𝒞2,…,𝒞k } and {𝝁1,…,𝝁k} so as to minimize 𝐽C
26/59
K-means
Algorithm
Select 𝑘 random points 𝝁1,𝝁2,…𝝁k as clusters’ initial
centroids.
 Repeat until converges (or other stopping
criterion):
 for i=1 to 𝑁 do:
 Assign 𝒙(𝑖) to the closest cluster
 for k=1 to 𝐾 do:
 Centriod update
27/59
K-means Algorithm
Step by Step
28/59
Assigning data to clusters Updating means
[Bishop]
29/59
Summary
of the First
Part
Hard margin SVM :
maximizing margin
Soft margin SVM:
handling noisy data
and overlapping
classes
Linearly Separable
Labeled Data
Clustering
Unlabeled Data
K-means:
Assigning data
to clusters
Centriod update
Data
Mining
30/59
Association
Rules
Classification
Clustering
Association
Rules
Regression
31/59
“
Association Rules:
 Frequent Patterns
▹ Frequent Itemset
▹ Frequent Sequential Pattern
▹ …
 Relation Between Data
32/59
Example
{ Milk , Bread }
K = 2
33/59
Support & Confidence
{Milk} → {Bread} [Support=50% , Confidence=100%]
ID Items
1 {Milk,Bread,Meat} , k=3
2 {Sugar,Bread,Eggs} , k=3
3 {Milk,Sugar,Bread,Butter} , k=4
4 {Bread,Butter} , k=2
34/59
Association Rules
X → Y and X∩Y=Ø
minsup , minconf
Y → X != X → Y
35/59
{Milk,Sugar} → {Bread}
Example
36/59
Frequent Pattern
{Milk,Bread}
Support = 2 𝑘 − 1
N transactions
A items
2 𝑘
− 1 ∗ N ∗ A compare
 So we need an algorithm to decrease that → Apriori
37/59
Apriori
Algorithm  Used to find Frequent Items
 Uses candidate generation
 Uses prior knowledge
 Level-wise search
 Uses minimum support
 Apriori property: All nonempty subsets of a frequent
item set must also be frequent.
In level k, k-item sets are found
Then these items are used to explore k+1
38/59
Apriori Algorithm
Step by Step
39/59
Example
11 items
5 transactions
40/59
Example …
without minsup 55 with
minsup 10
41/59
Example …
combine k-itemsets to
generate k+1-itemsets
{I1,I4}+{I2,I4} → {I1,I2,I4} √
{I1,I4}+{I2,I5} → {I1,I2,I4,I5} Х
42/59
Without Apriori Algorithm:
11
1
+ 11
2
+ 11
3
= 231
Using Apriori Algorithm:
(11+10+6) = 27
43/59
Association
Rules
 Time to find Association Rules
 For each frequent pattern with k-itemset 2 𝑘
− 2
rules
 Need confidence
44/59
Regression
Classification
Clustering
Association
Rules
Regression
45/59
Regression
 Previous classifications labels
 Used for prediction
 Numeric, continuous value
 Relation between independent and dependent
variables
46/59
Linear Regression
Equation:
Simple Equation:
Error:
W0 and W1:
47/59
Example
48/59
Regression
Continue
 Class label related to attribute if not we use
Correlation Coefficient
 Nonlinear regressions can be converted to linear
 Generalized linear model is Logistic Regression
uses probability
 Decision tree to Regression trees by predicting
continuous values rather than class labels
49/59
Complementary
Information
Data Mining Tools, Usage and Types
3
Introduction DM Methods
Complementary
Information
Conclusion
50/59
Tools
Usage
Types
51/59
Business Software:
 IBM Intelligent Miner
 SAS Enterprise Miner
 Microsoft SQL Server 2005
 SPSS Clementine
 …
Open Source Software:
 Rapid-I Rapid Miner
 Weka
 …
DM Tools  Business Software:
 IBM Intelligent Miner
 SAS Enterprise Miner
 Microsoft SQL Server 2005
 SPSS Clementine
 …
 Open Source Software:
 Rapid-I Rapid Miner
 Weka
 …
52/59
DM Usage
 Bank
 Financial issues
 Perfect quality
 Granting loans
 Financial services
 Reducing risks
 Money
Laundering and
Financial damages
 Marketing
 Massive data
 Increasing fast
 E-commerce
 Shopping patterns
 Service quality
 Customer
satisfaction
 Advertising
 Discount
 Bioinformation
 Laboratory
Information
 Protein structures
(Gene)
 Massive number of
sequences
 Need for computer
algorithms to
analyze them
 Accurate
53/59
DM Types
Text Mining
• No tables
• Books, articles, texts
• Semi-structured data
• Data Recovery and
Database
• Key words
• Massive data and text
Web Mining
• Massive Unstructured,
Semi-structured,
Multimedia data
• Links, advertisements
• Poor quality, changing
• Web structure, Content,
Web usage Mining
• Search engines
Multimedia Mining
• Voice, image, video
• Nature of the data
• Key words or
Patterns and shapes
Graph Mining
• Electronic circuits,
image, web and …
• Graph search algorithm
• Difference, index
• Social networks
analisis
Spatial Mining
• Medical images,
VLSI layers
• Location based
• Efficient techniques
54/59
Conclusion
Challenges and Conclusion.
4
Introduction DM Methods
Complementary
Information
Conclusion
55/59
“Challenges
 Individual systems or
Single-purpose systems
 Scalable and Interactive systems
 Standardization of data mining languages
 Complex data
 Distributed and Real Time data mining
56/59
Review
Introduction
• Data Mining
• Knowledge
Discovery
• DM Methods
DM Methods
• Classification
• Clustering
• Association Rules
• Regression
Conclusion
• Challenges
Complementary
Information
• DM Tools
• DM Usage
• DM Types
57/59
 C. M. Bishop, Pattern Recognition and Machine
Learning; Springer, 2006.
 Jiawei Han, Micheline Kamber; Data Mining
Concepts and Techniques, Second Edition,
Elsevier Inc. , 2006.
 J. Furnkranz et al. ;Foundations of Rule
Learning: Cognitive Technologies, Springer-
Verlag Berlin Heidelberg, 2012.
 Abraham Silberschatz, Henry F.Korth,
S.Sudarshan; Database System Concepts, Sixth
Edition, McGraw-Hill, 2010.
‫اسماعیلی‬،‫مهدی‬‫؛‬‫و‬ ‫مفاهیم‬‫های‬‫تکنیک‬‫کاوی‬ ‫داده‬‫؛‬‫دانش‬ ‫نیاز‬،
‫ماه‬ ‫تیر‬1391.
References
58/59
Thanks!
Any questions?
You can find us at z.poorbahman1@yahoo.com & bs.motavali@yahoo.com
😉
59/59

More Related Content

What's hot

Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos butest
 
Decision trees
Decision treesDecision trees
Decision trees
Rohit Srivastava
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
error007
 
2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods
Krish_ver2
 
Decision tree
Decision tree Decision tree
Decision tree
Learnbay Datascience
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Trees
ananth
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data萍華 楊
 
Covering (Rules-based) Algorithm
Covering (Rules-based) AlgorithmCovering (Rules-based) Algorithm
Covering (Rules-based) Algorithm
ZHAO Sam
 
Introduction to Some Tree based Learning Method
Introduction to Some Tree based Learning MethodIntroduction to Some Tree based Learning Method
Introduction to Some Tree based Learning Method
Honglin Yu
 
Machine learning session9(clustering)
Machine learning   session9(clustering)Machine learning   session9(clustering)
Machine learning session9(clustering)
Abhimanyu Dwivedi
 
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Simplilearn
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learningbutest
 
Classification
ClassificationClassification
Classification
DataminingTools Inc
 
AI Algorithms
AI AlgorithmsAI Algorithms
AI Algorithms
Dr. C.V. Suresh Babu
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
Salford Systems
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Marina Santini
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)
Marina Santini
 

What's hot (18)

Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos
 
Decision trees
Decision treesDecision trees
Decision trees
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods
 
Decision tree
Decision tree Decision tree
Decision tree
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Trees
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
 
Covering (Rules-based) Algorithm
Covering (Rules-based) AlgorithmCovering (Rules-based) Algorithm
Covering (Rules-based) Algorithm
 
Introduction to Some Tree based Learning Method
Introduction to Some Tree based Learning MethodIntroduction to Some Tree based Learning Method
Introduction to Some Tree based Learning Method
 
Machine learning session9(clustering)
Machine learning   session9(clustering)Machine learning   session9(clustering)
Machine learning session9(clustering)
 
Decision tree
Decision treeDecision tree
Decision tree
 
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
 
Classification
ClassificationClassification
Classification
 
AI Algorithms
AI AlgorithmsAI Algorithms
AI Algorithms
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)
 

Similar to Data mining

classification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdfclassification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdf
321106410027
 
Support Vector machine(SVM) and Random Forest
Support Vector machine(SVM) and Random ForestSupport Vector machine(SVM) and Random Forest
Support Vector machine(SVM) and Random Forest
umarcybermind
 
Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsChirag Gupta
 
slides
slidesslides
slidesbutest
 
background.pptx
background.pptxbackground.pptx
background.pptx
KabileshCm
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
AmAn Singh
 
Unit-4 classification
Unit-4 classificationUnit-4 classification
Unit-4 classification
LokarchanaD
 
5. Machine Learning.pptx
5.  Machine Learning.pptx5.  Machine Learning.pptx
5. Machine Learning.pptx
ssuser6654de1
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
Mukul Kumar Singh Chauhan
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
Tuan Yang
 
Classfication Basic.ppt
Classfication Basic.pptClassfication Basic.ppt
Classfication Basic.ppt
henonah
 
How to choose Machine Learning algorithm.
How to choose Machine Learning  algorithm.How to choose Machine Learning  algorithm.
How to choose Machine Learning algorithm.
Mala Deep Upadhaya
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
Akshay Kanchan
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
Aman Vasisht
 
Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)
Jeet Das
 
Cluster spss week7
Cluster spss week7Cluster spss week7
Cluster spss week7
Birat Sharma
 
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
ShivarkarSandip
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amato
SSSW
 
decisiontrees (3).ppt
decisiontrees (3).pptdecisiontrees (3).ppt
decisiontrees (3).ppt
LvlShivaNagendra
 
decisiontrees.ppt
decisiontrees.pptdecisiontrees.ppt
decisiontrees.ppt
LvlShivaNagendra
 

Similar to Data mining (20)

classification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdfclassification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdf
 
Support Vector machine(SVM) and Random Forest
Support Vector machine(SVM) and Random ForestSupport Vector machine(SVM) and Random Forest
Support Vector machine(SVM) and Random Forest
 
Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional Experts
 
slides
slidesslides
slides
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Unit-4 classification
Unit-4 classificationUnit-4 classification
Unit-4 classification
 
5. Machine Learning.pptx
5.  Machine Learning.pptx5.  Machine Learning.pptx
5. Machine Learning.pptx
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
 
Classfication Basic.ppt
Classfication Basic.pptClassfication Basic.ppt
Classfication Basic.ppt
 
How to choose Machine Learning algorithm.
How to choose Machine Learning  algorithm.How to choose Machine Learning  algorithm.
How to choose Machine Learning algorithm.
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
 
Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)
 
Cluster spss week7
Cluster spss week7Cluster spss week7
Cluster spss week7
 
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amato
 
decisiontrees (3).ppt
decisiontrees (3).pptdecisiontrees (3).ppt
decisiontrees (3).ppt
 
decisiontrees.ppt
decisiontrees.pptdecisiontrees.ppt
decisiontrees.ppt
 

Recently uploaded

Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 

Recently uploaded (20)

Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 

Data mining

  • 1. Data Mining Zahra Pourbahman and Behnaz Sadat Motavali Supervisor: Dr. Alireza Bagheri Advanced Database Course November 2016 1/59
  • 3. Introduction What is Data Mining and why is it importante? 1 Introduction DM Methods Complementary Information Conclusion 3/59
  • 4. “ Data Mining  Extracting or mining knowledge from large amount of data  Knowledge discovery in databases 4/59
  • 7. Data Mining Methods Data Mining Methods Predictive Descriptive Classification Association Rules SVM Clustering K-Means Regression Apriori Linear 7/59
  • 8. DM Methods What are Classification, Clustering, Association Rules and Regression in Data Mining? 2 Introduction DM Methods Complementary Information Conclusion 8/59
  • 10. Classification Problem  Given: Training set  labeled set of 𝑁 input-output pairs 𝐷={xi , yi} where 1<i<N  𝑦 ∈ {1,…,𝐾}  Goal: Given an input 𝒙 as a test data, assign it to one of 𝐾 classes  Examples: ▸ Spam filter ▸ Shape recognition 10/59
  • 11. Learning and Decision Boundary  Assume that training data is perfectly linearly separable  Note that we seek w such that wT x ≥ 0 when y = +1 wT x < 0 when y = −1 wT x n yn ≥ 0 E(w) = Σ wT x n yn 11/59
  • 12. Learning and Decision Boundary  Assume that training data is perfectly linearly separable  Note that we seek w such that wT x ≥ 0 when y = +1 wT x < 0 when y = −1 wT x n yn ≥ 0 E(w) = Σ wT x n yn 12/59
  • 13. Learning and Decision Boundary  Assume that training data is perfectly linearly separable  Note that we seek w such that wT x ≥ 0 when y = +1 wT x < 0 when y = −1 wT x n yn ≥ 0 E(w) = Σ wT x n yn 13/59
  • 14. Margin  Which line is better to select as the boundary to provide more generalization capability?  Larger margin provides better generalization to unseen data  A hyperplane that is farthest from all training samples  The largest margin has equal distances to the nearest sample of both classes 14/59
  • 15. Margin  Which line is better to select as the boundary to provide more generalization capability?  Larger margin provides better generalization to unseen data  A hyperplane that is farthest from all training samples  The largest margin has equal distances to the nearest sample of both classes × 15/59
  • 16. Hard Margin Support Vector Machine (SVM)  When training samples are not linearly separable, it has no solution. 16/59
  • 17. Beyond Linear Separability  Noise in the linearly separable classes  Overlapping classes that can be approximately separated by a linear boundary 17/59
  • 18. Beyond Linear Separability: Soft-Margin SVM  Soft margin: Maximizing a margin while trying to minimize the distance between misclassified points and their correct margin plane  SVM with slack variables: Allows samples to fall within the margin, but penalizes them 18/59
  • 19. Soft-Margin SVM: Parameter 𝐶 is a tradeoff parameter:  small 𝐶 allows margin constraints to be easily ignored large margin  large 𝐶 makes constraints hard to ignore narrow margin  𝐶=∞ enforces all constraints: hard margin 19/59
  • 20. Support Vectors  Hard Margin Support Vectors : (SVs) = {𝑥i | 𝛼>0}  The direction of hyper-plane can be found only based on support vectors:  The direction of hyper-plane can be found only based on support vectors 𝑊 = 𝛼𝑖 𝑦(𝑖) 𝑥(𝑖) 𝛼 𝑖 20/59
  • 21. Classifying New Samples Using only SVs in SVM Classification of a new sample 𝒙: 21/59
  • 24. Clustering Problem  We have a set of unlabeled data points {𝒙(i) } where 1<i<N and we intend to find groups of similar objects (based on the observed features) 24/59
  • 25. K-means Clustering  Given: the number of clusters 𝐾 and a set of unlabeled data 𝒳=𝒙1,…,𝒙N  Goal: find groups of data points 𝒞={𝒞1,𝒞2,…,𝒞k}  Hard Partitioning: ∀𝑗,𝒞𝑗≠∅ ∀𝑖,𝑗,𝒞𝑖∩𝒞𝑗=∅  Inter-cluster distances are small (compared with intra-cluster distances) 25/59
  • 26. Distortion measure  Our goal is to find 𝒞={𝒞1,𝒞2,…,𝒞k } and {𝝁1,…,𝝁k} so as to minimize 𝐽C 26/59
  • 27. K-means Algorithm Select 𝑘 random points 𝝁1,𝝁2,…𝝁k as clusters’ initial centroids.  Repeat until converges (or other stopping criterion):  for i=1 to 𝑁 do:  Assign 𝒙(𝑖) to the closest cluster  for k=1 to 𝐾 do:  Centriod update 27/59
  • 29. Assigning data to clusters Updating means [Bishop] 29/59
  • 30. Summary of the First Part Hard margin SVM : maximizing margin Soft margin SVM: handling noisy data and overlapping classes Linearly Separable Labeled Data Clustering Unlabeled Data K-means: Assigning data to clusters Centriod update Data Mining 30/59
  • 32. “ Association Rules:  Frequent Patterns ▹ Frequent Itemset ▹ Frequent Sequential Pattern ▹ …  Relation Between Data 32/59
  • 33. Example { Milk , Bread } K = 2 33/59
  • 34. Support & Confidence {Milk} → {Bread} [Support=50% , Confidence=100%] ID Items 1 {Milk,Bread,Meat} , k=3 2 {Sugar,Bread,Eggs} , k=3 3 {Milk,Sugar,Bread,Butter} , k=4 4 {Bread,Butter} , k=2 34/59
  • 35. Association Rules X → Y and X∩Y=Ø minsup , minconf Y → X != X → Y 35/59 {Milk,Sugar} → {Bread}
  • 37. Frequent Pattern {Milk,Bread} Support = 2 𝑘 − 1 N transactions A items 2 𝑘 − 1 ∗ N ∗ A compare  So we need an algorithm to decrease that → Apriori 37/59
  • 38. Apriori Algorithm  Used to find Frequent Items  Uses candidate generation  Uses prior knowledge  Level-wise search  Uses minimum support  Apriori property: All nonempty subsets of a frequent item set must also be frequent. In level k, k-item sets are found Then these items are used to explore k+1 38/59
  • 41. Example … without minsup 55 with minsup 10 41/59
  • 42. Example … combine k-itemsets to generate k+1-itemsets {I1,I4}+{I2,I4} → {I1,I2,I4} √ {I1,I4}+{I2,I5} → {I1,I2,I4,I5} Х 42/59
  • 43. Without Apriori Algorithm: 11 1 + 11 2 + 11 3 = 231 Using Apriori Algorithm: (11+10+6) = 27 43/59
  • 44. Association Rules  Time to find Association Rules  For each frequent pattern with k-itemset 2 𝑘 − 2 rules  Need confidence 44/59
  • 46. Regression  Previous classifications labels  Used for prediction  Numeric, continuous value  Relation between independent and dependent variables 46/59
  • 49. Regression Continue  Class label related to attribute if not we use Correlation Coefficient  Nonlinear regressions can be converted to linear  Generalized linear model is Logistic Regression uses probability  Decision tree to Regression trees by predicting continuous values rather than class labels 49/59
  • 50. Complementary Information Data Mining Tools, Usage and Types 3 Introduction DM Methods Complementary Information Conclusion 50/59
  • 52. Business Software:  IBM Intelligent Miner  SAS Enterprise Miner  Microsoft SQL Server 2005  SPSS Clementine  … Open Source Software:  Rapid-I Rapid Miner  Weka  … DM Tools  Business Software:  IBM Intelligent Miner  SAS Enterprise Miner  Microsoft SQL Server 2005  SPSS Clementine  …  Open Source Software:  Rapid-I Rapid Miner  Weka  … 52/59
  • 53. DM Usage  Bank  Financial issues  Perfect quality  Granting loans  Financial services  Reducing risks  Money Laundering and Financial damages  Marketing  Massive data  Increasing fast  E-commerce  Shopping patterns  Service quality  Customer satisfaction  Advertising  Discount  Bioinformation  Laboratory Information  Protein structures (Gene)  Massive number of sequences  Need for computer algorithms to analyze them  Accurate 53/59
  • 54. DM Types Text Mining • No tables • Books, articles, texts • Semi-structured data • Data Recovery and Database • Key words • Massive data and text Web Mining • Massive Unstructured, Semi-structured, Multimedia data • Links, advertisements • Poor quality, changing • Web structure, Content, Web usage Mining • Search engines Multimedia Mining • Voice, image, video • Nature of the data • Key words or Patterns and shapes Graph Mining • Electronic circuits, image, web and … • Graph search algorithm • Difference, index • Social networks analisis Spatial Mining • Medical images, VLSI layers • Location based • Efficient techniques 54/59
  • 55. Conclusion Challenges and Conclusion. 4 Introduction DM Methods Complementary Information Conclusion 55/59
  • 56. “Challenges  Individual systems or Single-purpose systems  Scalable and Interactive systems  Standardization of data mining languages  Complex data  Distributed and Real Time data mining 56/59
  • 57. Review Introduction • Data Mining • Knowledge Discovery • DM Methods DM Methods • Classification • Clustering • Association Rules • Regression Conclusion • Challenges Complementary Information • DM Tools • DM Usage • DM Types 57/59
  • 58.  C. M. Bishop, Pattern Recognition and Machine Learning; Springer, 2006.  Jiawei Han, Micheline Kamber; Data Mining Concepts and Techniques, Second Edition, Elsevier Inc. , 2006.  J. Furnkranz et al. ;Foundations of Rule Learning: Cognitive Technologies, Springer- Verlag Berlin Heidelberg, 2012.  Abraham Silberschatz, Henry F.Korth, S.Sudarshan; Database System Concepts, Sixth Edition, McGraw-Hill, 2010. ‫اسماعیلی‬،‫مهدی‬‫؛‬‫و‬ ‫مفاهیم‬‫های‬‫تکنیک‬‫کاوی‬ ‫داده‬‫؛‬‫دانش‬ ‫نیاز‬، ‫ماه‬ ‫تیر‬1391. References 58/59
  • 59. Thanks! Any questions? You can find us at z.poorbahman1@yahoo.com & bs.motavali@yahoo.com 😉 59/59