SlideShare a Scribd company logo
1
Pengantar	
  Datamining	
  
Anto Satriyo Nugroho, Dr.Eng
Center for Information & Communication Technology,
Agency for the Assessment & Application of Technology (PTIK-BPPT)
Email: anto.satriyo@bppt.go.id, asnugroho@ieee.org
URL: http://asnugroho.net
• Apakah Datamining itu ?
• Teknik dalam datamining
• Contoh Aplikasi Datamining
• Tutorial Pemakaian Software Datamining “WEKA”
• Further Readings
Agenda
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
Why Mine Data? Commercial Viewpoint
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
Why Mine Data? Scientific Viewpoint
• Data collected and stored at
enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data
– scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
– Goal:	
  To	
  predict	
  class	
  (star	
  or	
  galaxy)	
  of	
  sky	
  objects,	
  especially	
  
visually	
  faint	
  ones,	
  based	
  on	
  the	
  telescopic	
  survey	
  images	
  
(from	
  Palomar	
  Observatory).	
  
– 3000	
  images	
  with	
  23,040	
  x	
  23,040	
  pixels	
  per	
  image.	
  
– Approach:	
  
• Segment	
  the	
  image.	
  	
  
• Measure	
  image	
  aJributes	
  (features)	
  -­‐	
  40	
  of	
  them	
  per	
  
object.	
  
• Model	
  the	
  class	
  based	
  on	
  these	
  features.	
  
• Success	
  Story:	
  Could	
  find	
  16	
  new	
  high	
  red-­‐shiP	
  quasars,	
  
some	
  of	
  the	
  farthest	
  objects	
  that	
  are	
  difficult	
  to	
  find!	
  
Large	
  Scale	
  Data	
  :	
  Sky	
  Survey	
  Cataloging	
  
6
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
7
n Measuring the expression of
genes
n Possible to obtain the expression
of thousands of genes
n Disease classification
	
Microarray
http://cmgm.stanford.edu/pbrown/array.html
• Definition: automatically (or semiautomatically) process of
discovering meaningful pattern in data
• extracting
– implicit
– previously unknown
– potentially useful
information from data
	
Definition of Datamining
Proses dalam datamining
Datamining Tasks
• Prediction
use some variables to predict unknown or future
values of other variables
• Description
find human-interpretable patterns that describe the
data
Datamining Tasks
• Prediction
classification
regression
deviation detection
• Description
clustering
association rule discovery
sequential pattern discovery
Datamining Techniques
• Prediction
classification
regression
deviation detection
• Description
clustering
association rule discovery
sequential pattern discovery
Decision Tree
Rule-based
Bayesian
Artificial Neural network
Support Vector Machine
Datamining Techniques
• Prediction
classification
regression
deviation detection
• Description
clustering
association rule discovery
sequential pattern discovery
Linear Regression
Regression Tree
Artificial Neural network
Support Vector Machine
Datamining Techniques
• Prediction
classification
regression
deviation detection
• Description
clustering
association rule discovery
sequential pattern discovery
K-means Clustering
Self Organizing Map
Agglomerative
Hierarchical Clustering
DBSCAN
Rule : find the most similar pattern from the training set,
then assign the class of the test data by the
class of that pattern
X
Class A
Class B
X is test pattern
Class of the nearest pattern is A
class of is A
X
Nearest Neighbor Classifier
Association Rules & Basket Analysis
Cash register data :
“Customer who bought A and B will have high
probability to buy expensive product C”
Marketing Strategy:
n Sell A, B and C as one set
n Place A, B and C in one corner
n Etc
A, B C
⇒
Association Rules
(Rakesh Agrawal@IBM Almaden Research Center)
D
{ }
m
i
i
i ,...,
, 1
1
=
Ι
Y
X ⇒
: Items (products)
Database : transactions
φ
=
∩
⊆
⊆
Y
X
I
Y
I
X
Association Rules
(Rakesh Agrawal@IBM Almaden Research Center)
Y
X ∪
Y
X ⇒
Confidence c% : The ratio between transactions
to the total transactions
of product X	
Support s% : The ratio between transaction 	
to the total transactions
antecedent	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  consequent	
  
Y
X ⇒
Confidence & Supports
TID Items
001 Beer, coca cola, diapers
002 Beer, diapers
003 Beer,flour
004 Butter, egg, flour
⇒
beer diapers
Association Rule Support confidence
50% 67%
25% 33%
25% 100%
25% 33%
⇒
beer coca cola
⇒
butter flour
⇒
beer flour
Confidence & Supports
• Items : m à the number of association rules
• m: 100 à about 57,000 rules m: 100 à5.15 x 10
47
• Large number of rules are generated, but the only few of
them are really useful
• Useful rules :
– high score of both support & confidence
– Low score of support : the rules are applicable for only
few cases
( )
2
2
2
−
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
∑ =
k
m
k
k
m
Confidence & Supports
Artificial Neural Networks
..
.
x1
x2
x3
xn
y
w1
w2
w3
wn
Input Signal Output
f
w= synapses
f = Activation Function
	
mathematical model of information processing in
human brain
	
Mc Culloch-Pitts model (1943)‫‏‬
!
"
#
$
%
&
×
= ∑
=
n
i
i
i w
x
f
y
1
Two aspects:
- architecture : how the neurons are connected
- training algorithm:
algorithm to adjust the synapses to enable the
ANN perform desired input-output mapping
Artificial Neural Network
two aspects:
- architecture : multilayer perceptron
- training algorithm: backpropagation algorithm
(invented by Rumelhart, 1986)‫‏‬
Input information Output
Input Layer Output Layer
w
Hidden Layer
w
Artificial Neural Network
decrement of error during the training phase
of neural networks
=
“knowledge” acquisition
Artificial Neural Network
(training phase)
• Invented by Vapnik (1992)	
• SVM satisfied three conditions for ideal pattern
recognition method
– Robustness
– Theoretically Analysis
– Feasibility
• In principal, SVM works as binary classifier
• Structural-Risk Minimization
Support Vector Machines
Discrimination boundaries
Class -1 Class +1
Binary Classification
Margin	
Class -1 Class +1
Optimal Hyperplane by SVM
Input Space High-dimensional Feature Space
Hyperplane
X
Φ
)
(X
Φ
Non Linear Classification in SVM
• Fog forecasting
• Bioinformatics
• Sky survey Cataloging (Fayyad et al.)
• Spatio-Temporal Analysis of Disease Spreading using
Webmining
• Foreign Exchange Rate Prediction
• Network Intrusion Detection
• Etc.
Application of Datamining
Sky Survey Cataloging
• Sky Survey Cataloging
– Goal: To predict class (star or galaxy) of sky objects, especially
visually faint ones, based on the telescopic survey images (from
Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.
– Approach:
• Segment the image.
• Measure image attributes (features) - 40 of them per object.
• Model the class based on these features.
• Success Story: Could find 16 new high red-shift quasars, some
of the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
Early
Intermediate
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Class:
• Stages of Formation
Attributes:
• Image features,
• Characteristics of light
waves received, etc.
Courtesy: http://aps.umn.edu
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
Classifying Galaxies
Further Readings
• Buku-buku datamining a.l.
• Pang Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to
Datamining, Addison Wesley, 2006
• Datamining: Teknik Pemanfaatan Data untuk Keperluan Bisnis, Budi
Santosa, Graha Ilmu, 2007
• Tutorial on datamining by Dr.Iko Pramudiono (NTT, Japan)
http://datamining.japati.net/
• Datamining, Knowledge Discovery and Bioinformatics, Shinichi Morishita
(winner of KDD 2001): http://asnugroho.wordpress.com/2006/02/05/
datamining-knowledge-discovery-bioinformatics-terjemahan-artikel-prof-
shinichi-morishita/ (password: gomibako)‫‏‬
• AS. Nugroho: Datamining dalam Bioinformatika: menggali informasi
terpendam dalam lautan data biologi, SDA Asia Magazine, No.13, pp.
64-66, March 2006 http://asnugroho.wordpress.com/2006/02/06/peran-
datamining-dalam-bioinformatika/

More Related Content

Similar to 01-pengantar.pdf

Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
SaketBansal9
 
RS in the context of Big Data-v4
RS in the context of Big Data-v4RS in the context of Big Data-v4
RS in the context of Big Data-v4Khadija Atiya
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data Extraction
Dasha Herrmannova
 
Data Mining 101
Data Mining 101Data Mining 101
Data Mining 101
Ali Septiandri
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
Vijay Susheedran C G
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introductionNeeraj Tewari
 
Seminar nov2017
Seminar nov2017Seminar nov2017
Seminar nov2017
Ahmed Youssef Ali Amer
 
Big Data Analytics for connected home
Big Data Analytics for connected homeBig Data Analytics for connected home
Big Data Analytics for connected home
Héloïse Nonne
 
Time series analysis : Refresher and Innovations
Time series analysis : Refresher and InnovationsTime series analysis : Refresher and Innovations
Time series analysis : Refresher and Innovations
QuantUniversity
 
Machine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and ApplicationsMachine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and Applications
QuantUniversity
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰
台灣資料科學年會
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Miningdataminers.ir
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
The Statistical and Applied Mathematical Sciences Institute
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
hktripathy
 
chapter1_Introduction.pdf data mining ppt
chapter1_Introduction.pdf data mining pptchapter1_Introduction.pdf data mining ppt
chapter1_Introduction.pdf data mining ppt
GyanaKarn
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
Izwan Nizal Mohd Shaharanee
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningJeff Heaton
 
Main principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningMain principles of Data Science and Machine Learning
Main principles of Data Science and Machine Learning
Nikolay Karelin
 
Threat Detection in Surveillance Videos
Threat Detection in Surveillance VideosThreat Detection in Surveillance Videos
Threat Detection in Surveillance Videos
Databricks
 

Similar to 01-pengantar.pdf (20)

Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
RS in the context of Big Data-v4
RS in the context of Big Data-v4RS in the context of Big Data-v4
RS in the context of Big Data-v4
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data Extraction
 
Data Mining 101
Data Mining 101Data Mining 101
Data Mining 101
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
 
Seminar nov2017
Seminar nov2017Seminar nov2017
Seminar nov2017
 
Big Data Analytics for connected home
Big Data Analytics for connected homeBig Data Analytics for connected home
Big Data Analytics for connected home
 
Time series analysis : Refresher and Innovations
Time series analysis : Refresher and InnovationsTime series analysis : Refresher and Innovations
Time series analysis : Refresher and Innovations
 
Machine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and ApplicationsMachine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and Applications
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Mining
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
chapter1_Introduction.pdf data mining ppt
chapter1_Introduction.pdf data mining pptchapter1_Introduction.pdf data mining ppt
chapter1_Introduction.pdf data mining ppt
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morning
 
Main principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningMain principles of Data Science and Machine Learning
Main principles of Data Science and Machine Learning
 
Threat Detection in Surveillance Videos
Threat Detection in Surveillance VideosThreat Detection in Surveillance Videos
Threat Detection in Surveillance Videos
 

Recently uploaded

一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 

Recently uploaded (20)

一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 

01-pengantar.pdf

  • 1. 1 Pengantar  Datamining   Anto Satriyo Nugroho, Dr.Eng Center for Information & Communication Technology, Agency for the Assessment & Application of Technology (PTIK-BPPT) Email: anto.satriyo@bppt.go.id, asnugroho@ieee.org URL: http://asnugroho.net
  • 2. • Apakah Datamining itu ? • Teknik dalam datamining • Contoh Aplikasi Datamining • Tutorial Pemakaian Software Datamining “WEKA” • Further Readings Agenda
  • 3. • Lots of data is being collected and warehoused – Web data, e-commerce – purchases at department/ grocery stores – Bank/Credit Card transactions • Computers have become cheaper and more powerful • Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. in Customer Relationship Management) Why Mine Data? Commercial Viewpoint Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
  • 4. Why Mine Data? Scientific Viewpoint • Data collected and stored at enormous speeds (GB/hour) – remote sensors on a satellite – telescopes scanning the skies – microarrays generating gene expression data – scientific simulations generating terabytes of data • Traditional techniques infeasible for raw data • Data mining may help scientists – in classifying and segmenting data – in Hypothesis Formation Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
  • 5. – Goal:  To  predict  class  (star  or  galaxy)  of  sky  objects,  especially   visually  faint  ones,  based  on  the  telescopic  survey  images   (from  Palomar  Observatory).   – 3000  images  with  23,040  x  23,040  pixels  per  image.   – Approach:   • Segment  the  image.     • Measure  image  aJributes  (features)  -­‐  40  of  them  per   object.   • Model  the  class  based  on  these  features.   • Success  Story:  Could  find  16  new  high  red-­‐shiP  quasars,   some  of  the  farthest  objects  that  are  difficult  to  find!   Large  Scale  Data  :  Sky  Survey  Cataloging  
  • 7. 7 n Measuring the expression of genes n Possible to obtain the expression of thousands of genes n Disease classification Microarray http://cmgm.stanford.edu/pbrown/array.html
  • 8. • Definition: automatically (or semiautomatically) process of discovering meaningful pattern in data • extracting – implicit – previously unknown – potentially useful information from data Definition of Datamining
  • 10. Datamining Tasks • Prediction use some variables to predict unknown or future values of other variables • Description find human-interpretable patterns that describe the data
  • 11. Datamining Tasks • Prediction classification regression deviation detection • Description clustering association rule discovery sequential pattern discovery
  • 12. Datamining Techniques • Prediction classification regression deviation detection • Description clustering association rule discovery sequential pattern discovery Decision Tree Rule-based Bayesian Artificial Neural network Support Vector Machine
  • 13. Datamining Techniques • Prediction classification regression deviation detection • Description clustering association rule discovery sequential pattern discovery Linear Regression Regression Tree Artificial Neural network Support Vector Machine
  • 14. Datamining Techniques • Prediction classification regression deviation detection • Description clustering association rule discovery sequential pattern discovery K-means Clustering Self Organizing Map Agglomerative Hierarchical Clustering DBSCAN
  • 15. Rule : find the most similar pattern from the training set, then assign the class of the test data by the class of that pattern X Class A Class B X is test pattern Class of the nearest pattern is A class of is A X Nearest Neighbor Classifier
  • 16. Association Rules & Basket Analysis
  • 17. Cash register data : “Customer who bought A and B will have high probability to buy expensive product C” Marketing Strategy: n Sell A, B and C as one set n Place A, B and C in one corner n Etc A, B C ⇒ Association Rules (Rakesh Agrawal@IBM Almaden Research Center)
  • 18. D { } m i i i ,..., , 1 1 = Ι Y X ⇒ : Items (products) Database : transactions φ = ∩ ⊆ ⊆ Y X I Y I X Association Rules (Rakesh Agrawal@IBM Almaden Research Center)
  • 19. Y X ∪ Y X ⇒ Confidence c% : The ratio between transactions to the total transactions of product X Support s% : The ratio between transaction to the total transactions antecedent                                              consequent   Y X ⇒ Confidence & Supports
  • 20. TID Items 001 Beer, coca cola, diapers 002 Beer, diapers 003 Beer,flour 004 Butter, egg, flour ⇒ beer diapers Association Rule Support confidence 50% 67% 25% 33% 25% 100% 25% 33% ⇒ beer coca cola ⇒ butter flour ⇒ beer flour Confidence & Supports
  • 21. • Items : m à the number of association rules • m: 100 à about 57,000 rules m: 100 à5.15 x 10 47 • Large number of rules are generated, but the only few of them are really useful • Useful rules : – high score of both support & confidence – Low score of support : the rules are applicable for only few cases ( ) 2 2 2 − ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ∑ = k m k k m Confidence & Supports
  • 22. Artificial Neural Networks .. . x1 x2 x3 xn y w1 w2 w3 wn Input Signal Output f w= synapses f = Activation Function mathematical model of information processing in human brain Mc Culloch-Pitts model (1943)‫‏‬ ! " # $ % & × = ∑ = n i i i w x f y 1
  • 23. Two aspects: - architecture : how the neurons are connected - training algorithm: algorithm to adjust the synapses to enable the ANN perform desired input-output mapping Artificial Neural Network
  • 24. two aspects: - architecture : multilayer perceptron - training algorithm: backpropagation algorithm (invented by Rumelhart, 1986)‫‏‬ Input information Output Input Layer Output Layer w Hidden Layer w Artificial Neural Network
  • 25. decrement of error during the training phase of neural networks = “knowledge” acquisition Artificial Neural Network (training phase)
  • 26. • Invented by Vapnik (1992) • SVM satisfied three conditions for ideal pattern recognition method – Robustness – Theoretically Analysis – Feasibility • In principal, SVM works as binary classifier • Structural-Risk Minimization Support Vector Machines
  • 27. Discrimination boundaries Class -1 Class +1 Binary Classification
  • 28. Margin Class -1 Class +1 Optimal Hyperplane by SVM
  • 29. Input Space High-dimensional Feature Space Hyperplane X Φ ) (X Φ Non Linear Classification in SVM
  • 30. • Fog forecasting • Bioinformatics • Sky survey Cataloging (Fayyad et al.) • Spatio-Temporal Analysis of Disease Spreading using Webmining • Foreign Exchange Rate Prediction • Network Intrusion Detection • Etc. Application of Datamining
  • 31. Sky Survey Cataloging • Sky Survey Cataloging – Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory). – 3000 images with 23,040 x 23,040 pixels per image. – Approach: • Segment the image. • Measure image attributes (features) - 40 of them per object. • Model the class based on these features. • Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find! From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
  • 32. Early Intermediate Late Data Size: • 72 million stars, 20 million galaxies • Object Catalog: 9 GB • Image Database: 150 GB Class: • Stages of Formation Attributes: • Image features, • Characteristics of light waves received, etc. Courtesy: http://aps.umn.edu Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition Classifying Galaxies
  • 33. Further Readings • Buku-buku datamining a.l. • Pang Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Datamining, Addison Wesley, 2006 • Datamining: Teknik Pemanfaatan Data untuk Keperluan Bisnis, Budi Santosa, Graha Ilmu, 2007 • Tutorial on datamining by Dr.Iko Pramudiono (NTT, Japan) http://datamining.japati.net/ • Datamining, Knowledge Discovery and Bioinformatics, Shinichi Morishita (winner of KDD 2001): http://asnugroho.wordpress.com/2006/02/05/ datamining-knowledge-discovery-bioinformatics-terjemahan-artikel-prof- shinichi-morishita/ (password: gomibako)‫‏‬ • AS. Nugroho: Datamining dalam Bioinformatika: menggali informasi terpendam dalam lautan data biologi, SDA Asia Magazine, No.13, pp. 64-66, March 2006 http://asnugroho.wordpress.com/2006/02/06/peran- datamining-dalam-bioinformatika/