SlideShare a Scribd company logo
1 of 33
Download to read offline
1
Pengantar	
  Datamining	
  
Anto Satriyo Nugroho, Dr.Eng
Center for Information & Communication Technology,
Agency for the Assessment & Application of Technology (PTIK-BPPT)
Email: anto.satriyo@bppt.go.id, asnugroho@ieee.org
URL: http://asnugroho.net
• Apakah Datamining itu ?
• Teknik dalam datamining
• Contoh Aplikasi Datamining
• Tutorial Pemakaian Software Datamining “WEKA”
• Further Readings
Agenda
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
Why Mine Data? Commercial Viewpoint
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
Why Mine Data? Scientific Viewpoint
• Data collected and stored at
enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data
– scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
– Goal:	
  To	
  predict	
  class	
  (star	
  or	
  galaxy)	
  of	
  sky	
  objects,	
  especially	
  
visually	
  faint	
  ones,	
  based	
  on	
  the	
  telescopic	
  survey	
  images	
  
(from	
  Palomar	
  Observatory).	
  
– 3000	
  images	
  with	
  23,040	
  x	
  23,040	
  pixels	
  per	
  image.	
  
– Approach:	
  
• Segment	
  the	
  image.	
  	
  
• Measure	
  image	
  aJributes	
  (features)	
  -­‐	
  40	
  of	
  them	
  per	
  
object.	
  
• Model	
  the	
  class	
  based	
  on	
  these	
  features.	
  
• Success	
  Story:	
  Could	
  find	
  16	
  new	
  high	
  red-­‐shiP	
  quasars,	
  
some	
  of	
  the	
  farthest	
  objects	
  that	
  are	
  difficult	
  to	
  find!	
  
Large	
  Scale	
  Data	
  :	
  Sky	
  Survey	
  Cataloging	
  
6
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
7
n Measuring the expression of
genes
n Possible to obtain the expression
of thousands of genes
n Disease classification
	
Microarray
http://cmgm.stanford.edu/pbrown/array.html
• Definition: automatically (or semiautomatically) process of
discovering meaningful pattern in data
• extracting
– implicit
– previously unknown
– potentially useful
information from data
	
Definition of Datamining
Proses dalam datamining
Datamining Tasks
• Prediction
use some variables to predict unknown or future
values of other variables
• Description
find human-interpretable patterns that describe the
data
Datamining Tasks
• Prediction
classification
regression
deviation detection
• Description
clustering
association rule discovery
sequential pattern discovery
Datamining Techniques
• Prediction
classification
regression
deviation detection
• Description
clustering
association rule discovery
sequential pattern discovery
Decision Tree
Rule-based
Bayesian
Artificial Neural network
Support Vector Machine
Datamining Techniques
• Prediction
classification
regression
deviation detection
• Description
clustering
association rule discovery
sequential pattern discovery
Linear Regression
Regression Tree
Artificial Neural network
Support Vector Machine
Datamining Techniques
• Prediction
classification
regression
deviation detection
• Description
clustering
association rule discovery
sequential pattern discovery
K-means Clustering
Self Organizing Map
Agglomerative
Hierarchical Clustering
DBSCAN
Rule : find the most similar pattern from the training set,
then assign the class of the test data by the
class of that pattern
X
Class A
Class B
X is test pattern
Class of the nearest pattern is A
class of is A
X
Nearest Neighbor Classifier
Association Rules & Basket Analysis
Cash register data :
“Customer who bought A and B will have high
probability to buy expensive product C”
Marketing Strategy:
n Sell A, B and C as one set
n Place A, B and C in one corner
n Etc
A, B C
⇒
Association Rules
(Rakesh Agrawal@IBM Almaden Research Center)
D
{ }
m
i
i
i ,...,
, 1
1
=
Ι
Y
X ⇒
: Items (products)
Database : transactions
φ
=
∩
⊆
⊆
Y
X
I
Y
I
X
Association Rules
(Rakesh Agrawal@IBM Almaden Research Center)
Y
X ∪
Y
X ⇒
Confidence c% : The ratio between transactions
to the total transactions
of product X	
Support s% : The ratio between transaction 	
to the total transactions
antecedent	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  consequent	
  
Y
X ⇒
Confidence & Supports
TID Items
001 Beer, coca cola, diapers
002 Beer, diapers
003 Beer,flour
004 Butter, egg, flour
⇒
beer diapers
Association Rule Support confidence
50% 67%
25% 33%
25% 100%
25% 33%
⇒
beer coca cola
⇒
butter flour
⇒
beer flour
Confidence & Supports
• Items : m à the number of association rules
• m: 100 à about 57,000 rules m: 100 à5.15 x 10
47
• Large number of rules are generated, but the only few of
them are really useful
• Useful rules :
– high score of both support & confidence
– Low score of support : the rules are applicable for only
few cases
( )
2
2
2
−
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
∑ =
k
m
k
k
m
Confidence & Supports
Artificial Neural Networks
..
.
x1
x2
x3
xn
y
w1
w2
w3
wn
Input Signal Output
f
w= synapses
f = Activation Function
	
mathematical model of information processing in
human brain
	
Mc Culloch-Pitts model (1943)‫‏‬
!
"
#
$
%
&
×
= ∑
=
n
i
i
i w
x
f
y
1
Two aspects:
- architecture : how the neurons are connected
- training algorithm:
algorithm to adjust the synapses to enable the
ANN perform desired input-output mapping
Artificial Neural Network
two aspects:
- architecture : multilayer perceptron
- training algorithm: backpropagation algorithm
(invented by Rumelhart, 1986)‫‏‬
Input information Output
Input Layer Output Layer
w
Hidden Layer
w
Artificial Neural Network
decrement of error during the training phase
of neural networks
=
“knowledge” acquisition
Artificial Neural Network
(training phase)
• Invented by Vapnik (1992)	
• SVM satisfied three conditions for ideal pattern
recognition method
– Robustness
– Theoretically Analysis
– Feasibility
• In principal, SVM works as binary classifier
• Structural-Risk Minimization
Support Vector Machines
Discrimination boundaries
Class -1 Class +1
Binary Classification
Margin	
Class -1 Class +1
Optimal Hyperplane by SVM
Input Space High-dimensional Feature Space
Hyperplane
X
Φ
)
(X
Φ
Non Linear Classification in SVM
• Fog forecasting
• Bioinformatics
• Sky survey Cataloging (Fayyad et al.)
• Spatio-Temporal Analysis of Disease Spreading using
Webmining
• Foreign Exchange Rate Prediction
• Network Intrusion Detection
• Etc.
Application of Datamining
Sky Survey Cataloging
• Sky Survey Cataloging
– Goal: To predict class (star or galaxy) of sky objects, especially
visually faint ones, based on the telescopic survey images (from
Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.
– Approach:
• Segment the image.
• Measure image attributes (features) - 40 of them per object.
• Model the class based on these features.
• Success Story: Could find 16 new high red-shift quasars, some
of the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
Early
Intermediate
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Class:
• Stages of Formation
Attributes:
• Image features,
• Characteristics of light
waves received, etc.
Courtesy: http://aps.umn.edu
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
Classifying Galaxies
Further Readings
• Buku-buku datamining a.l.
• Pang Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to
Datamining, Addison Wesley, 2006
• Datamining: Teknik Pemanfaatan Data untuk Keperluan Bisnis, Budi
Santosa, Graha Ilmu, 2007
• Tutorial on datamining by Dr.Iko Pramudiono (NTT, Japan)
http://datamining.japati.net/
• Datamining, Knowledge Discovery and Bioinformatics, Shinichi Morishita
(winner of KDD 2001): http://asnugroho.wordpress.com/2006/02/05/
datamining-knowledge-discovery-bioinformatics-terjemahan-artikel-prof-
shinichi-morishita/ (password: gomibako)‫‏‬
• AS. Nugroho: Datamining dalam Bioinformatika: menggali informasi
terpendam dalam lautan data biologi, SDA Asia Magazine, No.13, pp.
64-66, March 2006 http://asnugroho.wordpress.com/2006/02/06/peran-
datamining-dalam-bioinformatika/

More Related Content

Similar to 01-pengantar.pdf

RS in the context of Big Data-v4
RS in the context of Big Data-v4RS in the context of Big Data-v4
RS in the context of Big Data-v4Khadija Atiya
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data ExtractionDasha Herrmannova
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiVijay Susheedran C G
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introductionNeeraj Tewari
 
Big Data Analytics for connected home
Big Data Analytics for connected homeBig Data Analytics for connected home
Big Data Analytics for connected homeHéloïse Nonne
 
Time series analysis : Refresher and Innovations
Time series analysis : Refresher and InnovationsTime series analysis : Refresher and Innovations
Time series analysis : Refresher and InnovationsQuantUniversity
 
Machine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and ApplicationsMachine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and ApplicationsQuantUniversity
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰台灣資料科學年會
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Miningdataminers.ir
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
chapter1_Introduction.pdf data mining ppt
chapter1_Introduction.pdf data mining pptchapter1_Introduction.pdf data mining ppt
chapter1_Introduction.pdf data mining pptGyanaKarn
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningJeff Heaton
 
Main principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningMain principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningNikolay Karelin
 
Threat Detection in Surveillance Videos
Threat Detection in Surveillance VideosThreat Detection in Surveillance Videos
Threat Detection in Surveillance VideosDatabricks
 

Similar to 01-pengantar.pdf (20)

RS in the context of Big Data-v4
RS in the context of Big Data-v4RS in the context of Big Data-v4
RS in the context of Big Data-v4
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data Extraction
 
Data Mining 101
Data Mining 101Data Mining 101
Data Mining 101
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
 
Seminar nov2017
Seminar nov2017Seminar nov2017
Seminar nov2017
 
Big Data Analytics for connected home
Big Data Analytics for connected homeBig Data Analytics for connected home
Big Data Analytics for connected home
 
Time series analysis : Refresher and Innovations
Time series analysis : Refresher and InnovationsTime series analysis : Refresher and Innovations
Time series analysis : Refresher and Innovations
 
Machine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and ApplicationsMachine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and Applications
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Mining
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
chapter1_Introduction.pdf data mining ppt
chapter1_Introduction.pdf data mining pptchapter1_Introduction.pdf data mining ppt
chapter1_Introduction.pdf data mining ppt
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morning
 
Main principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningMain principles of Data Science and Machine Learning
Main principles of Data Science and Machine Learning
 
Threat Detection in Surveillance Videos
Threat Detection in Surveillance VideosThreat Detection in Surveillance Videos
Threat Detection in Surveillance Videos
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 

Recently uploaded

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookmanojkuma9823
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 

Recently uploaded (20)

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 

01-pengantar.pdf

  • 1. 1 Pengantar  Datamining   Anto Satriyo Nugroho, Dr.Eng Center for Information & Communication Technology, Agency for the Assessment & Application of Technology (PTIK-BPPT) Email: anto.satriyo@bppt.go.id, asnugroho@ieee.org URL: http://asnugroho.net
  • 2. • Apakah Datamining itu ? • Teknik dalam datamining • Contoh Aplikasi Datamining • Tutorial Pemakaian Software Datamining “WEKA” • Further Readings Agenda
  • 3. • Lots of data is being collected and warehoused – Web data, e-commerce – purchases at department/ grocery stores – Bank/Credit Card transactions • Computers have become cheaper and more powerful • Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. in Customer Relationship Management) Why Mine Data? Commercial Viewpoint Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
  • 4. Why Mine Data? Scientific Viewpoint • Data collected and stored at enormous speeds (GB/hour) – remote sensors on a satellite – telescopes scanning the skies – microarrays generating gene expression data – scientific simulations generating terabytes of data • Traditional techniques infeasible for raw data • Data mining may help scientists – in classifying and segmenting data – in Hypothesis Formation Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
  • 5. – Goal:  To  predict  class  (star  or  galaxy)  of  sky  objects,  especially   visually  faint  ones,  based  on  the  telescopic  survey  images   (from  Palomar  Observatory).   – 3000  images  with  23,040  x  23,040  pixels  per  image.   – Approach:   • Segment  the  image.     • Measure  image  aJributes  (features)  -­‐  40  of  them  per   object.   • Model  the  class  based  on  these  features.   • Success  Story:  Could  find  16  new  high  red-­‐shiP  quasars,   some  of  the  farthest  objects  that  are  difficult  to  find!   Large  Scale  Data  :  Sky  Survey  Cataloging  
  • 7. 7 n Measuring the expression of genes n Possible to obtain the expression of thousands of genes n Disease classification Microarray http://cmgm.stanford.edu/pbrown/array.html
  • 8. • Definition: automatically (or semiautomatically) process of discovering meaningful pattern in data • extracting – implicit – previously unknown – potentially useful information from data Definition of Datamining
  • 10. Datamining Tasks • Prediction use some variables to predict unknown or future values of other variables • Description find human-interpretable patterns that describe the data
  • 11. Datamining Tasks • Prediction classification regression deviation detection • Description clustering association rule discovery sequential pattern discovery
  • 12. Datamining Techniques • Prediction classification regression deviation detection • Description clustering association rule discovery sequential pattern discovery Decision Tree Rule-based Bayesian Artificial Neural network Support Vector Machine
  • 13. Datamining Techniques • Prediction classification regression deviation detection • Description clustering association rule discovery sequential pattern discovery Linear Regression Regression Tree Artificial Neural network Support Vector Machine
  • 14. Datamining Techniques • Prediction classification regression deviation detection • Description clustering association rule discovery sequential pattern discovery K-means Clustering Self Organizing Map Agglomerative Hierarchical Clustering DBSCAN
  • 15. Rule : find the most similar pattern from the training set, then assign the class of the test data by the class of that pattern X Class A Class B X is test pattern Class of the nearest pattern is A class of is A X Nearest Neighbor Classifier
  • 16. Association Rules & Basket Analysis
  • 17. Cash register data : “Customer who bought A and B will have high probability to buy expensive product C” Marketing Strategy: n Sell A, B and C as one set n Place A, B and C in one corner n Etc A, B C ⇒ Association Rules (Rakesh Agrawal@IBM Almaden Research Center)
  • 18. D { } m i i i ,..., , 1 1 = Ι Y X ⇒ : Items (products) Database : transactions φ = ∩ ⊆ ⊆ Y X I Y I X Association Rules (Rakesh Agrawal@IBM Almaden Research Center)
  • 19. Y X ∪ Y X ⇒ Confidence c% : The ratio between transactions to the total transactions of product X Support s% : The ratio between transaction to the total transactions antecedent                                              consequent   Y X ⇒ Confidence & Supports
  • 20. TID Items 001 Beer, coca cola, diapers 002 Beer, diapers 003 Beer,flour 004 Butter, egg, flour ⇒ beer diapers Association Rule Support confidence 50% 67% 25% 33% 25% 100% 25% 33% ⇒ beer coca cola ⇒ butter flour ⇒ beer flour Confidence & Supports
  • 21. • Items : m à the number of association rules • m: 100 à about 57,000 rules m: 100 à5.15 x 10 47 • Large number of rules are generated, but the only few of them are really useful • Useful rules : – high score of both support & confidence – Low score of support : the rules are applicable for only few cases ( ) 2 2 2 − ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ∑ = k m k k m Confidence & Supports
  • 22. Artificial Neural Networks .. . x1 x2 x3 xn y w1 w2 w3 wn Input Signal Output f w= synapses f = Activation Function mathematical model of information processing in human brain Mc Culloch-Pitts model (1943)‫‏‬ ! " # $ % & × = ∑ = n i i i w x f y 1
  • 23. Two aspects: - architecture : how the neurons are connected - training algorithm: algorithm to adjust the synapses to enable the ANN perform desired input-output mapping Artificial Neural Network
  • 24. two aspects: - architecture : multilayer perceptron - training algorithm: backpropagation algorithm (invented by Rumelhart, 1986)‫‏‬ Input information Output Input Layer Output Layer w Hidden Layer w Artificial Neural Network
  • 25. decrement of error during the training phase of neural networks = “knowledge” acquisition Artificial Neural Network (training phase)
  • 26. • Invented by Vapnik (1992) • SVM satisfied three conditions for ideal pattern recognition method – Robustness – Theoretically Analysis – Feasibility • In principal, SVM works as binary classifier • Structural-Risk Minimization Support Vector Machines
  • 27. Discrimination boundaries Class -1 Class +1 Binary Classification
  • 28. Margin Class -1 Class +1 Optimal Hyperplane by SVM
  • 29. Input Space High-dimensional Feature Space Hyperplane X Φ ) (X Φ Non Linear Classification in SVM
  • 30. • Fog forecasting • Bioinformatics • Sky survey Cataloging (Fayyad et al.) • Spatio-Temporal Analysis of Disease Spreading using Webmining • Foreign Exchange Rate Prediction • Network Intrusion Detection • Etc. Application of Datamining
  • 31. Sky Survey Cataloging • Sky Survey Cataloging – Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory). – 3000 images with 23,040 x 23,040 pixels per image. – Approach: • Segment the image. • Measure image attributes (features) - 40 of them per object. • Model the class based on these features. • Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find! From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
  • 32. Early Intermediate Late Data Size: • 72 million stars, 20 million galaxies • Object Catalog: 9 GB • Image Database: 150 GB Class: • Stages of Formation Attributes: • Image features, • Characteristics of light waves received, etc. Courtesy: http://aps.umn.edu Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition Classifying Galaxies
  • 33. Further Readings • Buku-buku datamining a.l. • Pang Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Datamining, Addison Wesley, 2006 • Datamining: Teknik Pemanfaatan Data untuk Keperluan Bisnis, Budi Santosa, Graha Ilmu, 2007 • Tutorial on datamining by Dr.Iko Pramudiono (NTT, Japan) http://datamining.japati.net/ • Datamining, Knowledge Discovery and Bioinformatics, Shinichi Morishita (winner of KDD 2001): http://asnugroho.wordpress.com/2006/02/05/ datamining-knowledge-discovery-bioinformatics-terjemahan-artikel-prof- shinichi-morishita/ (password: gomibako)‫‏‬ • AS. Nugroho: Datamining dalam Bioinformatika: menggali informasi terpendam dalam lautan data biologi, SDA Asia Magazine, No.13, pp. 64-66, March 2006 http://asnugroho.wordpress.com/2006/02/06/peran- datamining-dalam-bioinformatika/