1. 1
Pengantar
Datamining
Anto Satriyo Nugroho, Dr.Eng
Center for Information & Communication Technology,
Agency for the Assessment & Application of Technology (PTIK-BPPT)
Email: anto.satriyo@bppt.go.id, asnugroho@ieee.org
URL: http://asnugroho.net
2. • Apakah Datamining itu ?
• Teknik dalam datamining
• Contoh Aplikasi Datamining
• Tutorial Pemakaian Software Datamining “WEKA”
• Further Readings
Agenda
3. • Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
Why Mine Data? Commercial Viewpoint
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
4. Why Mine Data? Scientific Viewpoint
• Data collected and stored at
enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data
– scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
5. – Goal:
To
predict
class
(star
or
galaxy)
of
sky
objects,
especially
visually
faint
ones,
based
on
the
telescopic
survey
images
(from
Palomar
Observatory).
– 3000
images
with
23,040
x
23,040
pixels
per
image.
– Approach:
• Segment
the
image.
• Measure
image
aJributes
(features)
-‐
40
of
them
per
object.
• Model
the
class
based
on
these
features.
• Success
Story:
Could
find
16
new
high
red-‐shiP
quasars,
some
of
the
farthest
objects
that
are
difficult
to
find!
Large
Scale
Data
:
Sky
Survey
Cataloging
7. 7
n Measuring the expression of
genes
n Possible to obtain the expression
of thousands of genes
n Disease classification
Microarray
http://cmgm.stanford.edu/pbrown/array.html
8. • Definition: automatically (or semiautomatically) process of
discovering meaningful pattern in data
• extracting
– implicit
– previously unknown
– potentially useful
information from data
Definition of Datamining
10. Datamining Tasks
• Prediction
use some variables to predict unknown or future
values of other variables
• Description
find human-interpretable patterns that describe the
data
15. Rule : find the most similar pattern from the training set,
then assign the class of the test data by the
class of that pattern
X
Class A
Class B
X is test pattern
Class of the nearest pattern is A
class of is A
X
Nearest Neighbor Classifier
17. Cash register data :
“Customer who bought A and B will have high
probability to buy expensive product C”
Marketing Strategy:
n Sell A, B and C as one set
n Place A, B and C in one corner
n Etc
A, B C
⇒
Association Rules
(Rakesh Agrawal@IBM Almaden Research Center)
18. D
{ }
m
i
i
i ,...,
, 1
1
=
Ι
Y
X ⇒
: Items (products)
Database : transactions
φ
=
∩
⊆
⊆
Y
X
I
Y
I
X
Association Rules
(Rakesh Agrawal@IBM Almaden Research Center)
19. Y
X ∪
Y
X ⇒
Confidence c% : The ratio between transactions
to the total transactions
of product X
Support s% : The ratio between transaction
to the total transactions
antecedent
consequent
Y
X ⇒
Confidence & Supports
21. • Items : m à the number of association rules
• m: 100 à about 57,000 rules m: 100 à5.15 x 10
47
• Large number of rules are generated, but the only few of
them are really useful
• Useful rules :
– high score of both support & confidence
– Low score of support : the rules are applicable for only
few cases
( )
2
2
2
−
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
∑ =
k
m
k
k
m
Confidence & Supports
23. Two aspects:
- architecture : how the neurons are connected
- training algorithm:
algorithm to adjust the synapses to enable the
ANN perform desired input-output mapping
Artificial Neural Network
24. two aspects:
- architecture : multilayer perceptron
- training algorithm: backpropagation algorithm
(invented by Rumelhart, 1986)
Input information Output
Input Layer Output Layer
w
Hidden Layer
w
Artificial Neural Network
25. decrement of error during the training phase
of neural networks
=
“knowledge” acquisition
Artificial Neural Network
(training phase)
26. • Invented by Vapnik (1992)
• SVM satisfied three conditions for ideal pattern
recognition method
– Robustness
– Theoretically Analysis
– Feasibility
• In principal, SVM works as binary classifier
• Structural-Risk Minimization
Support Vector Machines
30. • Fog forecasting
• Bioinformatics
• Sky survey Cataloging (Fayyad et al.)
• Spatio-Temporal Analysis of Disease Spreading using
Webmining
• Foreign Exchange Rate Prediction
• Network Intrusion Detection
• Etc.
Application of Datamining
31. Sky Survey Cataloging
• Sky Survey Cataloging
– Goal: To predict class (star or galaxy) of sky objects, especially
visually faint ones, based on the telescopic survey images (from
Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.
– Approach:
• Segment the image.
• Measure image attributes (features) - 40 of them per object.
• Model the class based on these features.
• Success Story: Could find 16 new high red-shift quasars, some
of the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
32. Early
Intermediate
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Class:
• Stages of Formation
Attributes:
• Image features,
• Characteristics of light
waves received, etc.
Courtesy: http://aps.umn.edu
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
Classifying Galaxies
33. Further Readings
• Buku-buku datamining a.l.
• Pang Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to
Datamining, Addison Wesley, 2006
• Datamining: Teknik Pemanfaatan Data untuk Keperluan Bisnis, Budi
Santosa, Graha Ilmu, 2007
• Tutorial on datamining by Dr.Iko Pramudiono (NTT, Japan)
http://datamining.japati.net/
• Datamining, Knowledge Discovery and Bioinformatics, Shinichi Morishita
(winner of KDD 2001): http://asnugroho.wordpress.com/2006/02/05/
datamining-knowledge-discovery-bioinformatics-terjemahan-artikel-prof-
shinichi-morishita/ (password: gomibako)
• AS. Nugroho: Datamining dalam Bioinformatika: menggali informasi
terpendam dalam lautan data biologi, SDA Asia Magazine, No.13, pp.
64-66, March 2006 http://asnugroho.wordpress.com/2006/02/06/peran-
datamining-dalam-bioinformatika/