SlideShare a Scribd company logo
1 of 85
Sequence Data Mining:
Techniques and Applications
Sunita Sarawagi
IIT Bombay
http://www.it.iitb.ac.in/~sunita
Mark Craven
University of Wisconsin
http://www.biostat.wisc.edu/~craven
What is a sequence?
• Ordered set of elements: s = a1,a2,..an
• Each element ai could be
– Numerical
– Categorical: domain a finite set of symbols S, |S|=m
– Multiple attributes
• The length n of a sequence is not fixed
• Order determined by time or position and could
be regular or irregular
Real-life sequences
• Classical applications
– Speech: sequence of phonemes
– Language: sequence of words and delimiters
– Handwriting: sequence of strokes
• Newer applications
– Bioinformatics:
• Genes: Sequence of 4 possible nucleotides, |S|=4
– Example: AACTGACCTGGGCCCAATCC
• Proteins: Sequence of 20 possible amino-acids, |S|=20
– Example:
– Telecommunications: alarms, data packets
– Retail data mining: Customer behavior, sessions in a e-store
(Example, Amazon)
– Intrusion detection
Intrusion detection
• Intrusions could be detected at
– Network-level (denial-of-service attacks,
port-scans, etc) [KDD Cup 99]
• Sequence of TCP-dumps
– Host-level (attacks on privileged programs
like lpr, sendmail)
• Sequence of system calls
• |S| = set of all possible system calls ~100
open
lseek
lstat
mmap
execve
ioctl
ioctl
close
execve
close
unlink
Outline
• Traditional mining operations on sequences
– Classification
– Clustering
– Finding repeated patterns
• Primitives for handling sequence data
• Sequence-specific mining operations
– Partial sequence classification (Tagging)
– Segmenting a sequence
– Predicting next symbol of a sequence
• Applications in biological sequence mining
Classification of whole sequences
Given:
– a set of classes C and
– a number of example sequences in each class,
train a model so that for an unseen sequence we
can say to which class it belongs
Example:
– Given a set of protein families, find family of a new
protein
– Given a sequence of packets, label session as
intrusion or normal
– Given several utterances of a set of words, classify a
new utterance to the right word
Conventional classification methods
• Conventional classification methods assume
– record data: fixed number of attributes
– single record per instance to be classified (no order)
• Sequences:
– variable length, order important.
• Adapting conventional classifiers to sequences
– Generative classifiers
– Boundary-based classifiers
– Distance based classifiers
– Kernel-based classifiers
Generative methods
• For each class i,
train a generative model Mi to
maximize likelihood over all
training sequences in the class i
• Estimate Pr(ci) as fraction of training
instances in class i
• For a new sequence x,
– find Pr(x|ci)*Pr(ci) for each i using Mi
– choose i with the largest value of Pr(x|ci)*P(ci)
x
Pr(x|c1)*Pr(c1)
Pr(x|c2)*Pr(c2)
Pr(x|c3)*Pr(c3)
Need a generative model for sequence data
Boundary-based methods
• Data: points in a fixed multi-dimensional space
• Output of training: Boundaries that define regions
within which same class predicted
• Application: Tests on boundaries to find region
Need to embed sequence data in a fixed coordinate space
Decision trees Neural networks
Linear discriminants
Kernel-based classifiers
• Define function K(xi, x) that intuitively defines similarity
between two sequences and satisfies two properties
– K is symmetric i.e., K(xi, x) = K(x, xi)
– K is positive definite
• Training: learn for each class c,
– wic for each train sequence xi
– bc
• Application: predict class of x
– For each class c, find f(x,c) = S wicK(xi, x)+bc
– Predicted class is c with highest value f(x,c)
• Well-known kernel classifiers
– Nearest neighbor classifier
– Support Vector Machines
– Radial Basis functions
Need kernel/similarity function
Sequence clustering
• Given a set of sequences, create groups such
that similar sequences in the same group
• Three kinds of clustering algorithms
– Distance-based:
• K-means
• Various hierarchical algorithms
– Model-based algorithms
• Expectation Maximization algorithm
– Density-based algorithms
• Clique
Need similarity function
Need generative models
Need dimensional embedding
Outline
• Traditional mining on sequences
• Primitives for handling sequence data
– Embed sequence in a fixed dimensional space
• All conventional record mining techniques will apply
– Generative models for sequence
• Sequence classification: generative methods
• Clustering sequences: model-based approach
– Distance between two sequences
• Sequence classification: SVM and NN
• Clustering sequences: distance-based approach
• Sequence-specific mining operations
• Applications
Embedding sequences in fixed
dimensional space
• Ignore order, each symbol maps to a dimension
– extensively used in text classification and clustering
• Extract aggregate features
– Real-valued elements: Fourier coefficients, Wavelet coefficients,
Auto-regressive coefficients
– Categorical data: number of symbol changes
• Sliding window techniques (k: window size)
– Define a coordinate for each possible k-gram a (mk coordinates)
 a-th coordinate is number of times a in sequence
• (k,b) mismatch score: a-th coordinate is number of k-grams in
sequence with b mismatches with a
– Define a coordinate for each of the k-positions
• Multiple rows per sequence
Sliding window examples
o c l i e m
1 2 1 1 3 2 1
2 .. .. .. .. .. ..
3 .. .. .. .. .. ..
open
lseek
ioctl
mmap
execve
ioctl
ioctl
open
execve
close
mmap
One symbol per column
Sliding window: window-size 3
ioe cli oli lie lim ...
1 1 0 1 0 1
2 .. .. .. .. .. ..
3 .. .. .. .. .. ..
One row per trace
sid
A1 A2 A3
1 o l i
1 l i m
1 i m e
1 .. .. ..
1 e c m
Multiple rows per trace
ioe cli oli lie lim ...
1 2 1 1 0 1
2 .. .. .. .. .. ..
3 .. .. .. .. .. ..
mis-match scores: b=1
Detecting attacks on privileged programs
• Short sequences of system calls made during
normal execution of system calls are very
consistent, yet different from the sequences of
its abnormal executions
• Two approaches
– STIDE (Warrender 1999)
• Create dictionary of unique k-windows in normal traces,
count what fraction occur in new traces and threshold.
– RIPPER –based (Lee 1998)
• next...
Classification models on k-grams trace data
• When both normal
and abnormal data
available
– class label =
normal/abnormal:
• When only normal
trace,
– class-label=k-th system
call
7-grams class
vtimes open seek read read read seek normal
lstat lstat lstat bind open close vtimes abnormal
… …
Learn rules to predict class-label [RIPPER]
6-attributes class
vtimes open seek read read read seek
lstat lstat lstat bind open close vtimes
…
Examples of output RIPPER rules
• Both-traces:
– if the 2nd system call is vtimes and the 7th is vtrace, then the
sequence is “normal”
– if the 6th system call is lseek and the 7th is sigvec, then the
sequence is “normal”
– …
– if none of the above, then the sequence is “abnormal”
• Only-normal:
– if the 3rd system call is lstat and the 4th is write, then the 7th is
stat
– if the 1st system call is sigblock and the 4th is bind, then the 7th
is setsockopt
– …
– if none of the above, then the 7th is open
Experimental results on sendmail
traces Only-normal BOTH
sscp-1 13.5 32.2
sscp-2 13.6 30.4
sscp-3 13.6 30.4
syslog-remote-1 11.5 21.2
syslog-remote-2 8.4 15.6
syslog-local-1 6.1 11.1
syslog-local-2 8.0 15.9
decode-1 3.9 2.1
decode-2 4.2 2.0
sm565a 8.1 8.0
sm5x 8.2 6.5
sendmail 0.6 0.1
• The output rule sets contain
~250 rules, each with 2 or 3
attribute tests
• Score each trace by counting
fraction of mismatches and
thresholding
Summary: Only normal traces
sufficient to detect intrusions
Percent of mismatching traces
More realistic experiments
• Different programs need different thresholds
• Simple methods (e.g. STIDE) work as well
• Results sensitive to window size
• Is it possible to do better with sequence specific
methods?
STIDE RIPPER
threshold %false-pos threshold %false-pos
Site-1 lpr 12 0.0 3 0.0016
Site-2 lpr 12 0.0013 4 0.0265
named 20 0.0019 10 0.0
xlock 20 0.00008 10 0.0
[from Warrender 99]
Outline
• Traditional mining on sequences
• Primitives for handling sequence data
– Embed sequence in a fixed dimensional space
• All conventional record mining techniques will apply
– Distance between two sequences
• Sequence classification: SVM and NN
• Clustering sequences: distance-based approach
– Generative models for sequences
• Sequence classification: whole and partial
• Clustering sequences: model-based approach
• Sequence-specific mining operations
• Applications in biological sequence mining
Probabilistic models for sequences
• Independent model
• One-level dependence (Markov chains)
• Fixed memory (Order-l Markov chains)
• Variable memory models
• More complicated models
– Hidden Markov Models
• Model structure
– A parameter for each symbol in S
• Probability of a sequence s being generated
from the model
– example: Pr(AACA)
= Pr(A) Pr(A) Pr(C) Pr(A) = Pr(A)3 Pr(C)
= 0.13  0.9
• Training transition probabilities
– Data T : set of sequences
– Count(s ε T): total number of times substring s
appears in training data T
Pr(s) = Count(s ε T) / length(T)
Independent model
Pr(A) = 0.1
Pr(C) = 0.9
• Model structure
– A state for each symbol in S
– Edges between states with probabilities
• Probability of a sequence s being
generated from the model
– Example: Pr(AACA)
= Pr(A) Pr(A|A) Pr(C|A) Pr(A|C)
= 0.5  0.1  0.9  0.4
• Training transition probability between
states
Pr(s|b) = Count(bs ε T) / Count(b ε T)
First Order Markov Chains
C
A
0.9
0.4
0.1 0.6
start
0.5 0.5
l = memory of sequence
• Model
– A state for each possible suffix of
length l  |S|l states
– Edges between states with
probabilities
• Pr(AACA)
= Pr(AA)Pr(C|AA) Pr(A|AC)
= 0.5  0.9  0.7
• Training model
Pr(s|s) = count(ss ε T) / count(s ε T)
Higher order Markov Chains
AC
AA
C 0.3
C 0.9
A 0.1
l = 2
CC
CA
0.8
A 0.7
C 0.2
A 0.4
C 0.6
start
0.5
Variable Memory models
• Probabilistic Suffix Automata (PSA)
• Model
– States: substrings of size no greater than l
where no string is suffix of another
• Calculating Pr(AACA):
= Pr(A)Pr(A|A)Pr(C|A)Pr(A|AC)
= 0.5  0.3  0.7  0.1
• Training: not straight-forward
– Eased by Prediction Suffix Trees
– PSTs can be converted to PSA after training
CC
AC
C 0.7
C 0.9
A 0.1
A
C 0.6
A 0.3
A 0.6
start
0.2 0.5
Prediction Suffix Trees (PSTs)
• Suffix trees with emission probabilities of
observation attached with each tree node
• Linear time algorithms exist for constructing
such PSTs from training data [Apostolico 2000]
CC
AC
C 0.7
C 0.9
A 0.1
A
A 0.3
C 0.6
e
A C
AC CC
0.3, 0.7
0.28, 0.72
0.25, 0.75
0.1, 0.9 0.4, 0.6
Pr(AACA)=0.28  0.3  0.7  0.1
Hidden Markov Models
• Doubly stochastic models
• Efficient dynamic programming
algorithms exist for
– Finding Pr(S)
– The highest probability path P that
maximizes Pr(S|P) (Viterbi)
• Training the model
– (Baum-Welch algorithm)
S2
S4
S1
0.9
0.5
0.5
0.8
0.2
0.1
S3
A
C
0.6
0.4
A
C
0.3
0.7
A
C
0.5
0.5
A
C
0.9
0.1
Discriminative training of HMMs
• Models trained to maximize likelihood of data
might perform badly when
– Model not representative of data
– Training data insufficient
• Alternatives to Maximum-likelihood/EM
– Objective functions:
• Minimum classification error
• Maximum posterior probability of actual label Pr(c|x)
• Maximum mutual information with class
– Harder to train above functions, number of
alternatives to EM proposed
• Generalized probabilistic descent [Katagiri 98]
• Deterministic annealing [Rao 01]
HMMs for profiling system calls
• Training:
– Initial number of states = 40 (roughly equals number
of distinct system calls)
– Train on normal traces
• Testing:
– Need to handle variable length and online data
– For each call, find the total probability of outputting
given all calls before it.
• If probability below a threshold call it abnormal.
– Trace is abnormal if fraction of abnormal calls are
high
More realistic experiments
• HMMs
– Take longer time to train
– Less sensitive to thresholds, no window parameter
– Best overall performance
• VMM and Sparse Markov Transducers also shown to perform
significantly better than fixed window methods [Eskin 01]
STIDE RIPPER HMM
thresh
old
%false-
pos
threshold %false-
pos
threshold %false-
pos
Site-1 lpr 12 0.0 3 0.0016 10-7 0.0003
Site-2 lpr 12 0.0013 4 0.0265 10-7 0.0015
named 20 0.0019 10 0.0 10-7 0.0
xlock 20 0.00008 10 0.0 10-7 0.0
[from Warrender 99]
ROC curves comparing different
methods
[from Warrender 99]
Outline
• Traditional mining on sequences
• Primitives for handling sequence data
• Sequence-specific mining operations
– Partial sequence classification (Tagging)
– Segmenting a sequence
– Predicting next symbol of a sequence
• Applications in biological sequence mining
Partial sequence classification (Tagging)
• The tagging problem:
– Given:
• A set of tags L
• Training examples of sequences showing the breakup of
the sequence into the set of tags
– Learn to breakup a sequence into tags
– (classification of parts of sequences)
• Examples:
– Text segmentation
• Break sequence of words forming an address string into
subparts like Road, City name etc
– Continuous speech recognition
• Identify words in continuous speech
Text sequence segmentation
Example: Addresses, bib records
House
number Building Road City
Zip
4089 Whispering Pines Nobel Drive San Diego CA 92122
P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick
(1993) Protein and Solvent Engineering of Subtilising BPN' in
Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115,
12231-12237.
Author Year Title Journal
Volume
Page
State
Approaches used for tagging
• Learn to identify start and end boundaries of
each label
– Method: for each label, build two classifiers for
accepting its two boundaries.
– Any conventional classifier would do:
• Rule-based most common.
• K-windows approach:
– For each label, train a classifier on k-windows
– During testing
• classify each k-window
• Adapt state-based generative models like HMM
State-based models for sequence
tagging
• Two approaches:
– Separate model per tag with special prefix and suffix states to
capture the start and end of a tag
S2
S4
S1
S3
Prefix Suffix
Road name
S2
S4
S1
S3
Prefix
Suffix
Building name
Combined model over all tags
… Mahatma Gandhi Road Near Parkland ...
… [Mahatma Gandhi Road Near : Landmark] Parkland ...
 Example: IE
 Naïve Model: One state per element
Nested model
Each element
another HMM
Other approaches
• Disadvantages of generative models (HMMs)
– Maximizing joint probability of sequence and labels
may not maximize accuracy
– Conditional independence of features a restrictive
assumption
• Alternative: Conditional Random Fields
– Maximize conditional probability of labels given
sequence
– Arbitrary overlapping features allowed
Outline
• Traditional mining on sequences
• Primitives for handling sequence data
• Sequence-specific mining operations
– Partial sequence classification (Tagging)
– Segmenting a sequence
– Predicting next symbol of a sequence
• Applications in biological sequence mining
Segmenting sequences
Basic premise: Sequence
assumed to be generated
from multiple models
Goal: discover boundaries of
model change
Applications:
Bioinformatics
Better models by breaking
up heterogeneous
sequences
Pick
Return
A
B
A C T G G T T T A C C C C T G T G
A C T G G T T T A C C C C T G T G
Simpler problem: Segmenting a 0/1
sequence
• Players A and B
• A has a set of coins with
different biases
• A repeatedly
– Picks arbitrary coin
– Tosses it arbitrary number
of times
• B observes H/T
• Guesses transition
points and biases
Pick
Toss
Return
A
B
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
A MDL-based approach
• Given n head/tail observations
– Can assume n different coins with bias 0 or 1
• Data fits perfectly (with probability one)
• Many coins needed
– Or assume one coin
• May fit data poorly
• “Best segmentation” is a compromise between
model length and data fit
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
1/4 5/7 1/3
MDL
For each segment i:
L(Mi): cost of model parameters: log(Pr(head))
+ segment boundary: log (sequence length)
L(Di|Mi): cost of describing data in segment Si
given model Mi: log(Hh T(1-h)) H: #heads, T: #tails
Goal: find segmentation that leads to smallest total cost
segment i L(Mi) + L(Di|Mi)
How to find optimal segments
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
Sequence of 17 tosses:
Derived graph with 18 nodes and all possible edges
Edge cost =
model cost
+ data cost
Shortest path
Non-independent models
• In previous method each segment is assumed to be
independent of each other, does not allow model reuse
of the form:
• The (k,h) segmentation problem:
– Assume a fixed number h of models is to be used for
segmenting into k parts an n-element sequence (k > h)
– (k,k) segmentation solvable by dynamic programming
– General (k,h) problem NP hard
0 0 1 0 1 1 0 1 0 1 1 0 0 0 0 0 1 0
Approximations: for (k,h) segmentation
1. Solve (k,k) to get segments
2. Solve (n,h) to get H models
– Example: 2-medians
3. Assign each segment to best of H
A second variant (Replace step 2 above with)
– Find summary statistics for each k segment, cluster
them into H clusters
A C T G G T T T A C C C C T G T G
S1 S2 S3 S4
M1 M2
A C T G G T T T A C C C C T G T G
Results of segmenting genetic
sequences
From:
Gionis &
Mannila
2003
Outline
• Traditional mining operations on sequences
• Primitives for handling sequence data
• Sequence-specific mining operations
• Applications in biological sequence mining
1. Classifying proteins according to family
2. Finding genes in DNA sequences
Two sequence mining applications
The protein classification task
Given: amino-acid sequence of a protein
Do: predict the family to which it belongs
GDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVCVLAHHFGKEFTPPVQAAYAKVVAGVANALAHKYH
image from the DOE Human Genome Program
http://www.ornl.gov/hgmis
The roles of proteins
• A protein family is…
Figure from the DOE Human Genome Program
http://www.ornl.gov/hgmis
The roles of proteins
• The amino-acid sequence of a protein determines its
structure
• The structure of a protein determines its function
• Proteins play many key roles in cells
– structural support
– storage of amino acids
– transport of other substances
– coordination of an organism’s activities
– response of cell to chemical stimuli
– movement
– protection against disease
– selective acceleration of chemical reactions
Protein taxonomies
• The SCOP and CATH databases
provide hierarchical taxonomies
of protein families
An alignment of globin family proteins
Figure from www-cryst.bioc.cam.ac.uk/~max/res_globin.html
 The sequences in a
family may vary in
length
 Some positions are
more conserved
than others
Profile HMMs
i 2 i 3
i 1
i 0
d 1 d 2 d 3
m 1 m 3
m 2
start end
Match states represent
key conserved positions
Insert states account
for extra characters
in some sequences
Delete states are silent; they
Account for characters missing
in some sequences
• Profile HMMs are commonly used to model families
of sequences
A 0.01
R 0.12
D 0.04
N 0.29
C 0.01
E 0.03
Q 0.02
G 0.01
Insert and match states have
emission distributions over
sequence characters
Profile HMMs
• To classify sequences according to family, we can
train a profile HMM to model the proteins of each
family of interest
• Given a sequence x, use Bayes’ rule to make
classification


j
j
j
i
i
i
c
c
x
c
c
x
x
c
)
Pr(
)
|
Pr(
)
Pr(
)
|
Pr(
)
|
Pr(
How Likely is a Given Sequence?
• Consider computing in a profile HMM
• The probability that the path is taken and
the sequence is generated:




L
i
i
N
L i
i
i
a
x
e
a
x
x
1
0
0
1 1
1
)
(
)
...
,
...
Pr( 





L
x
x ...
1
N

 ...
0
)
|
Pr( i
c
x
transition
probabilities
emission
probabilities
How Likely Is A Given Sequence?
A 0.1
C 0.4
G 0.4
T 0.1
A 0.4
C 0.1
G 0.1
T 0.4
begin end
0.5
0.5
0.2
0.8
0.4
0.6
0.1
0.9
0.2
0.8
0 5
4
3
2
1
6
.
0
3
.
0
8
.
0
4
.
0
2
.
0
4
.
0
5
.
0
)
C
(
)
A
(
)
A
(
)
,
AAC
Pr( 35
3
13
1
11
1
01













 a
e
a
e
a
e
a

A 0.4
C 0.1
G 0.2
T 0.3
A 0.2
C 0.3
G 0.3
T 0.2
How Likely is a Given Sequence?
• the probability over all paths is:
)
...
,
...
Pr(
)
...
Pr( 0
1
1 



 N
L
L x
x
x
x
• but the number of paths can be exponential in the
length of the sequence...
• the Forward algorithm enables us to compute this
efficiently using dynamic programming

How Likely is a Given Sequence: The
Forward Algorithm
• define to be the probability of being in state k
having observed the first i characters of x
)
(i
fk
• we want to compute , the probability of being
in the end state having observed all of x
• can define this recursively
)
(L
fN
The Forward Algorithm




k
kN
k
N
L a
L
f
L
f
x
x
x )
(
)
(
)
...
Pr(
)
Pr( 1
probability that we’re in the end state and
have observed the entire sequence
• termination:


k
kl
k
l a
i
f
i
f )
(
)
(
 

k
kl
k
l
l a
i
f
i
e
i
f )
1
(
)
(
)
(
• recursion for silent states:
• recursion for emitting states (i =1…L):
1
)
0
(
0 
f
states
silent
not
are
that
for
,
0
)
0
( k
fk 
probability that we’re in start state and
have observed 0 characters from the sequence
• initialization:
Training a profile HMM
• The parameters in profile HMMs are typically trained
using the Baum-Welch method (an EM algorithm)
• Heuristic methods are used to determine the length of
the model
– Initialize using an alignment of the training
sequences (as in globin example)
– Iteratively make local adjustments to length if delete
or insert states are used “too much” for training
sequences
The Fisher kernel method for
protein classification
• Standard HMMs are generative models
– Training involves estimating
– Predictions based on are made using
Bayes’ rule
• Sometimes we can get more accurate predictions
using discriminative methods which try to
optimize directly
– One example: the Fisher kernel method
[Jaakola et al. ‘00]
)
|
Pr( i
c
x
)
|
Pr( x
ci
)
|
Pr( x
ci
The Fisher kernel method for
protein classification
• Consider learning to discriminate proteins in class
from proteins other families
1. Train an HMM for
2. Use HMM to map each protein sequence x
into a fixed-length vector
3. Train an support vector machine (SVM)
whose kernel function is the Euclidean
distance between vectors
• The resulting discriminative model is given by
1
c
1
c






1
1 :
:
)
,
(
)
,
(
)
(
c
x
i
i
i
c
x
i
i
i
i
i
x
x
K
x
x
K
x
D 

The Fisher kernel method for
protein classification
• How can we get an informative fixed-length vector?
Compute the Fisher score
i.e. each component of is a derivative of the
log-likelihood of x with respect to a particular
parameter in the HMM
)
,
|
Pr(
log 1 
 c
x
Ux 

x
U
i
c
x
x i
U 



 )
,
|
Pr(
log 1
]
[
Profile HMM accuracy
Figure from Jaakola et al., ISMB 1999
BLAST-based
methods
profile HMMs
• classifying 2447proteins into 33 families
• x-axis represents the median fraction of negative sequences that
score as high as a positive sequence for a given family’s model
profile HMMs w/
Fisher kernel SVM
The gene finding task
Given: an uncharacterized DNA sequence
Do: locate the genes in the sequence, including the
coordinates of individual exons and introns
image from the UCSC Genome Browser
http://genome.ucsc.edu/
image from the DOE Human Genome Program
http://www.ornl.gov/hgmis
The structure of genes
• Genes consist of alternating sequences of exons and
introns
• Introns are spliced out before the gene is translated into
protein
ATG GAA ACC CGA TCG GGC … AC
intergenic
region
intergenic
region
intron exon
exon exon intron
G TAA AGT CTA …
• Exons consist of three-letter words, called codons
• Each codon encodes a single amino acid (character in
a protein sequence)
Each shape represents a functional unit
of a gene or genomic region
Pairs of intron/exon units represent
the different ways an intron can interrupt
a coding sequence (after 1st base in codon,
after 2nd base or after 3rd base)
Complementary submodel
(not shown) detects genes on
opposite DNA strand
The GENSCAN HMM for gene finding
[Burge & Karlin ‘97]
The GENSCAN HMM
• For each sequence type, GENSCAN models
– the length distribution
– the sequence composition
• Length distribution models vary depending on sequence type
– Nonparametric (using histograms)
– Parametric (using geometric distributions)
– Fixed-length
• Sequence composition models vary depending on type
– 5th-order, inhomogeneous
– 5th -order homogenous
– Independent and 1st-order inhomogeneous
– Tree-structured variable memory
Representing exons in GENSCAN
• For exons, GENSCAN uses
– Histograms to represent exon lengths
– 5th-order, inhomogeneous Markov models to
represent exon sequences
• 5th-order, inhomogeneous models can represent
statistics about pairs of neighboring codons
A 5th-order Markov model for DNA
GCTAC
AAAAA
TTTTT
CTACG
CTACA
CTACC
CTACT
Pr(A | GCTAC)
start
Pr(GCTAC)
)
|
Pr(
)
Pr
)
Pr( GCTAC
A
(GCTAC
GCTACA 
Markov models for exons
• for each “word” we evaluate, we’ll want to consider its
position with respect to the assumed codon framing
• thus we’ll want to use an inhomogenous model to
represent genes
G C T A C G G A G C T T C G G A G C
G C T A C G G is in 3rd codon position
C T A C G G G is in 1st position
T A C G G A A is in 2nd position
A 5th-order inhomogeneous model
GCTAC
CTACG
CTACA
CTACC
CTACT
AAAAA
TTTTT
start
TACAG
TACAA
TACAC
TACAT
AAAAA
TTTTT
GCTAC
CTACG
CTACA
CTACC
CTACT
AAAAA
TTTTT
position 1 position 2 position 3
Transitions go to
states in position 1
Inference with the gene-finding HMM
given: an uncharacterized DNA sequence
do: find the most probable path through the model for the
sequence
• This path will specify the coordinates of the predicted
genes (including intron and exon boundaries)
• The Viterbi algorithm is used to compute this path
Finding the most probable path:
the Viterbi algorithm
• define to be the probability of the most probable
path accounting for the first i characters of x and ending
in state k
)
(i
vk
• we want to compute , the probability of the
most probable path accounting for all of the
sequence and ending in the end state
• can define recursively
• can use dynamic programming to find
efficiently
)
(L
vN
)
(L
vN
Finding the most probable path:
the Viterbi algorithm
• initialization:
1
)
0
(
0 
v
states
silent
not
are
that
for
,
0
)
0
( k
vk 
The Viterbi algorithm
• recursion for emitting states (i =1…L):
 
kl
k
k
i
l
l a
i
v
x
e
i
v )
1
(
max
)
(
)
( 

 
kl
k
k
l a
i
v
i
v )
(
max
)
( 
• recursion for silent states:
 
kl
k
k
l a
i
v
i )
(
max
arg
)
(
ptr 
 
kl
k
k
l a
i
v
i )
1
(
max
arg
)
(
ptr 
 keep track of most
probable path
The Viterbi algorithm
• to recover the most probable path, follow pointers
back starting at
• termination:
L

 )
kN
k
k
a
L
v )
(
max
arg
L 

 )
kN
k
k
a
L
v
x )
(
max
)
,
Pr( 

Parsing a DNA Sequence
ACCGTTACGTGTCATTCTACGTGATCATCGGATCCTAGAATCATCGATCCGTGCGATCGATCGGATTAGCTAGCTTAGCTAGGAGAGCATCGATCGGATCGAGGAGGAGCCTATATAAATCAA
The Viterbi path represents
a parse of a given sequence,
predicting exons, introns, etc
Some lessons from these
biological applications
• HMMs provide state-of-the-art performance in protein
classification and gene finding
• HMMs can naturally classify and parse sequences of
variable length
• Much domain knowledge can be incorporated into the
structure of the models
– Types, lengths and ordering of sequence features
– Appropriate amount of memory to represent various
sequence features
• Models can vary representation across sequence
features
• Discriminative methods often provide superior
predictive accuracy to generative methods
References
• S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman.
Gapped BLAST and PSI-BLAST: A new generation of protein database search programs.
Nucleic Acids Research, 25:3389–3402, 1997.
• Apostolico, A., and Bejerano, G. 2000. Optimal amnesic probabilistic automata or how to learn
and classify proteins in linear time and space. In Proceedings of RECOMB2000.
http://citeseer.nj.nec.com/apostolico00optimal.html
• Vinayak R. Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatic text segmentation
for extracting structured records. SIGMOD 2001.
• C. Burge and S. Karlin, Prediction of Complete Gene Structures in Human Genomic DNA.
Journal of Molecular Biology, 268:78-94, 1997.
• Mary Elaine Calif and R. J. Mooney. Relational learning of pattern-match rules for information
extraction. AAAI 1999.
• S. Chakrabarti, S. Sarawagi and B.Dom, Mining surprising patterns using temporal description
length,VLDB, 1998
• M. Collins, “Discriminitive training method for Hidden Markov Models: Theory and
experiments with perceptron algorithms, EMNLP 2002
• R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological sequence analysis: probabilistic
models of proteins and nucleic acids, Cambridge University Press, 1998.
• Eleazar Eskin, Wenke Lee and Salvatore J. Stolfo. ``Modeling System Calls for Intrusion
Detection with Dynamic Window Sizes.'' Proceedings of DISCEX II. June 2001.
• IDS http://www.cs.columbia.edu/ids/publications/
• D Freitag and A McCallum, Information Extraction with HMM Structures Learned by
Stochastic Optimization, AAAI 2000
References
• Gionis and H. Mannila: Finding recurrent sources in sequences. ACM ReCOMB 2003
• Michael T. Goodrich, Efficient Piecewise-Approximation Using the Uniform Metric
Symposium on Computational Geometry , (1994)
• D. Haussler. Convolution kernels on discrete structure. Technical report, UC Santa Cruz,
1999.
• K. Karplus, C. Barrett, and R. Hughey, Hidden Markov models for detecting remote
protein homologies. Bioinformatics 14(10): 846-856, 1998.
• L. Lo Conte, S. Brenner, T. Hubbard, C. Chothia, and A. Murzin. SCOP database in
2002: refinements accommodate structural genomics. Nucleic Acids Research, 30:264-
267, 2002
• Wenke Lee and Sal Stolfo. ``Data Mining Approaches for Intrusion Detection'' In
Proceedings of the Seventh USENIX Security Symposium (SECURITY '98), San Antonio,
TX, January 1998
• A. McCallum and D. Freitag and F. Pereira, Maximum entropy Markov models for
information extraction and segmentation, ICML-2000
• Rabiner, Lawrence R., 1990. A tutorial on hidden Markov models and selected
applications in speech recognition. In Alex Weibel and Kay-Fu Lee (eds.), Readings in
Speech Recognition. Los Altos, CA: Morgan Kaufmann, pages 267--296.
• D. Ron, Y. Singer and N. Tishby. The power of amnesia: learning probabilistic automata
with variable memory length. Machine Learning, 25:117-- 149, 1996
• Warrender, Christina, Stephanie Forrest, and Barak Pearlmutter. Detecting Intrusions
Using System Calls: Alternative Data Models. To appear, 1999 IEEE Symposium on
Security and Privacy. 1999

More Related Content

Similar to sequencea.ppt

Machine learning with R
Machine learning with RMachine learning with R
Machine learning with RMaarten Smeets
 
Scan Based Side Channel Attack on Data Encryption Standard
Scan Based Side Channel Attack on Data Encryption StandardScan Based Side Channel Attack on Data Encryption Standard
Scan Based Side Channel Attack on Data Encryption StandardLei Hsiung
 
Toast 2015 qiime_talk
Toast 2015 qiime_talkToast 2015 qiime_talk
Toast 2015 qiime_talkTOASTworkshop
 
Basic use of xcms
Basic use of xcmsBasic use of xcms
Basic use of xcmsXiuxia Du
 
Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...
Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...
Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...Junho Suh
 
Graph mining seminar_2009
Graph mining seminar_2009Graph mining seminar_2009
Graph mining seminar_2009Houw Liong The
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfH K Yoon
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
sequence alignment
sequence alignmentsequence alignment
sequence alignmentammar kareem
 
Una introducción a la minería de series temporales
Una introducción a la minería de series temporalesUna introducción a la minería de series temporales
Una introducción a la minería de series temporalesFacultad de Informática UCM
 
Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019 Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019 Rakibul Hasan Pranto
 
Real-Time Streaming Data Analysis with HTM
Real-Time Streaming Data Analysis with HTMReal-Time Streaming Data Analysis with HTM
Real-Time Streaming Data Analysis with HTMNumenta
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford MapR Technologies
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspectiveAnirban Santara
 
General pipeline of transcriptomics analysis
General pipeline of transcriptomics analysisGeneral pipeline of transcriptomics analysis
General pipeline of transcriptomics analysisSanty Marques-Ladeira
 

Similar to sequencea.ppt (20)

Machine learning with R
Machine learning with RMachine learning with R
Machine learning with R
 
BIRTE-13-Kawashima
BIRTE-13-KawashimaBIRTE-13-Kawashima
BIRTE-13-Kawashima
 
Scan Based Side Channel Attack on Data Encryption Standard
Scan Based Side Channel Attack on Data Encryption StandardScan Based Side Channel Attack on Data Encryption Standard
Scan Based Side Channel Attack on Data Encryption Standard
 
Neural Networks made easy
Neural Networks made easyNeural Networks made easy
Neural Networks made easy
 
Toast 2015 qiime_talk
Toast 2015 qiime_talkToast 2015 qiime_talk
Toast 2015 qiime_talk
 
Basic use of xcms
Basic use of xcmsBasic use of xcms
Basic use of xcms
 
Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...
Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...
Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...
 
Graph mining seminar_2009
Graph mining seminar_2009Graph mining seminar_2009
Graph mining seminar_2009
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Una introducción a la minería de series temporales
Una introducción a la minería de series temporalesUna introducción a la minería de series temporales
Una introducción a la minería de series temporales
 
Unit i
Unit iUnit i
Unit i
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019 Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019
 
Real-Time Streaming Data Analysis with HTM
Real-Time Streaming Data Analysis with HTMReal-Time Streaming Data Analysis with HTM
Real-Time Streaming Data Analysis with HTM
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
General pipeline of transcriptomics analysis
General pipeline of transcriptomics analysisGeneral pipeline of transcriptomics analysis
General pipeline of transcriptomics analysis
 
ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processing
 

Recently uploaded

Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersChitralekhaTherkar
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 

Recently uploaded (20)

Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of Powders
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 

sequencea.ppt

  • 1. Sequence Data Mining: Techniques and Applications Sunita Sarawagi IIT Bombay http://www.it.iitb.ac.in/~sunita Mark Craven University of Wisconsin http://www.biostat.wisc.edu/~craven
  • 2. What is a sequence? • Ordered set of elements: s = a1,a2,..an • Each element ai could be – Numerical – Categorical: domain a finite set of symbols S, |S|=m – Multiple attributes • The length n of a sequence is not fixed • Order determined by time or position and could be regular or irregular
  • 3. Real-life sequences • Classical applications – Speech: sequence of phonemes – Language: sequence of words and delimiters – Handwriting: sequence of strokes • Newer applications – Bioinformatics: • Genes: Sequence of 4 possible nucleotides, |S|=4 – Example: AACTGACCTGGGCCCAATCC • Proteins: Sequence of 20 possible amino-acids, |S|=20 – Example: – Telecommunications: alarms, data packets – Retail data mining: Customer behavior, sessions in a e-store (Example, Amazon) – Intrusion detection
  • 4. Intrusion detection • Intrusions could be detected at – Network-level (denial-of-service attacks, port-scans, etc) [KDD Cup 99] • Sequence of TCP-dumps – Host-level (attacks on privileged programs like lpr, sendmail) • Sequence of system calls • |S| = set of all possible system calls ~100 open lseek lstat mmap execve ioctl ioctl close execve close unlink
  • 5. Outline • Traditional mining operations on sequences – Classification – Clustering – Finding repeated patterns • Primitives for handling sequence data • Sequence-specific mining operations – Partial sequence classification (Tagging) – Segmenting a sequence – Predicting next symbol of a sequence • Applications in biological sequence mining
  • 6. Classification of whole sequences Given: – a set of classes C and – a number of example sequences in each class, train a model so that for an unseen sequence we can say to which class it belongs Example: – Given a set of protein families, find family of a new protein – Given a sequence of packets, label session as intrusion or normal – Given several utterances of a set of words, classify a new utterance to the right word
  • 7. Conventional classification methods • Conventional classification methods assume – record data: fixed number of attributes – single record per instance to be classified (no order) • Sequences: – variable length, order important. • Adapting conventional classifiers to sequences – Generative classifiers – Boundary-based classifiers – Distance based classifiers – Kernel-based classifiers
  • 8. Generative methods • For each class i, train a generative model Mi to maximize likelihood over all training sequences in the class i • Estimate Pr(ci) as fraction of training instances in class i • For a new sequence x, – find Pr(x|ci)*Pr(ci) for each i using Mi – choose i with the largest value of Pr(x|ci)*P(ci) x Pr(x|c1)*Pr(c1) Pr(x|c2)*Pr(c2) Pr(x|c3)*Pr(c3) Need a generative model for sequence data
  • 9. Boundary-based methods • Data: points in a fixed multi-dimensional space • Output of training: Boundaries that define regions within which same class predicted • Application: Tests on boundaries to find region Need to embed sequence data in a fixed coordinate space Decision trees Neural networks Linear discriminants
  • 10. Kernel-based classifiers • Define function K(xi, x) that intuitively defines similarity between two sequences and satisfies two properties – K is symmetric i.e., K(xi, x) = K(x, xi) – K is positive definite • Training: learn for each class c, – wic for each train sequence xi – bc • Application: predict class of x – For each class c, find f(x,c) = S wicK(xi, x)+bc – Predicted class is c with highest value f(x,c) • Well-known kernel classifiers – Nearest neighbor classifier – Support Vector Machines – Radial Basis functions Need kernel/similarity function
  • 11. Sequence clustering • Given a set of sequences, create groups such that similar sequences in the same group • Three kinds of clustering algorithms – Distance-based: • K-means • Various hierarchical algorithms – Model-based algorithms • Expectation Maximization algorithm – Density-based algorithms • Clique Need similarity function Need generative models Need dimensional embedding
  • 12. Outline • Traditional mining on sequences • Primitives for handling sequence data – Embed sequence in a fixed dimensional space • All conventional record mining techniques will apply – Generative models for sequence • Sequence classification: generative methods • Clustering sequences: model-based approach – Distance between two sequences • Sequence classification: SVM and NN • Clustering sequences: distance-based approach • Sequence-specific mining operations • Applications
  • 13. Embedding sequences in fixed dimensional space • Ignore order, each symbol maps to a dimension – extensively used in text classification and clustering • Extract aggregate features – Real-valued elements: Fourier coefficients, Wavelet coefficients, Auto-regressive coefficients – Categorical data: number of symbol changes • Sliding window techniques (k: window size) – Define a coordinate for each possible k-gram a (mk coordinates)  a-th coordinate is number of times a in sequence • (k,b) mismatch score: a-th coordinate is number of k-grams in sequence with b mismatches with a – Define a coordinate for each of the k-positions • Multiple rows per sequence
  • 14. Sliding window examples o c l i e m 1 2 1 1 3 2 1 2 .. .. .. .. .. .. 3 .. .. .. .. .. .. open lseek ioctl mmap execve ioctl ioctl open execve close mmap One symbol per column Sliding window: window-size 3 ioe cli oli lie lim ... 1 1 0 1 0 1 2 .. .. .. .. .. .. 3 .. .. .. .. .. .. One row per trace sid A1 A2 A3 1 o l i 1 l i m 1 i m e 1 .. .. .. 1 e c m Multiple rows per trace ioe cli oli lie lim ... 1 2 1 1 0 1 2 .. .. .. .. .. .. 3 .. .. .. .. .. .. mis-match scores: b=1
  • 15. Detecting attacks on privileged programs • Short sequences of system calls made during normal execution of system calls are very consistent, yet different from the sequences of its abnormal executions • Two approaches – STIDE (Warrender 1999) • Create dictionary of unique k-windows in normal traces, count what fraction occur in new traces and threshold. – RIPPER –based (Lee 1998) • next...
  • 16. Classification models on k-grams trace data • When both normal and abnormal data available – class label = normal/abnormal: • When only normal trace, – class-label=k-th system call 7-grams class vtimes open seek read read read seek normal lstat lstat lstat bind open close vtimes abnormal … … Learn rules to predict class-label [RIPPER] 6-attributes class vtimes open seek read read read seek lstat lstat lstat bind open close vtimes …
  • 17. Examples of output RIPPER rules • Both-traces: – if the 2nd system call is vtimes and the 7th is vtrace, then the sequence is “normal” – if the 6th system call is lseek and the 7th is sigvec, then the sequence is “normal” – … – if none of the above, then the sequence is “abnormal” • Only-normal: – if the 3rd system call is lstat and the 4th is write, then the 7th is stat – if the 1st system call is sigblock and the 4th is bind, then the 7th is setsockopt – … – if none of the above, then the 7th is open
  • 18. Experimental results on sendmail traces Only-normal BOTH sscp-1 13.5 32.2 sscp-2 13.6 30.4 sscp-3 13.6 30.4 syslog-remote-1 11.5 21.2 syslog-remote-2 8.4 15.6 syslog-local-1 6.1 11.1 syslog-local-2 8.0 15.9 decode-1 3.9 2.1 decode-2 4.2 2.0 sm565a 8.1 8.0 sm5x 8.2 6.5 sendmail 0.6 0.1 • The output rule sets contain ~250 rules, each with 2 or 3 attribute tests • Score each trace by counting fraction of mismatches and thresholding Summary: Only normal traces sufficient to detect intrusions Percent of mismatching traces
  • 19. More realistic experiments • Different programs need different thresholds • Simple methods (e.g. STIDE) work as well • Results sensitive to window size • Is it possible to do better with sequence specific methods? STIDE RIPPER threshold %false-pos threshold %false-pos Site-1 lpr 12 0.0 3 0.0016 Site-2 lpr 12 0.0013 4 0.0265 named 20 0.0019 10 0.0 xlock 20 0.00008 10 0.0 [from Warrender 99]
  • 20. Outline • Traditional mining on sequences • Primitives for handling sequence data – Embed sequence in a fixed dimensional space • All conventional record mining techniques will apply – Distance between two sequences • Sequence classification: SVM and NN • Clustering sequences: distance-based approach – Generative models for sequences • Sequence classification: whole and partial • Clustering sequences: model-based approach • Sequence-specific mining operations • Applications in biological sequence mining
  • 21. Probabilistic models for sequences • Independent model • One-level dependence (Markov chains) • Fixed memory (Order-l Markov chains) • Variable memory models • More complicated models – Hidden Markov Models
  • 22. • Model structure – A parameter for each symbol in S • Probability of a sequence s being generated from the model – example: Pr(AACA) = Pr(A) Pr(A) Pr(C) Pr(A) = Pr(A)3 Pr(C) = 0.13  0.9 • Training transition probabilities – Data T : set of sequences – Count(s ε T): total number of times substring s appears in training data T Pr(s) = Count(s ε T) / length(T) Independent model Pr(A) = 0.1 Pr(C) = 0.9
  • 23. • Model structure – A state for each symbol in S – Edges between states with probabilities • Probability of a sequence s being generated from the model – Example: Pr(AACA) = Pr(A) Pr(A|A) Pr(C|A) Pr(A|C) = 0.5  0.1  0.9  0.4 • Training transition probability between states Pr(s|b) = Count(bs ε T) / Count(b ε T) First Order Markov Chains C A 0.9 0.4 0.1 0.6 start 0.5 0.5
  • 24. l = memory of sequence • Model – A state for each possible suffix of length l  |S|l states – Edges between states with probabilities • Pr(AACA) = Pr(AA)Pr(C|AA) Pr(A|AC) = 0.5  0.9  0.7 • Training model Pr(s|s) = count(ss ε T) / count(s ε T) Higher order Markov Chains AC AA C 0.3 C 0.9 A 0.1 l = 2 CC CA 0.8 A 0.7 C 0.2 A 0.4 C 0.6 start 0.5
  • 25. Variable Memory models • Probabilistic Suffix Automata (PSA) • Model – States: substrings of size no greater than l where no string is suffix of another • Calculating Pr(AACA): = Pr(A)Pr(A|A)Pr(C|A)Pr(A|AC) = 0.5  0.3  0.7  0.1 • Training: not straight-forward – Eased by Prediction Suffix Trees – PSTs can be converted to PSA after training CC AC C 0.7 C 0.9 A 0.1 A C 0.6 A 0.3 A 0.6 start 0.2 0.5
  • 26. Prediction Suffix Trees (PSTs) • Suffix trees with emission probabilities of observation attached with each tree node • Linear time algorithms exist for constructing such PSTs from training data [Apostolico 2000] CC AC C 0.7 C 0.9 A 0.1 A A 0.3 C 0.6 e A C AC CC 0.3, 0.7 0.28, 0.72 0.25, 0.75 0.1, 0.9 0.4, 0.6 Pr(AACA)=0.28  0.3  0.7  0.1
  • 27. Hidden Markov Models • Doubly stochastic models • Efficient dynamic programming algorithms exist for – Finding Pr(S) – The highest probability path P that maximizes Pr(S|P) (Viterbi) • Training the model – (Baum-Welch algorithm) S2 S4 S1 0.9 0.5 0.5 0.8 0.2 0.1 S3 A C 0.6 0.4 A C 0.3 0.7 A C 0.5 0.5 A C 0.9 0.1
  • 28. Discriminative training of HMMs • Models trained to maximize likelihood of data might perform badly when – Model not representative of data – Training data insufficient • Alternatives to Maximum-likelihood/EM – Objective functions: • Minimum classification error • Maximum posterior probability of actual label Pr(c|x) • Maximum mutual information with class – Harder to train above functions, number of alternatives to EM proposed • Generalized probabilistic descent [Katagiri 98] • Deterministic annealing [Rao 01]
  • 29. HMMs for profiling system calls • Training: – Initial number of states = 40 (roughly equals number of distinct system calls) – Train on normal traces • Testing: – Need to handle variable length and online data – For each call, find the total probability of outputting given all calls before it. • If probability below a threshold call it abnormal. – Trace is abnormal if fraction of abnormal calls are high
  • 30. More realistic experiments • HMMs – Take longer time to train – Less sensitive to thresholds, no window parameter – Best overall performance • VMM and Sparse Markov Transducers also shown to perform significantly better than fixed window methods [Eskin 01] STIDE RIPPER HMM thresh old %false- pos threshold %false- pos threshold %false- pos Site-1 lpr 12 0.0 3 0.0016 10-7 0.0003 Site-2 lpr 12 0.0013 4 0.0265 10-7 0.0015 named 20 0.0019 10 0.0 10-7 0.0 xlock 20 0.00008 10 0.0 10-7 0.0 [from Warrender 99]
  • 31. ROC curves comparing different methods [from Warrender 99]
  • 32. Outline • Traditional mining on sequences • Primitives for handling sequence data • Sequence-specific mining operations – Partial sequence classification (Tagging) – Segmenting a sequence – Predicting next symbol of a sequence • Applications in biological sequence mining
  • 33. Partial sequence classification (Tagging) • The tagging problem: – Given: • A set of tags L • Training examples of sequences showing the breakup of the sequence into the set of tags – Learn to breakup a sequence into tags – (classification of parts of sequences) • Examples: – Text segmentation • Break sequence of words forming an address string into subparts like Road, City name etc – Continuous speech recognition • Identify words in continuous speech
  • 34. Text sequence segmentation Example: Addresses, bib records House number Building Road City Zip 4089 Whispering Pines Nobel Drive San Diego CA 92122 P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237. Author Year Title Journal Volume Page State
  • 35. Approaches used for tagging • Learn to identify start and end boundaries of each label – Method: for each label, build two classifiers for accepting its two boundaries. – Any conventional classifier would do: • Rule-based most common. • K-windows approach: – For each label, train a classifier on k-windows – During testing • classify each k-window • Adapt state-based generative models like HMM
  • 36. State-based models for sequence tagging • Two approaches: – Separate model per tag with special prefix and suffix states to capture the start and end of a tag S2 S4 S1 S3 Prefix Suffix Road name S2 S4 S1 S3 Prefix Suffix Building name
  • 37. Combined model over all tags … Mahatma Gandhi Road Near Parkland ... … [Mahatma Gandhi Road Near : Landmark] Parkland ...  Example: IE  Naïve Model: One state per element Nested model Each element another HMM
  • 38. Other approaches • Disadvantages of generative models (HMMs) – Maximizing joint probability of sequence and labels may not maximize accuracy – Conditional independence of features a restrictive assumption • Alternative: Conditional Random Fields – Maximize conditional probability of labels given sequence – Arbitrary overlapping features allowed
  • 39. Outline • Traditional mining on sequences • Primitives for handling sequence data • Sequence-specific mining operations – Partial sequence classification (Tagging) – Segmenting a sequence – Predicting next symbol of a sequence • Applications in biological sequence mining
  • 40. Segmenting sequences Basic premise: Sequence assumed to be generated from multiple models Goal: discover boundaries of model change Applications: Bioinformatics Better models by breaking up heterogeneous sequences Pick Return A B A C T G G T T T A C C C C T G T G A C T G G T T T A C C C C T G T G
  • 41. Simpler problem: Segmenting a 0/1 sequence • Players A and B • A has a set of coins with different biases • A repeatedly – Picks arbitrary coin – Tosses it arbitrary number of times • B observes H/T • Guesses transition points and biases Pick Toss Return A B 0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1 0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
  • 42. A MDL-based approach • Given n head/tail observations – Can assume n different coins with bias 0 or 1 • Data fits perfectly (with probability one) • Many coins needed – Or assume one coin • May fit data poorly • “Best segmentation” is a compromise between model length and data fit 0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1 1/4 5/7 1/3
  • 43. MDL For each segment i: L(Mi): cost of model parameters: log(Pr(head)) + segment boundary: log (sequence length) L(Di|Mi): cost of describing data in segment Si given model Mi: log(Hh T(1-h)) H: #heads, T: #tails Goal: find segmentation that leads to smallest total cost segment i L(Mi) + L(Di|Mi)
  • 44. How to find optimal segments 0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1 Sequence of 17 tosses: Derived graph with 18 nodes and all possible edges Edge cost = model cost + data cost Shortest path
  • 45. Non-independent models • In previous method each segment is assumed to be independent of each other, does not allow model reuse of the form: • The (k,h) segmentation problem: – Assume a fixed number h of models is to be used for segmenting into k parts an n-element sequence (k > h) – (k,k) segmentation solvable by dynamic programming – General (k,h) problem NP hard 0 0 1 0 1 1 0 1 0 1 1 0 0 0 0 0 1 0
  • 46. Approximations: for (k,h) segmentation 1. Solve (k,k) to get segments 2. Solve (n,h) to get H models – Example: 2-medians 3. Assign each segment to best of H A second variant (Replace step 2 above with) – Find summary statistics for each k segment, cluster them into H clusters A C T G G T T T A C C C C T G T G S1 S2 S3 S4 M1 M2 A C T G G T T T A C C C C T G T G
  • 47. Results of segmenting genetic sequences From: Gionis & Mannila 2003
  • 48. Outline • Traditional mining operations on sequences • Primitives for handling sequence data • Sequence-specific mining operations • Applications in biological sequence mining 1. Classifying proteins according to family 2. Finding genes in DNA sequences
  • 49. Two sequence mining applications
  • 50. The protein classification task Given: amino-acid sequence of a protein Do: predict the family to which it belongs GDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVCVLAHHFGKEFTPPVQAAYAKVVAGVANALAHKYH
  • 51. image from the DOE Human Genome Program http://www.ornl.gov/hgmis
  • 52. The roles of proteins • A protein family is… Figure from the DOE Human Genome Program http://www.ornl.gov/hgmis
  • 53. The roles of proteins • The amino-acid sequence of a protein determines its structure • The structure of a protein determines its function • Proteins play many key roles in cells – structural support – storage of amino acids – transport of other substances – coordination of an organism’s activities – response of cell to chemical stimuli – movement – protection against disease – selective acceleration of chemical reactions
  • 54. Protein taxonomies • The SCOP and CATH databases provide hierarchical taxonomies of protein families
  • 55. An alignment of globin family proteins Figure from www-cryst.bioc.cam.ac.uk/~max/res_globin.html  The sequences in a family may vary in length  Some positions are more conserved than others
  • 56. Profile HMMs i 2 i 3 i 1 i 0 d 1 d 2 d 3 m 1 m 3 m 2 start end Match states represent key conserved positions Insert states account for extra characters in some sequences Delete states are silent; they Account for characters missing in some sequences • Profile HMMs are commonly used to model families of sequences A 0.01 R 0.12 D 0.04 N 0.29 C 0.01 E 0.03 Q 0.02 G 0.01 Insert and match states have emission distributions over sequence characters
  • 57. Profile HMMs • To classify sequences according to family, we can train a profile HMM to model the proteins of each family of interest • Given a sequence x, use Bayes’ rule to make classification   j j j i i i c c x c c x x c ) Pr( ) | Pr( ) Pr( ) | Pr( ) | Pr(
  • 58. How Likely is a Given Sequence? • Consider computing in a profile HMM • The probability that the path is taken and the sequence is generated:     L i i N L i i i a x e a x x 1 0 0 1 1 1 ) ( ) ... , ... Pr(       L x x ... 1 N   ... 0 ) | Pr( i c x transition probabilities emission probabilities
  • 59. How Likely Is A Given Sequence? A 0.1 C 0.4 G 0.4 T 0.1 A 0.4 C 0.1 G 0.1 T 0.4 begin end 0.5 0.5 0.2 0.8 0.4 0.6 0.1 0.9 0.2 0.8 0 5 4 3 2 1 6 . 0 3 . 0 8 . 0 4 . 0 2 . 0 4 . 0 5 . 0 ) C ( ) A ( ) A ( ) , AAC Pr( 35 3 13 1 11 1 01               a e a e a e a  A 0.4 C 0.1 G 0.2 T 0.3 A 0.2 C 0.3 G 0.3 T 0.2
  • 60. How Likely is a Given Sequence? • the probability over all paths is: ) ... , ... Pr( ) ... Pr( 0 1 1      N L L x x x x • but the number of paths can be exponential in the length of the sequence... • the Forward algorithm enables us to compute this efficiently using dynamic programming 
  • 61. How Likely is a Given Sequence: The Forward Algorithm • define to be the probability of being in state k having observed the first i characters of x ) (i fk • we want to compute , the probability of being in the end state having observed all of x • can define this recursively ) (L fN
  • 62. The Forward Algorithm     k kN k N L a L f L f x x x ) ( ) ( ) ... Pr( ) Pr( 1 probability that we’re in the end state and have observed the entire sequence • termination:   k kl k l a i f i f ) ( ) (    k kl k l l a i f i e i f ) 1 ( ) ( ) ( • recursion for silent states: • recursion for emitting states (i =1…L): 1 ) 0 ( 0  f states silent not are that for , 0 ) 0 ( k fk  probability that we’re in start state and have observed 0 characters from the sequence • initialization:
  • 63. Training a profile HMM • The parameters in profile HMMs are typically trained using the Baum-Welch method (an EM algorithm) • Heuristic methods are used to determine the length of the model – Initialize using an alignment of the training sequences (as in globin example) – Iteratively make local adjustments to length if delete or insert states are used “too much” for training sequences
  • 64. The Fisher kernel method for protein classification • Standard HMMs are generative models – Training involves estimating – Predictions based on are made using Bayes’ rule • Sometimes we can get more accurate predictions using discriminative methods which try to optimize directly – One example: the Fisher kernel method [Jaakola et al. ‘00] ) | Pr( i c x ) | Pr( x ci ) | Pr( x ci
  • 65. The Fisher kernel method for protein classification • Consider learning to discriminate proteins in class from proteins other families 1. Train an HMM for 2. Use HMM to map each protein sequence x into a fixed-length vector 3. Train an support vector machine (SVM) whose kernel function is the Euclidean distance between vectors • The resulting discriminative model is given by 1 c 1 c       1 1 : : ) , ( ) , ( ) ( c x i i i c x i i i i i x x K x x K x D  
  • 66. The Fisher kernel method for protein classification • How can we get an informative fixed-length vector? Compute the Fisher score i.e. each component of is a derivative of the log-likelihood of x with respect to a particular parameter in the HMM ) , | Pr( log 1   c x Ux   x U i c x x i U      ) , | Pr( log 1 ] [
  • 67. Profile HMM accuracy Figure from Jaakola et al., ISMB 1999 BLAST-based methods profile HMMs • classifying 2447proteins into 33 families • x-axis represents the median fraction of negative sequences that score as high as a positive sequence for a given family’s model profile HMMs w/ Fisher kernel SVM
  • 68. The gene finding task Given: an uncharacterized DNA sequence Do: locate the genes in the sequence, including the coordinates of individual exons and introns image from the UCSC Genome Browser http://genome.ucsc.edu/
  • 69. image from the DOE Human Genome Program http://www.ornl.gov/hgmis
  • 70. The structure of genes • Genes consist of alternating sequences of exons and introns • Introns are spliced out before the gene is translated into protein ATG GAA ACC CGA TCG GGC … AC intergenic region intergenic region intron exon exon exon intron G TAA AGT CTA … • Exons consist of three-letter words, called codons • Each codon encodes a single amino acid (character in a protein sequence)
  • 71. Each shape represents a functional unit of a gene or genomic region Pairs of intron/exon units represent the different ways an intron can interrupt a coding sequence (after 1st base in codon, after 2nd base or after 3rd base) Complementary submodel (not shown) detects genes on opposite DNA strand The GENSCAN HMM for gene finding [Burge & Karlin ‘97]
  • 72. The GENSCAN HMM • For each sequence type, GENSCAN models – the length distribution – the sequence composition • Length distribution models vary depending on sequence type – Nonparametric (using histograms) – Parametric (using geometric distributions) – Fixed-length • Sequence composition models vary depending on type – 5th-order, inhomogeneous – 5th -order homogenous – Independent and 1st-order inhomogeneous – Tree-structured variable memory
  • 73. Representing exons in GENSCAN • For exons, GENSCAN uses – Histograms to represent exon lengths – 5th-order, inhomogeneous Markov models to represent exon sequences • 5th-order, inhomogeneous models can represent statistics about pairs of neighboring codons
  • 74. A 5th-order Markov model for DNA GCTAC AAAAA TTTTT CTACG CTACA CTACC CTACT Pr(A | GCTAC) start Pr(GCTAC) ) | Pr( ) Pr ) Pr( GCTAC A (GCTAC GCTACA 
  • 75. Markov models for exons • for each “word” we evaluate, we’ll want to consider its position with respect to the assumed codon framing • thus we’ll want to use an inhomogenous model to represent genes G C T A C G G A G C T T C G G A G C G C T A C G G is in 3rd codon position C T A C G G G is in 1st position T A C G G A A is in 2nd position
  • 76. A 5th-order inhomogeneous model GCTAC CTACG CTACA CTACC CTACT AAAAA TTTTT start TACAG TACAA TACAC TACAT AAAAA TTTTT GCTAC CTACG CTACA CTACC CTACT AAAAA TTTTT position 1 position 2 position 3 Transitions go to states in position 1
  • 77. Inference with the gene-finding HMM given: an uncharacterized DNA sequence do: find the most probable path through the model for the sequence • This path will specify the coordinates of the predicted genes (including intron and exon boundaries) • The Viterbi algorithm is used to compute this path
  • 78. Finding the most probable path: the Viterbi algorithm • define to be the probability of the most probable path accounting for the first i characters of x and ending in state k ) (i vk • we want to compute , the probability of the most probable path accounting for all of the sequence and ending in the end state • can define recursively • can use dynamic programming to find efficiently ) (L vN ) (L vN
  • 79. Finding the most probable path: the Viterbi algorithm • initialization: 1 ) 0 ( 0  v states silent not are that for , 0 ) 0 ( k vk 
  • 80. The Viterbi algorithm • recursion for emitting states (i =1…L):   kl k k i l l a i v x e i v ) 1 ( max ) ( ) (     kl k k l a i v i v ) ( max ) (  • recursion for silent states:   kl k k l a i v i ) ( max arg ) ( ptr    kl k k l a i v i ) 1 ( max arg ) ( ptr   keep track of most probable path
  • 81. The Viterbi algorithm • to recover the most probable path, follow pointers back starting at • termination: L   ) kN k k a L v ) ( max arg L    ) kN k k a L v x ) ( max ) , Pr(  
  • 82. Parsing a DNA Sequence ACCGTTACGTGTCATTCTACGTGATCATCGGATCCTAGAATCATCGATCCGTGCGATCGATCGGATTAGCTAGCTTAGCTAGGAGAGCATCGATCGGATCGAGGAGGAGCCTATATAAATCAA The Viterbi path represents a parse of a given sequence, predicting exons, introns, etc
  • 83. Some lessons from these biological applications • HMMs provide state-of-the-art performance in protein classification and gene finding • HMMs can naturally classify and parse sequences of variable length • Much domain knowledge can be incorporated into the structure of the models – Types, lengths and ordering of sequence features – Appropriate amount of memory to represent various sequence features • Models can vary representation across sequence features • Discriminative methods often provide superior predictive accuracy to generative methods
  • 84. References • S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25:3389–3402, 1997. • Apostolico, A., and Bejerano, G. 2000. Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. In Proceedings of RECOMB2000. http://citeseer.nj.nec.com/apostolico00optimal.html • Vinayak R. Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatic text segmentation for extracting structured records. SIGMOD 2001. • C. Burge and S. Karlin, Prediction of Complete Gene Structures in Human Genomic DNA. Journal of Molecular Biology, 268:78-94, 1997. • Mary Elaine Calif and R. J. Mooney. Relational learning of pattern-match rules for information extraction. AAAI 1999. • S. Chakrabarti, S. Sarawagi and B.Dom, Mining surprising patterns using temporal description length,VLDB, 1998 • M. Collins, “Discriminitive training method for Hidden Markov Models: Theory and experiments with perceptron algorithms, EMNLP 2002 • R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological sequence analysis: probabilistic models of proteins and nucleic acids, Cambridge University Press, 1998. • Eleazar Eskin, Wenke Lee and Salvatore J. Stolfo. ``Modeling System Calls for Intrusion Detection with Dynamic Window Sizes.'' Proceedings of DISCEX II. June 2001. • IDS http://www.cs.columbia.edu/ids/publications/ • D Freitag and A McCallum, Information Extraction with HMM Structures Learned by Stochastic Optimization, AAAI 2000
  • 85. References • Gionis and H. Mannila: Finding recurrent sources in sequences. ACM ReCOMB 2003 • Michael T. Goodrich, Efficient Piecewise-Approximation Using the Uniform Metric Symposium on Computational Geometry , (1994) • D. Haussler. Convolution kernels on discrete structure. Technical report, UC Santa Cruz, 1999. • K. Karplus, C. Barrett, and R. Hughey, Hidden Markov models for detecting remote protein homologies. Bioinformatics 14(10): 846-856, 1998. • L. Lo Conte, S. Brenner, T. Hubbard, C. Chothia, and A. Murzin. SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Research, 30:264- 267, 2002 • Wenke Lee and Sal Stolfo. ``Data Mining Approaches for Intrusion Detection'' In Proceedings of the Seventh USENIX Security Symposium (SECURITY '98), San Antonio, TX, January 1998 • A. McCallum and D. Freitag and F. Pereira, Maximum entropy Markov models for information extraction and segmentation, ICML-2000 • Rabiner, Lawrence R., 1990. A tutorial on hidden Markov models and selected applications in speech recognition. In Alex Weibel and Kay-Fu Lee (eds.), Readings in Speech Recognition. Los Altos, CA: Morgan Kaufmann, pages 267--296. • D. Ron, Y. Singer and N. Tishby. The power of amnesia: learning probabilistic automata with variable memory length. Machine Learning, 25:117-- 149, 1996 • Warrender, Christina, Stephanie Forrest, and Barak Pearlmutter. Detecting Intrusions Using System Calls: Alternative Data Models. To appear, 1999 IEEE Symposium on Security and Privacy. 1999