EXTRACTION OF SEQUENTIAL RULES (VIDEO 4/4)

Video 4/4– Sequence Mining
Alexis Bondu
www.edge-ml.fr
MODL : A Bayesian approach for model selection

Extraction of sequential rules
A new kind of variable
< … > sequences
Univariate
Multivariate

[4] M. E. Egho, D. Gay, N. Voisine, M. Boullé, F. Clérot. A Parameter-Free Approach for Mining Robust
Sequential Classification Rules. ICDM 2015.
Mac Boullé : http://www.marc-boulle.fr
Bibliography, implemented articles

Sequential data
Class, Sequence
0, <A,B,D,D,D,E,B,A,D,A,E,A,D>
1, <D,A,B,D,E,D,A,D,E,D,A,D,A,E,D,D,D,E,A,D,D,E,D,A,D,E>
1, <A,C,C,V,A,C,C,A,V,V,A,C,C,A,V,V,A,C,C,A,V>
0, <C,A,B,D,A,C,B,A,E,A,C>
0, <B,A,C,B,C,A,B,E>
1, <A,C,B,A,B,C,D,A,E>
0, <A,B,B,A,C,B,A,C,C,A,B,B,A,C>
1, <A,B,C>
0, <A,B,C,A,B,E,E>
1, <B,C,A,C,C,A,E,E,D,A,E,D,A>
1, <A,B,C,D,A,B,C,E>
0, <A,B,B,C,A,C,D,A,C,D,A,B,B,A,C,D,A,A,B,C,A,D,E,A,E,C,C,A,D>
1, <A,B,C,B,A,C,B,B,C,A,B,D,D,D,A,E>
0, <A,B,C,A,B,C,D,A,D,C,C,A,D,A,E>
0, <A,B,B,C,A,C,E,E,E>

Class, Sequence
0, <A,B,D,D,D,E,B,A,D,A,E,A,D>
1, <D,A,B,D,E,D,A,D,E,D,A,D,A,E,D,D,D,E,A,D,D,E,D,A,D,E>
1, <A,C,C,V,A,C,C,A,V,V,A,C,C,A,V,V,A,C,C,A,V>
0, <C,A,B,D,A,C,B,A,E,A,C>
0, <B,A,C,B,C,A,B,E>
1, <A,C,B,A,B,C,D,A,E>
0, <A,B,B,A,C,B,A,C,C,A,B,B,A,C>
1, <A,B,C>
0, <A,B,C,A,B,E,E>
1, <B,C,A,C,C,A,E,E,D,A,E,D,A>
1, <A,B,C,D,A,B,C,E>
0, <A,B,B,C,A,C,D,A,C,D,A,B,B,A,C,D,A,A,B,C,A,D,E,A,E,C,C,A,D>
1, <A,B,C,B,A,C,B,B,C,A,B,D,D,D,A,E>
0, <A,B,C,A,B,C,D,A,D,C,C,A,D,A,E>
0, <A,B,B,C,A,C,E,E,E>
DNA Texts WEB sessions Predictive maintenance
Sequential data

A example
Texts categorization
Abstract
Whole genome RNA expression studies permit systematic approaches to understanding
the correlation between gene expression profiles to disease states or different
developmental stages of a cell. Microarray analysis provides quantitative information about
the complete transcription profile of cells that facilitate drug and therapeutics development,
disease diagnosis, and understanding in the basic cell biology. One of the challenges in
microarray analysis, especially in cancerous gene expression profiles, is to identify genes
or groups of genes that are highly expressed in tumour cells but not in normal cells and
vice versa. Previously, we have shown that ensemble machine learning consistently
performs well in classifying biological data. In this paper, we focus on three different
supervised machine learning techniques in cancer classification, namely C4.5 decision
tree, and bagged and boosted decision trees.
Two classes of scientific articles : medicine, machine learning

Abstract
Whole genome RNA expression studies permit systematic approaches to understanding
the correlation between gene_expression profiles to disease states or different
developmental stages of a cell. Microarray analysis provides quantitative_information
about the complete transcription profile of cells that facilitate drug and therapeutics
development, disease_diagnosis, and understanding in the basic cell biology. One of the
challenges in microarray analysis, especially in cancerous gene_expression profiles, is to
identify genes or groups of genes that are highly expressed in tumour_cells but not in
normal cells and vice versa. Previously, we have shown that ensemble machine_learning
consistently performs well in classifying biological data. In this paper, we focus on three
different supervised machine_learning techniques in cancer classification, namely C4.5
decision_tree, and bagged and boosted decision_trees.
< classifying, data > → P(ML) = 95%, P(medicine) = 5%
A example
Texts categorization
Two classes of scientific articles : medicine, machine learning

The MODL optimization criterion
Choice of the number of distinct
symbols within the rules
Choice of the length of the rule
Choice of the distinct symbols
within the rule
Choice of the order the symbols
Description of the distribution of
class values INSIDE the rule
The same
OUTSIDE the
rule
Likelihood of the data INSIDE the
rule Likelihood of the data
OUTSIDE the rule

Prior : Favors simple rules
Likelihood : Favors informative rules
A natural tradeoff which favors the
robustness
The MODL optimization criterion

Robustness of the criterion
Robustness of the compression gain illustrated by using the dataset « skater »
Confidence Growth rate
Compression gain
MODL
GC =1-
-log(P(M | D))
-log(P(M0 | D))
Recall : The compression gain compares the coding length of the current model with the
one of the null model M0, which no includes any element in the rule.

Other kinds of variable: adaptation of the criterion
< … > sequences
[ … ] lists
{ … } ensembles
1 – Computation of the supports (specific indexation technic)
2 – The order of the symbols is not encoded for the sub-set rules
Sub-list : contignuous symbols
Sub-set : symbol without order

Rule Mining Algorithm
1 – Exploration
Random drawing of sub-sequences
2 – Filtering
Computation of the compression gain
3- Collection of rules
A reservoir of rules is constituted

Classification
Recoding the rules & Training of a classifier
Ensemble of
informative rules
A B C D E F G
0 0 1 0 1 0 0
1 1 0 0 0 1 0
0 1 0 0 0 0 1
1 1 0 0 0 0 0
0 1 0 1 0 0 0
Binary recoding
Rules
Observations
Training of the
classifier
Compression gain > 0

Selection of independent rules
3- Independent rules
1 – Sort
Sub-sequences ordered by decreasing GC
Objective : interpretability / performance of the classifier
2 – Filtering
Marginal compression gain

Use cases
Applications on textual datasets
SMS :
- No preprocessing
- 2 classes (spam / non spam)
- AUC = 0.96 with 50 rules
E-mails Reuters :
- No preprocessing
- 10 classes
- 4 sequential variables (organization / place / objet / corps)
Wikipedia :
- No preprocessing
- 10 classes

FREE
P(spam) = 0.986842
have won
P(spam) = 1.000000
You have
P(spam) = 0.791667
URGENT!
P(spam) = 1.000000
to contact
P(spam) = 0.948718
STOP
P(spam) = 0.979167
now!
P(spam) = 0.907407
awarded
P(spam) = 1.000000
£1000
P(spam) = 1.000000
guaranteed
P(spam) = 1.000000
Use cases
Applications on textual datasets

You know all about Edge ML’s algorithms!
The automated pipe of Machine Learning : MODL is everywhere!

EXTRACTION OF SEQUENTIAL RULES (VIDEO 4/4)

More Related Content

What's hot

Similar to EXTRACTION OF SEQUENTIAL RULES (VIDEO 4/4)

Recently uploaded

EXTRACTION OF SEQUENTIAL RULES (VIDEO 4/4)

Editor's Notes