Video 4/4– Sequence Mining
Alexis Bondu
www.edge-ml.fr
MODL : A Bayesian approach for model selection
Extraction of sequential rules
A new kind of variable
< … > sequences
Univariate
Multivariate
[4] M. E. Egho, D. Gay, N. Voisine, M. Boullé, F. Clérot. A Parameter-Free Approach for Mining Robust
Sequential Classification Rules. ICDM 2015.
Mac Boullé : http://www.marc-boulle.fr
Bibliography, implemented articles
Sequential data
Class, Sequence
0, <A,B,D,D,D,E,B,A,D,A,E,A,D>
1, <D,A,B,D,E,D,A,D,E,D,A,D,A,E,D,D,D,E,A,D,D,E,D,A,D,E>
1, <A,C,C,V,A,C,C,A,V,V,A,C,C,A,V,V,A,C,C,A,V>
0, <C,A,B,D,A,C,B,A,E,A,C>
0, <B,A,C,B,C,A,B,E>
1, <A,C,B,A,B,C,D,A,E>
0, <A,B,B,A,C,B,A,C,C,A,B,B,A,C>
1, <A,B,C>
0, <A,B,C,A,B,E,E>
1, <B,C,A,C,C,A,E,E,D,A,E,D,A>
1, <A,B,C,D,A,B,C,E>
0, <A,B,B,C,A,C,D,A,C,D,A,B,B,A,C,D,A,A,B,C,A,D,E,A,E,C,C,A,D>
1, <A,B,C,B,A,C,B,B,C,A,B,D,D,D,A,E>
0, <A,B,C,A,B,C,D,A,D,C,C,A,D,A,E>
0, <A,B,B,C,A,C,E,E,E>
Class, Sequence
0, <A,B,D,D,D,E,B,A,D,A,E,A,D>
1, <D,A,B,D,E,D,A,D,E,D,A,D,A,E,D,D,D,E,A,D,D,E,D,A,D,E>
1, <A,C,C,V,A,C,C,A,V,V,A,C,C,A,V,V,A,C,C,A,V>
0, <C,A,B,D,A,C,B,A,E,A,C>
0, <B,A,C,B,C,A,B,E>
1, <A,C,B,A,B,C,D,A,E>
0, <A,B,B,A,C,B,A,C,C,A,B,B,A,C>
1, <A,B,C>
0, <A,B,C,A,B,E,E>
1, <B,C,A,C,C,A,E,E,D,A,E,D,A>
1, <A,B,C,D,A,B,C,E>
0, <A,B,B,C,A,C,D,A,C,D,A,B,B,A,C,D,A,A,B,C,A,D,E,A,E,C,C,A,D>
1, <A,B,C,B,A,C,B,B,C,A,B,D,D,D,A,E>
0, <A,B,C,A,B,C,D,A,D,C,C,A,D,A,E>
0, <A,B,B,C,A,C,E,E,E>
DNA Texts WEB sessions Predictive maintenance
Sequential data
A example
Texts categorization
Abstract
Whole genome RNA expression studies permit systematic approaches to understanding
the correlation between gene expression profiles to disease states or different
developmental stages of a cell. Microarray analysis provides quantitative information about
the complete transcription profile of cells that facilitate drug and therapeutics development,
disease diagnosis, and understanding in the basic cell biology. One of the challenges in
microarray analysis, especially in cancerous gene expression profiles, is to identify genes
or groups of genes that are highly expressed in tumour cells but not in normal cells and
vice versa. Previously, we have shown that ensemble machine learning consistently
performs well in classifying biological data. In this paper, we focus on three different
supervised machine learning techniques in cancer classification, namely C4.5 decision
tree, and bagged and boosted decision trees.
Two classes of scientific articles : medicine, machine learning
Abstract
Whole genome RNA expression studies permit systematic approaches to understanding
the correlation between gene_expression profiles to disease states or different
developmental stages of a cell. Microarray analysis provides quantitative_information
about the complete transcription profile of cells that facilitate drug and therapeutics
development, disease_diagnosis, and understanding in the basic cell biology. One of the
challenges in microarray analysis, especially in cancerous gene_expression profiles, is to
identify genes or groups of genes that are highly expressed in tumour_cells but not in
normal cells and vice versa. Previously, we have shown that ensemble machine_learning
consistently performs well in classifying biological data. In this paper, we focus on three
different supervised machine_learning techniques in cancer classification, namely C4.5
decision_tree, and bagged and boosted decision_trees.
< classifying, data > → P(ML) = 95%, P(medicine) = 5%
A example
Texts categorization
Two classes of scientific articles : medicine, machine learning
The MODL optimization criterion
Choice of the number of distinct
symbols within the rules
Choice of the length of the rule
Choice of the distinct symbols
within the rule
Choice of the order the symbols
Description of the distribution of
class values INSIDE the rule
The same
OUTSIDE the
rule
Likelihood of the data INSIDE the
rule Likelihood of the data
OUTSIDE the rule
Prior : Favors simple rules
Likelihood : Favors informative rules
A natural tradeoff which favors the
robustness
The MODL optimization criterion
Robustness of the criterion
Robustness of the compression gain illustrated by using the dataset « skater »
Confidence Growth rate
Compression gain
MODL
GC =1-
-log(P(M | D))
-log(P(M0 | D))
Recall : The compression gain compares the coding length of the current model with the
one of the null model M0, which no includes any element in the rule.
Other kinds of variable: adaptation of the criterion
< … > sequences
[ … ] lists
{ … } ensembles
1 – Computation of the supports (specific indexation technic)
2 – The order of the symbols is not encoded for the sub-set rules
Sub-list : contignuous symbols
Sub-set : symbol without order
Rule Mining Algorithm
1 – Exploration
Random drawing of sub-sequences
2 – Filtering
Computation of the compression gain
3- Collection of rules
A reservoir of rules is constituted
Classification
Recoding the rules & Training of a classifier
Ensemble of
informative rules
A B C D E F G
0 0 1 0 1 0 0
1 1 0 0 0 1 0
0 1 0 0 0 0 1
1 1 0 0 0 0 0
0 1 0 1 0 0 0
Binary recoding
Rules
Observations
Training of the
classifier
Compression gain > 0
Selection of independent rules
3- Independent rules
1 – Sort
Sub-sequences ordered by decreasing GC
Objective : interpretability / performance of the classifier
2 – Filtering
Marginal compression gain
Use cases
Applications on textual datasets
SMS :
- No preprocessing
- 2 classes (spam / non spam)
- AUC = 0.96 with 50 rules
E-mails Reuters :
- No preprocessing
- 10 classes
- 4 sequential variables (organization / place / objet / corps)
- AUC = 0.975 with 1000 rules
- AUC = 0.935 with 50 rules
Wikipedia :
- No preprocessing
- 10 classes
- AUC = 0.991 with 2000 rules
FREE
P(spam) = 0.986842
have won
P(spam) = 1.000000
You have
P(spam) = 0.791667
URGENT!
P(spam) = 1.000000
to contact
P(spam) = 0.948718
STOP
P(spam) = 0.979167
now!
P(spam) = 0.907407
awarded
P(spam) = 1.000000
£1000
P(spam) = 1.000000
guaranteed
P(spam) = 1.000000
Use cases
Applications on textual datasets
You know all about Edge ML’s algorithms!
The automated pipe of Machine Learning : MODL is everywhere!

EXTRACTION OF SEQUENTIAL RULES (VIDEO 4/4)

  • 1.
    Video 4/4– SequenceMining Alexis Bondu www.edge-ml.fr MODL : A Bayesian approach for model selection
  • 2.
    Extraction of sequentialrules A new kind of variable < … > sequences Univariate Multivariate
  • 3.
    [4] M. E.Egho, D. Gay, N. Voisine, M. Boullé, F. Clérot. A Parameter-Free Approach for Mining Robust Sequential Classification Rules. ICDM 2015. Mac Boullé : http://www.marc-boulle.fr Bibliography, implemented articles
  • 4.
    Sequential data Class, Sequence 0,<A,B,D,D,D,E,B,A,D,A,E,A,D> 1, <D,A,B,D,E,D,A,D,E,D,A,D,A,E,D,D,D,E,A,D,D,E,D,A,D,E> 1, <A,C,C,V,A,C,C,A,V,V,A,C,C,A,V,V,A,C,C,A,V> 0, <C,A,B,D,A,C,B,A,E,A,C> 0, <B,A,C,B,C,A,B,E> 1, <A,C,B,A,B,C,D,A,E> 0, <A,B,B,A,C,B,A,C,C,A,B,B,A,C> 1, <A,B,C> 0, <A,B,C,A,B,E,E> 1, <B,C,A,C,C,A,E,E,D,A,E,D,A> 1, <A,B,C,D,A,B,C,E> 0, <A,B,B,C,A,C,D,A,C,D,A,B,B,A,C,D,A,A,B,C,A,D,E,A,E,C,C,A,D> 1, <A,B,C,B,A,C,B,B,C,A,B,D,D,D,A,E> 0, <A,B,C,A,B,C,D,A,D,C,C,A,D,A,E> 0, <A,B,B,C,A,C,E,E,E>
  • 5.
    Class, Sequence 0, <A,B,D,D,D,E,B,A,D,A,E,A,D> 1,<D,A,B,D,E,D,A,D,E,D,A,D,A,E,D,D,D,E,A,D,D,E,D,A,D,E> 1, <A,C,C,V,A,C,C,A,V,V,A,C,C,A,V,V,A,C,C,A,V> 0, <C,A,B,D,A,C,B,A,E,A,C> 0, <B,A,C,B,C,A,B,E> 1, <A,C,B,A,B,C,D,A,E> 0, <A,B,B,A,C,B,A,C,C,A,B,B,A,C> 1, <A,B,C> 0, <A,B,C,A,B,E,E> 1, <B,C,A,C,C,A,E,E,D,A,E,D,A> 1, <A,B,C,D,A,B,C,E> 0, <A,B,B,C,A,C,D,A,C,D,A,B,B,A,C,D,A,A,B,C,A,D,E,A,E,C,C,A,D> 1, <A,B,C,B,A,C,B,B,C,A,B,D,D,D,A,E> 0, <A,B,C,A,B,C,D,A,D,C,C,A,D,A,E> 0, <A,B,B,C,A,C,E,E,E> DNA Texts WEB sessions Predictive maintenance Sequential data
  • 6.
    A example Texts categorization Abstract Wholegenome RNA expression studies permit systematic approaches to understanding the correlation between gene expression profiles to disease states or different developmental stages of a cell. Microarray analysis provides quantitative information about the complete transcription profile of cells that facilitate drug and therapeutics development, disease diagnosis, and understanding in the basic cell biology. One of the challenges in microarray analysis, especially in cancerous gene expression profiles, is to identify genes or groups of genes that are highly expressed in tumour cells but not in normal cells and vice versa. Previously, we have shown that ensemble machine learning consistently performs well in classifying biological data. In this paper, we focus on three different supervised machine learning techniques in cancer classification, namely C4.5 decision tree, and bagged and boosted decision trees. Two classes of scientific articles : medicine, machine learning
  • 7.
    Abstract Whole genome RNAexpression studies permit systematic approaches to understanding the correlation between gene_expression profiles to disease states or different developmental stages of a cell. Microarray analysis provides quantitative_information about the complete transcription profile of cells that facilitate drug and therapeutics development, disease_diagnosis, and understanding in the basic cell biology. One of the challenges in microarray analysis, especially in cancerous gene_expression profiles, is to identify genes or groups of genes that are highly expressed in tumour_cells but not in normal cells and vice versa. Previously, we have shown that ensemble machine_learning consistently performs well in classifying biological data. In this paper, we focus on three different supervised machine_learning techniques in cancer classification, namely C4.5 decision_tree, and bagged and boosted decision_trees. < classifying, data > → P(ML) = 95%, P(medicine) = 5% A example Texts categorization Two classes of scientific articles : medicine, machine learning
  • 8.
    The MODL optimizationcriterion Choice of the number of distinct symbols within the rules Choice of the length of the rule Choice of the distinct symbols within the rule Choice of the order the symbols Description of the distribution of class values INSIDE the rule The same OUTSIDE the rule Likelihood of the data INSIDE the rule Likelihood of the data OUTSIDE the rule
  • 9.
    Prior : Favorssimple rules Likelihood : Favors informative rules A natural tradeoff which favors the robustness The MODL optimization criterion
  • 10.
    Robustness of thecriterion Robustness of the compression gain illustrated by using the dataset « skater » Confidence Growth rate Compression gain MODL GC =1- -log(P(M | D)) -log(P(M0 | D)) Recall : The compression gain compares the coding length of the current model with the one of the null model M0, which no includes any element in the rule.
  • 11.
    Other kinds ofvariable: adaptation of the criterion < … > sequences [ … ] lists { … } ensembles 1 – Computation of the supports (specific indexation technic) 2 – The order of the symbols is not encoded for the sub-set rules Sub-list : contignuous symbols Sub-set : symbol without order
  • 12.
    Rule Mining Algorithm 1– Exploration Random drawing of sub-sequences 2 – Filtering Computation of the compression gain 3- Collection of rules A reservoir of rules is constituted
  • 13.
    Classification Recoding the rules& Training of a classifier Ensemble of informative rules A B C D E F G 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 0 0 0 Binary recoding Rules Observations Training of the classifier Compression gain > 0
  • 14.
    Selection of independentrules 3- Independent rules 1 – Sort Sub-sequences ordered by decreasing GC Objective : interpretability / performance of the classifier 2 – Filtering Marginal compression gain
  • 15.
    Use cases Applications ontextual datasets SMS : - No preprocessing - 2 classes (spam / non spam) - AUC = 0.96 with 50 rules E-mails Reuters : - No preprocessing - 10 classes - 4 sequential variables (organization / place / objet / corps) - AUC = 0.975 with 1000 rules - AUC = 0.935 with 50 rules Wikipedia : - No preprocessing - 10 classes - AUC = 0.991 with 2000 rules
  • 16.
    FREE P(spam) = 0.986842 havewon P(spam) = 1.000000 You have P(spam) = 0.791667 URGENT! P(spam) = 1.000000 to contact P(spam) = 0.948718 STOP P(spam) = 0.979167 now! P(spam) = 0.907407 awarded P(spam) = 1.000000 £1000 P(spam) = 1.000000 guaranteed P(spam) = 1.000000 Use cases Applications on textual datasets
  • 17.
    You know allabout Edge ML’s algorithms! The automated pipe of Machine Learning : MODL is everywhere!

Editor's Notes

  • #2 chaine you tube Edge ML (Auto ML) Vidéo destinée au DS Vidéo à regarder à la suite !! (maths) - Aujourd’hui : Intro à Auto ML + particularité MODL
  • #13 Deux paramètre temps et nombre de règles -> valeur par défaut intelligentes