SlideShare a Scribd company logo
1 of 66
Download to read offline
ANALYSE DE SENTIMENTS ET
CLASSIFICATION PAR
APPROCHE NEURONALE


Patrice	Bellot


Aix-Marseille	Université	-	CNRS	(LIS-INS2I)


patrice.bellot@univ-amu.fr


Atelier	ANF	CNRS	INRAE


novembre	2021
Apprentissage automatique, modèle, évaluation
2
Ensemble	d’exemples
Donnée	(X) Réponse	attendue	(Y)
Donnée	(X) Réponse	attendue	(Y)
Donnée	(X) Réponse	attendue	(Y)
Donnée	(X) Réponse	attendue	(Y)
Donnée	(X) Réponse	attendue	(Y)
Donnée	(X) Réponse	attendue	(Y)
Ensemble	
d’entraînement
Ensemble	de	
validation
Modèle
Ensemble	de	test
Donnée	(X) Réponse	attendue	(Y) Donnée	(X) Réponse	attendue	(Y)
Entraînement	cherchant


à	minimiser	l’erreur	sur


ces	données
Evaluation	puis	réglages	de	la	configuration


(hyper-paramètres,	architecture…)
M
S
1
2
3
4
Donnée	(X)
Donnée	(X) Réponse	attendue	(Y)
Donnée	(X) Réponse	attendue	(Y)
Donnée	(X) Réponse	attendue	(Y)
5
Donnée	(X)
Donnée	(X)
2 et 3 sont répétées en un nombre d’époques (epochs)
Evaluation
Exploitation
ANALYSE DE SENTIMENT (POLARITE)
SUR DES CRITIQUES DE FILMS
3
4
http://ai.stanford.edu/~amaas/data/sentiment
/

Large Movie Review Dataset
Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y.,
& Potts, C. (2011, June). Learning word vectors for
sentiment analysis. In Proceedings of the 49th annual
meeting of the association for computational linguistics:
Human language technologies (pp. 142-150).
Corpus	d’entraînement	(train)	:	12	500	positives,	12	500	négatives
uses disjoint sets of movies for training and testing.
These steps minimize the ability of a learner to rely
on idiosyncratic word–class associations, thereby
focusing attention on genuine sentiment features.
4.3.2 IMDB Review Dataset
We constructed a collection of 50,000 reviews from
IMDB, allowing no more than 30 reviews per movie.
The constructed dataset contains an even number of
positive and negative reviews, so randomly guessing
yields 50% accuracy. Following previous work on
polarity classification, we consider only highly po-
larized reviews. A negative review has a score 4
out of 10, and a positive review has a score 7
out of 10. Neutral reviews are not included in the
dataset. In the interest of providing a benchmark for
future work in this area, we release this dataset to
the public.2
We evenly divided the dataset into training and
test sets. The training set is the same 25,000 la-
beled reviews used to induce word vectors with our
model. We evaluate classifier performance after
cross-validating classifier parameters on the training
set, again using a linear SVM in all cases. Table 2
shows classification performance on our subset of
IMDB reviews. Our model showed superior per-
formance to other approaches, and performed best
when concatenated with bag of words representa-
tion. Again the variant of our model which utilized
extra unlabeled data during training performed best.
Differences in accuracy are small, but, because
our test set contains 25,000 examples, the variance
of the performance estimate is quite low. For ex-
ample, an accuracy increase of 0.1% corresponds to
correctly classifying an additional 25 reviews.
4.4 Subjectivity Detection
As a second evaluation task, we performed sentence-
level subjectivity classification. In this task, a clas-
sifier is trained to decide whether a given sentence is
subjective, expressing the writer’s opinions, or ob-
jective, expressing purely facts. We used the dataset
of Pang and Lee (2004), which contains subjective
sentences from movie review summaries and objec-
tive sentences from movie plot summaries. This task
2
is substantially different from the review classi
tion task because it uses sentences as opposed to
tire documents and the target concept is subject
instead of opinion polarity. We randomly split
10,000 examples into 10 folds and report 10-
cross validation accuracy using the SVM trai
protocol of Pang and Lee (2004).
Table 2 shows classification accuracies from
sentence subjectivity experiment. Our model a
provided superior features when compared aga
other VSMs. Improvement over the bag-of-w
baseline is obtained by concatenating the two fea
vectors.
5 Discussion
We presented a vector space model that learns w
representations captuing semantic and sentimen
formation. The model’s probabilistic founda
gives a theoretically justified technique for w
vector induction as an alternative to the overwh
ing number of matrix factorization-based techniq
commonly used. Our model is parametrized
log-bilinear model following recent success in
ing similar techniques for language models (Be
et al., 2003; Collobert and Weston, 2008; Mnih
Hinton, 2007), and it is related to probabilistic la
topic models (Blei et al., 2003; Steyvers and G
fiths, 2006). We parametrize the topical compo
of our model in a manner that aims to capture w
representations instead of latent topics. In our
periments, our method performed better than L
which models latent topics directly.
We extended the unsupervised model to in
porate sentiment information and showed how
extended model can leverage the abundance
sentiment-labeled texts available online to y
word representations that capture both sentim
and semantic relations. We demonstrated the
ity of such representations on two tasks of s
ment classification, using existing datasets as
as a larger one that we release for future resea
These tasks involve relatively simple sentimen
formation, but the model is highly flexible in
regard; it can be used to characterize a wide va
Le plus simple : utiliser des modules existants
5
from textblob import TextBlo
b

print(TextBlob("I hate that movie").sentiment.polarity
texte = "This is a gem. As a Film Four production - the anticipated quality was indeed delivered. Shot with
that reminded me some Errol Morris films, well arranged and simply gripping. It's long yet horrifying to the
excruciating. We know something bad happened (one can guess by the lack of participation of a person in the
but we are compelled to see it, a bit like a car accident in slow motion. The story spans most conceivable a
unlike some documentaries did not try and refrain from showing the grimmer sides of the stories, as also dea
guilt of the people Don left behind him, wondering why they didn't stop him in time. It took me a few hours
the melancholy that gripped me after seeing this very-well made documentary.
print(TextBlob(texte).sentiment.polarity
)

-	0,8
-	0,054
Mais….	comment	?	quelle	performance	en	moyenne	?	comment	l’améliorer	?
6
taille	du	fichier	movie_dataset.csv	:	65,9	Mo	(50	000	lignes,	14	millions	de	tokens,	194	758	mots	différents)
7
les	tokens	les	plus	fréquents	:	
100	749	mots	n’apparaissent	qu’une	fois	: ‘the’	apparait	573	397	fois
{'the': 573397, ',': 544031, '.':
467886, 'and': 309118, 'a': 309103,
'of': 285087, 'to': 263658, 'is':
21474
0

les	tokens	les	plus	fréquents	après	suppressions	des	mots	outils	:
8
Toutes	les	critiques	réunies
9
Les	critiques	positives
10
Les	critiques	négatives
Avec un classifieur bayésien naïf (NB)
11
Division	des	exemples	:


50	%	entraînement	(train)


50	%	test
BRIEF ARTICLE
THE AUTHOR
P(w1, w2, ..., wT 1, wT ) =
T
Y
t=1
P(wt|wt 1, wt 2, ..., w1)
Similarité(d1, d2) ⇡
!
d1.
!
d2
Similarité(d1, d2) ⇡ cos(
!
d1,
!
d2)
P(Classe|(w1, w2, ..., wT 1, wT ))
n
X
i=1
Pi. log2 Pi
X = (x1, x2, ..., xn)
P(classe|X) =
P(X|classe)P(classe)
P(X)
P(X|classe) =
Y
i
P(x1|classe) ⇥ ... ⇥ P(xi|classe) ⇥ ... ⇥ P(xn|classe)
accuracy(Y, b
Y ) =
1
nexemples
X
i
1(b
yi = yi)
Evaluation	(classes	0	et	1)


	données	de	test
classes


prédites
classes


réelles
Réseau de neurones
Apprentissage	de	relations	(non	linéaires)	entre	les	entrées	et	la	sortie
poids
poids
poids
poids
activation
activation
poids
poids
activation
0	ou	1
descripteurs


(mots)
classe	«	négative	»	


vs


classe	«	positive	»
Architectures neuronales
13
Riti	Dass


https://medium.com/predict/the-complete-list-to-make-you-an-ai-pro-be83448720b8
Deep	Learning	for	Ligand-Based	Virtual	Screening	in	Drug	Discovery


	 October	2018


	 DOI:	10.1109/PAIS.2018.8598488


	 Conference:	2018	3rd	International	Conference	on	Pattern	Analysis	and	Intelligent	Systems	(PAIS)


	 Meriem	BahiMohamed	Batouche
ein receptor is known.
ed is high-throughput
suitable for very large
igand-based approach
y observed for small
ve similar biological
perform poorly when
nds available.
play a significant role
order to address the
], [7]. Unfortunately,
wing in an exponential
ance predicting in ma-
g has rapidly emerged
in the pharmaceutical
munity, as a promising
The first deep learning
roposed by Merck in
Quantitative Structure-
er few years, Dahl et
ased on the multi-task
chemical and genomic
s molecular structure.
ated the performance
target prediction task.
ively multitask neural
new drug from finger-
highlighted by recent
drugs and the identi-
ies like the prediction
ular fingerprints. [12].
cient technique based
k named AtomNet to
es for drug discovery.
nce of Deep Learning
compounds. Pereira
approach based on
improve molecular-
al., [15] used a deep
drugs into therapeutic
iptional data.
ug design and discov-
solving the problems
past, modern DL has cracked the code for training stability
and generalization and scales on big data [17].
Deep learning algorithms are really deep architectures of
consecutive layers based learning multiple levels of repre-
sentation and abstraction. The potency of deep learning for
drug discovery shows the extraordinary promise. Here, we
provide a new compound classification technique based on
Deep Learning using Spark-H2O platform-python for Ligand-
based Virtual Screening. Within this scope, we use a Deep
Neural Networks algorithm to reduce considerably the search
space and classify rapidly the ligands as active or inactive
based on their heterogeneous molecular descriptors.
Fig. 1. Deep Neural Network architecture
As shown in Fig. 1, our deep neural network (DNN) is
a sequence of fully-connected layers that take the features of
each compound as input and classify these ligands as drug-like
or nondrug-like. Once input features are given to the DNN,
the output values are computed sequentially along the layers
of the network. At each layer, the weighted sum is calculated
by multiplying the input vector including the output values of
each unit in the previous layer by the weight vector for each
unit in the current hidden layer. The basic DNN architecture
consists of an input layer L0, an output layer Lout and H
hidden layers Lh (h 2 {1, 2, ..., H}) between input and output
layers. Each hidden layer Lh is a set of several units which
can be arranged as a vector ah R|
Lh|, where |Lh| stands for
the number of units in Lh. Then, each hidden layer Lh can
Réduction de la dimension (projection)
Les plongements de mots (word embeddings)
14
Tommaso	Teofili


https://jaxenter.com/deep-learning-search-word2vec-147782.html
Adrian	Colyer	


https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/
Ef
fi
cient Estimation of Word Representations in Vector Space – Mikolov et al. 2013
Distributed Representations of Words and Phrases and their Compositionality – Mikolov et al. 2013
Apprentissage de représentation Word2Vec
15
pour	chaque	mot,	


un	vecteur	en	


200	dimensions
Le modèle Word2Vec appris sur les critiques
16
#### Enregistrement du modèle appris
(ici format réduit : gain de place mais ne permet pas de continuer
l'entraînement avec de nouveaux textes -- pour enregistrer un modèle complet,
faire model.save à la place
)

#%
%

nomEmbeddings = 'imdb_embeddings_word2vec_200_5_100
'

model.wv.save_word2vec_format(nomEmbeddings, binary=False
Visualisation des plongements (2 dimensions)
17
https://www.geeksforgeeks.org/difference-between-pca-vs-t-sne/
https://towardsdatascience.com/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b
voisinage	conservé	


(distance)
allure	générale	


conservée	(variance)
https://stats.stackexchange.com/questions/
238538/are-there-cases-where-pca-is-more-
suitable-than-t-sne
18
représentation


du	vocabulaire


complet


(approche	t-sne)
t-distributed Stochastic Neighbor Embedding
tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23
env.	30	s	/	itération


(1	seul	cœur)
19
20
https://web.stanford.edu/class/cs224n/materials/
Gensim%20word%20vector%20visualization.html


ACP


de	3000


mots	pris


au	hasard
21
ACP


de	3000


mots	pris


au	hasard


en	limitant


Word2Vec


aux	mots	ayant


au	moins


100	occurrences


(soit	6561	mots)
model = gensim.models.Word2Vec(sentences=review_lines, size=200, window=5, workers=4, min_count=100
22
t-SNE


en	limitant


Word2Vec


aux	mots	ayant


au	moins


100	occurrences


(soit	6561	mots)
250	itérations
1000	itérations
23
t-SNE	des	mots	dont


occurrences	>	500


en	limitant


Word2Vec


aux	mots	ayant


au	moins


100	occurrences
24
t-SNE	des	mots	dont


occurrences	>	1000


en	limitant


Word2Vec


aux	mots	ayant


au	moins


100	occurrences
250	itérations
1000	itérations
2500	itérations
25
t-SNE	des	mots	dont


occurrences	>	5000


en	limitant


Word2Vec


aux	mots	ayant


au	moins


100	occurrences 

2500	itérations
tsne_model = TSNE(perplexity=40, n_components=2, init='random', n_iter=2500, random_state=23, verbose=True, n_jobs=12
Architecture testée
26
les	plongements	de	la	critique
Embedding	layer
Flatten
… …
Dense
… …
Dense
sortie	(classe	prédite)
mot mot mot….
mot 1
mot 2
mot 3
mot 4
mot 5
mot 6
longueur	


max	(input_length)
une	critique	de	film
table	de	correspondance
mot
mot
mot
mot
<—	nombre	de	dimensions	(output_dim)	—>
..
Les	vecteurs	(plongements)
taille	du	vocabulaire


(input_dim)
[[6, 16], [42, 24], [2, 17], [42, 24], [18], [17], [22, 17], [27, 42],
[[ 6 16 0 0
]

[42 24 0 0
]

[ 2 17 0 0
]

[42 24 0 0
]

[18 0 0 0
]

[17 0 0 0
]

[22 17 0 0
]

[27 42 0 0
]

[22 24 0 0
]

[49 46 16 34]]
O	ou	1
27
model.add(embedding_layer
)

model.add(Flatten()
)

model.add(Dense(8, activation='relu')
)

model.add(Dense(1, activation='sigmoid')
)

DIMENSION_EMBEDDINGS = 20
0

modelEmbeddings = gensim.models.Word2Vec(sentences=review_lines, size=DIMENSION_EMBEDDINGS, window=5, workers=12, min_count=100
1er	essai	:	on	ne	garde


que	le	mots	qui	apparaissent


au	moins	100	fois


Réseau	:	une	seule	couche	cachée
28
29
model.add(Flatten()
)

model.add(Dense(32, activation='relu')
)

model.add(Dense(8, activation='relu')
)

model.add(Dense(1, activation='sigmoid')
)

2è	essai	:	


on	ajoute	une	2è	couche
30
31
model.add(embedding_layer
)

model.add(Flatten()
)

model.add(Dense(32, activation='relu')
)

model.add(Dense(32, activation='relu')
)

model.add(Dense(8, activation='relu')
)

model.add(Dense(1, activation='sigmoid')
)

Epoch 16/2
0

35000/35000 [==============================] - 12s 330us/step - loss: 0.0199 - accuracy: 0.9935 - val_loss: 3.4786 - val_accuracy: 0.493
Les	6	000	mots	les	plus	fréquents	ne	sont	pas	suffisants
3è	essai	:	on	rajoute	une	3è	couche…
32
DIMENSION_EMBEDDINGS = 20
0

model = gensim.models.Word2Vec(sentences=review_lines, size=DIMENSION_EMBEDDINGS, window=5, workers=12, min_count=10
history = model.fit(X_train_pad, y_train, batch_size=128, epochs=20, validation_data=(X_test_pad, y_test), verbose=1
Epoch 1/2
0

35000/35000 [==============================] - 12s 340us/step - loss: 0.4560 - accuracy: 0.7825 - val_loss: 0.4157 - val_accuracy: 0.806
Epoch 2/2
0

35000/35000 [==============================] - 11s 304us/step - loss: 0.2631 - accuracy: 0.8859 - val_loss: 0.4545 - val_accuracy: 0.798
model.add(Flatten()
)

model.add(Dense(32, activation='relu')
)

model.add(Dense(8, activation='relu')
)

model.add(Dense(1, activation='sigmoid')
)

Nette	amélioration	


avec	les	27	000	mots	les	plus	fréquents
4è	essai	:	on	prend	en	compte	plus	de	mots	du	vocabulaire
DANS GOOGLE COLAB
33
https://colab.research.google.com
34
35
36
37
En conclusion
— Approche bayésienne « naïve » : 0,85


- durée d’apprentissage : quelques secondes


— Approche neuronale « plongements + couches denses » :
- mal configurée : 0,50 (soit l’équivalent d’un tirage aléatoire…)
- après quelques réglages et essais : 0,80


- durée d’apprentissage


— avec CPU seul 12 cœurs : environ 10s / epoch, soit 3 mn
— avec GPU (Google Colab) : environ 8s. / epoch


— Approche neuronale « plongements + réseaux récurrents »
- meilleur score : 0,88 (soit 3% de gain) — 0,9


- durée d’apprentissage :


— avec CPU seul : environ 3000 s. / epoch, soit > 24 h.


— avec TPU (Google Colab) : environ 200 s. / epoch


38
39
40
https://d2l.ai/index.html
41
https://towardsdatascience.com
https://www.kdnuggets.com
CATEGORISATION DE TEXTES ET
CLASSIFICATION NON SUPERVISÉE
42
43
fichier	.CSV	sur	27	colonnes


(méta-données	ISTEX)
44
Identifier	les	colonnes	(in)utiles
45
Lire	le	fichier	.csv	en	Python	avec	le	module	Pandas
46
Explorer	les	données
47
Mise en forme du .csv pour Weka
— Weka utilise normalement le format .ARFF mais peut importer les .CSV
— Il est plus facile de respecter les paramètres CSV de Weka avant
l’importation…
48
resumesClasses.to_csv(fichierSortie,sep=',',na_rep='?', index = False, header=True, quoting=csv.QUOTE_NO
quotechar='"', doublequote=False, escapechar=''
)

for index, row in resumesClasses.iterrows()
:

cat = str(row['Categories']
)

catRetenue = re.search(" 2 - ([a-z]+)", cat
)

if catRetenue
:

row['Categories'] = catRetenue[1
]

else
:

catRetenue = re.search("1 - ([a-z]+)", cat
)

if catRetenue
:

row['Categories'] = catRetenue[1
]

On	extrait	les	
colonnes


Résumés	et	
Catégories
On	ne	conserve	qu’une	seule	catégorie
resumesClasses = data[[‘Resume','Categories']][data['Resume'].notnull()
resumesClasses = resumesClasses[resumesClasses['Categories'].notnull()
On	enregistre	en	.csv	«	Weka	»
49
ça	ne	suffit	pas…	les	types	ne	sont	pas	spécifiés
50
Ouverture	du	.csv	dans	Weka	«	Explorer	»
Conversion	de	l’attribut	Résumé	en	«	string	»
Reste	à	vectoriser	les	résumés	avec	un	filtre	adapté
53
1
2
3
54
On	applique	StringToWordVector	:	il	y	a	autant	d’attributs	que	de	mots
55
Utilisation	d’un	classifieur	bayésien	pour	apprendre	à	prédire	les	catégories
Attention	à	


bien	choisir	l’attribut	


à	prédire
56
THE AUTHOR
P(w1, w2, ..., wT 1, wT ) =
T
Y
t=1
P(wt|wt 1, wt 2, ..., w1)
Similarité(d1, d2) ⇡
!
d1.
!
d2
Similarité(d1, d2) ⇡ cos(
!
d1,
!
d2)
P(Classe|(w1, w2, ..., wT 1, wT ))
n
X
i=1
Pi. log2 Pi
X = (x1, x2, ..., xn)
P(classe|X) =
P(X|classe)P(classe)
P(X)
P(X|classe) =
Y
i
P(x1|classe) ⇥ ... ⇥ P(xi|classe) ⇥ ... ⇥ P(xn|classe)
i=1
X = (x1, x2, ..., xn)
P(classe|X) =
P(X|classe)P(classe)
P(X)
) =
Y
i
P(x1|classe) ⇥ ... ⇥ P(xi|classe) ⇥ ... ⇥ P(xn|classe)
, w2, ..., wT 1, wT ))
n
X
i=1
Pi. log2 Pi
X = (x1, x2, ..., xn)
P(classe|X) =
P(X|classe)P(classe)
P(X)
asse) =
Y
i
P(x1|classe) ⇥ ... ⇥ P(xi|classe) ⇥ ... ⇥ P(xn|classe)
Similarité(d1, d2) ⇡ cos(
!
d1,
!
d2)
P(Classe|(w1, w2, ..., wT 1, wT ))
n
X
i=1
Pi. log2 Pi
X = (x1, x2, ..., xn)
P(classe|X) =
P(X|classe)P(classe)
P(X)
P(X|classe) =
Y
i
P(x1|classe) ⇥ ... ⇥ P(xi|classe) ⇥ ... ⇥ P(xn|classe)
P(w1, w2, ..., wT 1, wT ) =
T
Y
t=1
P(wt|wt 1, wt 2, ..., w1)
Similarité(d1, d2) ⇡
!
d1.
!
d2
Similarité(d1, d2) ⇡ cos(
!
d1,
!
d2)
P(Classe|(w1, w2, ..., wT 1, wT ))
n
X
i=1
Pi. log2 Pi
X = (x1, x2, ..., xn)
P(classe|X) =
P(X|classe)P(classe)
P(X)
P(X|classe) =
Y
i
P(x1|classe) ⇥ ... ⇥ P(xi|classe) ⇥ ... ⇥ P(xn|classe)
1
Similarité(d1, d2) ⇡
!
d1.
!
d2
Similarité(d1, d2) ⇡ cos(
!
d1,
!
d2)
P(Classe|(w1, w2, ..., wT 1, wT ))
n
X
i=1
Pi. log2 Pi
X = (x1, x2, ..., xn)
P(classe|X) =
P(X|classe)P(classe)
P(X)
P(X|classe) =
Y
i
P(x1|classe) ⇥ ... ⇥ P(xi|classe) ⇥ ... ⇥ P(xn|classe)
P(w1, w2, ..., wT 1, wT ) =
T
Y
t=1
P(wt|wt 1, wt 2, ..., w1)
Similarité(d1, d2) ⇡
!
d1.
!
d2
Similarité(d1, d2) ⇡ cos(
!
d1,
!
d2)
P(Classe|(w1, w2, ..., wT 1, wT ))
n
X
i=1
Pi. log2 Pi
X = (x1, x2, ..., xn)
P(classe|X) =
P(X|classe)P(classe)
P(X)
P(X|classe) =
Y
i
P(x1|classe) ⇥ ... ⇥ P(xi|classe) ⇥ ... ⇥ P(xn|classe)
1
t=1
Similarité(d1, d2) ⇡
!
d1.
!
d2
Similarité(d1, d2) ⇡ cos(
!
d1,
!
d2)
P(Classe|(w1, w2, ..., wT 1, wT ))
n
X
i=1
Pi. log2 Pi
X = (x1, x2, ..., xn)
P(classe|X) =
P(X|classe)P(classe)
P(X)
P(X|classe) =
Y
i
P(x1|classe) ⇥ ... ⇥ P(xi|classe) ⇥ ... ⇥ P(xn|classe)
P(w1, w2, ..., wT 1, wT ) =
T
Y
t=1
P(wt|wt 1, wt 2, ..., w1)
Similarité(d1, d2) ⇡
!
d1.
!
d2
Similarité(d1, d2) ⇡ cos(
!
d1,
!
d2)
P(Classe|(w1, w2, ..., wT 1, wT ))
n
X
i=1
Pi. log2 Pi
X = (x1, x2, ..., xn)
P(classe|X) =
P(X|classe)P(classe)
P(X)
P(X|classe) =
Y
i
P(x1|classe) ⇥ ... ⇥ P(xi|classe) ⇥ ... ⇥ P(xn|classe)
Données


d’apprentissage
Classifieur


bayésien


(règle	de	Bayes)
avec	:
P(w1, w2, ..., wT 1, wT ) =
t=1
P(wt|wt 1, wt 2, ..., w1)
larité(d1, d2) ⇡
!
d1.
!
d2
ité(d1, d2) ⇡ cos(
!
d1,
!
d2)
lasse|(w1, w2, ..., wT 1, wT ))
n
X
i=1
Pi. log2 Pi
X = (x1, x2, ..., xn)
P(classe|X) =
P(X|classe)P(classe)
P(X)
P(X|classe) =
Y
i
P(x1|classe) ⇥ P(x2|classe) ⇥ ... ⇥ P(xn|classe)
connaissance


a	priori
inutile	pour	comparer	les	P(classe)
La	probabilité	de	chaque	classe	candidate 

Autant	de	scores	que	de	classes
les	x	sont	les	descripteurs	(features)	de	l'individu	à	classer
la	fréquence	avec	


laquelle	on	observe	x_1	dans	la	classe


parmi	les	exemples	(données	d’apprentissage)
la	fréquence	avec	


laquelle	on	observe	x_n	dans	la	classe


parmi	les	exemples	(données	d’apprentissage)
{
le	modèle


appris
….											….
Max	?
57
58
à	comparer	avec	48%	avec	2/3	—	1/3	(test)
59
Et	avec	un	arbre	de


décision	?
60
61
Références et liens pour WEKA
62
Data Mining: Practical Machine Learning Tools
and Technique
s

de Ian H. Witten , Eibe Frank, et al.
Classification non supervisée (avec Python)
— Objectif : réunir les documents en fonction de leurs similarités et visualiser les
classes obtenues


— Moyens :


— algorithmes de partitionnement tels que les k-Moyennes


ou les cartes auto-organisées


— visualisation par ACP


— Espace initial en très grande dimension (la taille du vocabulaire) :
— réunir les mots similaires = projeter les documents sur un espace réduit
63
64
1275	résumés


100	dimensions
65
k-Means	puis	ACP	pour	visualiser	les	classes
3	classes
6	classes
15	classes
66
4	x	4 10	x	10
30	x	30
Cartes	auto-organisées	
(SOM)
som = somoclu.Somoclu(4, 4, maptype="toroid"
)

som.train(doc2vec
)

More Related Content

What's hot

Cours de Gestion de la « E- Réputation»
Cours de Gestion de la « E- Réputation» Cours de Gestion de la « E- Réputation»
Cours de Gestion de la « E- Réputation» Babacar LO
 
Rapport de stage PFE - Mémoire master: Développement d'une application Android
Rapport de stage PFE - Mémoire master: Développement d'une application AndroidRapport de stage PFE - Mémoire master: Développement d'une application Android
Rapport de stage PFE - Mémoire master: Développement d'une application AndroidBadrElattaoui
 
Mise à niveau d’un système de gestion de clientèle (CRM)
Mise à niveau d’un système de gestion de clientèle (CRM)Mise à niveau d’un système de gestion de clientèle (CRM)
Mise à niveau d’un système de gestion de clientèle (CRM)Nawres Farhat
 
conception et realisation PFE
conception et realisation PFEconception et realisation PFE
conception et realisation PFEAlàa Assèli
 
Rapport PFE BIAT Conception et mise en place d’une plate-forme de gestion des...
Rapport PFE BIAT Conception et mise en place d’une plate-forme de gestion des...Rapport PFE BIAT Conception et mise en place d’une plate-forme de gestion des...
Rapport PFE BIAT Conception et mise en place d’une plate-forme de gestion des...Yasmine Lachheb
 
Etat de l’art approche et outils BI
Etat de l’art approche et outils BIEtat de l’art approche et outils BI
Etat de l’art approche et outils BISaid Sadik
 
Banque et transformation digitale
Banque et transformation digitaleBanque et transformation digitale
Banque et transformation digitaleJerome Caudrelier
 
présentation soutenance PFE.ppt
présentation soutenance PFE.pptprésentation soutenance PFE.ppt
présentation soutenance PFE.pptMohamed Ben Bouzid
 
Presentation usage reseaux sociaux.ppt
Presentation usage reseaux sociaux.pptPresentation usage reseaux sociaux.ppt
Presentation usage reseaux sociaux.pptElusNumeriques
 
Piloter sa campagne digital marketing de A à Z
Piloter sa campagne digital marketing de A à ZPiloter sa campagne digital marketing de A à Z
Piloter sa campagne digital marketing de A à ZAmar LAKEL, PhD
 
Club E-réputation : questions d'entreprise sur l'e-réputation
Club E-réputation : questions d'entreprise sur l'e-réputationClub E-réputation : questions d'entreprise sur l'e-réputation
Club E-réputation : questions d'entreprise sur l'e-réputationSindup
 
Cahier des charges site internet
Cahier des charges site internetCahier des charges site internet
Cahier des charges site internetEPC Familia
 
Soutenance mémoire de fin d'études
Soutenance mémoire de fin d'étudesSoutenance mémoire de fin d'études
Soutenance mémoire de fin d'étudesFabrice HAUHOUOT
 
La spécification des besoins
La spécification des besoinsLa spécification des besoins
La spécification des besoinsIsmahen Traya
 
Cours community management
Cours community managementCours community management
Cours community managementKerim Bouzouita
 
Seo (Search Engine Optimization) Référencement naturel
Seo (Search Engine Optimization) Référencement naturel Seo (Search Engine Optimization) Référencement naturel
Seo (Search Engine Optimization) Référencement naturel Soukaina EL HAYOUNI
 
La veille au sein d’ooredoo
La veille au sein d’ooredooLa veille au sein d’ooredoo
La veille au sein d’ooredooSoulef riahi
 

What's hot (20)

Cours de Gestion de la « E- Réputation»
Cours de Gestion de la « E- Réputation» Cours de Gestion de la « E- Réputation»
Cours de Gestion de la « E- Réputation»
 
Rapport de stage PFE - Mémoire master: Développement d'une application Android
Rapport de stage PFE - Mémoire master: Développement d'une application AndroidRapport de stage PFE - Mémoire master: Développement d'une application Android
Rapport de stage PFE - Mémoire master: Développement d'une application Android
 
Mise à niveau d’un système de gestion de clientèle (CRM)
Mise à niveau d’un système de gestion de clientèle (CRM)Mise à niveau d’un système de gestion de clientèle (CRM)
Mise à niveau d’un système de gestion de clientèle (CRM)
 
conception et realisation PFE
conception et realisation PFEconception et realisation PFE
conception et realisation PFE
 
Rapport PFE BIAT Conception et mise en place d’une plate-forme de gestion des...
Rapport PFE BIAT Conception et mise en place d’une plate-forme de gestion des...Rapport PFE BIAT Conception et mise en place d’une plate-forme de gestion des...
Rapport PFE BIAT Conception et mise en place d’une plate-forme de gestion des...
 
Etat de l’art approche et outils BI
Etat de l’art approche et outils BIEtat de l’art approche et outils BI
Etat de l’art approche et outils BI
 
Banque et transformation digitale
Banque et transformation digitaleBanque et transformation digitale
Banque et transformation digitale
 
présentation soutenance PFE.ppt
présentation soutenance PFE.pptprésentation soutenance PFE.ppt
présentation soutenance PFE.ppt
 
Presentation usage reseaux sociaux.ppt
Presentation usage reseaux sociaux.pptPresentation usage reseaux sociaux.ppt
Presentation usage reseaux sociaux.ppt
 
Piloter sa campagne digital marketing de A à Z
Piloter sa campagne digital marketing de A à ZPiloter sa campagne digital marketing de A à Z
Piloter sa campagne digital marketing de A à Z
 
Club E-réputation : questions d'entreprise sur l'e-réputation
Club E-réputation : questions d'entreprise sur l'e-réputationClub E-réputation : questions d'entreprise sur l'e-réputation
Club E-réputation : questions d'entreprise sur l'e-réputation
 
Opinion Mining
Opinion MiningOpinion Mining
Opinion Mining
 
Cahier des charges site internet
Cahier des charges site internetCahier des charges site internet
Cahier des charges site internet
 
Soutenance mémoire de fin d'études
Soutenance mémoire de fin d'étudesSoutenance mémoire de fin d'études
Soutenance mémoire de fin d'études
 
La spécification des besoins
La spécification des besoinsLa spécification des besoins
La spécification des besoins
 
Deep learning
Deep learningDeep learning
Deep learning
 
Cours community management
Cours community managementCours community management
Cours community management
 
Seo (Search Engine Optimization) Référencement naturel
Seo (Search Engine Optimization) Référencement naturel Seo (Search Engine Optimization) Référencement naturel
Seo (Search Engine Optimization) Référencement naturel
 
Présentation PFE
Présentation PFEPrésentation PFE
Présentation PFE
 
La veille au sein d’ooredoo
La veille au sein d’ooredooLa veille au sein d’ooredoo
La veille au sein d’ooredoo
 

Similar to Deep Learning Drug Discovery Classification

A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEijnlc
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEkevig
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly
 
Using binary classifiers
Using binary classifiersUsing binary classifiers
Using binary classifiersbutest
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .pptbutest
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2Viral Gupta
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similaritypathsproject
 
Word2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsWord2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsVincenzo Lomonaco
 
Beyond Word2Vec: Embedding Words and Phrases in Same Vector Space
Beyond Word2Vec: Embedding Words and Phrases in Same Vector SpaceBeyond Word2Vec: Embedding Words and Phrases in Same Vector Space
Beyond Word2Vec: Embedding Words and Phrases in Same Vector SpaceVijay Prakash Dwivedi
 
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine LearningSentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine LearningIRJET Journal
 
Robust Coreference Resolution and Entity Linking on Dialogues: Character Iden...
Robust Coreference Resolution and Entity Linking on Dialogues: Character Iden...Robust Coreference Resolution and Entity Linking on Dialogues: Character Iden...
Robust Coreference Resolution and Entity Linking on Dialogues: Character Iden...Jinho Choi
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelHemantha Kulathilake
 

Similar to Deep Learning Drug Discovery Classification (20)

A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 
DL.pptx
DL.pptxDL.pptx
DL.pptx
 
228-SE3001_2
228-SE3001_2228-SE3001_2
228-SE3001_2
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
 
Using binary classifiers
Using binary classifiersUsing binary classifiers
Using binary classifiers
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2
 
1808.10245v1 (1).pdf
1808.10245v1 (1).pdf1808.10245v1 (1).pdf
1808.10245v1 (1).pdf
 
UWB semeval2016-task5
UWB semeval2016-task5UWB semeval2016-task5
UWB semeval2016-task5
 
Test for AI model
Test for AI modelTest for AI model
Test for AI model
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
 
Word2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsWord2vec on the italian language: first experiments
Word2vec on the italian language: first experiments
 
Beyond Word2Vec: Embedding Words and Phrases in Same Vector Space
Beyond Word2Vec: Embedding Words and Phrases in Same Vector SpaceBeyond Word2Vec: Embedding Words and Phrases in Same Vector Space
Beyond Word2Vec: Embedding Words and Phrases in Same Vector Space
 
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine LearningSentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
 
Robust Coreference Resolution and Entity Linking on Dialogues: Character Iden...
Robust Coreference Resolution and Entity Linking on Dialogues: Character Iden...Robust Coreference Resolution and Entity Linking on Dialogues: Character Iden...
Robust Coreference Resolution and Entity Linking on Dialogues: Character Iden...
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
Overfitting and-tbl
Overfitting and-tblOverfitting and-tbl
Overfitting and-tbl
 

More from Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)

More from Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I) (9)

Introduction à la fouille de textes et positionnement de l'offre logicielle
Introduction à la fouille de textes et positionnement de l'offre logicielleIntroduction à la fouille de textes et positionnement de l'offre logicielle
Introduction à la fouille de textes et positionnement de l'offre logicielle
 
A combination of reduction and expansion approaches to handle with long natur...
A combination of reduction and expansion approaches to handle with long natur...A combination of reduction and expansion approaches to handle with long natur...
A combination of reduction and expansion approaches to handle with long natur...
 
Introduction générale sur les enjeux du Text and Data Mining TDM
Introduction générale sur les enjeux du Text and Data Mining TDMIntroduction générale sur les enjeux du Text and Data Mining TDM
Introduction générale sur les enjeux du Text and Data Mining TDM
 
Recommandation sociale : filtrage collaboratif et par le contenu
Recommandation sociale : filtrage collaboratif et par le contenuRecommandation sociale : filtrage collaboratif et par le contenu
Recommandation sociale : filtrage collaboratif et par le contenu
 
Scholarly Book Recommendation
Scholarly Book RecommendationScholarly Book Recommendation
Scholarly Book Recommendation
 
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
 
Huma-Num une Infrastructure pour les SHS
Huma-Num une Infrastructure pour les SHSHuma-Num une Infrastructure pour les SHS
Huma-Num une Infrastructure pour les SHS
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
OpenEdition Lab projects in Text Mining
OpenEdition Lab projects in Text MiningOpenEdition Lab projects in Text Mining
OpenEdition Lab projects in Text Mining
 

Recently uploaded

zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Forest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantForest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantadityabhardwaj282
 
Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2John Carlo Rollon
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫qfactory1
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)DHURKADEVIBASKAR
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 

Recently uploaded (20)

zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Forest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantForest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are important
 
Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 

Deep Learning Drug Discovery Classification

  • 1. ANALYSE DE SENTIMENTS ET CLASSIFICATION PAR APPROCHE NEURONALE Patrice Bellot 
 Aix-Marseille Université - CNRS (LIS-INS2I) patrice.bellot@univ-amu.fr Atelier ANF CNRS INRAE 
 novembre 2021
  • 2. Apprentissage automatique, modèle, évaluation 2 Ensemble d’exemples Donnée (X) Réponse attendue (Y) Donnée (X) Réponse attendue (Y) Donnée (X) Réponse attendue (Y) Donnée (X) Réponse attendue (Y) Donnée (X) Réponse attendue (Y) Donnée (X) Réponse attendue (Y) Ensemble d’entraînement Ensemble de validation Modèle Ensemble de test Donnée (X) Réponse attendue (Y) Donnée (X) Réponse attendue (Y) Entraînement cherchant à minimiser l’erreur sur ces données Evaluation puis réglages de la configuration (hyper-paramètres, architecture…) M S 1 2 3 4 Donnée (X) Donnée (X) Réponse attendue (Y) Donnée (X) Réponse attendue (Y) Donnée (X) Réponse attendue (Y) 5 Donnée (X) Donnée (X) 2 et 3 sont répétées en un nombre d’époques (epochs) Evaluation Exploitation
  • 3. ANALYSE DE SENTIMENT (POLARITE) SUR DES CRITIQUES DE FILMS 3
  • 4. 4 http://ai.stanford.edu/~amaas/data/sentiment / Large Movie Review Dataset Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June). Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (pp. 142-150). Corpus d’entraînement (train) : 12 500 positives, 12 500 négatives uses disjoint sets of movies for training and testing. These steps minimize the ability of a learner to rely on idiosyncratic word–class associations, thereby focusing attention on genuine sentiment features. 4.3.2 IMDB Review Dataset We constructed a collection of 50,000 reviews from IMDB, allowing no more than 30 reviews per movie. The constructed dataset contains an even number of positive and negative reviews, so randomly guessing yields 50% accuracy. Following previous work on polarity classification, we consider only highly po- larized reviews. A negative review has a score 4 out of 10, and a positive review has a score 7 out of 10. Neutral reviews are not included in the dataset. In the interest of providing a benchmark for future work in this area, we release this dataset to the public.2 We evenly divided the dataset into training and test sets. The training set is the same 25,000 la- beled reviews used to induce word vectors with our model. We evaluate classifier performance after cross-validating classifier parameters on the training set, again using a linear SVM in all cases. Table 2 shows classification performance on our subset of IMDB reviews. Our model showed superior per- formance to other approaches, and performed best when concatenated with bag of words representa- tion. Again the variant of our model which utilized extra unlabeled data during training performed best. Differences in accuracy are small, but, because our test set contains 25,000 examples, the variance of the performance estimate is quite low. For ex- ample, an accuracy increase of 0.1% corresponds to correctly classifying an additional 25 reviews. 4.4 Subjectivity Detection As a second evaluation task, we performed sentence- level subjectivity classification. In this task, a clas- sifier is trained to decide whether a given sentence is subjective, expressing the writer’s opinions, or ob- jective, expressing purely facts. We used the dataset of Pang and Lee (2004), which contains subjective sentences from movie review summaries and objec- tive sentences from movie plot summaries. This task 2 is substantially different from the review classi tion task because it uses sentences as opposed to tire documents and the target concept is subject instead of opinion polarity. We randomly split 10,000 examples into 10 folds and report 10- cross validation accuracy using the SVM trai protocol of Pang and Lee (2004). Table 2 shows classification accuracies from sentence subjectivity experiment. Our model a provided superior features when compared aga other VSMs. Improvement over the bag-of-w baseline is obtained by concatenating the two fea vectors. 5 Discussion We presented a vector space model that learns w representations captuing semantic and sentimen formation. The model’s probabilistic founda gives a theoretically justified technique for w vector induction as an alternative to the overwh ing number of matrix factorization-based techniq commonly used. Our model is parametrized log-bilinear model following recent success in ing similar techniques for language models (Be et al., 2003; Collobert and Weston, 2008; Mnih Hinton, 2007), and it is related to probabilistic la topic models (Blei et al., 2003; Steyvers and G fiths, 2006). We parametrize the topical compo of our model in a manner that aims to capture w representations instead of latent topics. In our periments, our method performed better than L which models latent topics directly. We extended the unsupervised model to in porate sentiment information and showed how extended model can leverage the abundance sentiment-labeled texts available online to y word representations that capture both sentim and semantic relations. We demonstrated the ity of such representations on two tasks of s ment classification, using existing datasets as as a larger one that we release for future resea These tasks involve relatively simple sentimen formation, but the model is highly flexible in regard; it can be used to characterize a wide va
  • 5. Le plus simple : utiliser des modules existants 5 from textblob import TextBlo b print(TextBlob("I hate that movie").sentiment.polarity texte = "This is a gem. As a Film Four production - the anticipated quality was indeed delivered. Shot with that reminded me some Errol Morris films, well arranged and simply gripping. It's long yet horrifying to the excruciating. We know something bad happened (one can guess by the lack of participation of a person in the but we are compelled to see it, a bit like a car accident in slow motion. The story spans most conceivable a unlike some documentaries did not try and refrain from showing the grimmer sides of the stories, as also dea guilt of the people Don left behind him, wondering why they didn't stop him in time. It took me a few hours the melancholy that gripped me after seeing this very-well made documentary. print(TextBlob(texte).sentiment.polarity ) - 0,8 - 0,054 Mais…. comment ? quelle performance en moyenne ? comment l’améliorer ?
  • 7. 7 les tokens les plus fréquents : 100 749 mots n’apparaissent qu’une fois : ‘the’ apparait 573 397 fois {'the': 573397, ',': 544031, '.': 467886, 'and': 309118, 'a': 309103, 'of': 285087, 'to': 263658, 'is': 21474 0 les tokens les plus fréquents après suppressions des mots outils :
  • 11. Avec un classifieur bayésien naïf (NB) 11 Division des exemples : 50 % entraînement (train) 50 % test BRIEF ARTICLE THE AUTHOR P(w1, w2, ..., wT 1, wT ) = T Y t=1 P(wt|wt 1, wt 2, ..., w1) Similarité(d1, d2) ⇡ ! d1. ! d2 Similarité(d1, d2) ⇡ cos( ! d1, ! d2) P(Classe|(w1, w2, ..., wT 1, wT )) n X i=1 Pi. log2 Pi X = (x1, x2, ..., xn) P(classe|X) = P(X|classe)P(classe) P(X) P(X|classe) = Y i P(x1|classe) ⇥ ... ⇥ P(xi|classe) ⇥ ... ⇥ P(xn|classe) accuracy(Y, b Y ) = 1 nexemples X i 1(b yi = yi) Evaluation (classes 0 et 1) 
 données de test classes prédites classes réelles
  • 13. Architectures neuronales 13 Riti Dass https://medium.com/predict/the-complete-list-to-make-you-an-ai-pro-be83448720b8 Deep Learning for Ligand-Based Virtual Screening in Drug Discovery October 2018 DOI: 10.1109/PAIS.2018.8598488 Conference: 2018 3rd International Conference on Pattern Analysis and Intelligent Systems (PAIS) Meriem BahiMohamed Batouche ein receptor is known. ed is high-throughput suitable for very large igand-based approach y observed for small ve similar biological perform poorly when nds available. play a significant role order to address the ], [7]. Unfortunately, wing in an exponential ance predicting in ma- g has rapidly emerged in the pharmaceutical munity, as a promising The first deep learning roposed by Merck in Quantitative Structure- er few years, Dahl et ased on the multi-task chemical and genomic s molecular structure. ated the performance target prediction task. ively multitask neural new drug from finger- highlighted by recent drugs and the identi- ies like the prediction ular fingerprints. [12]. cient technique based k named AtomNet to es for drug discovery. nce of Deep Learning compounds. Pereira approach based on improve molecular- al., [15] used a deep drugs into therapeutic iptional data. ug design and discov- solving the problems past, modern DL has cracked the code for training stability and generalization and scales on big data [17]. Deep learning algorithms are really deep architectures of consecutive layers based learning multiple levels of repre- sentation and abstraction. The potency of deep learning for drug discovery shows the extraordinary promise. Here, we provide a new compound classification technique based on Deep Learning using Spark-H2O platform-python for Ligand- based Virtual Screening. Within this scope, we use a Deep Neural Networks algorithm to reduce considerably the search space and classify rapidly the ligands as active or inactive based on their heterogeneous molecular descriptors. Fig. 1. Deep Neural Network architecture As shown in Fig. 1, our deep neural network (DNN) is a sequence of fully-connected layers that take the features of each compound as input and classify these ligands as drug-like or nondrug-like. Once input features are given to the DNN, the output values are computed sequentially along the layers of the network. At each layer, the weighted sum is calculated by multiplying the input vector including the output values of each unit in the previous layer by the weight vector for each unit in the current hidden layer. The basic DNN architecture consists of an input layer L0, an output layer Lout and H hidden layers Lh (h 2 {1, 2, ..., H}) between input and output layers. Each hidden layer Lh is a set of several units which can be arranged as a vector ah R| Lh|, where |Lh| stands for the number of units in Lh. Then, each hidden layer Lh can
  • 14. Réduction de la dimension (projection) Les plongements de mots (word embeddings) 14 Tommaso Teofili https://jaxenter.com/deep-learning-search-word2vec-147782.html Adrian Colyer https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/ Ef fi cient Estimation of Word Representations in Vector Space – Mikolov et al. 2013 Distributed Representations of Words and Phrases and their Compositionality – Mikolov et al. 2013
  • 15. Apprentissage de représentation Word2Vec 15 pour chaque mot, un vecteur en 200 dimensions
  • 16. Le modèle Word2Vec appris sur les critiques 16 #### Enregistrement du modèle appris (ici format réduit : gain de place mais ne permet pas de continuer l'entraînement avec de nouveaux textes -- pour enregistrer un modèle complet, faire model.save à la place ) #% % nomEmbeddings = 'imdb_embeddings_word2vec_200_5_100 ' model.wv.save_word2vec_format(nomEmbeddings, binary=False
  • 17. Visualisation des plongements (2 dimensions) 17 https://www.geeksforgeeks.org/difference-between-pca-vs-t-sne/ https://towardsdatascience.com/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b voisinage conservé (distance) allure générale conservée (variance) https://stats.stackexchange.com/questions/ 238538/are-there-cases-where-pca-is-more- suitable-than-t-sne
  • 18. 18 représentation du vocabulaire complet (approche t-sne) t-distributed Stochastic Neighbor Embedding tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23 env. 30 s / itération (1 seul cœur)
  • 19. 19
  • 25. 25 t-SNE des mots dont occurrences > 5000 en limitant Word2Vec aux mots ayant au moins 100 occurrences 2500 itérations tsne_model = TSNE(perplexity=40, n_components=2, init='random', n_iter=2500, random_state=23, verbose=True, n_jobs=12
  • 26. Architecture testée 26 les plongements de la critique Embedding layer Flatten … … Dense … … Dense sortie (classe prédite) mot mot mot…. mot 1 mot 2 mot 3 mot 4 mot 5 mot 6 longueur max (input_length) une critique de film table de correspondance mot mot mot mot <— nombre de dimensions (output_dim) —> .. Les vecteurs (plongements) taille du vocabulaire (input_dim) [[6, 16], [42, 24], [2, 17], [42, 24], [18], [17], [22, 17], [27, 42], [[ 6 16 0 0 ] [42 24 0 0 ] [ 2 17 0 0 ] [42 24 0 0 ] [18 0 0 0 ] [17 0 0 0 ] [22 17 0 0 ] [27 42 0 0 ] [22 24 0 0 ] [49 46 16 34]] O ou 1
  • 27. 27 model.add(embedding_layer ) model.add(Flatten() ) model.add(Dense(8, activation='relu') ) model.add(Dense(1, activation='sigmoid') ) DIMENSION_EMBEDDINGS = 20 0 modelEmbeddings = gensim.models.Word2Vec(sentences=review_lines, size=DIMENSION_EMBEDDINGS, window=5, workers=12, min_count=100 1er essai : on ne garde que le mots qui apparaissent au moins 100 fois Réseau : une seule couche cachée
  • 28. 28
  • 30. 30
  • 31. 31 model.add(embedding_layer ) model.add(Flatten() ) model.add(Dense(32, activation='relu') ) model.add(Dense(32, activation='relu') ) model.add(Dense(8, activation='relu') ) model.add(Dense(1, activation='sigmoid') ) Epoch 16/2 0 35000/35000 [==============================] - 12s 330us/step - loss: 0.0199 - accuracy: 0.9935 - val_loss: 3.4786 - val_accuracy: 0.493 Les 6 000 mots les plus fréquents ne sont pas suffisants 3è essai : on rajoute une 3è couche…
  • 32. 32 DIMENSION_EMBEDDINGS = 20 0 model = gensim.models.Word2Vec(sentences=review_lines, size=DIMENSION_EMBEDDINGS, window=5, workers=12, min_count=10 history = model.fit(X_train_pad, y_train, batch_size=128, epochs=20, validation_data=(X_test_pad, y_test), verbose=1 Epoch 1/2 0 35000/35000 [==============================] - 12s 340us/step - loss: 0.4560 - accuracy: 0.7825 - val_loss: 0.4157 - val_accuracy: 0.806 Epoch 2/2 0 35000/35000 [==============================] - 11s 304us/step - loss: 0.2631 - accuracy: 0.8859 - val_loss: 0.4545 - val_accuracy: 0.798 model.add(Flatten() ) model.add(Dense(32, activation='relu') ) model.add(Dense(8, activation='relu') ) model.add(Dense(1, activation='sigmoid') ) Nette amélioration 
 avec les 27 000 mots les plus fréquents 4è essai : on prend en compte plus de mots du vocabulaire
  • 34. 34
  • 35. 35
  • 36. 36
  • 37. 37
  • 38. En conclusion — Approche bayésienne « naïve » : 0,85 - durée d’apprentissage : quelques secondes 
 — Approche neuronale « plongements + couches denses » : - mal configurée : 0,50 (soit l’équivalent d’un tirage aléatoire…) - après quelques réglages et essais : 0,80 - durée d’apprentissage — avec CPU seul 12 cœurs : environ 10s / epoch, soit 3 mn — avec GPU (Google Colab) : environ 8s. / epoch 
 — Approche neuronale « plongements + réseaux récurrents » - meilleur score : 0,88 (soit 3% de gain) — 0,9 - durée d’apprentissage : — avec CPU seul : environ 3000 s. / epoch, soit > 24 h. — avec TPU (Google Colab) : environ 200 s. / epoch 38
  • 39. 39
  • 42. CATEGORISATION DE TEXTES ET CLASSIFICATION NON SUPERVISÉE 42
  • 47. 47
  • 48. Mise en forme du .csv pour Weka — Weka utilise normalement le format .ARFF mais peut importer les .CSV — Il est plus facile de respecter les paramètres CSV de Weka avant l’importation… 48 resumesClasses.to_csv(fichierSortie,sep=',',na_rep='?', index = False, header=True, quoting=csv.QUOTE_NO quotechar='"', doublequote=False, escapechar='' ) for index, row in resumesClasses.iterrows() : cat = str(row['Categories'] ) catRetenue = re.search(" 2 - ([a-z]+)", cat ) if catRetenue : row['Categories'] = catRetenue[1 ] else : catRetenue = re.search("1 - ([a-z]+)", cat ) if catRetenue : row['Categories'] = catRetenue[1 ] On extrait les colonnes Résumés et Catégories On ne conserve qu’une seule catégorie resumesClasses = data[[‘Resume','Categories']][data['Resume'].notnull() resumesClasses = resumesClasses[resumesClasses['Categories'].notnull() On enregistre en .csv « Weka »
  • 56. 56 THE AUTHOR P(w1, w2, ..., wT 1, wT ) = T Y t=1 P(wt|wt 1, wt 2, ..., w1) Similarité(d1, d2) ⇡ ! d1. ! d2 Similarité(d1, d2) ⇡ cos( ! d1, ! d2) P(Classe|(w1, w2, ..., wT 1, wT )) n X i=1 Pi. log2 Pi X = (x1, x2, ..., xn) P(classe|X) = P(X|classe)P(classe) P(X) P(X|classe) = Y i P(x1|classe) ⇥ ... ⇥ P(xi|classe) ⇥ ... ⇥ P(xn|classe) i=1 X = (x1, x2, ..., xn) P(classe|X) = P(X|classe)P(classe) P(X) ) = Y i P(x1|classe) ⇥ ... ⇥ P(xi|classe) ⇥ ... ⇥ P(xn|classe) , w2, ..., wT 1, wT )) n X i=1 Pi. log2 Pi X = (x1, x2, ..., xn) P(classe|X) = P(X|classe)P(classe) P(X) asse) = Y i P(x1|classe) ⇥ ... ⇥ P(xi|classe) ⇥ ... ⇥ P(xn|classe) Similarité(d1, d2) ⇡ cos( ! d1, ! d2) P(Classe|(w1, w2, ..., wT 1, wT )) n X i=1 Pi. log2 Pi X = (x1, x2, ..., xn) P(classe|X) = P(X|classe)P(classe) P(X) P(X|classe) = Y i P(x1|classe) ⇥ ... ⇥ P(xi|classe) ⇥ ... ⇥ P(xn|classe) P(w1, w2, ..., wT 1, wT ) = T Y t=1 P(wt|wt 1, wt 2, ..., w1) Similarité(d1, d2) ⇡ ! d1. ! d2 Similarité(d1, d2) ⇡ cos( ! d1, ! d2) P(Classe|(w1, w2, ..., wT 1, wT )) n X i=1 Pi. log2 Pi X = (x1, x2, ..., xn) P(classe|X) = P(X|classe)P(classe) P(X) P(X|classe) = Y i P(x1|classe) ⇥ ... ⇥ P(xi|classe) ⇥ ... ⇥ P(xn|classe) 1 Similarité(d1, d2) ⇡ ! d1. ! d2 Similarité(d1, d2) ⇡ cos( ! d1, ! d2) P(Classe|(w1, w2, ..., wT 1, wT )) n X i=1 Pi. log2 Pi X = (x1, x2, ..., xn) P(classe|X) = P(X|classe)P(classe) P(X) P(X|classe) = Y i P(x1|classe) ⇥ ... ⇥ P(xi|classe) ⇥ ... ⇥ P(xn|classe) P(w1, w2, ..., wT 1, wT ) = T Y t=1 P(wt|wt 1, wt 2, ..., w1) Similarité(d1, d2) ⇡ ! d1. ! d2 Similarité(d1, d2) ⇡ cos( ! d1, ! d2) P(Classe|(w1, w2, ..., wT 1, wT )) n X i=1 Pi. log2 Pi X = (x1, x2, ..., xn) P(classe|X) = P(X|classe)P(classe) P(X) P(X|classe) = Y i P(x1|classe) ⇥ ... ⇥ P(xi|classe) ⇥ ... ⇥ P(xn|classe) 1 t=1 Similarité(d1, d2) ⇡ ! d1. ! d2 Similarité(d1, d2) ⇡ cos( ! d1, ! d2) P(Classe|(w1, w2, ..., wT 1, wT )) n X i=1 Pi. log2 Pi X = (x1, x2, ..., xn) P(classe|X) = P(X|classe)P(classe) P(X) P(X|classe) = Y i P(x1|classe) ⇥ ... ⇥ P(xi|classe) ⇥ ... ⇥ P(xn|classe) P(w1, w2, ..., wT 1, wT ) = T Y t=1 P(wt|wt 1, wt 2, ..., w1) Similarité(d1, d2) ⇡ ! d1. ! d2 Similarité(d1, d2) ⇡ cos( ! d1, ! d2) P(Classe|(w1, w2, ..., wT 1, wT )) n X i=1 Pi. log2 Pi X = (x1, x2, ..., xn) P(classe|X) = P(X|classe)P(classe) P(X) P(X|classe) = Y i P(x1|classe) ⇥ ... ⇥ P(xi|classe) ⇥ ... ⇥ P(xn|classe) Données d’apprentissage Classifieur bayésien (règle de Bayes) avec : P(w1, w2, ..., wT 1, wT ) = t=1 P(wt|wt 1, wt 2, ..., w1) larité(d1, d2) ⇡ ! d1. ! d2 ité(d1, d2) ⇡ cos( ! d1, ! d2) lasse|(w1, w2, ..., wT 1, wT )) n X i=1 Pi. log2 Pi X = (x1, x2, ..., xn) P(classe|X) = P(X|classe)P(classe) P(X) P(X|classe) = Y i P(x1|classe) ⇥ P(x2|classe) ⇥ ... ⇥ P(xn|classe) connaissance a priori inutile pour comparer les P(classe) La probabilité de chaque classe candidate Autant de scores que de classes les x sont les descripteurs (features) de l'individu à classer la fréquence avec laquelle on observe x_1 dans la classe parmi les exemples (données d’apprentissage) la fréquence avec laquelle on observe x_n dans la classe parmi les exemples (données d’apprentissage) { le modèle appris …. …. Max ?
  • 57. 57
  • 60. 60
  • 61. 61
  • 62. Références et liens pour WEKA 62 Data Mining: Practical Machine Learning Tools and Technique s de Ian H. Witten , Eibe Frank, et al.
  • 63. Classification non supervisée (avec Python) — Objectif : réunir les documents en fonction de leurs similarités et visualiser les classes obtenues — Moyens : — algorithmes de partitionnement tels que les k-Moyennes 
 ou les cartes auto-organisées — visualisation par ACP — Espace initial en très grande dimension (la taille du vocabulaire) : — réunir les mots similaires = projeter les documents sur un espace réduit 63
  • 66. 66 4 x 4 10 x 10 30 x 30 Cartes auto-organisées (SOM) som = somoclu.Somoclu(4, 4, maptype="toroid" ) som.train(doc2vec )