Devoxx traitement automatique du langage sur du texte en 2019

Traitement Automatique
du Langage sur du texte
Devoxx Avril 2019

Hyperlex
Contract Management
Analysis
Data Extraction
and
Review

Machine Learning Pipeline
Documents
Title 1
Lorem ipsum dolor sit amet,
consectetur adipiscing elit.
Phasellus bibendum nulla eget
ornare. Sed velit dui,
Title 2
tincidunt vel massa in,
Praesent ante tellus, interdum
vitae auctor non,
Sed facilisis ipsum vel ornare
cursus. Nam pulvinar risus sed
arcu molestie, non semper felis
efficitur. Pellentesque porttitor
maximus augue, sed vulputate
sapien fringilla vel. Sed facilisis
nisi vel elit iaculis, quis ornare
dolor euismod. Donec odio felis,
lobortis sed cursus ut, mollis
vitae sem. Vivamus ultrices sed
sem eu fermentum. Sed id
tincidunt ex. Etiam pharetra
enim maximus luctus ornare.
Nulla suscipit metus leo, vel
dictum justo posuere in.
Integer in laoreet urna. Nunc ut
maximus mi, vel iaculis sem.
mattis eu lorem. Donec
ullamcorper sit amet arcu at
efficitur. Mauris quis convallis
erat. Sed faucibus urna ut
lectus mattis elementum.
Aenean tincidunt maximus
bibendum. In vestibulum
aliquam neque, ut
Header
Table
W
aterm
ark
Document classification
Optical Character
Recognition
Text cleaning and
recomposition
Paragraph segmentation
Paragraph classification
Named Entity Recognition
Hierarchical Data
Recomposition
Understanding

Common NLP tasks
My father went to Devoxx last year when he was in France.
ORGANIZATION
Named Entity Recognition (NER)
Part-of-speech tagging
VERB
PERSONPERSON
Coreference Resolution (CR)
LOCATION
Entity Mention Detection (EMD)
Relation Extraction (RE)
● Language Modeling
● Question Answering
● Summarization
● Machine Translation

Traditional Machine Learning
ORGANIZATION
Feature
Representation
Learning Function
Label prediction
to
Devoxx
last
Preprocessing
Stemming
Lemmatization
Word segmentation
Vectorization

Learning Functions
Linear Regression Logistic Regression
Support-vector machine
Perceptron

Conditional Random Field
def word2features(sent, i):
word = sent[i][0]
postag = sent[i][1]
features = {
'bias': 1.0,
'word.lower()': word.lower(),
'word[-3:]': word[-3:],
'word[-2:]': word[-2:],
'word.isupper()': word.isupper(),
'word.istitle()': word.istitle(),
'word.isdigit()': word.isdigit(),
'postag': postag,
'postag[:2]': postag[:2],
}
if i > 0:
word1 = sent[i-1][0]
postag1 = sent[i-1][1]
features.update({
'-1:word.lower()': word1.lower(),
'-1:word.istitle()': word1.istitle(),
'-1:word.isupper()': word1.isupper(),
'-1:postag': postag1,
'-1:postag[:2]': postag1[:2],
})
else:
features['BOS'] = True
...
# CoNLL 2002 data
nltk.corpus.conll2002.fileids()
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]
...
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1,
c2=0.1,
max_iterations=100,
all_possible_transitions=True
)
crf.fit(X_train, y_train)
...
y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred,
average='weighted',
labels=labels)
labels = ['B-LOC', 'B-ORG', 'B-PER', 'I-PER',
'B-MISC', 'I-ORG', 'I-LOC', 'I-MISC']

Conditional random field
precision recall f1-score support
B-LOC 0.775 0.757 0.766 1084
I-LOC 0.601 0.631 0.616 325
B-MISC 0.698 0.499 0.582 339
I-MISC 0.644 0.567 0.603 557
B-ORG 0.795 0.801 0.798 1400
I-ORG 0.831 0.773 0.801 1104
B-PER 0.812 0.876 0.843 735
I-PER 0.873 0.931 0.901 634
avg / total 0.779 0.764 0.770 6178

Going Deep
From one layer to many hidden layers
vectors
Learning Function
to
Devoxx
last
Learning Function Learning Function
ORGANIZATION
Label prediction
BackpropagationBackpropagation Backpropagation
loss function

Word Vectors
Confidential cat Personal
Source : Efficient Estimation of Word Representations in Vector Space - Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean - 2013
“The Issuer hereby agrees to hold and treat all Confidential Information”

Paragraph and document embedding
Produce a vector from a paragraph or document
Source : Distributed Representations of Sentences and Documents - Quoc Le, Tomas Mikolov

Term Frequency–Inverse Document
Frequency
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(encoding='latin-1', ngram_range=(1, 2),
stop_words='english')
features = tfidf.fit_transform(df.Consumer_complaint_narrative).toarray()
features.shape
>> (4569, 12633)
4569 documents represented by 12633 features, representing the tf-idf score for different
unigrams and bigrams

Entity Recognition with Deep Learning
ORG
- ---

Recurrent Neural Network
My father went Devoxx
ORG
- --
Source : Understanding LSTM Networks - https://colah.github.io/posts/2015-08-Understanding-LSTMs/ - 2015

Long Short-Term Memory
went to Devoxx
- - ORG
Source : Understanding LSTM Networks - https://colah.github.io/posts/2015-08-Understanding-LSTMs/ - 2015

Deep Contextualized Word Representations
ELMo (Embeddings from Language Models)
LSTM-based language model trained on large corpus of text.
My
father
went
Word Embedding
Forward LSTM Backward LSTM
Word Prediction

ELMo capture the word sense based on the context
Source : Deep contextualized word representations - Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
Luke Zettlemoyer

Provide results on most NLP tasks
But slower by an order of magnitude (predictions around ~20x slower)
Source : Deep contextualized word representations - Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
Luke Zettlemoyer

Sequence to Sequence
Source : Jay Alammar - https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

Sequence to Sequence
Source : Sequence to Sequence Learning with Neural Networks - Ilya Sutskever, Oriol Vinyals, Quoc V. Le - 2014

Augmented Recurrent Neural Networks with
Attention
Source : CHRIS OLAH, SHAN CARTER - https://distill.pub/2016/augmented-rnns/#attentional-interfaces

Encoder Decoder with Attention
Source : Jay Alammar - https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

Attention: Transformer
Source : Transformer: A Novel Neural Network Architecture for Language Understanding -
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
Self-attention mechanism directly models relationships
between all words in a sentence, regardless of their respective
position

Attention: Transformer
Source : Attention Is All You Need - Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia
Polosukhin - 2017

Image Captioning
Source : Image Captioning - https://github.com/yunjey/pytorch-tutorial/tree/master/tutorials/03-advanced/image_captioning - 2018

BERT
“The Issuer hereby agrees to hold and treat all Confidential Information”
Masked Language Model
“The Issuer hereby agrees to [...]” || “This Agreement shall terminate [...]”
Next sentence prediction
Source : J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - 2018

BERT
Source : BERT Explained: State of the art language model for NLP -
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 - 2018

BERT - Training cost
Dataset: BookCorpus (800M words) + English Wikipedia (2500M words)
According to the paper: english models took 4 days to pre-train on 16 to
64 TPUs (~500USD for a BERT-base model)
English + multilingual models released by Google

BERT - NER
ORG
- - -----
It was the best conference he ever attended.
- - - -
Conditional
Random Field
BERT
Transformer
encoder
Embedding

BERT - Model Architecture Comparison

Benchmark
33
General Language Understanding Evaluation benchmark (GLUE)
benchmark

Our feedbacks on BERT
● Quite fast to finetune from BERT-base
(minutes to hour)
● Finetuning on the training corpus is
needed (compared to finetuning only
on a general corpus)
● Finetuning only the extractor is already
enough, but jointly learn BERT+classifier
helps a little more
● More experiments should be done with
>128 tokens and BERT-large

Multi-Task Learning
Source : A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks - Victor Sanh, Thomas Wolf, Sebastian Ruder

Multitask Learning
Source : A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks - Victor Sanh, Thomas Wolf, Sebastian Ruder

Chronology
Source : Unsupervised Deep Learning - https://media.neurips.cc/Conferences/NIPS2018/Slides/graves-deeplearning2.pdf

LSTM Text Generation
GoT Book 6 (LSTM trained on the first five ASOIAF/GOT books)
Tyrion could hear Lord Aemon’s coughing. “I miss for it. Why did you proper?”
“I feared Master Sansa, Ser,” Ser Jaime reminded her. “She Baratheon is one of the crossing. The second
sons of your onion concubine.”
Lady Donella length of a longsword, the hair that went ready to climb side from her. And all between
them were belaquo bonebreaker and the night’s watch ride in their room. Only he could not look at
them, even others sure. “How could you leave the world?”
“Some must, for you,” a woman’s voiced up lazily. “Gods, Reek.”
She poured off two eyes and stepped down under the fire. “She will find your brother, and now I heard
her since she was standing the bowl. The night was fair and damp.
Source : https://github.com/zackthoutt/got-book-6

GPT 2
"New AI fake text generator may be too dangerous to release"
SYSTEM PROMPT (HUMAN-WRITTEN)
A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its
whereabouts are unknown.
The incident occurred on the downtown train line, which runs from Covington and Ashland stations.
In an email to Ohio news outlets, the U.S. Department of Energy said it is working with the Federal Railroad
Administration to find the thief.
“The theft of this nuclear material will have significant negative consequences on public and environmental
health, our workforce and the economy of our nation,” said Tom Hicks, the U.S. Energy Secretary, in a statement.
“Our top priority is to secure the theft and ensure it doesn’t happen again.”
The stolen material was taken from the University of Cincinnati’s Research Triangle Park nuclear research site,
according to a news release from Department officials.
The Nuclear Regulatory Commission did not immediately release any information.
According to the release, the U.S. Department of Energy’s Office of Nuclear Material Safety and Security is leading
that team’s investigation.
...
Trained on 40 gigabytes of text retrieved from sources
linked by high-ranking Reddit posts. 1.5 billion parameters.
Source : https://openai.com/blog/better-language-models/

GPT 2 - Common Sense Reasoning
The trophy doesn’t fit into the brown suitcase because it
is too large.
Correct answer: it = trophy
Model answer: it = trophy
The trophy doesn’t fit into the brown suitcase because it
is too small.
Correct answer: it = suitcase
Model answer: it = suitcase

GPT 2 - Question Answering
Who wrote the book the origin of species?
Correct answer: Charles Darwin
Model answer: Charles Darwin
What is the largest state in the U.S. by land mass?
Correct answer: Alaska
Model answer: California

GPT 2 "junior" - Repetitive speeches
Public GPT 2: 117 million parameter version of the model
Source : https://arstechnica.com/information-technology/2019/02/twenty-minutes-into-the-future-with-openais-deep-fake-text-ai/
The last time a Republican Governor signed a bill that would force the states to accept millions of
more immigrants than they already do, that was in 2009. That's a long time to wait to see if that
legislation would pass, but it is already happening now. It is a long time to wait to see if that legislation
passes again. And that's part of the reason why our president is so popular. He is one who says things
that are good for the country. He has said things that are good for America. As a businessman, he has
said things that are good for America. And he has said things that are good for our communities. He
has said things that are good for our families. He has said things that are good for our nation. He has
said things that are good for America. And he has said things that are good for our country. And he
has said things that are good for America. And he has said things that are good for America. And he
has said things that are good for America. And he has said things that are good for America. And he
has said things that are good for America. And he has said things that are good for America.

GPT 2 "junior" - Fake News Test
SYSTEM PROMPT (HUMAN-WRITTEN)
Iran has announced that it has successfully tested a hydrogen bomb.
As for the U.N., the council's permanent representative under President Barack Obama has
met to see to it that it follows the Paris climate agreement, which has allowed states to
move away from fossil fuels.
…
Kim Jong-un, the leader of North Korea and most closely aligned with the United States, has
warned of an imminent U.S. attack. The test of a hydrogen bomb Thursday killed 13 people
and injured several others in a Pyongyang explosion, the country's state TV station
reported.<|endoftext|>Coconut Cream
…
This is an amazing flavor that will satisfy any dessert enthusiast, but in most cases doesn't
taste sweet in moderation. It also creates a great nut-free breakfast cereal and can be
served with peanut butter or coconut milk. The taste can just as great as any dessert
topping.
Ingredients:
1/3 cup coconut milk
1/2 cup low-fat unsweetened coconut milk
...

Hyperlex.ai
Alexis Agahi
@aagahi

Devoxx traitement automatique du langage sur du texte en 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Devoxx traitement automatique du langage sur du texte en 2019

Similar to Devoxx traitement automatique du langage sur du texte en 2019 (20)

Recently uploaded

Recently uploaded (20)

Devoxx traitement automatique du langage sur du texte en 2019