SlideShare a Scribd company logo
Traitement Automatique
du Langage sur du texte
Devoxx Avril 2019
Hyperlex
Contract Management
Analysis
Data Extraction
and
Review
Machine Learning Pipeline
Documents
Title 1
Lorem ipsum dolor sit amet,
consectetur adipiscing elit.
Phasellus bibendum nulla eget
ornare. Sed velit dui,
Title 2
tincidunt vel massa in,
Praesent ante tellus, interdum
vitae auctor non,
Sed facilisis ipsum vel ornare
cursus. Nam pulvinar risus sed
arcu molestie, non semper felis
efficitur. Pellentesque porttitor
maximus augue, sed vulputate
sapien fringilla vel. Sed facilisis
nisi vel elit iaculis, quis ornare
dolor euismod. Donec odio felis,
lobortis sed cursus ut, mollis
vitae sem. Vivamus ultrices sed
sem eu fermentum. Sed id
tincidunt ex. Etiam pharetra
enim maximus luctus ornare.
Nulla suscipit metus leo, vel
dictum justo posuere in.
Integer in laoreet urna. Nunc ut
maximus mi, vel iaculis sem.
mattis eu lorem. Donec
ullamcorper sit amet arcu at
efficitur. Mauris quis convallis
erat. Sed faucibus urna ut
lectus mattis elementum.
Aenean tincidunt maximus
bibendum. In vestibulum
aliquam neque, ut
Header
Table
W
aterm
ark
Document classification
Optical Character
Recognition
Text cleaning and
recomposition
Paragraph segmentation
Paragraph classification
Named Entity Recognition
Hierarchical Data
Recomposition
Understanding
Common NLP tasks
My father went to Devoxx last year when he was in France.
ORGANIZATION
Named Entity Recognition (NER)
Part-of-speech tagging
VERB
PERSONPERSON
Coreference Resolution (CR)
LOCATION
Entity Mention Detection (EMD)
Relation Extraction (RE)
● Language Modeling
● Question Answering
● Summarization
● Machine Translation
Traditional Machine Learning
ORGANIZATION
Feature
Representation
Learning Function
Label prediction
to
Devoxx
last
Preprocessing
Stemming
Lemmatization
Word segmentation
Vectorization
Learning Functions
Linear Regression Logistic Regression
Support-vector machine
Perceptron
Conditional Random Field
def word2features(sent, i):
word = sent[i][0]
postag = sent[i][1]
features = {
'bias': 1.0,
'word.lower()': word.lower(),
'word[-3:]': word[-3:],
'word[-2:]': word[-2:],
'word.isupper()': word.isupper(),
'word.istitle()': word.istitle(),
'word.isdigit()': word.isdigit(),
'postag': postag,
'postag[:2]': postag[:2],
}
if i > 0:
word1 = sent[i-1][0]
postag1 = sent[i-1][1]
features.update({
'-1:word.lower()': word1.lower(),
'-1:word.istitle()': word1.istitle(),
'-1:word.isupper()': word1.isupper(),
'-1:postag': postag1,
'-1:postag[:2]': postag1[:2],
})
else:
features['BOS'] = True
...
# CoNLL 2002 data
nltk.corpus.conll2002.fileids()
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]
...
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1,
c2=0.1,
max_iterations=100,
all_possible_transitions=True
)
crf.fit(X_train, y_train)
...
y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred,
average='weighted',
labels=labels)
labels = ['B-LOC', 'B-ORG', 'B-PER', 'I-PER',
'B-MISC', 'I-ORG', 'I-LOC', 'I-MISC']
Conditional random field
precision recall f1-score support
B-LOC 0.775 0.757 0.766 1084
I-LOC 0.601 0.631 0.616 325
B-MISC 0.698 0.499 0.582 339
I-MISC 0.644 0.567 0.603 557
B-ORG 0.795 0.801 0.798 1400
I-ORG 0.831 0.773 0.801 1104
B-PER 0.812 0.876 0.843 735
I-PER 0.873 0.931 0.901 634
avg / total 0.779 0.764 0.770 6178
Going Deeper
Going Deep
From one layer to many hidden layers
vectors
Learning Function
to
Devoxx
last
Learning Function Learning Function
ORGANIZATION
Label prediction
BackpropagationBackpropagation Backpropagation
loss function
Word Vectors
Confidential cat Personal
Source : Efficient Estimation of Word Representations in Vector Space - Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean - 2013
“The Issuer hereby agrees to hold and treat all Confidential Information”
Word Vectors
Paragraph and document embedding
Produce a vector from a paragraph or document
Source : Distributed Representations of Sentences and Documents - Quoc Le, Tomas Mikolov
Term Frequency–Inverse Document
Frequency
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(encoding='latin-1', ngram_range=(1, 2),
stop_words='english')
features = tfidf.fit_transform(df.Consumer_complaint_narrative).toarray()
features.shape
>> (4569, 12633)
4569 documents represented by 12633 features, representing the tf-idf score for different
unigrams and bigrams
Entity Recognition with Deep Learning
ORG
My father went to Devoxx last year when he was in France.
- ---
Recurrent Neural Network
My father went Devoxx
ORG
- --
Source : Understanding LSTM Networks - https://colah.github.io/posts/2015-08-Understanding-LSTMs/ - 2015
Long Short-Term Memory
went to Devoxx
- - ORG
Source : Understanding LSTM Networks - https://colah.github.io/posts/2015-08-Understanding-LSTMs/ - 2015
Deep Contextualized Word Representations
ELMo (Embeddings from Language Models)
LSTM-based language model trained on large corpus of text.
My
father
went
Word Embedding
Forward LSTM Backward LSTM
Word Prediction
Deep Contextualized Word Representations
ELMo capture the word sense based on the context
Source : Deep contextualized word representations - Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
Luke Zettlemoyer
Deep Contextualized Word Representations
Provide results on most NLP tasks
But slower by an order of magnitude (predictions around ~20x slower)
Source : Deep contextualized word representations - Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
Luke Zettlemoyer
Sequence to Sequence
Source : Jay Alammar - https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
Sequence to Sequence
Source : Sequence to Sequence Learning with Neural Networks - Ilya Sutskever, Oriol Vinyals, Quoc V. Le - 2014
Augmented Recurrent Neural Networks with
Attention
Source : CHRIS OLAH, SHAN CARTER - https://distill.pub/2016/augmented-rnns/#attentional-interfaces
Encoder Decoder with Attention
Source : Jay Alammar - https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
Attention: Transformer
Source : Transformer: A Novel Neural Network Architecture for Language Understanding -
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
Self-attention mechanism directly models relationships
between all words in a sentence, regardless of their respective
position
Attention: Transformer
Source : Attention Is All You Need - Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia
Polosukhin - 2017
Image Captioning
Source : Image Captioning - https://github.com/yunjey/pytorch-tutorial/tree/master/tutorials/03-advanced/image_captioning - 2018
BERT
“The Issuer hereby agrees to hold and treat all Confidential Information”
Masked Language Model
“The Issuer hereby agrees to [...]” || “This Agreement shall terminate [...]”
Next sentence prediction
Source : J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - 2018
BERT
Source : BERT Explained: State of the art language model for NLP -
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 - 2018
BERT - Training cost
Dataset: BookCorpus (800M words) + English Wikipedia (2500M words)
According to the paper: english models took 4 days to pre-train on 16 to
64 TPUs (~500USD for a BERT-base model)
English + multilingual models released by Google
Source : J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - 2018
BERT - NER
ORG
My father went to Devoxx last year when he was in France.
- - -----
It was the best conference he ever attended.
- - - -
Conditional
Random Field
BERT
Transformer
encoder
Embedding
BERT - Model Architecture Comparison
Source : J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - 2018
Benchmark
33
General Language Understanding Evaluation benchmark (GLUE)
benchmark
Source : J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - 2018
Our feedbacks on BERT
● Quite fast to finetune from BERT-base
(minutes to hour)
● Finetuning on the training corpus is
needed (compared to finetuning only
on a general corpus)
● Finetuning only the extractor is already
enough, but jointly learn BERT+classifier
helps a little more
● More experiments should be done with
>128 tokens and BERT-large
Multi-Task Learning
Source : A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks - Victor Sanh, Thomas Wolf, Sebastian Ruder
Multitask Learning
Source : A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks - Victor Sanh, Thomas Wolf, Sebastian Ruder
Chronology
Source : Unsupervised Deep Learning - https://media.neurips.cc/Conferences/NIPS2018/Slides/graves-deeplearning2.pdf
Example
LSTM Text Generation
GoT Book 6 (LSTM trained on the first five ASOIAF/GOT books)
Tyrion could hear Lord Aemon’s coughing. “I miss for it. Why did you proper?”
“I feared Master Sansa, Ser,” Ser Jaime reminded her. “She Baratheon is one of the crossing. The second
sons of your onion concubine.”
Lady Donella length of a longsword, the hair that went ready to climb side from her. And all between
them were belaquo bonebreaker and the night’s watch ride in their room. Only he could not look at
them, even others sure. “How could you leave the world?”
“Some must, for you,” a woman’s voiced up lazily. “Gods, Reek.”
She poured off two eyes and stepped down under the fire. “She will find your brother, and now I heard
her since she was standing the bowl. The night was fair and damp.
Source : https://github.com/zackthoutt/got-book-6
GPT 2
"New AI fake text generator may be too dangerous to release"
SYSTEM PROMPT (HUMAN-WRITTEN)
A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its
whereabouts are unknown.
The incident occurred on the downtown train line, which runs from Covington and Ashland stations.
In an email to Ohio news outlets, the U.S. Department of Energy said it is working with the Federal Railroad
Administration to find the thief.
“The theft of this nuclear material will have significant negative consequences on public and environmental
health, our workforce and the economy of our nation,” said Tom Hicks, the U.S. Energy Secretary, in a statement.
“Our top priority is to secure the theft and ensure it doesn’t happen again.”
The stolen material was taken from the University of Cincinnati’s Research Triangle Park nuclear research site,
according to a news release from Department officials.
The Nuclear Regulatory Commission did not immediately release any information.
According to the release, the U.S. Department of Energy’s Office of Nuclear Material Safety and Security is leading
that team’s investigation.
...
Trained on 40 gigabytes of text retrieved from sources
linked by high-ranking Reddit posts. 1.5 billion parameters.
Source : https://openai.com/blog/better-language-models/
GPT 2 - Common Sense Reasoning
The trophy doesn’t fit into the brown suitcase because it
is too large.
Correct answer: it = trophy
Model answer: it = trophy
The trophy doesn’t fit into the brown suitcase because it
is too small.
Correct answer: it = suitcase
Model answer: it = suitcase
GPT 2 - Question Answering
Who wrote the book the origin of species?
Correct answer: Charles Darwin
Model answer: Charles Darwin
What is the largest state in the U.S. by land mass?
Correct answer: Alaska
Model answer: California
GPT 2 "junior" - Repetitive speeches
Public GPT 2: 117 million parameter version of the model
Source : https://arstechnica.com/information-technology/2019/02/twenty-minutes-into-the-future-with-openais-deep-fake-text-ai/
The last time a Republican Governor signed a bill that would force the states to accept millions of
more immigrants than they already do, that was in 2009. That's a long time to wait to see if that
legislation would pass, but it is already happening now. It is a long time to wait to see if that legislation
passes again. And that's part of the reason why our president is so popular. He is one who says things
that are good for the country. He has said things that are good for America. As a businessman, he has
said things that are good for America. And he has said things that are good for our communities. He
has said things that are good for our families. He has said things that are good for our nation. He has
said things that are good for America. And he has said things that are good for our country. And he
has said things that are good for America. And he has said things that are good for America. And he
has said things that are good for America. And he has said things that are good for America. And he
has said things that are good for America. And he has said things that are good for America.
GPT 2 "junior" - Fake News Test
SYSTEM PROMPT (HUMAN-WRITTEN)
Iran has announced that it has successfully tested a hydrogen bomb.
As for the U.N., the council's permanent representative under President Barack Obama has
met to see to it that it follows the Paris climate agreement, which has allowed states to
move away from fossil fuels.
…
Kim Jong-un, the leader of North Korea and most closely aligned with the United States, has
warned of an imminent U.S. attack. The test of a hydrogen bomb Thursday killed 13 people
and injured several others in a Pyongyang explosion, the country's state TV station
reported.<|endoftext|>Coconut Cream
…
This is an amazing flavor that will satisfy any dessert enthusiast, but in most cases doesn't
taste sweet in moderation. It also creates a great nut-free breakfast cereal and can be
served with peanut butter or coconut milk. The taste can just as great as any dessert
topping.
Ingredients:
1/3 cup coconut milk
1/2 cup low-fat unsweetened coconut milk
...
Questions?
Hyperlex.ai
Alexis Agahi
@aagahi

More Related Content

What's hot

RoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer: Enhanced Transformer with Rotary Position EmbeddingRoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer: Enhanced Transformer with Rotary Position Embedding
taeseon ryu
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
outsider2
 
Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analytics
shanbady
 
Developing Korean Chatbot 101
Developing Korean Chatbot 101Developing Korean Chatbot 101
Developing Korean Chatbot 101
Jaemin Cho
 
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATIONAN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
cscpconf
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
Sumit Raj
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
bhavesh_physics
 
Parts of Speect Tagging
Parts of Speect TaggingParts of Speect Tagging
Parts of Speect Tagging
theyaseen51
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
Senthil Kumar M
 
Nlp
NlpNlp
BERT introduction
BERT introductionBERT introduction
BERT introduction
Hanwha System / ICT
 
Creating Chatbots Using TensorFlow | Chatbot Tutorial | Deep Learning Trainin...
Creating Chatbots Using TensorFlow | Chatbot Tutorial | Deep Learning Trainin...Creating Chatbots Using TensorFlow | Chatbot Tutorial | Deep Learning Trainin...
Creating Chatbots Using TensorFlow | Chatbot Tutorial | Deep Learning Trainin...
Edureka!
 
NLTK
NLTKNLTK
Lecture04
Lecture04Lecture04
Lecture04
mavillard
 
BERT
BERTBERT
Chatbot ppt
Chatbot pptChatbot ppt
Chatbot ppt
Manish Mishra
 
Nltk
NltkNltk
Nltk
Anirudh
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
Janu Jahnavi
 
Chat bot making process using Python 3 & TensorFlow
Chat bot making process using Python 3 & TensorFlowChat bot making process using Python 3 & TensorFlow
Chat bot making process using Python 3 & TensorFlow
Jeongkyu Shin
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
Karthik Murugesan
 

What's hot (20)

RoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer: Enhanced Transformer with Rotary Position EmbeddingRoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer: Enhanced Transformer with Rotary Position Embedding
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
 
Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analytics
 
Developing Korean Chatbot 101
Developing Korean Chatbot 101Developing Korean Chatbot 101
Developing Korean Chatbot 101
 
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATIONAN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
Parts of Speect Tagging
Parts of Speect TaggingParts of Speect Tagging
Parts of Speect Tagging
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
 
Nlp
NlpNlp
Nlp
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
Creating Chatbots Using TensorFlow | Chatbot Tutorial | Deep Learning Trainin...
Creating Chatbots Using TensorFlow | Chatbot Tutorial | Deep Learning Trainin...Creating Chatbots Using TensorFlow | Chatbot Tutorial | Deep Learning Trainin...
Creating Chatbots Using TensorFlow | Chatbot Tutorial | Deep Learning Trainin...
 
NLTK
NLTKNLTK
NLTK
 
Lecture04
Lecture04Lecture04
Lecture04
 
BERT
BERTBERT
BERT
 
Chatbot ppt
Chatbot pptChatbot ppt
Chatbot ppt
 
Nltk
NltkNltk
Nltk
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
 
Chat bot making process using Python 3 & TensorFlow
Chat bot making process using Python 3 & TensorFlowChat bot making process using Python 3 & TensorFlow
Chat bot making process using Python 3 & TensorFlow
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 

Similar to Devoxx traitement automatique du langage sur du texte en 2019

NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
shanbady
 
Andrea gatto meetup_dli_18_feb_2020
Andrea gatto meetup_dli_18_feb_2020Andrea gatto meetup_dli_18_feb_2020
Andrea gatto meetup_dli_18_feb_2020
Deep Learning Italia
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
odsc
 
Machine Learning in NLP
Machine Learning in NLPMachine Learning in NLP
Machine Learning in NLP
Vijay Ganti
 
State of NLP and Amazon Comprehend
State of NLP and Amazon ComprehendState of NLP and Amazon Comprehend
State of NLP and Amazon Comprehend
Egor Pushkin
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
butest
 
The State of ML for iOS: On the Advent of WWDC 2018 🕯
The State of ML for iOS: On the Advent of WWDC 2018 🕯The State of ML for iOS: On the Advent of WWDC 2018 🕯
The State of ML for iOS: On the Advent of WWDC 2018 🕯
Meghan Kane
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
ParrotAI
 
ppt
pptppt
ppt
butest
 
ppt
pptppt
ppt
butest
 
Deep Learning and Watson Studio
Deep Learning and Watson StudioDeep Learning and Watson Studio
Deep Learning and Watson Studio
Sasha Lazarevic
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Codemotion
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
Ashraf Uddin
 
Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)
Julien SIMON
 
Natural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application Trends
Shreyas Suresh Rao
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
Apache MXNet
 
How to make intelligent web apps
How to make intelligent web appsHow to make intelligent web apps
How to make intelligent web apps
iapain
 
PYTHON PPT.pptx
PYTHON PPT.pptxPYTHON PPT.pptx
PYTHON PPT.pptx
AbhishekMourya36
 
DevFest Kuwait 2020 - GDG Kuwait
DevFest Kuwait 2020 - GDG KuwaitDevFest Kuwait 2020 - GDG Kuwait
DevFest Kuwait 2020 - GDG Kuwait
GDGKuwaitGoogleDevel
 
Natural Language Processing: Comparing NLTK and OpenNLP
Natural Language Processing: Comparing NLTK and OpenNLPNatural Language Processing: Comparing NLTK and OpenNLP
Natural Language Processing: Comparing NLTK and OpenNLP
CodeOps Technologies LLP
 

Similar to Devoxx traitement automatique du langage sur du texte en 2019 (20)

NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
 
Andrea gatto meetup_dli_18_feb_2020
Andrea gatto meetup_dli_18_feb_2020Andrea gatto meetup_dli_18_feb_2020
Andrea gatto meetup_dli_18_feb_2020
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
Machine Learning in NLP
Machine Learning in NLPMachine Learning in NLP
Machine Learning in NLP
 
State of NLP and Amazon Comprehend
State of NLP and Amazon ComprehendState of NLP and Amazon Comprehend
State of NLP and Amazon Comprehend
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
 
The State of ML for iOS: On the Advent of WWDC 2018 🕯
The State of ML for iOS: On the Advent of WWDC 2018 🕯The State of ML for iOS: On the Advent of WWDC 2018 🕯
The State of ML for iOS: On the Advent of WWDC 2018 🕯
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
ppt
pptppt
ppt
 
ppt
pptppt
ppt
 
Deep Learning and Watson Studio
Deep Learning and Watson StudioDeep Learning and Watson Studio
Deep Learning and Watson Studio
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)
 
Natural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application Trends
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
 
How to make intelligent web apps
How to make intelligent web appsHow to make intelligent web apps
How to make intelligent web apps
 
PYTHON PPT.pptx
PYTHON PPT.pptxPYTHON PPT.pptx
PYTHON PPT.pptx
 
DevFest Kuwait 2020 - GDG Kuwait
DevFest Kuwait 2020 - GDG KuwaitDevFest Kuwait 2020 - GDG Kuwait
DevFest Kuwait 2020 - GDG Kuwait
 
Natural Language Processing: Comparing NLTK and OpenNLP
Natural Language Processing: Comparing NLTK and OpenNLPNatural Language Processing: Comparing NLTK and OpenNLP
Natural Language Processing: Comparing NLTK and OpenNLP
 

Recently uploaded

Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 

Devoxx traitement automatique du langage sur du texte en 2019

  • 1. Traitement Automatique du Langage sur du texte Devoxx Avril 2019
  • 3. Machine Learning Pipeline Documents Title 1 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus bibendum nulla eget ornare. Sed velit dui, Title 2 tincidunt vel massa in, Praesent ante tellus, interdum vitae auctor non, Sed facilisis ipsum vel ornare cursus. Nam pulvinar risus sed arcu molestie, non semper felis efficitur. Pellentesque porttitor maximus augue, sed vulputate sapien fringilla vel. Sed facilisis nisi vel elit iaculis, quis ornare dolor euismod. Donec odio felis, lobortis sed cursus ut, mollis vitae sem. Vivamus ultrices sed sem eu fermentum. Sed id tincidunt ex. Etiam pharetra enim maximus luctus ornare. Nulla suscipit metus leo, vel dictum justo posuere in. Integer in laoreet urna. Nunc ut maximus mi, vel iaculis sem. mattis eu lorem. Donec ullamcorper sit amet arcu at efficitur. Mauris quis convallis erat. Sed faucibus urna ut lectus mattis elementum. Aenean tincidunt maximus bibendum. In vestibulum aliquam neque, ut Header Table W aterm ark Document classification Optical Character Recognition Text cleaning and recomposition Paragraph segmentation Paragraph classification Named Entity Recognition Hierarchical Data Recomposition Understanding
  • 4. Common NLP tasks My father went to Devoxx last year when he was in France. ORGANIZATION Named Entity Recognition (NER) Part-of-speech tagging VERB PERSONPERSON Coreference Resolution (CR) LOCATION Entity Mention Detection (EMD) Relation Extraction (RE) ● Language Modeling ● Question Answering ● Summarization ● Machine Translation
  • 5. Traditional Machine Learning ORGANIZATION Feature Representation Learning Function Label prediction to Devoxx last Preprocessing Stemming Lemmatization Word segmentation Vectorization
  • 6. Learning Functions Linear Regression Logistic Regression Support-vector machine Perceptron
  • 7. Conditional Random Field def word2features(sent, i): word = sent[i][0] postag = sent[i][1] features = { 'bias': 1.0, 'word.lower()': word.lower(), 'word[-3:]': word[-3:], 'word[-2:]': word[-2:], 'word.isupper()': word.isupper(), 'word.istitle()': word.istitle(), 'word.isdigit()': word.isdigit(), 'postag': postag, 'postag[:2]': postag[:2], } if i > 0: word1 = sent[i-1][0] postag1 = sent[i-1][1] features.update({ '-1:word.lower()': word1.lower(), '-1:word.istitle()': word1.istitle(), '-1:word.isupper()': word1.isupper(), '-1:postag': postag1, '-1:postag[:2]': postag1[:2], }) else: features['BOS'] = True ... # CoNLL 2002 data nltk.corpus.conll2002.fileids() X_train = [sent2features(s) for s in train_sents] y_train = [sent2labels(s) for s in train_sents] ... crf = sklearn_crfsuite.CRF( algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=100, all_possible_transitions=True ) crf.fit(X_train, y_train) ... y_pred = crf.predict(X_test) metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=labels) labels = ['B-LOC', 'B-ORG', 'B-PER', 'I-PER', 'B-MISC', 'I-ORG', 'I-LOC', 'I-MISC']
  • 8. Conditional random field precision recall f1-score support B-LOC 0.775 0.757 0.766 1084 I-LOC 0.601 0.631 0.616 325 B-MISC 0.698 0.499 0.582 339 I-MISC 0.644 0.567 0.603 557 B-ORG 0.795 0.801 0.798 1400 I-ORG 0.831 0.773 0.801 1104 B-PER 0.812 0.876 0.843 735 I-PER 0.873 0.931 0.901 634 avg / total 0.779 0.764 0.770 6178
  • 10. Going Deep From one layer to many hidden layers vectors Learning Function to Devoxx last Learning Function Learning Function ORGANIZATION Label prediction BackpropagationBackpropagation Backpropagation loss function
  • 11. Word Vectors Confidential cat Personal Source : Efficient Estimation of Word Representations in Vector Space - Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean - 2013 “The Issuer hereby agrees to hold and treat all Confidential Information”
  • 13. Paragraph and document embedding Produce a vector from a paragraph or document Source : Distributed Representations of Sentences and Documents - Quoc Le, Tomas Mikolov
  • 14. Term Frequency–Inverse Document Frequency TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document) IDF(t) = log_e(Total number of documents / Number of documents with term t in it) from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer(encoding='latin-1', ngram_range=(1, 2), stop_words='english') features = tfidf.fit_transform(df.Consumer_complaint_narrative).toarray() features.shape >> (4569, 12633) 4569 documents represented by 12633 features, representing the tf-idf score for different unigrams and bigrams
  • 15. Entity Recognition with Deep Learning ORG My father went to Devoxx last year when he was in France. - ---
  • 16. Recurrent Neural Network My father went Devoxx ORG - -- Source : Understanding LSTM Networks - https://colah.github.io/posts/2015-08-Understanding-LSTMs/ - 2015
  • 17. Long Short-Term Memory went to Devoxx - - ORG Source : Understanding LSTM Networks - https://colah.github.io/posts/2015-08-Understanding-LSTMs/ - 2015
  • 18. Deep Contextualized Word Representations ELMo (Embeddings from Language Models) LSTM-based language model trained on large corpus of text. My father went Word Embedding Forward LSTM Backward LSTM Word Prediction
  • 19. Deep Contextualized Word Representations ELMo capture the word sense based on the context Source : Deep contextualized word representations - Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer
  • 20. Deep Contextualized Word Representations Provide results on most NLP tasks But slower by an order of magnitude (predictions around ~20x slower) Source : Deep contextualized word representations - Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer
  • 21. Sequence to Sequence Source : Jay Alammar - https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
  • 22. Sequence to Sequence Source : Sequence to Sequence Learning with Neural Networks - Ilya Sutskever, Oriol Vinyals, Quoc V. Le - 2014
  • 23. Augmented Recurrent Neural Networks with Attention Source : CHRIS OLAH, SHAN CARTER - https://distill.pub/2016/augmented-rnns/#attentional-interfaces
  • 24. Encoder Decoder with Attention Source : Jay Alammar - https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
  • 25. Attention: Transformer Source : Transformer: A Novel Neural Network Architecture for Language Understanding - https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html Self-attention mechanism directly models relationships between all words in a sentence, regardless of their respective position
  • 26. Attention: Transformer Source : Attention Is All You Need - Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin - 2017
  • 27. Image Captioning Source : Image Captioning - https://github.com/yunjey/pytorch-tutorial/tree/master/tutorials/03-advanced/image_captioning - 2018
  • 28. BERT “The Issuer hereby agrees to hold and treat all Confidential Information” Masked Language Model “The Issuer hereby agrees to [...]” || “This Agreement shall terminate [...]” Next sentence prediction Source : J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - 2018
  • 29. BERT Source : BERT Explained: State of the art language model for NLP - https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 - 2018
  • 30. BERT - Training cost Dataset: BookCorpus (800M words) + English Wikipedia (2500M words) According to the paper: english models took 4 days to pre-train on 16 to 64 TPUs (~500USD for a BERT-base model) English + multilingual models released by Google Source : J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - 2018
  • 31. BERT - NER ORG My father went to Devoxx last year when he was in France. - - ----- It was the best conference he ever attended. - - - - Conditional Random Field BERT Transformer encoder Embedding
  • 32. BERT - Model Architecture Comparison Source : J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - 2018
  • 33. Benchmark 33 General Language Understanding Evaluation benchmark (GLUE) benchmark Source : J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - 2018
  • 34. Our feedbacks on BERT ● Quite fast to finetune from BERT-base (minutes to hour) ● Finetuning on the training corpus is needed (compared to finetuning only on a general corpus) ● Finetuning only the extractor is already enough, but jointly learn BERT+classifier helps a little more ● More experiments should be done with >128 tokens and BERT-large
  • 35. Multi-Task Learning Source : A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks - Victor Sanh, Thomas Wolf, Sebastian Ruder
  • 36. Multitask Learning Source : A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks - Victor Sanh, Thomas Wolf, Sebastian Ruder
  • 37. Chronology Source : Unsupervised Deep Learning - https://media.neurips.cc/Conferences/NIPS2018/Slides/graves-deeplearning2.pdf
  • 39. LSTM Text Generation GoT Book 6 (LSTM trained on the first five ASOIAF/GOT books) Tyrion could hear Lord Aemon’s coughing. “I miss for it. Why did you proper?” “I feared Master Sansa, Ser,” Ser Jaime reminded her. “She Baratheon is one of the crossing. The second sons of your onion concubine.” Lady Donella length of a longsword, the hair that went ready to climb side from her. And all between them were belaquo bonebreaker and the night’s watch ride in their room. Only he could not look at them, even others sure. “How could you leave the world?” “Some must, for you,” a woman’s voiced up lazily. “Gods, Reek.” She poured off two eyes and stepped down under the fire. “She will find your brother, and now I heard her since she was standing the bowl. The night was fair and damp. Source : https://github.com/zackthoutt/got-book-6
  • 40. GPT 2 "New AI fake text generator may be too dangerous to release" SYSTEM PROMPT (HUMAN-WRITTEN) A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown. The incident occurred on the downtown train line, which runs from Covington and Ashland stations. In an email to Ohio news outlets, the U.S. Department of Energy said it is working with the Federal Railroad Administration to find the thief. “The theft of this nuclear material will have significant negative consequences on public and environmental health, our workforce and the economy of our nation,” said Tom Hicks, the U.S. Energy Secretary, in a statement. “Our top priority is to secure the theft and ensure it doesn’t happen again.” The stolen material was taken from the University of Cincinnati’s Research Triangle Park nuclear research site, according to a news release from Department officials. The Nuclear Regulatory Commission did not immediately release any information. According to the release, the U.S. Department of Energy’s Office of Nuclear Material Safety and Security is leading that team’s investigation. ... Trained on 40 gigabytes of text retrieved from sources linked by high-ranking Reddit posts. 1.5 billion parameters. Source : https://openai.com/blog/better-language-models/
  • 41. GPT 2 - Common Sense Reasoning The trophy doesn’t fit into the brown suitcase because it is too large. Correct answer: it = trophy Model answer: it = trophy The trophy doesn’t fit into the brown suitcase because it is too small. Correct answer: it = suitcase Model answer: it = suitcase
  • 42. GPT 2 - Question Answering Who wrote the book the origin of species? Correct answer: Charles Darwin Model answer: Charles Darwin What is the largest state in the U.S. by land mass? Correct answer: Alaska Model answer: California
  • 43. GPT 2 "junior" - Repetitive speeches Public GPT 2: 117 million parameter version of the model Source : https://arstechnica.com/information-technology/2019/02/twenty-minutes-into-the-future-with-openais-deep-fake-text-ai/ The last time a Republican Governor signed a bill that would force the states to accept millions of more immigrants than they already do, that was in 2009. That's a long time to wait to see if that legislation would pass, but it is already happening now. It is a long time to wait to see if that legislation passes again. And that's part of the reason why our president is so popular. He is one who says things that are good for the country. He has said things that are good for America. As a businessman, he has said things that are good for America. And he has said things that are good for our communities. He has said things that are good for our families. He has said things that are good for our nation. He has said things that are good for America. And he has said things that are good for our country. And he has said things that are good for America. And he has said things that are good for America. And he has said things that are good for America. And he has said things that are good for America. And he has said things that are good for America. And he has said things that are good for America.
  • 44. GPT 2 "junior" - Fake News Test SYSTEM PROMPT (HUMAN-WRITTEN) Iran has announced that it has successfully tested a hydrogen bomb. As for the U.N., the council's permanent representative under President Barack Obama has met to see to it that it follows the Paris climate agreement, which has allowed states to move away from fossil fuels. … Kim Jong-un, the leader of North Korea and most closely aligned with the United States, has warned of an imminent U.S. attack. The test of a hydrogen bomb Thursday killed 13 people and injured several others in a Pyongyang explosion, the country's state TV station reported.<|endoftext|>Coconut Cream … This is an amazing flavor that will satisfy any dessert enthusiast, but in most cases doesn't taste sweet in moderation. It also creates a great nut-free breakfast cereal and can be served with peanut butter or coconut milk. The taste can just as great as any dessert topping. Ingredients: 1/3 cup coconut milk 1/2 cup low-fat unsweetened coconut milk ...