NLP in the Deep Learning Era: the story so far

…
9th PyData Athens Meetup
Tuesday 23, April
18:30 - 19:30
The Cube Athens
nlp.cs.aueb

4
Industry
Data scientist, Cognitiv+
● Study use cases for contract analytics
● Train models to extract:
○ the document structure,
○ keystone insights (data points)
○ Identify keystone actions (mechanisms)
Head of Data Science, Cognitiv+
● Explore new markets (leases, financial documents,
regulatory compliance)
● Investigate visual aspects of document processing - improv
data cleansing / document formatting.
Academia
Research Αssociate, National and Kapodistrian University
of Athens
● Legislative knowledge representation as Linked Data
● Model relations between structured documents
● Nomothesia Project: Open Data Platform for Greek
Legislation (http://legislation.di.uoa.gr)
PhD Candidate, Athens University of Economics and Business
● Contract Entity Extraction and Structure Identification
● Sentence Classification in contractual clauses
● Extreme Multi-label Classification on EU legislation
● Judicial Decision Prediction
● Large-scale aux. pre-training for unsupervised learning

CONVENIENCE MATERIAL BREACH RENEWAL ASSISTANCE
AUTOMATIC NON AUTOMATIC
9

10
Donald Trump PERSON China GPEand visited
for his new re-election campaign, as the candidate of the .
White House ORGANIZATIONleft
Republican Party NORP

11
LANDLORD TENANT PROPERTY ADDRESS CONTRACT PERIOD
TITLE START DATE CONTRACT PARTY

12
DOCUMENT 1 DOCUMENT 2 DOCUMENT 3 DOCUMENT 1683
Q:
A:
…

16
X = (1, 3)
W1 = (3, 4)
h1 = (1, 4)

17
X = (1, 3)
W1 = (3, 4)
h1 = (1, 4)
W2 = (4, 4)
h2 = (1, 4)

18
X = (1, 3)
W1 = (3, 4)
h1 = (1, 4)
W2 = (4, 4)
h2 = (1, 4)
Wo = (4, 1)
Y = (1,1)

30Proper Noun, Ccccc, [‘p’, ‘a’, ‘p’, ‘a’, …, ‘i’, ‘o’, ‘u’]

•
39
1.2 0.4 0.2
0.3 0.3 0.2
0.1 0.2 0.2
1.5 0.8 0.2
1.2 1.9 0.2
w1 = It
w2 = was
w3 = a
w4 = bad
w5 = movie
Word
Representations w[t]

•
40
1.2 0.4 0.2
0.3 0.3 0.2
0.1 0.2 0.2
1.5 0.8 0.2
1.2 1.9 0.2
w1 = It
w2 = was
w3 = a
w4 = bad
w5 = movie
0.1
0.1
0.1
0.5
0.2
Attention
Scores a[t]
Word
a1
a2
a3
a4
a5

•
41
1.2 0.4 0.2
0.3 0.3 0.2
0.1 0.2 0.2
1.5 0.8 0.2
1.2 1.9 0.2
w1 = It
w2 = was
w3 = a
w4 = bad
w5 = movie
0.1
0.1
0.1
0.5
0.2
Weighted w[t]
aw[t] = w[t] × a[t]
0.12 0.04 0.02
0.15 0.15 0.01
0.01 0.02 0.02
0.75 0.40 0.10
0.24 0.38 0.04
Attention
Scores a[t]
Word
a1
a2
a3
a4
a5

•
42
1.2 0.4 0.2
0.3 0.3 0.2
0.1 0.2 0.2
1.5 0.8 0.2
1.2 1.9 0.2
w1 = It
w2 = was
w3 = a
w4 = bad
w5 = movie
0.1
0.1
0.1
0.5
0.2
Weighted w[t]
aw[t] = w[t] × a[t]
0.12 0.04 0.02
0.15 0.15 0.01
0.01 0.02 0.02
0.75 0.40 0.10
0.24 0.38 0.04
Attention
Scores a[t] Sentence Representation
∑aw[t]
1.27 0.99 0.19
Word
a1
a2
a3
a4
a5

•
44
common customs tariff | tariff nomenclature | mushroom growing | tobacco | pharmaceutical product

62
Use of NLP goodies as software
Code Time!

64
import spacy
# Load NLP pipeline processor
nlp = spacy.load('en')
# Parse Text
doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised
border wall with Mexico.')

65
import spacy
# Parse Text
print('{:15} {:15} {:15} {:15} {:15} {:15}'.format('TOKEN', 'NORM', 'LEMMA', 'POS', 'DEPENDENCY',
'SHAPE'))
print('======================================================================================')
for token in doc:
print('{:15} {:15} {:15} {:15} {:15} {:15}'.format(token.text, token.norm_, token.lemma_,
token.pos_, token.dep_, token.shape_))

66
import spacy
# Parse Text
print('{:15} {:15} {:15} {:15} {:15} {:15}'.format('TOKEN', 'NORM', 'LEMMA', 'POS', 'DEPENDENCY',
'SHAPE'))
print('======================================================================================')
for token in doc:
print('{:15} {:15} {:15} {:15} {:15} {:15}'.format(token.text, token.norm_, token.lemma_,
token.pos_, token.dep_, token.shape_))
… …

67
print('{:15} {:15} {:15} {:15}'.format('ENTITY', 'START', 'END', 'LABEL'))
print('=======================================================================')
# Print Named Entities
for ent in doc.ents:
print('{:15} {:15} {:15} {:15}'.format(ent.text, ent.start_char, ent.end_char, ent.label_)

68
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
events = ["national elections", "national emergency", "government shutdown"]
events_patterns = list(nlp.pipe(events))

69
import spacy
# Init rule-based matcher
matcher = PhraseMatcher(nlp.vocab)
# Fill new category information
matcher.add("EVENT", None, *events_patterns)

70
import spacy
# Define the custom component
def events_component(doc):
# Apply the matcher to the doc
matches = matcher(doc)
# Create a Span for each match and assign the label 'ANIMAL'
spans = [Span(doc, start, end, label="EVENT") for match_id, start, end in matches]
# Overwrite the doc.ents with the matched spans
doc.ents += spans
return doc

71
import spacy
# Define the custom component
def events_component(doc):
# Apply the matcher to the doc
matches = matcher(doc)
# Create a Span for each match and assign the label 'ANIMAL'
spans = [Span(doc, start, end, label="EVENT") for match_id, start, end in matches]
# Overwrite the doc.ents with the matched spans
doc.ents += spans
return doc
# Add the component to the pipeline after the 'ner' component
nlp.add_pipe(events_component, after="ner")
# Process the text and print the text and label for the doc.ents
border wall with Mexico, as Congress overwhelmingly approved a border security agreement that would
prevent a second damaging government shutdown.')

72
# Process the text and print the text and label for the doc.ents
border wall with Mexico, as Congress overwhelmingly approved a border security agreement that would
prevent a second damaging government shutdown.')

76
from keras.layers import Input, Embedding, Dropout
# Word Ids
inputs = Input(shape=(None, ), dtype='int32', name='word_ids')
# Word Embeddings
embeddings = Embedding(400000, 200, name='word_embeddings')(inputs)
# Dropout regularization over embeddings
dropped_embeddings = Dropout(rate=0.2, name='word_embeddings_dropout')(embeddings)

77
from keras.layers import Input, Embedding, Dropout, Bidirectional, LSTM
# Word Ids
# Word Embeddings
# Stack of bidirectional LSTMs
# 1st bi-LSTM
lstms_1 = Bidirectional(LSTM(100, return_sequences=True), name='1st_bilstms')(dropped_embeddings)
# 2nd bi-LSTM
lstms_2 = Bidirectional(LSTM(100), name='2nd_bilstms')(lstms_1)

78
from keras.layers import Input, Embedding, Dropout, Bidirectional, LSTM, Dense
from keras.models import Model
# Word Ids
# Word Embeddings
# 1st bi-LSTM
# 2nd bi-LSTM
# Output Layer
outputs = Dense(5, activation='softmax', name='outputs')(lstms_2)
# Wrap model
model = Model(inputs=inputs, outputs=outputs)
# Print topology
model.summary(110)

79
Layer (type) Output Shape Param #
==============================================================================================================
word_ids (InputLayer) (None, None ) 0
______________________________________________________________________________________________________________
word_embeddings (Embedding) (None, None, 200) 80000000
______________________________________________________________________________________________________________
word_embeddings_dropout (Dropout) (None, None, 200) 0
______________________________________________________________________________________________________________
1st_bilstms (Bidirectional) (None, None, 200) 240800
______________________________________________________________________________________________________________
2nd_bilstms (Bidirectional) (None, 200) 240800
______________________________________________________________________________________________________________
outputs (Dense) (None, 5) 1005
==============================================================================================================
Total params: 80,482,605
Trainable params: 80,482,605
Non-trainable params: 0

82
from keras.layers import Layer
class ELMo(Layer):
def __init__(self, **kwargs):
def build(self, input_shape):
def call(self, x, mask=None):

83
class ELMo(Layer):
def __init__(self, elmo_representation='elmo', trainable=True, **kwargs):
self.module_output = elmo_representation
self.trainable = trainable
self.elmo = None
super(ELMo, self).__init__(**kwargs)

84
import tensorflow_hub as hub
import keras.backend as K
class ELMo(Layer):
self.elmo = None
# SetUp tensorflow Hub module
self.elmo = hub.Module('https://tfhub.dev/google/elmo/2',
trainable=self.trainable, name="{}_module".format(self.name))
# Assign module's trainable weights to model
self.trainable_weights += K.tf.trainable_variables(scope="^{}_module/.*".format(self.name))
super(ELMo, self).build(input_shape)

85
import tensorflow as tf
class ELMo(Layer):
self.elmo = None
self.elmo = hub.Module('https://tfhub.dev/google/elmo/2',
# Assign module's trainable weights to model
self.trainable_weights += K.tf.trainable_variables(scope="^{}_module/.*".format(self.name))
super(ELMo, self).build(input_shape)
result = self.elmo(K.squeeze(K.cast(x, tf.string), axis=1),
as_dict=True,
signature='default',
)[self.module_output]
return result

86
from keras.layers import Input, Embedding, Dropout, Bidirectional, LSTM, Dense
# Word Ids
elmo_inputs = Input(shape=(1,), dtype='string', name='texts')
# Word Embeddings
word_embeddings = Embedding(400000, 200, name='word_embeddings')(inputs)
# ELMo embeddings as weighted average across layers
elmo_embeddings = ELMo(name='elmo')(elmo_inputs)
# Concat Word + ELMo embeddings
concatenated_embeddings = concatenate([word_embeddings, elmo_embeddings], axis=-1,
name='concat_embeddings')
dropped_embeddings = Dropout(rate=0.2, name='dropout_embeddings')(concatenated_embeddings)
# 1st bi-LSTM
# 2nd bi-LSTM
# Output Layer
outputs = Dense(5, activation='softmax', name='outputs')(lstms_2)
# Wrap model
model = Model(inputs=[inputs, elmo_inputs], outputs=outputs)
# Print topology
model.summary(110)

87
______________________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==============================================================================================================
word_ids (InputLayer) ( None, None ) 0
______________________________________________________________________________________________________________
texts (InputLayer) ( None, 1) 0
______________________________________________________________________________________________________________
word_embeddings (Embedding) ( None, None, 200) 80000000 word_ids[ 0][0]
______________________________________________________________________________________________________________
elmo (ELMo) ( None, None, 1024) 4 texts[0][0]
______________________________________________________________________________________________________________
concat_embeddings (Concatenate) ( None, None, 1224) 0 word_embeddings[ 0][0]
elmo[0][0]
______________________________________________________________________________________________________________
dropout_embeddings (Dropout) ( None, None, 1224) 0 concat_embeddings[ 0][0]
______________________________________________________________________________________________________________
1st_bilstms (Bidirectional) ( None, None, 200) 1060000 dropout_embeddings[ 0][0]
______________________________________________________________________________________________________________
2nd_bilstms (Bidirectional) ( None, 200) 240800 1st_bilstms[ 0][0]
______________________________________________________________________________________________________________
outputs (Dense) ( None, 5) 1005 2nd_bilstms[ 0][0]
==============================================================================================================

91
class BERT(Layer):
def __init__(self, trainable=True, **kwargs):
self.bert = None
super(BERT, self).__init__(**kwargs)
self.bert = hub.Module('https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1',
# Remove unused layers and assign trainable parameters
self.trainable_weights += [var for var in self.bert.variables
if not "/cls/" in var.name]
super(BERT, self).build(input_shape)
# Split inputs into (word, mask, segment) ids
splits = Lambda(lambda k: K.tf.split(k, num_or_size_splits=3, axis=2))(x)
inputs = [(Lambda(lambda s: K.tf.squeeze(s, axis=-1) for i in range(len(splits))]
result = self.bert(dict(input_ids=inputs[0], input_mask=inputs[1], segment_ids=inputs[2]),
as_dict=True,
signature='tokens')['pooled_output']
return result

92
from keras.layers import Input, Embedding, Dropout, Dense
# Load BERT presplitted IDs
inputs = Input(shape=(None, 3), dtype='int32', name='token_ids')
cls = BERT(name='bert')(inputs)
dropped_cls = Dropout(rate=0.1, name='dropout_bert_cls')(cls)
outputs = Dense(5, activation='softmax', name='outputs')(dropped_cls)
# Wrap model
model = Model(inputs=inputs, outputs=outputs)
# Print topology
model.summary(110)
Layer (type) Output Shape Param #
==============================================================================================================
token_ids (InputLayer) ( None, None, 3) 0
______________________________________________________________________________________________________________
bert (BERT) ( None, 768) 109482240
______________________________________________________________________________________________________________
dropout_bert_cls (Dropout) ( None, 768) 0
______________________________________________________________________________________________________________
outputs (Dense) ( None, 5) 3845
==============================================================================================================

Cognitiv+ GrayBox - Rapid deployment of NLP models
for corporate documents analysis

94
Cognitiv+ GrayBox - Rapid deployment of NLP models
for business documents’ analysis

95
Cognitiv+ Annotator
● Our annotator preserves layout information, which is valuable for annotating
business documents, while it is really handy.

96
Cognitiv+ Document Insights View
● New documents (.pdf, .doc) can be uploaded in the platform, get digitized and
previewed near identical to their initial format.

97
Cognitiv+ Document Insights View
● Insights are automatically extracted for all general-use models and any other
custom model has been trained using our annotator and graybox.

98
Cognitiv+ Document Index View
● Document structure (index) is also automatically extracted in order to help the
user’s browsing experience.

NLP in the Deep Learning Era: the story so far

Recommended

Recommended

More Related Content

Similar to NLP in the Deep Learning Era: the story so far

Similar to NLP in the Deep Learning Era: the story so far (20)

Recently uploaded

Recently uploaded (20)

NLP in the Deep Learning Era: the story so far