9th PyData Athens Meetup
Tuesday 23, April
18:30 - 19:30
The Cube Athens
Data scientist, Cognitiv+
● Study use cases for contract analytics
● Train models to extract:
○ the document structure,
○ keystone insights (data points)
○ Identify keystone actions (mechanisms)
Head of Data Science, Cognitiv+
● Explore new markets (leases, financial documents,
regulatory compliance)
● Investigate visual aspects of document processing - improv
data cleansing / document formatting.
Research Αssociate, National and Kapodistrian University
of Athens
● Legislative knowledge representation as Linked Data
● Model relations between structured documents
● Nomothesia Project: Open Data Platform for Greek
Legislation (
PhD Candidate, Athens University of Economics and Business
● Contract Entity Extraction and Structure Identification
● Sentence Classification in contractual clauses
● Extreme Multi-label Classification on EU legislation
● Judicial Decision Prediction
● Large-scale aux. pre-training for unsupervised learning
Donald Trump PERSON China GPEand visited
for his new re-election campaign, as the candidate of the .
White House ORGANIZATIONleft
Republican Party NORP
● …
X = (1, 3)
X = (1, 3)
W1 = (3, 4)
h1 = (1, 4)
X = (1, 3)
W1 = (3, 4)
h1 = (1, 4)
W2 = (4, 4)
h2 = (1, 4)
X = (1, 3)
W1 = (3, 4)
h1 = (1, 4)
W2 = (4, 4)
h2 = (1, 4)
Wo = (4, 1)
Y = (1,1)
yp yg
Word Embeddings1
30Proper Noun, Ccccc, [‘p’, ‘a’, ‘p’, ‘a’, …, ‘i’, ‘o’, ‘u’]
Recurrent Neural Networks2
Self-Attention Mechanism3
1.2 0.4 0.2
0.3 0.3 0.2
0.1 0.2 0.2
1.5 0.8 0.2
1.2 1.9 0.2
w1 = It
w2 = was
w3 = a
w4 = bad
w5 = movie
Representations w[t]
1.2 0.4 0.2
0.3 0.3 0.2
0.1 0.2 0.2
1.5 0.8 0.2
1.2 1.9 0.2
w1 = It
w2 = was
w3 = a
w4 = bad
w5 = movie
Scores a[t]
Representations w[t]
1.2 0.4 0.2
0.3 0.3 0.2
0.1 0.2 0.2
1.5 0.8 0.2
1.2 1.9 0.2
w1 = It
w2 = was
w3 = a
w4 = bad
w5 = movie
Weighted w[t]
aw[t] = w[t] × a[t]
0.12 0.04 0.02
0.15 0.15 0.01
0.01 0.02 0.02
0.75 0.40 0.10
0.24 0.38 0.04
Scores a[t]
Representations w[t]
1.2 0.4 0.2
0.3 0.3 0.2
0.1 0.2 0.2
1.5 0.8 0.2
1.2 1.9 0.2
w1 = It
w2 = was
w3 = a
w4 = bad
w5 = movie
Weighted w[t]
aw[t] = w[t] × a[t]
0.12 0.04 0.02
0.15 0.15 0.01
0.01 0.02 0.02
0.75 0.40 0.10
0.24 0.38 0.04
Scores a[t] Sentence Representation
1.27 0.99 0.19
Representations w[t]
common customs tariff | tariff nomenclature | mushroom growing | tobacco | pharmaceutical product
common customs tariff | tariff nomenclature | mushroom growing | tobacco | pharmaceutical product common customs tariff | tobacco | tariff nomenclature | tobacco industry | alcoholic beverage
Transfer Learning5
Use of NLP goodies as software
Code Time!
import spacy
# Load NLP pipeline processor
nlp = spacy.load('en')
# Parse Text
doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised
border wall with Mexico.')
import spacy
# Load NLP pipeline processor
nlp = spacy.load('en')
# Parse Text
doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised
border wall with Mexico.')
print('{:15} {:15} {:15} {:15} {:15} {:15}'.format('TOKEN', 'NORM', 'LEMMA', 'POS', 'DEPENDENCY',
for token in doc:
print('{:15} {:15} {:15} {:15} {:15} {:15}'.format(token.text, token.norm_, token.lemma_,
token.pos_, token.dep_, token.shape_))
import spacy
# Load NLP pipeline processor
nlp = spacy.load('en')
# Parse Text
doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised
border wall with Mexico.')
print('{:15} {:15} {:15} {:15} {:15} {:15}'.format('TOKEN', 'NORM', 'LEMMA', 'POS', 'DEPENDENCY',
for token in doc:
print('{:15} {:15} {:15} {:15} {:15} {:15}'.format(token.text, token.norm_, token.lemma_,
token.pos_, token.dep_, token.shape_))
… …
print('{:15} {:15} {:15} {:15}'.format('ENTITY', 'START', 'END', 'LABEL'))
# Print Named Entities
for ent in doc.ents:
print('{:15} {:15} {:15} {:15}'.format(ent.text, ent.start_char, ent.end_char, ent.label_)
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
events = ["national elections", "national emergency", "government shutdown"]
events_patterns = list(nlp.pipe(events))
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
events = ["national elections", "national emergency", "government shutdown"]
events_patterns = list(nlp.pipe(events))
# Init rule-based matcher
matcher = PhraseMatcher(nlp.vocab)
# Fill new category information
matcher.add("EVENT", None, *events_patterns)
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
events = ["national elections", "national emergency", "government shutdown"]
events_patterns = list(nlp.pipe(events))
# Init rule-based matcher
matcher = PhraseMatcher(nlp.vocab)
# Fill new category information
matcher.add("EVENT", None, *events_patterns)
# Define the custom component
def events_component(doc):
# Apply the matcher to the doc
matches = matcher(doc)
# Create a Span for each match and assign the label 'ANIMAL'
spans = [Span(doc, start, end, label="EVENT") for match_id, start, end in matches]
# Overwrite the doc.ents with the matched spans
doc.ents += spans
return doc
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
events = ["national elections", "national emergency", "government shutdown"]
events_patterns = list(nlp.pipe(events))
# Init rule-based matcher
matcher = PhraseMatcher(nlp.vocab)
# Fill new category information
matcher.add("EVENT", None, *events_patterns)
# Define the custom component
def events_component(doc):
# Apply the matcher to the doc
matches = matcher(doc)
# Create a Span for each match and assign the label 'ANIMAL'
spans = [Span(doc, start, end, label="EVENT") for match_id, start, end in matches]
# Overwrite the doc.ents with the matched spans
doc.ents += spans
return doc
# Add the component to the pipeline after the 'ner' component
nlp.add_pipe(events_component, after="ner")
# Process the text and print the text and label for the doc.ents
doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised
border wall with Mexico, as Congress overwhelmingly approved a border security agreement that would
prevent a second damaging government shutdown.')
# Process the text and print the text and label for the doc.ents
doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised
border wall with Mexico, as Congress overwhelmingly approved a border security agreement that would
prevent a second damaging government shutdown.')
from keras.layers import Input, Embedding, Dropout
# Word Ids
inputs = Input(shape=(None, ), dtype='int32', name='word_ids')
# Word Embeddings
embeddings = Embedding(400000, 200, name='word_embeddings')(inputs)
# Dropout regularization over embeddings
dropped_embeddings = Dropout(rate=0.2, name='word_embeddings_dropout')(embeddings)
from keras.layers import Input, Embedding, Dropout, Bidirectional, LSTM
# Word Ids
inputs = Input(shape=(None, ), dtype='int32', name='word_ids')
# Word Embeddings
embeddings = Embedding(400000, 200, name='word_embeddings')(inputs)
# Dropout regularization over embeddings
dropped_embeddings = Dropout(rate=0.2, name='word_embeddings_dropout')(embeddings)
# Stack of bidirectional LSTMs
# 1st bi-LSTM
lstms_1 = Bidirectional(LSTM(100, return_sequences=True), name='1st_bilstms')(dropped_embeddings)
# 2nd bi-LSTM
lstms_2 = Bidirectional(LSTM(100), name='2nd_bilstms')(lstms_1)
from keras.layers import Input, Embedding, Dropout, Bidirectional, LSTM, Dense
from keras.models import Model
# Word Ids
inputs = Input(shape=(None, ), dtype='int32', name='word_ids')
# Word Embeddings
embeddings = Embedding(400000, 200, name='word_embeddings')(inputs)
# Dropout regularization over embeddings
dropped_embeddings = Dropout(rate=0.2, name='word_embeddings_dropout')(embeddings)
# Stack of bidirectional LSTMs
# 1st bi-LSTM
lstms_1 = Bidirectional(LSTM(100, return_sequences=True), name='1st_bilstms')(dropped_embeddings)
# 2nd bi-LSTM
lstms_2 = Bidirectional(LSTM(100), name='2nd_bilstms')(lstms_1)
# Output Layer
outputs = Dense(5, activation='softmax', name='outputs')(lstms_2)
# Wrap model
model = Model(inputs=inputs, outputs=outputs)
# Print topology
Layer (type) Output Shape Param #
word_ids (InputLayer) (None, None ) 0
word_embeddings (Embedding) (None, None, 200) 80000000
word_embeddings_dropout (Dropout) (None, None, 200) 0
1st_bilstms (Bidirectional) (None, None, 200) 240800
2nd_bilstms (Bidirectional) (None, 200) 240800
outputs (Dense) (None, 5) 1005
Total params: 80,482,605
Trainable params: 80,482,605
Non-trainable params: 0
from keras.layers import Layer
class ELMo(Layer):
def __init__(self, **kwargs):
def build(self, input_shape):
def call(self, x, mask=None):
from keras.layers import Layer
class ELMo(Layer):
def __init__(self, elmo_representation='elmo', trainable=True, **kwargs):
self.module_output = elmo_representation
self.trainable = trainable
self.elmo = None
super(ELMo, self).__init__(**kwargs)
def build(self, input_shape):
def call(self, x, mask=None):
from keras.layers import Layer
import tensorflow_hub as hub
import keras.backend as K
class ELMo(Layer):
def __init__(self, elmo_representation='elmo', trainable=True, **kwargs):
self.module_output = elmo_representation
self.trainable = trainable
self.elmo = None
super(ELMo, self).__init__(**kwargs)
def build(self, input_shape):
# SetUp tensorflow Hub module
self.elmo = hub.Module('',
trainable=self.trainable, name="{}_module".format(
# Assign module's trainable weights to model
self.trainable_weights +="^{}_module/.*".format(
super(ELMo, self).build(input_shape)
def call(self, x, mask=None):
from keras.layers import Layer
import tensorflow_hub as hub
import keras.backend as K
import tensorflow as tf
class ELMo(Layer):
def __init__(self, elmo_representation='elmo', trainable=True, **kwargs):
self.module_output = elmo_representation
self.trainable = trainable
self.elmo = None
super(ELMo, self).__init__(**kwargs)
def build(self, input_shape):
# SetUp tensorflow Hub module
self.elmo = hub.Module('',
trainable=self.trainable, name="{}_module".format(
# Assign module's trainable weights to model
self.trainable_weights +="^{}_module/.*".format(
super(ELMo, self).build(input_shape)
def call(self, x, mask=None):
result = self.elmo(K.squeeze(K.cast(x, tf.string), axis=1),
return result
from keras.layers import Input, Embedding, Dropout, Bidirectional, LSTM, Dense
from keras.models import Model
# Word Ids
inputs = Input(shape=(None, ), dtype='int32', name='word_ids')
elmo_inputs = Input(shape=(1,), dtype='string', name='texts')
# Word Embeddings
word_embeddings = Embedding(400000, 200, name='word_embeddings')(inputs)
# ELMo embeddings as weighted average across layers
elmo_embeddings = ELMo(name='elmo')(elmo_inputs)
# Concat Word + ELMo embeddings
concatenated_embeddings = concatenate([word_embeddings, elmo_embeddings], axis=-1,
# Dropout regularization over embeddings
dropped_embeddings = Dropout(rate=0.2, name='dropout_embeddings')(concatenated_embeddings)
# Stack of bidirectional LSTMs
# 1st bi-LSTM
lstms_1 = Bidirectional(LSTM(100, return_sequences=True), name='1st_bilstms')(dropped_embeddings)
# 2nd bi-LSTM
lstms_2 = Bidirectional(LSTM(100), name='2nd_bilstms')(lstms_1)
# Output Layer
outputs = Dense(5, activation='softmax', name='outputs')(lstms_2)
# Wrap model
model = Model(inputs=[inputs, elmo_inputs], outputs=outputs)
# Print topology
Layer (type) Output Shape Param # Connected to
word_ids (InputLayer) ( None, None ) 0
texts (InputLayer) ( None, 1) 0
word_embeddings (Embedding) ( None, None, 200) 80000000 word_ids[ 0][0]
elmo (ELMo) ( None, None, 1024) 4 texts[0][0]
concat_embeddings (Concatenate) ( None, None, 1224) 0 word_embeddings[ 0][0]
dropout_embeddings (Dropout) ( None, None, 1224) 0 concat_embeddings[ 0][0]
1st_bilstms (Bidirectional) ( None, None, 200) 1060000 dropout_embeddings[ 0][0]
2nd_bilstms (Bidirectional) ( None, 200) 240800 1st_bilstms[ 0][0]
outputs (Dense) ( None, 5) 1005 2nd_bilstms[ 0][0]
Total params: 81,301,809
Trainable params: 81,301,809
Non-trainable params: 0
from keras.layers import Layer
import tensorflow_hub as hub
import keras.backend as K
class BERT(Layer):
def __init__(self, trainable=True, **kwargs):
self.bert = None
self.trainable = trainable
super(BERT, self).__init__(**kwargs)
def build(self, input_shape):
# SetUp tensorflow Hub module
self.bert = hub.Module('',
trainable=self.trainable, name="{}_module".format(
# Remove unused layers and assign trainable parameters
self.trainable_weights += [var for var in self.bert.variables
if not "/cls/" in]
super(BERT, self).build(input_shape)
def call(self, x, mask=None):
# Split inputs into (word, mask, segment) ids
splits = Lambda(lambda k:, num_or_size_splits=3, axis=2))(x)
inputs = [(Lambda(lambda s:, axis=-1) for i in range(len(splits))]
result = self.bert(dict(input_ids=inputs[0], input_mask=inputs[1], segment_ids=inputs[2]),
return result
from keras.models import Model
from keras.layers import Input, Embedding, Dropout, Dense
# Load BERT presplitted IDs
inputs = Input(shape=(None, 3), dtype='int32', name='token_ids')
cls = BERT(name='bert')(inputs)
dropped_cls = Dropout(rate=0.1, name='dropout_bert_cls')(cls)
outputs = Dense(5, activation='softmax', name='outputs')(dropped_cls)
# Wrap model
model = Model(inputs=inputs, outputs=outputs)
# Print topology
Layer (type) Output Shape Param #
token_ids (InputLayer) ( None, None, 3) 0
bert (BERT) ( None, 768) 109482240
dropout_bert_cls (Dropout) ( None, 768) 0
outputs (Dense) ( None, 5) 3845
Total params: 109,486,085
Trainable params: 109,486,085
Non-trainable params: 0
Cognitiv+ GrayBox - Rapid deployment of NLP models
for corporate documents analysis
Cognitiv+ GrayBox - Rapid deployment of NLP models
for business documents’ analysis
Cognitiv+ Annotator
● Our annotator preserves layout information, which is valuable for annotating
business documents, while it is really handy.
Cognitiv+ Document Insights View
● New documents (.pdf, .doc) can be uploaded in the platform, get digitized and
previewed near identical to their initial format.
Cognitiv+ Document Insights View
● Insights are automatically extracted for all general-use models and any other
custom model has been trained using our annotator and graybox.
Cognitiv+ Document Index View
● Document structure (index) is also automatically extracted in order to help the
user’s browsing experience.

NLP in the Deep Learning Era: the story so far

  9th PyData Athens Meetup Tuesday 23, April 18:30 - 19:30 The Cube Athens
  • 3.
  Industry Data scientist, Cognitiv+ ● Study use cases for contract analytics ● Train models to extract: ○ the document structure, ○ keystone insights (data points) ○ Identify keystone actions (mechanisms) Head of Data Science, Cognitiv+ ● Explore new markets (leases, financial documents, regulatory compliance) ● Investigate visual aspects of document processing - improv data cleansing / document formatting. Academia Research Αssociate, National and Kapodistrian University of Athens ● Legislative knowledge representation as Linked Data ● Model relations between structured documents ● Nomothesia Project: Open Data Platform for Greek Legislation ( PhD Candidate, Athens University of Economics and Business ● Contract Entity Extraction and Structure Identification ● Sentence Classification in contractual clauses ● Extreme Multi-label Classification on EU legislation ● Judicial Decision Prediction ● Large-scale aux. pre-training for unsupervised learning
  • 5.
  • 6.
  • 7. 7
  • 10. 10 Donald Trump PERSON China GPEand visited for his new re-election campaign, as the candidate of the . White House ORGANIZATIONleft Republican Party NORP
  • 16. 16 X = (1, 3) W1 = (3, 4) h1 = (1, 4)
  • 17. 17 X = (1, 3) W1 = (3, 4) h1 = (1, 4) W2 = (4, 4) h2 = (1, 4)
  • 18. 18 X = (1, 3) W1 = (3, 4) h1 = (1, 4) W2 = (4, 4) h2 = (1, 4) Wo = (4, 1) Y = (1,1)
  • 20.
  • 22. 22
  • 23. 23
  • 27. 27
  • 28. 28
  • 29. 29
  • 30. 30Proper Noun, Ccccc, [‘p’, ‘a’, ‘p’, ‘a’, …, ‘i’, ‘o’, ‘u’]
  • 35. 35
  • 36. 36
  • 37. 37
  • 39. • 39 1.2 0.4 0.2 0.3 0.3 0.2 0.1 0.2 0.2 1.5 0.8 0.2 1.2 1.9 0.2 w1 = It w2 = was w3 = a w4 = bad w5 = movie Word Representations w[t]
  • 40. • 40 1.2 0.4 0.2 0.3 0.3 0.2 0.1 0.2 0.2 1.5 0.8 0.2 1.2 1.9 0.2 w1 = It w2 = was w3 = a w4 = bad w5 = movie 0.1 0.1 0.1 0.5 0.2 Attention Scores a[t] Word Representations w[t] a1 a2 a3 a4 a5
  • 41. • 41 1.2 0.4 0.2 0.3 0.3 0.2 0.1 0.2 0.2 1.5 0.8 0.2 1.2 1.9 0.2 w1 = It w2 = was w3 = a w4 = bad w5 = movie 0.1 0.1 0.1 0.5 0.2 Weighted w[t] aw[t] = w[t] × a[t] 0.12 0.04 0.02 0.15 0.15 0.01 0.01 0.02 0.02 0.75 0.40 0.10 0.24 0.38 0.04 Attention Scores a[t] Word Representations w[t] a1 a2 a3 a4 a5
  • 42. • 42 1.2 0.4 0.2 0.3 0.3 0.2 0.1 0.2 0.2 1.5 0.8 0.2 1.2 1.9 0.2 w1 = It w2 = was w3 = a w4 = bad w5 = movie 0.1 0.1 0.1 0.5 0.2 Weighted w[t] aw[t] = w[t] × a[t] 0.12 0.04 0.02 0.15 0.15 0.01 0.01 0.02 0.02 0.75 0.40 0.10 0.24 0.38 0.04 Attention Scores a[t] Sentence Representation ∑aw[t] 1.27 0.99 0.19 Word Representations w[t] a1 a2 a3 a4 a5
  • 44. • 44 common customs tariff | tariff nomenclature | mushroom growing | tobacco | pharmaceutical product
  • 45. • 45 common customs tariff | tariff nomenclature | mushroom growing | tobacco | pharmaceutical product common customs tariff | tobacco | tariff nomenclature | tobacco industry | alcoholic beverage
  • 48. 48
  • 49. 49
  • 50. 50
  • 51. 51
  • 60.
  • 61.
  • 62. 62 Use of NLP goodies as software Code Time!
  • 63. 63
  • 64. 64 import spacy # Load NLP pipeline processor nlp = spacy.load('en') # Parse Text doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised border wall with Mexico.')
  • 65. 65 import spacy # Load NLP pipeline processor nlp = spacy.load('en') # Parse Text doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised border wall with Mexico.') print('{:15} {:15} {:15} {:15} {:15} {:15}'.format('TOKEN', 'NORM', 'LEMMA', 'POS', 'DEPENDENCY', 'SHAPE')) print('======================================================================================') for token in doc: print('{:15} {:15} {:15} {:15} {:15} {:15}'.format(token.text, token.norm_, token.lemma_, token.pos_, token.dep_, token.shape_))
  • 66. 66 import spacy # Load NLP pipeline processor nlp = spacy.load('en') # Parse Text doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised border wall with Mexico.') print('{:15} {:15} {:15} {:15} {:15} {:15}'.format('TOKEN', 'NORM', 'LEMMA', 'POS', 'DEPENDENCY', 'SHAPE')) print('======================================================================================') for token in doc: print('{:15} {:15} {:15} {:15} {:15} {:15}'.format(token.text, token.norm_, token.lemma_, token.pos_, token.dep_, token.shape_)) … …
  • 67. 67 print('{:15} {:15} {:15} {:15}'.format('ENTITY', 'START', 'END', 'LABEL')) print('=======================================================================') # Print Named Entities for ent in doc.ents: print('{:15} {:15} {:15} {:15}'.format(ent.text, ent.start_char, ent.end_char, ent.label_)
  • 68. 68 import spacy from spacy.matcher import PhraseMatcher from spacy.tokens import Span events = ["national elections", "national emergency", "government shutdown"] events_patterns = list(nlp.pipe(events))
  • 69. 69 import spacy from spacy.matcher import PhraseMatcher from spacy.tokens import Span events = ["national elections", "national emergency", "government shutdown"] events_patterns = list(nlp.pipe(events)) # Init rule-based matcher matcher = PhraseMatcher(nlp.vocab) # Fill new category information matcher.add("EVENT", None, *events_patterns)
  • 70. 70 import spacy from spacy.matcher import PhraseMatcher from spacy.tokens import Span events = ["national elections", "national emergency", "government shutdown"] events_patterns = list(nlp.pipe(events)) # Init rule-based matcher matcher = PhraseMatcher(nlp.vocab) # Fill new category information matcher.add("EVENT", None, *events_patterns) # Define the custom component def events_component(doc): # Apply the matcher to the doc matches = matcher(doc) # Create a Span for each match and assign the label 'ANIMAL' spans = [Span(doc, start, end, label="EVENT") for match_id, start, end in matches] # Overwrite the doc.ents with the matched spans doc.ents += spans return doc
  • 71. 71 import spacy from spacy.matcher import PhraseMatcher from spacy.tokens import Span events = ["national elections", "national emergency", "government shutdown"] events_patterns = list(nlp.pipe(events)) # Init rule-based matcher matcher = PhraseMatcher(nlp.vocab) # Fill new category information matcher.add("EVENT", None, *events_patterns) # Define the custom component def events_component(doc): # Apply the matcher to the doc matches = matcher(doc) # Create a Span for each match and assign the label 'ANIMAL' spans = [Span(doc, start, end, label="EVENT") for match_id, start, end in matches] # Overwrite the doc.ents with the matched spans doc.ents += spans return doc # Add the component to the pipeline after the 'ner' component nlp.add_pipe(events_component, after="ner") # Process the text and print the text and label for the doc.ents doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised border wall with Mexico, as Congress overwhelmingly approved a border security agreement that would prevent a second damaging government shutdown.')
  • 72. 72 # Process the text and print the text and label for the doc.ents doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised border wall with Mexico, as Congress overwhelmingly approved a border security agreement that would prevent a second damaging government shutdown.')
  • 73. 73
  • 74. 74
  • 76. 76 from keras.layers import Input, Embedding, Dropout # Word Ids inputs = Input(shape=(None, ), dtype='int32', name='word_ids') # Word Embeddings embeddings = Embedding(400000, 200, name='word_embeddings')(inputs) # Dropout regularization over embeddings dropped_embeddings = Dropout(rate=0.2, name='word_embeddings_dropout')(embeddings)
  • 77. 77 from keras.layers import Input, Embedding, Dropout, Bidirectional, LSTM # Word Ids inputs = Input(shape=(None, ), dtype='int32', name='word_ids') # Word Embeddings embeddings = Embedding(400000, 200, name='word_embeddings')(inputs) # Dropout regularization over embeddings dropped_embeddings = Dropout(rate=0.2, name='word_embeddings_dropout')(embeddings) # Stack of bidirectional LSTMs # 1st bi-LSTM lstms_1 = Bidirectional(LSTM(100, return_sequences=True), name='1st_bilstms')(dropped_embeddings) # 2nd bi-LSTM lstms_2 = Bidirectional(LSTM(100), name='2nd_bilstms')(lstms_1)
  • 78. 78 from keras.layers import Input, Embedding, Dropout, Bidirectional, LSTM, Dense from keras.models import Model # Word Ids inputs = Input(shape=(None, ), dtype='int32', name='word_ids') # Word Embeddings embeddings = Embedding(400000, 200, name='word_embeddings')(inputs) # Dropout regularization over embeddings dropped_embeddings = Dropout(rate=0.2, name='word_embeddings_dropout')(embeddings) # Stack of bidirectional LSTMs # 1st bi-LSTM lstms_1 = Bidirectional(LSTM(100, return_sequences=True), name='1st_bilstms')(dropped_embeddings) # 2nd bi-LSTM lstms_2 = Bidirectional(LSTM(100), name='2nd_bilstms')(lstms_1) # Output Layer outputs = Dense(5, activation='softmax', name='outputs')(lstms_2) # Wrap model model = Model(inputs=inputs, outputs=outputs) # Print topology model.summary(110)
  • 79. 79 Layer (type) Output Shape Param # ============================================================================================================== word_ids (InputLayer) (None, None ) 0 ______________________________________________________________________________________________________________ word_embeddings (Embedding) (None, None, 200) 80000000 ______________________________________________________________________________________________________________ word_embeddings_dropout (Dropout) (None, None, 200) 0 ______________________________________________________________________________________________________________ 1st_bilstms (Bidirectional) (None, None, 200) 240800 ______________________________________________________________________________________________________________ 2nd_bilstms (Bidirectional) (None, 200) 240800 ______________________________________________________________________________________________________________ outputs (Dense) (None, 5) 1005 ============================================================================================================== Total params: 80,482,605 Trainable params: 80,482,605 Non-trainable params: 0
  • 80. 80
  • 82. 82 from keras.layers import Layer class ELMo(Layer): def __init__(self, **kwargs): def build(self, input_shape): def call(self, x, mask=None):
  • 83. 83 from keras.layers import Layer class ELMo(Layer): def __init__(self, elmo_representation='elmo', trainable=True, **kwargs): self.module_output = elmo_representation self.trainable = trainable self.elmo = None super(ELMo, self).__init__(**kwargs) def build(self, input_shape): def call(self, x, mask=None):
  • 84. 84 from keras.layers import Layer import tensorflow_hub as hub import keras.backend as K class ELMo(Layer): def __init__(self, elmo_representation='elmo', trainable=True, **kwargs): self.module_output = elmo_representation self.trainable = trainable self.elmo = None super(ELMo, self).__init__(**kwargs) def build(self, input_shape): # SetUp tensorflow Hub module self.elmo = hub.Module('', trainable=self.trainable, name="{}_module".format( # Assign module's trainable weights to model self.trainable_weights +="^{}_module/.*".format( super(ELMo, self).build(input_shape) def call(self, x, mask=None):
  • 85. 85 from keras.layers import Layer import tensorflow_hub as hub import keras.backend as K import tensorflow as tf class ELMo(Layer): def __init__(self, elmo_representation='elmo', trainable=True, **kwargs): self.module_output = elmo_representation self.trainable = trainable self.elmo = None super(ELMo, self).__init__(**kwargs) def build(self, input_shape): # SetUp tensorflow Hub module self.elmo = hub.Module('', trainable=self.trainable, name="{}_module".format( # Assign module's trainable weights to model self.trainable_weights +="^{}_module/.*".format( super(ELMo, self).build(input_shape) def call(self, x, mask=None): result = self.elmo(K.squeeze(K.cast(x, tf.string), axis=1), as_dict=True, signature='default', )[self.module_output] return result
  • 86. 86 from keras.layers import Input, Embedding, Dropout, Bidirectional, LSTM, Dense from keras.models import Model # Word Ids inputs = Input(shape=(None, ), dtype='int32', name='word_ids') elmo_inputs = Input(shape=(1,), dtype='string', name='texts') # Word Embeddings word_embeddings = Embedding(400000, 200, name='word_embeddings')(inputs) # ELMo embeddings as weighted average across layers elmo_embeddings = ELMo(name='elmo')(elmo_inputs) # Concat Word + ELMo embeddings concatenated_embeddings = concatenate([word_embeddings, elmo_embeddings], axis=-1, name='concat_embeddings') # Dropout regularization over embeddings dropped_embeddings = Dropout(rate=0.2, name='dropout_embeddings')(concatenated_embeddings) # Stack of bidirectional LSTMs # 1st bi-LSTM lstms_1 = Bidirectional(LSTM(100, return_sequences=True), name='1st_bilstms')(dropped_embeddings) # 2nd bi-LSTM lstms_2 = Bidirectional(LSTM(100), name='2nd_bilstms')(lstms_1) # Output Layer outputs = Dense(5, activation='softmax', name='outputs')(lstms_2) # Wrap model model = Model(inputs=[inputs, elmo_inputs], outputs=outputs) # Print topology model.summary(110)
  • 87. 87 ______________________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ============================================================================================================== word_ids (InputLayer) ( None, None ) 0 ______________________________________________________________________________________________________________ texts (InputLayer) ( None, 1) 0 ______________________________________________________________________________________________________________ word_embeddings (Embedding) ( None, None, 200) 80000000 word_ids[ 0][0] ______________________________________________________________________________________________________________ elmo (ELMo) ( None, None, 1024) 4 texts[0][0] ______________________________________________________________________________________________________________ concat_embeddings (Concatenate) ( None, None, 1224) 0 word_embeddings[ 0][0] elmo[0][0] ______________________________________________________________________________________________________________ dropout_embeddings (Dropout) ( None, None, 1224) 0 concat_embeddings[ 0][0] ______________________________________________________________________________________________________________ 1st_bilstms (Bidirectional) ( None, None, 200) 1060000 dropout_embeddings[ 0][0] ______________________________________________________________________________________________________________ 2nd_bilstms (Bidirectional) ( None, 200) 240800 1st_bilstms[ 0][0] ______________________________________________________________________________________________________________ outputs (Dense) ( None, 5) 1005 2nd_bilstms[ 0][0] ============================================================================================================== Total params: 81,301,809 Trainable params: 81,301,809 Non-trainable params: 0
  • 88. 88
  • 89. 89
  • 90. 90
  • 91. 91 from keras.layers import Layer import tensorflow_hub as hub import keras.backend as K class BERT(Layer): def __init__(self, trainable=True, **kwargs): self.bert = None self.trainable = trainable super(BERT, self).__init__(**kwargs) def build(self, input_shape): # SetUp tensorflow Hub module self.bert = hub.Module('', trainable=self.trainable, name="{}_module".format( # Remove unused layers and assign trainable parameters self.trainable_weights += [var for var in self.bert.variables if not "/cls/" in] super(BERT, self).build(input_shape) def call(self, x, mask=None): # Split inputs into (word, mask, segment) ids splits = Lambda(lambda k:, num_or_size_splits=3, axis=2))(x) inputs = [(Lambda(lambda s:, axis=-1) for i in range(len(splits))] result = self.bert(dict(input_ids=inputs[0], input_mask=inputs[1], segment_ids=inputs[2]), as_dict=True, signature='tokens')['pooled_output'] return result
  • 92. 92 from keras.models import Model from keras.layers import Input, Embedding, Dropout, Dense # Load BERT presplitted IDs inputs = Input(shape=(None, 3), dtype='int32', name='token_ids') cls = BERT(name='bert')(inputs) dropped_cls = Dropout(rate=0.1, name='dropout_bert_cls')(cls) outputs = Dense(5, activation='softmax', name='outputs')(dropped_cls) # Wrap model model = Model(inputs=inputs, outputs=outputs) # Print topology model.summary(110) Layer (type) Output Shape Param # ============================================================================================================== token_ids (InputLayer) ( None, None, 3) 0 ______________________________________________________________________________________________________________ bert (BERT) ( None, 768) 109482240 ______________________________________________________________________________________________________________ dropout_bert_cls (Dropout) ( None, 768) 0 ______________________________________________________________________________________________________________ outputs (Dense) ( None, 5) 3845 ============================================================================================================== Total params: 109,486,085 Trainable params: 109,486,085 Non-trainable params: 0
  • 93. Cognitiv+ GrayBox - Rapid deployment of NLP models for corporate documents analysis
  • 94. 94 Cognitiv+ GrayBox - Rapid deployment of NLP models for business documents’ analysis
  • 95. 95 Cognitiv+ Annotator ● Our annotator preserves layout information, which is valuable for annotating business documents, while it is really handy.
  • 96. 96 Cognitiv+ Document Insights View ● New documents (.pdf, .doc) can be uploaded in the platform, get digitized and previewed near identical to their initial format.
  • 97. 97 Cognitiv+ Document Insights View ● Insights are automatically extracted for all general-use models and any other custom model has been trained using our annotator and graybox.
  • 98. 98 Cognitiv+ Document Index View ● Document structure (index) is also automatically extracted in order to help the user’s browsing experience.