SlideShare a Scribd company logo
1 of 99
Download to read offline
…
9th PyData Athens Meetup
Tuesday 23, April
18:30 - 19:30
The Cube Athens
nlp.cs.aueb
2
●
●
●
●
4
Industry
Data scientist, Cognitiv+
● Study use cases for contract analytics
● Train models to extract:
○ the document structure,
○ keystone insights (data points)
○ Identify keystone actions (mechanisms)
Head of Data Science, Cognitiv+
● Explore new markets (leases, financial documents,
regulatory compliance)
● Investigate visual aspects of document processing - improv
data cleansing / document formatting.
Academia
Research Αssociate, National and Kapodistrian University
of Athens
● Legislative knowledge representation as Linked Data
● Model relations between structured documents
● Nomothesia Project: Open Data Platform for Greek
Legislation (http://legislation.di.uoa.gr)
PhD Candidate, Athens University of Economics and Business
● Contract Entity Extraction and Structure Identification
● Sentence Classification in contractual clauses
● Extreme Multi-label Classification on EU legislation
● Judicial Decision Prediction
● Large-scale aux. pre-training for unsupervised learning
7
8
SPAM
CONVENIENCE MATERIAL BREACH RENEWAL ASSISTANCE
AUTOMATIC NON AUTOMATIC
9
10
Donald Trump PERSON China GPEand visited
for his new re-election campaign, as the candidate of the .
White House ORGANIZATIONleft
Republican Party NORP
11
LANDLORD TENANT PROPERTY ADDRESS CONTRACT PERIOD
TITLE START DATE CONTRACT PARTY
12
DOCUMENT 1 DOCUMENT 2 DOCUMENT 3 DOCUMENT 1683
Q:
A:
…
13
●
○
●
○
●
○
14
●
● …
●
15
X = (1, 3)
16
X = (1, 3)
W1 = (3, 4)
h1 = (1, 4)
17
X = (1, 3)
W1 = (3, 4)
h1 = (1, 4)
W2 = (4, 4)
h2 = (1, 4)
18
X = (1, 3)
W1 = (3, 4)
h1 = (1, 4)
W2 = (4, 4)
h2 = (1, 4)
Wo = (4, 1)
Y = (1,1)
19
yp yg
21
Word Embeddings1
22
23
•
24
•
25
•
26
27
28
29
30Proper Noun, Ccccc, [‘p’, ‘a’, ‘p’, ‘a’, …, ‘i’, ‘o’, ‘u’]
31
Recurrent Neural Networks2
•
32
•
33
•
34
35
36
37
38
Self-Attention Mechanism3
•
39
1.2 0.4 0.2
0.3 0.3 0.2
0.1 0.2 0.2
1.5 0.8 0.2
1.2 1.9 0.2
w1 = It
w2 = was
w3 = a
w4 = bad
w5 = movie
Word
Representations w[t]
•
40
1.2 0.4 0.2
0.3 0.3 0.2
0.1 0.2 0.2
1.5 0.8 0.2
1.2 1.9 0.2
w1 = It
w2 = was
w3 = a
w4 = bad
w5 = movie
0.1
0.1
0.1
0.5
0.2
Attention
Scores a[t]
Word
Representations w[t]
a1
a2
a3
a4
a5
•
41
1.2 0.4 0.2
0.3 0.3 0.2
0.1 0.2 0.2
1.5 0.8 0.2
1.2 1.9 0.2
w1 = It
w2 = was
w3 = a
w4 = bad
w5 = movie
0.1
0.1
0.1
0.5
0.2
Weighted w[t]
aw[t] = w[t] × a[t]
0.12 0.04 0.02
0.15 0.15 0.01
0.01 0.02 0.02
0.75 0.40 0.10
0.24 0.38 0.04
Attention
Scores a[t]
Word
Representations w[t]
a1
a2
a3
a4
a5
•
42
1.2 0.4 0.2
0.3 0.3 0.2
0.1 0.2 0.2
1.5 0.8 0.2
1.2 1.9 0.2
w1 = It
w2 = was
w3 = a
w4 = bad
w5 = movie
0.1
0.1
0.1
0.5
0.2
Weighted w[t]
aw[t] = w[t] × a[t]
0.12 0.04 0.02
0.15 0.15 0.01
0.01 0.02 0.02
0.75 0.40 0.10
0.24 0.38 0.04
Attention
Scores a[t] Sentence Representation
∑aw[t]
1.27 0.99 0.19
Word
Representations w[t]
a1
a2
a3
a4
a5
•
43
•
44
common customs tariff | tariff nomenclature | mushroom growing | tobacco | pharmaceutical product
•
45
common customs tariff | tariff nomenclature | mushroom growing | tobacco | pharmaceutical product common customs tariff | tobacco | tariff nomenclature | tobacco industry | alcoholic beverage
46
Transformers4
NOT ME!
47
•
•
48
49
50
51
52
Transfer Learning5
53
•
54
•
55
α
α
α
•
56
α
α
α
•
57
•
α
α
α
γ
58
•
59
•
62
Use of NLP goodies as software
Code Time!
63
64
import spacy
# Load NLP pipeline processor
nlp = spacy.load('en')
# Parse Text
doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised
border wall with Mexico.')
65
import spacy
# Load NLP pipeline processor
nlp = spacy.load('en')
# Parse Text
doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised
border wall with Mexico.')
print('{:15} {:15} {:15} {:15} {:15} {:15}'.format('TOKEN', 'NORM', 'LEMMA', 'POS', 'DEPENDENCY',
'SHAPE'))
print('======================================================================================')
for token in doc:
print('{:15} {:15} {:15} {:15} {:15} {:15}'.format(token.text, token.norm_, token.lemma_,
token.pos_, token.dep_, token.shape_))
66
import spacy
# Load NLP pipeline processor
nlp = spacy.load('en')
# Parse Text
doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised
border wall with Mexico.')
print('{:15} {:15} {:15} {:15} {:15} {:15}'.format('TOKEN', 'NORM', 'LEMMA', 'POS', 'DEPENDENCY',
'SHAPE'))
print('======================================================================================')
for token in doc:
print('{:15} {:15} {:15} {:15} {:15} {:15}'.format(token.text, token.norm_, token.lemma_,
token.pos_, token.dep_, token.shape_))
… …
67
print('{:15} {:15} {:15} {:15}'.format('ENTITY', 'START', 'END', 'LABEL'))
print('=======================================================================')
# Print Named Entities
for ent in doc.ents:
print('{:15} {:15} {:15} {:15}'.format(ent.text, ent.start_char, ent.end_char, ent.label_)
68
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
events = ["national elections", "national emergency", "government shutdown"]
events_patterns = list(nlp.pipe(events))
69
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
events = ["national elections", "national emergency", "government shutdown"]
events_patterns = list(nlp.pipe(events))
# Init rule-based matcher
matcher = PhraseMatcher(nlp.vocab)
# Fill new category information
matcher.add("EVENT", None, *events_patterns)
70
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
events = ["national elections", "national emergency", "government shutdown"]
events_patterns = list(nlp.pipe(events))
# Init rule-based matcher
matcher = PhraseMatcher(nlp.vocab)
# Fill new category information
matcher.add("EVENT", None, *events_patterns)
# Define the custom component
def events_component(doc):
# Apply the matcher to the doc
matches = matcher(doc)
# Create a Span for each match and assign the label 'ANIMAL'
spans = [Span(doc, start, end, label="EVENT") for match_id, start, end in matches]
# Overwrite the doc.ents with the matched spans
doc.ents += spans
return doc
71
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
events = ["national elections", "national emergency", "government shutdown"]
events_patterns = list(nlp.pipe(events))
# Init rule-based matcher
matcher = PhraseMatcher(nlp.vocab)
# Fill new category information
matcher.add("EVENT", None, *events_patterns)
# Define the custom component
def events_component(doc):
# Apply the matcher to the doc
matches = matcher(doc)
# Create a Span for each match and assign the label 'ANIMAL'
spans = [Span(doc, start, end, label="EVENT") for match_id, start, end in matches]
# Overwrite the doc.ents with the matched spans
doc.ents += spans
return doc
# Add the component to the pipeline after the 'ner' component
nlp.add_pipe(events_component, after="ner")
# Process the text and print the text and label for the doc.ents
doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised
border wall with Mexico, as Congress overwhelmingly approved a border security agreement that would
prevent a second damaging government shutdown.')
72
# Process the text and print the text and label for the doc.ents
doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised
border wall with Mexico, as Congress overwhelmingly approved a border security agreement that would
prevent a second damaging government shutdown.')
73
74
75
●
●
●
●
●
76
from keras.layers import Input, Embedding, Dropout
# Word Ids
inputs = Input(shape=(None, ), dtype='int32', name='word_ids')
# Word Embeddings
embeddings = Embedding(400000, 200, name='word_embeddings')(inputs)
# Dropout regularization over embeddings
dropped_embeddings = Dropout(rate=0.2, name='word_embeddings_dropout')(embeddings)
77
from keras.layers import Input, Embedding, Dropout, Bidirectional, LSTM
# Word Ids
inputs = Input(shape=(None, ), dtype='int32', name='word_ids')
# Word Embeddings
embeddings = Embedding(400000, 200, name='word_embeddings')(inputs)
# Dropout regularization over embeddings
dropped_embeddings = Dropout(rate=0.2, name='word_embeddings_dropout')(embeddings)
# Stack of bidirectional LSTMs
# 1st bi-LSTM
lstms_1 = Bidirectional(LSTM(100, return_sequences=True), name='1st_bilstms')(dropped_embeddings)
# 2nd bi-LSTM
lstms_2 = Bidirectional(LSTM(100), name='2nd_bilstms')(lstms_1)
78
from keras.layers import Input, Embedding, Dropout, Bidirectional, LSTM, Dense
from keras.models import Model
# Word Ids
inputs = Input(shape=(None, ), dtype='int32', name='word_ids')
# Word Embeddings
embeddings = Embedding(400000, 200, name='word_embeddings')(inputs)
# Dropout regularization over embeddings
dropped_embeddings = Dropout(rate=0.2, name='word_embeddings_dropout')(embeddings)
# Stack of bidirectional LSTMs
# 1st bi-LSTM
lstms_1 = Bidirectional(LSTM(100, return_sequences=True), name='1st_bilstms')(dropped_embeddings)
# 2nd bi-LSTM
lstms_2 = Bidirectional(LSTM(100), name='2nd_bilstms')(lstms_1)
# Output Layer
outputs = Dense(5, activation='softmax', name='outputs')(lstms_2)
# Wrap model
model = Model(inputs=inputs, outputs=outputs)
# Print topology
model.summary(110)
79
Layer (type) Output Shape Param #
==============================================================================================================
word_ids (InputLayer) (None, None ) 0
______________________________________________________________________________________________________________
word_embeddings (Embedding) (None, None, 200) 80000000
______________________________________________________________________________________________________________
word_embeddings_dropout (Dropout) (None, None, 200) 0
______________________________________________________________________________________________________________
1st_bilstms (Bidirectional) (None, None, 200) 240800
______________________________________________________________________________________________________________
2nd_bilstms (Bidirectional) (None, 200) 240800
______________________________________________________________________________________________________________
outputs (Dense) (None, 5) 1005
==============================================================================================================
Total params: 80,482,605
Trainable params: 80,482,605
Non-trainable params: 0
80
81
α
α
α
γ
82
from keras.layers import Layer
class ELMo(Layer):
def __init__(self, **kwargs):
def build(self, input_shape):
def call(self, x, mask=None):
83
from keras.layers import Layer
class ELMo(Layer):
def __init__(self, elmo_representation='elmo', trainable=True, **kwargs):
self.module_output = elmo_representation
self.trainable = trainable
self.elmo = None
super(ELMo, self).__init__(**kwargs)
def build(self, input_shape):
def call(self, x, mask=None):
84
from keras.layers import Layer
import tensorflow_hub as hub
import keras.backend as K
class ELMo(Layer):
def __init__(self, elmo_representation='elmo', trainable=True, **kwargs):
self.module_output = elmo_representation
self.trainable = trainable
self.elmo = None
super(ELMo, self).__init__(**kwargs)
def build(self, input_shape):
# SetUp tensorflow Hub module
self.elmo = hub.Module('https://tfhub.dev/google/elmo/2',
trainable=self.trainable, name="{}_module".format(self.name))
# Assign module's trainable weights to model
self.trainable_weights += K.tf.trainable_variables(scope="^{}_module/.*".format(self.name))
super(ELMo, self).build(input_shape)
def call(self, x, mask=None):
85
from keras.layers import Layer
import tensorflow_hub as hub
import keras.backend as K
import tensorflow as tf
class ELMo(Layer):
def __init__(self, elmo_representation='elmo', trainable=True, **kwargs):
self.module_output = elmo_representation
self.trainable = trainable
self.elmo = None
super(ELMo, self).__init__(**kwargs)
def build(self, input_shape):
# SetUp tensorflow Hub module
self.elmo = hub.Module('https://tfhub.dev/google/elmo/2',
trainable=self.trainable, name="{}_module".format(self.name))
# Assign module's trainable weights to model
self.trainable_weights += K.tf.trainable_variables(scope="^{}_module/.*".format(self.name))
super(ELMo, self).build(input_shape)
def call(self, x, mask=None):
result = self.elmo(K.squeeze(K.cast(x, tf.string), axis=1),
as_dict=True,
signature='default',
)[self.module_output]
return result
86
from keras.layers import Input, Embedding, Dropout, Bidirectional, LSTM, Dense
from keras.models import Model
# Word Ids
inputs = Input(shape=(None, ), dtype='int32', name='word_ids')
elmo_inputs = Input(shape=(1,), dtype='string', name='texts')
# Word Embeddings
word_embeddings = Embedding(400000, 200, name='word_embeddings')(inputs)
# ELMo embeddings as weighted average across layers
elmo_embeddings = ELMo(name='elmo')(elmo_inputs)
# Concat Word + ELMo embeddings
concatenated_embeddings = concatenate([word_embeddings, elmo_embeddings], axis=-1,
name='concat_embeddings')
# Dropout regularization over embeddings
dropped_embeddings = Dropout(rate=0.2, name='dropout_embeddings')(concatenated_embeddings)
# Stack of bidirectional LSTMs
# 1st bi-LSTM
lstms_1 = Bidirectional(LSTM(100, return_sequences=True), name='1st_bilstms')(dropped_embeddings)
# 2nd bi-LSTM
lstms_2 = Bidirectional(LSTM(100), name='2nd_bilstms')(lstms_1)
# Output Layer
outputs = Dense(5, activation='softmax', name='outputs')(lstms_2)
# Wrap model
model = Model(inputs=[inputs, elmo_inputs], outputs=outputs)
# Print topology
model.summary(110)
87
______________________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==============================================================================================================
word_ids (InputLayer) ( None, None ) 0
______________________________________________________________________________________________________________
texts (InputLayer) ( None, 1) 0
______________________________________________________________________________________________________________
word_embeddings (Embedding) ( None, None, 200) 80000000 word_ids[ 0][0]
______________________________________________________________________________________________________________
elmo (ELMo) ( None, None, 1024) 4 texts[0][0]
______________________________________________________________________________________________________________
concat_embeddings (Concatenate) ( None, None, 1224) 0 word_embeddings[ 0][0]
elmo[0][0]
______________________________________________________________________________________________________________
dropout_embeddings (Dropout) ( None, None, 1224) 0 concat_embeddings[ 0][0]
______________________________________________________________________________________________________________
1st_bilstms (Bidirectional) ( None, None, 200) 1060000 dropout_embeddings[ 0][0]
______________________________________________________________________________________________________________
2nd_bilstms (Bidirectional) ( None, 200) 240800 1st_bilstms[ 0][0]
______________________________________________________________________________________________________________
outputs (Dense) ( None, 5) 1005 2nd_bilstms[ 0][0]
==============================================================================================================
Total params: 81,301,809
Trainable params: 81,301,809
Non-trainable params: 0
88
89
90
91
from keras.layers import Layer
import tensorflow_hub as hub
import keras.backend as K
class BERT(Layer):
def __init__(self, trainable=True, **kwargs):
self.bert = None
self.trainable = trainable
super(BERT, self).__init__(**kwargs)
def build(self, input_shape):
# SetUp tensorflow Hub module
self.bert = hub.Module('https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1',
trainable=self.trainable, name="{}_module".format(self.name))
# Remove unused layers and assign trainable parameters
self.trainable_weights += [var for var in self.bert.variables
if not "/cls/" in var.name]
super(BERT, self).build(input_shape)
def call(self, x, mask=None):
# Split inputs into (word, mask, segment) ids
splits = Lambda(lambda k: K.tf.split(k, num_or_size_splits=3, axis=2))(x)
inputs = [(Lambda(lambda s: K.tf.squeeze(s, axis=-1) for i in range(len(splits))]
result = self.bert(dict(input_ids=inputs[0], input_mask=inputs[1], segment_ids=inputs[2]),
as_dict=True,
signature='tokens')['pooled_output']
return result
92
from keras.models import Model
from keras.layers import Input, Embedding, Dropout, Dense
# Load BERT presplitted IDs
inputs = Input(shape=(None, 3), dtype='int32', name='token_ids')
cls = BERT(name='bert')(inputs)
dropped_cls = Dropout(rate=0.1, name='dropout_bert_cls')(cls)
outputs = Dense(5, activation='softmax', name='outputs')(dropped_cls)
# Wrap model
model = Model(inputs=inputs, outputs=outputs)
# Print topology
model.summary(110)
Layer (type) Output Shape Param #
==============================================================================================================
token_ids (InputLayer) ( None, None, 3) 0
______________________________________________________________________________________________________________
bert (BERT) ( None, 768) 109482240
______________________________________________________________________________________________________________
dropout_bert_cls (Dropout) ( None, 768) 0
______________________________________________________________________________________________________________
outputs (Dense) ( None, 5) 3845
==============================================================================================================
Total params: 109,486,085
Trainable params: 109,486,085
Non-trainable params: 0
Cognitiv+ GrayBox - Rapid deployment of NLP models
for corporate documents analysis
94
Cognitiv+ GrayBox - Rapid deployment of NLP models
for business documents’ analysis
95
Cognitiv+ Annotator
● Our annotator preserves layout information, which is valuable for annotating
business documents, while it is really handy.
96
Cognitiv+ Document Insights View
● New documents (.pdf, .doc) can be uploaded in the platform, get digitized and
previewed near identical to their initial format.
97
Cognitiv+ Document Insights View
● Insights are automatically extracted for all general-use models and any other
custom model has been trained using our annotator and graybox.
98
Cognitiv+ Document Index View
● Document structure (index) is also automatically extracted in order to help the
user’s browsing experience.
99
Questions?

More Related Content

Similar to NLP in the Deep Learning Era: the story so far

MongoDB Solution for Internet of Things and Big Data
MongoDB Solution for Internet of Things and Big DataMongoDB Solution for Internet of Things and Big Data
MongoDB Solution for Internet of Things and Big DataStefano Dindo
 
Lab pratico per la progettazione di soluzioni MongoDB in ambito Internet of T...
Lab pratico per la progettazione di soluzioni MongoDB in ambito Internet of T...Lab pratico per la progettazione di soluzioni MongoDB in ambito Internet of T...
Lab pratico per la progettazione di soluzioni MongoDB in ambito Internet of T...festival ICT 2016
 
A Signature Algorithm Based On Chaotic Maps And Factoring Problems
A Signature Algorithm Based On Chaotic Maps And Factoring ProblemsA Signature Algorithm Based On Chaotic Maps And Factoring Problems
A Signature Algorithm Based On Chaotic Maps And Factoring ProblemsSandra Long
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeWim Godden
 
Sentimental analysis of financial articles using neural network
Sentimental analysis of financial articles using neural networkSentimental analysis of financial articles using neural network
Sentimental analysis of financial articles using neural networkBhavyateja Potineni
 
Data Democratization at Nubank
 Data Democratization at Nubank Data Democratization at Nubank
Data Democratization at NubankDatabricks
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
Malli: inside data-driven schemas
Malli: inside data-driven schemasMalli: inside data-driven schemas
Malli: inside data-driven schemasMetosin Oy
 
apidays LIVE London 2021 - AI for Insurance, Expert.ai
apidays LIVE London 2021 - AI for Insurance, Expert.aiapidays LIVE London 2021 - AI for Insurance, Expert.ai
apidays LIVE London 2021 - AI for Insurance, Expert.aiapidays
 
Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Paulo Gandra de Sousa
 
ETL for Pros: Getting Data Into MongoDB
ETL for Pros: Getting Data Into MongoDBETL for Pros: Getting Data Into MongoDB
ETL for Pros: Getting Data Into MongoDBMongoDB
 
An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)Dataspora
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Johan Blomme
 
Online Movie Ticket Booking
Online Movie Ticket BookingOnline Movie Ticket Booking
Online Movie Ticket BookingAstha Patel
 
Beyond php it's not (just) about the code
Beyond php   it's not (just) about the codeBeyond php   it's not (just) about the code
Beyond php it's not (just) about the codeWim Godden
 
ETL for Pros: Getting Data Into MongoDB
ETL for Pros: Getting Data Into MongoDBETL for Pros: Getting Data Into MongoDB
ETL for Pros: Getting Data Into MongoDBMongoDB
 
Data manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsyData manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsySmartHinJ
 
Poetry with R -- Dissecting the code
Poetry with R -- Dissecting the codePoetry with R -- Dissecting the code
Poetry with R -- Dissecting the codePeter Solymos
 

Similar to NLP in the Deep Learning Era: the story so far (20)

MongoDB Solution for Internet of Things and Big Data
MongoDB Solution for Internet of Things and Big DataMongoDB Solution for Internet of Things and Big Data
MongoDB Solution for Internet of Things and Big Data
 
Lab pratico per la progettazione di soluzioni MongoDB in ambito Internet of T...
Lab pratico per la progettazione di soluzioni MongoDB in ambito Internet of T...Lab pratico per la progettazione di soluzioni MongoDB in ambito Internet of T...
Lab pratico per la progettazione di soluzioni MongoDB in ambito Internet of T...
 
A Signature Algorithm Based On Chaotic Maps And Factoring Problems
A Signature Algorithm Based On Chaotic Maps And Factoring ProblemsA Signature Algorithm Based On Chaotic Maps And Factoring Problems
A Signature Algorithm Based On Chaotic Maps And Factoring Problems
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
 
Sentimental analysis of financial articles using neural network
Sentimental analysis of financial articles using neural networkSentimental analysis of financial articles using neural network
Sentimental analysis of financial articles using neural network
 
Data Democratization at Nubank
 Data Democratization at Nubank Data Democratization at Nubank
Data Democratization at Nubank
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
Malli: inside data-driven schemas
Malli: inside data-driven schemasMalli: inside data-driven schemas
Malli: inside data-driven schemas
 
apidays LIVE London 2021 - AI for Insurance, Expert.ai
apidays LIVE London 2021 - AI for Insurance, Expert.aiapidays LIVE London 2021 - AI for Insurance, Expert.ai
apidays LIVE London 2021 - AI for Insurance, Expert.ai
 
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
 
Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)
 
PoEAA by Example
PoEAA by ExamplePoEAA by Example
PoEAA by Example
 
ETL for Pros: Getting Data Into MongoDB
ETL for Pros: Getting Data Into MongoDBETL for Pros: Getting Data Into MongoDB
ETL for Pros: Getting Data Into MongoDB
 
An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
 
Online Movie Ticket Booking
Online Movie Ticket BookingOnline Movie Ticket Booking
Online Movie Ticket Booking
 
Beyond php it's not (just) about the code
Beyond php   it's not (just) about the codeBeyond php   it's not (just) about the code
Beyond php it's not (just) about the code
 
ETL for Pros: Getting Data Into MongoDB
ETL for Pros: Getting Data Into MongoDBETL for Pros: Getting Data Into MongoDB
ETL for Pros: Getting Data Into MongoDB
 
Data manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsyData manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsy
 
Poetry with R -- Dissecting the code
Poetry with R -- Dissecting the codePoetry with R -- Dissecting the code
Poetry with R -- Dissecting the code
 

Recently uploaded

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 

Recently uploaded (20)

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 

NLP in the Deep Learning Era: the story so far

  • 1. … 9th PyData Athens Meetup Tuesday 23, April 18:30 - 19:30 The Cube Athens nlp.cs.aueb
  • 3.
  • 4. 4 Industry Data scientist, Cognitiv+ ● Study use cases for contract analytics ● Train models to extract: ○ the document structure, ○ keystone insights (data points) ○ Identify keystone actions (mechanisms) Head of Data Science, Cognitiv+ ● Explore new markets (leases, financial documents, regulatory compliance) ● Investigate visual aspects of document processing - improv data cleansing / document formatting. Academia Research Αssociate, National and Kapodistrian University of Athens ● Legislative knowledge representation as Linked Data ● Model relations between structured documents ● Nomothesia Project: Open Data Platform for Greek Legislation (http://legislation.di.uoa.gr) PhD Candidate, Athens University of Economics and Business ● Contract Entity Extraction and Structure Identification ● Sentence Classification in contractual clauses ● Extreme Multi-label Classification on EU legislation ● Judicial Decision Prediction ● Large-scale aux. pre-training for unsupervised learning
  • 5.
  • 6.
  • 7. 7
  • 9. CONVENIENCE MATERIAL BREACH RENEWAL ASSISTANCE AUTOMATIC NON AUTOMATIC 9
  • 10. 10 Donald Trump PERSON China GPEand visited for his new re-election campaign, as the candidate of the . White House ORGANIZATIONleft Republican Party NORP
  • 11. 11 LANDLORD TENANT PROPERTY ADDRESS CONTRACT PERIOD TITLE START DATE CONTRACT PARTY
  • 12. 12 DOCUMENT 1 DOCUMENT 2 DOCUMENT 3 DOCUMENT 1683 Q: A: …
  • 16. 16 X = (1, 3) W1 = (3, 4) h1 = (1, 4)
  • 17. 17 X = (1, 3) W1 = (3, 4) h1 = (1, 4) W2 = (4, 4) h2 = (1, 4)
  • 18. 18 X = (1, 3) W1 = (3, 4) h1 = (1, 4) W2 = (4, 4) h2 = (1, 4) Wo = (4, 1) Y = (1,1)
  • 20.
  • 22. 22
  • 23. 23
  • 27. 27
  • 28. 28
  • 29. 29
  • 30. 30Proper Noun, Ccccc, [‘p’, ‘a’, ‘p’, ‘a’, …, ‘i’, ‘o’, ‘u’]
  • 35. 35
  • 36. 36
  • 37. 37
  • 39. • 39 1.2 0.4 0.2 0.3 0.3 0.2 0.1 0.2 0.2 1.5 0.8 0.2 1.2 1.9 0.2 w1 = It w2 = was w3 = a w4 = bad w5 = movie Word Representations w[t]
  • 40. • 40 1.2 0.4 0.2 0.3 0.3 0.2 0.1 0.2 0.2 1.5 0.8 0.2 1.2 1.9 0.2 w1 = It w2 = was w3 = a w4 = bad w5 = movie 0.1 0.1 0.1 0.5 0.2 Attention Scores a[t] Word Representations w[t] a1 a2 a3 a4 a5
  • 41. • 41 1.2 0.4 0.2 0.3 0.3 0.2 0.1 0.2 0.2 1.5 0.8 0.2 1.2 1.9 0.2 w1 = It w2 = was w3 = a w4 = bad w5 = movie 0.1 0.1 0.1 0.5 0.2 Weighted w[t] aw[t] = w[t] × a[t] 0.12 0.04 0.02 0.15 0.15 0.01 0.01 0.02 0.02 0.75 0.40 0.10 0.24 0.38 0.04 Attention Scores a[t] Word Representations w[t] a1 a2 a3 a4 a5
  • 42. • 42 1.2 0.4 0.2 0.3 0.3 0.2 0.1 0.2 0.2 1.5 0.8 0.2 1.2 1.9 0.2 w1 = It w2 = was w3 = a w4 = bad w5 = movie 0.1 0.1 0.1 0.5 0.2 Weighted w[t] aw[t] = w[t] × a[t] 0.12 0.04 0.02 0.15 0.15 0.01 0.01 0.02 0.02 0.75 0.40 0.10 0.24 0.38 0.04 Attention Scores a[t] Sentence Representation ∑aw[t] 1.27 0.99 0.19 Word Representations w[t] a1 a2 a3 a4 a5
  • 44. • 44 common customs tariff | tariff nomenclature | mushroom growing | tobacco | pharmaceutical product
  • 45. • 45 common customs tariff | tariff nomenclature | mushroom growing | tobacco | pharmaceutical product common customs tariff | tobacco | tariff nomenclature | tobacco industry | alcoholic beverage
  • 48. 48
  • 49. 49
  • 50. 50
  • 51. 51
  • 60.
  • 61.
  • 62. 62 Use of NLP goodies as software Code Time!
  • 63. 63
  • 64. 64 import spacy # Load NLP pipeline processor nlp = spacy.load('en') # Parse Text doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised border wall with Mexico.')
  • 65. 65 import spacy # Load NLP pipeline processor nlp = spacy.load('en') # Parse Text doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised border wall with Mexico.') print('{:15} {:15} {:15} {:15} {:15} {:15}'.format('TOKEN', 'NORM', 'LEMMA', 'POS', 'DEPENDENCY', 'SHAPE')) print('======================================================================================') for token in doc: print('{:15} {:15} {:15} {:15} {:15} {:15}'.format(token.text, token.norm_, token.lemma_, token.pos_, token.dep_, token.shape_))
  • 66. 66 import spacy # Load NLP pipeline processor nlp = spacy.load('en') # Parse Text doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised border wall with Mexico.') print('{:15} {:15} {:15} {:15} {:15} {:15}'.format('TOKEN', 'NORM', 'LEMMA', 'POS', 'DEPENDENCY', 'SHAPE')) print('======================================================================================') for token in doc: print('{:15} {:15} {:15} {:15} {:15} {:15}'.format(token.text, token.norm_, token.lemma_, token.pos_, token.dep_, token.shape_)) … …
  • 67. 67 print('{:15} {:15} {:15} {:15}'.format('ENTITY', 'START', 'END', 'LABEL')) print('=======================================================================') # Print Named Entities for ent in doc.ents: print('{:15} {:15} {:15} {:15}'.format(ent.text, ent.start_char, ent.end_char, ent.label_)
  • 68. 68 import spacy from spacy.matcher import PhraseMatcher from spacy.tokens import Span events = ["national elections", "national emergency", "government shutdown"] events_patterns = list(nlp.pipe(events))
  • 69. 69 import spacy from spacy.matcher import PhraseMatcher from spacy.tokens import Span events = ["national elections", "national emergency", "government shutdown"] events_patterns = list(nlp.pipe(events)) # Init rule-based matcher matcher = PhraseMatcher(nlp.vocab) # Fill new category information matcher.add("EVENT", None, *events_patterns)
  • 70. 70 import spacy from spacy.matcher import PhraseMatcher from spacy.tokens import Span events = ["national elections", "national emergency", "government shutdown"] events_patterns = list(nlp.pipe(events)) # Init rule-based matcher matcher = PhraseMatcher(nlp.vocab) # Fill new category information matcher.add("EVENT", None, *events_patterns) # Define the custom component def events_component(doc): # Apply the matcher to the doc matches = matcher(doc) # Create a Span for each match and assign the label 'ANIMAL' spans = [Span(doc, start, end, label="EVENT") for match_id, start, end in matches] # Overwrite the doc.ents with the matched spans doc.ents += spans return doc
  • 71. 71 import spacy from spacy.matcher import PhraseMatcher from spacy.tokens import Span events = ["national elections", "national emergency", "government shutdown"] events_patterns = list(nlp.pipe(events)) # Init rule-based matcher matcher = PhraseMatcher(nlp.vocab) # Fill new category information matcher.add("EVENT", None, *events_patterns) # Define the custom component def events_component(doc): # Apply the matcher to the doc matches = matcher(doc) # Create a Span for each match and assign the label 'ANIMAL' spans = [Span(doc, start, end, label="EVENT") for match_id, start, end in matches] # Overwrite the doc.ents with the matched spans doc.ents += spans return doc # Add the component to the pipeline after the 'ner' component nlp.add_pipe(events_component, after="ner") # Process the text and print the text and label for the doc.ents doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised border wall with Mexico, as Congress overwhelmingly approved a border security agreement that would prevent a second damaging government shutdown.')
  • 72. 72 # Process the text and print the text and label for the doc.ents doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised border wall with Mexico, as Congress overwhelmingly approved a border security agreement that would prevent a second damaging government shutdown.')
  • 73. 73
  • 74. 74
  • 76. 76 from keras.layers import Input, Embedding, Dropout # Word Ids inputs = Input(shape=(None, ), dtype='int32', name='word_ids') # Word Embeddings embeddings = Embedding(400000, 200, name='word_embeddings')(inputs) # Dropout regularization over embeddings dropped_embeddings = Dropout(rate=0.2, name='word_embeddings_dropout')(embeddings)
  • 77. 77 from keras.layers import Input, Embedding, Dropout, Bidirectional, LSTM # Word Ids inputs = Input(shape=(None, ), dtype='int32', name='word_ids') # Word Embeddings embeddings = Embedding(400000, 200, name='word_embeddings')(inputs) # Dropout regularization over embeddings dropped_embeddings = Dropout(rate=0.2, name='word_embeddings_dropout')(embeddings) # Stack of bidirectional LSTMs # 1st bi-LSTM lstms_1 = Bidirectional(LSTM(100, return_sequences=True), name='1st_bilstms')(dropped_embeddings) # 2nd bi-LSTM lstms_2 = Bidirectional(LSTM(100), name='2nd_bilstms')(lstms_1)
  • 78. 78 from keras.layers import Input, Embedding, Dropout, Bidirectional, LSTM, Dense from keras.models import Model # Word Ids inputs = Input(shape=(None, ), dtype='int32', name='word_ids') # Word Embeddings embeddings = Embedding(400000, 200, name='word_embeddings')(inputs) # Dropout regularization over embeddings dropped_embeddings = Dropout(rate=0.2, name='word_embeddings_dropout')(embeddings) # Stack of bidirectional LSTMs # 1st bi-LSTM lstms_1 = Bidirectional(LSTM(100, return_sequences=True), name='1st_bilstms')(dropped_embeddings) # 2nd bi-LSTM lstms_2 = Bidirectional(LSTM(100), name='2nd_bilstms')(lstms_1) # Output Layer outputs = Dense(5, activation='softmax', name='outputs')(lstms_2) # Wrap model model = Model(inputs=inputs, outputs=outputs) # Print topology model.summary(110)
  • 79. 79 Layer (type) Output Shape Param # ============================================================================================================== word_ids (InputLayer) (None, None ) 0 ______________________________________________________________________________________________________________ word_embeddings (Embedding) (None, None, 200) 80000000 ______________________________________________________________________________________________________________ word_embeddings_dropout (Dropout) (None, None, 200) 0 ______________________________________________________________________________________________________________ 1st_bilstms (Bidirectional) (None, None, 200) 240800 ______________________________________________________________________________________________________________ 2nd_bilstms (Bidirectional) (None, 200) 240800 ______________________________________________________________________________________________________________ outputs (Dense) (None, 5) 1005 ============================================================================================================== Total params: 80,482,605 Trainable params: 80,482,605 Non-trainable params: 0
  • 80. 80
  • 82. 82 from keras.layers import Layer class ELMo(Layer): def __init__(self, **kwargs): def build(self, input_shape): def call(self, x, mask=None):
  • 83. 83 from keras.layers import Layer class ELMo(Layer): def __init__(self, elmo_representation='elmo', trainable=True, **kwargs): self.module_output = elmo_representation self.trainable = trainable self.elmo = None super(ELMo, self).__init__(**kwargs) def build(self, input_shape): def call(self, x, mask=None):
  • 84. 84 from keras.layers import Layer import tensorflow_hub as hub import keras.backend as K class ELMo(Layer): def __init__(self, elmo_representation='elmo', trainable=True, **kwargs): self.module_output = elmo_representation self.trainable = trainable self.elmo = None super(ELMo, self).__init__(**kwargs) def build(self, input_shape): # SetUp tensorflow Hub module self.elmo = hub.Module('https://tfhub.dev/google/elmo/2', trainable=self.trainable, name="{}_module".format(self.name)) # Assign module's trainable weights to model self.trainable_weights += K.tf.trainable_variables(scope="^{}_module/.*".format(self.name)) super(ELMo, self).build(input_shape) def call(self, x, mask=None):
  • 85. 85 from keras.layers import Layer import tensorflow_hub as hub import keras.backend as K import tensorflow as tf class ELMo(Layer): def __init__(self, elmo_representation='elmo', trainable=True, **kwargs): self.module_output = elmo_representation self.trainable = trainable self.elmo = None super(ELMo, self).__init__(**kwargs) def build(self, input_shape): # SetUp tensorflow Hub module self.elmo = hub.Module('https://tfhub.dev/google/elmo/2', trainable=self.trainable, name="{}_module".format(self.name)) # Assign module's trainable weights to model self.trainable_weights += K.tf.trainable_variables(scope="^{}_module/.*".format(self.name)) super(ELMo, self).build(input_shape) def call(self, x, mask=None): result = self.elmo(K.squeeze(K.cast(x, tf.string), axis=1), as_dict=True, signature='default', )[self.module_output] return result
  • 86. 86 from keras.layers import Input, Embedding, Dropout, Bidirectional, LSTM, Dense from keras.models import Model # Word Ids inputs = Input(shape=(None, ), dtype='int32', name='word_ids') elmo_inputs = Input(shape=(1,), dtype='string', name='texts') # Word Embeddings word_embeddings = Embedding(400000, 200, name='word_embeddings')(inputs) # ELMo embeddings as weighted average across layers elmo_embeddings = ELMo(name='elmo')(elmo_inputs) # Concat Word + ELMo embeddings concatenated_embeddings = concatenate([word_embeddings, elmo_embeddings], axis=-1, name='concat_embeddings') # Dropout regularization over embeddings dropped_embeddings = Dropout(rate=0.2, name='dropout_embeddings')(concatenated_embeddings) # Stack of bidirectional LSTMs # 1st bi-LSTM lstms_1 = Bidirectional(LSTM(100, return_sequences=True), name='1st_bilstms')(dropped_embeddings) # 2nd bi-LSTM lstms_2 = Bidirectional(LSTM(100), name='2nd_bilstms')(lstms_1) # Output Layer outputs = Dense(5, activation='softmax', name='outputs')(lstms_2) # Wrap model model = Model(inputs=[inputs, elmo_inputs], outputs=outputs) # Print topology model.summary(110)
  • 87. 87 ______________________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ============================================================================================================== word_ids (InputLayer) ( None, None ) 0 ______________________________________________________________________________________________________________ texts (InputLayer) ( None, 1) 0 ______________________________________________________________________________________________________________ word_embeddings (Embedding) ( None, None, 200) 80000000 word_ids[ 0][0] ______________________________________________________________________________________________________________ elmo (ELMo) ( None, None, 1024) 4 texts[0][0] ______________________________________________________________________________________________________________ concat_embeddings (Concatenate) ( None, None, 1224) 0 word_embeddings[ 0][0] elmo[0][0] ______________________________________________________________________________________________________________ dropout_embeddings (Dropout) ( None, None, 1224) 0 concat_embeddings[ 0][0] ______________________________________________________________________________________________________________ 1st_bilstms (Bidirectional) ( None, None, 200) 1060000 dropout_embeddings[ 0][0] ______________________________________________________________________________________________________________ 2nd_bilstms (Bidirectional) ( None, 200) 240800 1st_bilstms[ 0][0] ______________________________________________________________________________________________________________ outputs (Dense) ( None, 5) 1005 2nd_bilstms[ 0][0] ============================================================================================================== Total params: 81,301,809 Trainable params: 81,301,809 Non-trainable params: 0
  • 88. 88
  • 89. 89
  • 90. 90
  • 91. 91 from keras.layers import Layer import tensorflow_hub as hub import keras.backend as K class BERT(Layer): def __init__(self, trainable=True, **kwargs): self.bert = None self.trainable = trainable super(BERT, self).__init__(**kwargs) def build(self, input_shape): # SetUp tensorflow Hub module self.bert = hub.Module('https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1', trainable=self.trainable, name="{}_module".format(self.name)) # Remove unused layers and assign trainable parameters self.trainable_weights += [var for var in self.bert.variables if not "/cls/" in var.name] super(BERT, self).build(input_shape) def call(self, x, mask=None): # Split inputs into (word, mask, segment) ids splits = Lambda(lambda k: K.tf.split(k, num_or_size_splits=3, axis=2))(x) inputs = [(Lambda(lambda s: K.tf.squeeze(s, axis=-1) for i in range(len(splits))] result = self.bert(dict(input_ids=inputs[0], input_mask=inputs[1], segment_ids=inputs[2]), as_dict=True, signature='tokens')['pooled_output'] return result
  • 92. 92 from keras.models import Model from keras.layers import Input, Embedding, Dropout, Dense # Load BERT presplitted IDs inputs = Input(shape=(None, 3), dtype='int32', name='token_ids') cls = BERT(name='bert')(inputs) dropped_cls = Dropout(rate=0.1, name='dropout_bert_cls')(cls) outputs = Dense(5, activation='softmax', name='outputs')(dropped_cls) # Wrap model model = Model(inputs=inputs, outputs=outputs) # Print topology model.summary(110) Layer (type) Output Shape Param # ============================================================================================================== token_ids (InputLayer) ( None, None, 3) 0 ______________________________________________________________________________________________________________ bert (BERT) ( None, 768) 109482240 ______________________________________________________________________________________________________________ dropout_bert_cls (Dropout) ( None, 768) 0 ______________________________________________________________________________________________________________ outputs (Dense) ( None, 5) 3845 ============================================================================================================== Total params: 109,486,085 Trainable params: 109,486,085 Non-trainable params: 0
  • 93. Cognitiv+ GrayBox - Rapid deployment of NLP models for corporate documents analysis
  • 94. 94 Cognitiv+ GrayBox - Rapid deployment of NLP models for business documents’ analysis
  • 95. 95 Cognitiv+ Annotator ● Our annotator preserves layout information, which is valuable for annotating business documents, while it is really handy.
  • 96. 96 Cognitiv+ Document Insights View ● New documents (.pdf, .doc) can be uploaded in the platform, get digitized and previewed near identical to their initial format.
  • 97. 97 Cognitiv+ Document Insights View ● Insights are automatically extracted for all general-use models and any other custom model has been trained using our annotator and graybox.
  • 98. 98 Cognitiv+ Document Index View ● Document structure (index) is also automatically extracted in order to help the user’s browsing experience.