4. 4
Industry
Data scientist, Cognitiv+
● Study use cases for contract analytics
● Train models to extract:
○ the document structure,
○ keystone insights (data points)
○ Identify keystone actions (mechanisms)
Head of Data Science, Cognitiv+
● Explore new markets (leases, financial documents,
regulatory compliance)
● Investigate visual aspects of document processing - improv
data cleansing / document formatting.
Academia
Research Αssociate, National and Kapodistrian University
of Athens
● Legislative knowledge representation as Linked Data
● Model relations between structured documents
● Nomothesia Project: Open Data Platform for Greek
Legislation (http://legislation.di.uoa.gr)
PhD Candidate, Athens University of Economics and Business
● Contract Entity Extraction and Structure Identification
● Sentence Classification in contractual clauses
● Extreme Multi-label Classification on EU legislation
● Judicial Decision Prediction
● Large-scale aux. pre-training for unsupervised learning
10. 10
Donald Trump PERSON China GPEand visited
for his new re-election campaign, as the candidate of the .
White House ORGANIZATIONleft
Republican Party NORP
64. 64
import spacy
# Load NLP pipeline processor
nlp = spacy.load('en')
# Parse Text
doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised
border wall with Mexico.')
65. 65
import spacy
# Load NLP pipeline processor
nlp = spacy.load('en')
# Parse Text
doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised
border wall with Mexico.')
print('{:15} {:15} {:15} {:15} {:15} {:15}'.format('TOKEN', 'NORM', 'LEMMA', 'POS', 'DEPENDENCY',
'SHAPE'))
print('======================================================================================')
for token in doc:
print('{:15} {:15} {:15} {:15} {:15} {:15}'.format(token.text, token.norm_, token.lemma_,
token.pos_, token.dep_, token.shape_))
66. 66
import spacy
# Load NLP pipeline processor
nlp = spacy.load('en')
# Parse Text
doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised
border wall with Mexico.')
print('{:15} {:15} {:15} {:15} {:15} {:15}'.format('TOKEN', 'NORM', 'LEMMA', 'POS', 'DEPENDENCY',
'SHAPE'))
print('======================================================================================')
for token in doc:
print('{:15} {:15} {:15} {:15} {:15} {:15}'.format(token.text, token.norm_, token.lemma_,
token.pos_, token.dep_, token.shape_))
… …
67. 67
print('{:15} {:15} {:15} {:15}'.format('ENTITY', 'START', 'END', 'LABEL'))
print('=======================================================================')
# Print Named Entities
for ent in doc.ents:
print('{:15} {:15} {:15} {:15}'.format(ent.text, ent.start_char, ent.end_char, ent.label_)
69. 69
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
events = ["national elections", "national emergency", "government shutdown"]
events_patterns = list(nlp.pipe(events))
# Init rule-based matcher
matcher = PhraseMatcher(nlp.vocab)
# Fill new category information
matcher.add("EVENT", None, *events_patterns)
70. 70
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
events = ["national elections", "national emergency", "government shutdown"]
events_patterns = list(nlp.pipe(events))
# Init rule-based matcher
matcher = PhraseMatcher(nlp.vocab)
# Fill new category information
matcher.add("EVENT", None, *events_patterns)
# Define the custom component
def events_component(doc):
# Apply the matcher to the doc
matches = matcher(doc)
# Create a Span for each match and assign the label 'ANIMAL'
spans = [Span(doc, start, end, label="EVENT") for match_id, start, end in matches]
# Overwrite the doc.ents with the matched spans
doc.ents += spans
return doc
71. 71
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
events = ["national elections", "national emergency", "government shutdown"]
events_patterns = list(nlp.pipe(events))
# Init rule-based matcher
matcher = PhraseMatcher(nlp.vocab)
# Fill new category information
matcher.add("EVENT", None, *events_patterns)
# Define the custom component
def events_component(doc):
# Apply the matcher to the doc
matches = matcher(doc)
# Create a Span for each match and assign the label 'ANIMAL'
spans = [Span(doc, start, end, label="EVENT") for match_id, start, end in matches]
# Overwrite the doc.ents with the matched spans
doc.ents += spans
return doc
# Add the component to the pipeline after the 'ner' component
nlp.add_pipe(events_component, after="ner")
# Process the text and print the text and label for the doc.ents
doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised
border wall with Mexico, as Congress overwhelmingly approved a border security agreement that would
prevent a second damaging government shutdown.')
72. 72
# Process the text and print the text and label for the doc.ents
doc = nlp('Donald Trump has vowed to declare a national emergency as a way of funding his long-promised
border wall with Mexico, as Congress overwhelmingly approved a border security agreement that would
prevent a second damaging government shutdown.')
95. 95
Cognitiv+ Annotator
● Our annotator preserves layout information, which is valuable for annotating
business documents, while it is really handy.
96. 96
Cognitiv+ Document Insights View
● New documents (.pdf, .doc) can be uploaded in the platform, get digitized and
previewed near identical to their initial format.
97. 97
Cognitiv+ Document Insights View
● Insights are automatically extracted for all general-use models and any other
custom model has been trained using our annotator and graybox.
98. 98
Cognitiv+ Document Index View
● Document structure (index) is also automatically extracted in order to help the
user’s browsing experience.