A Novel Method and Architecture for Law Processing, Utilising High Performance Computing Infrastructures

A Novel Method and Architecture for Law
Processing, Utilising High Performance Computing
Infrastructures
Yannis Charalabidis, University of the Aegean, Greece – yannisx@aegean.gr
Michalis Loutsaris, University of the Aegean, Greece – mloutsaris@aegean.gr
Samos, July 2019

2
Presentation Structure
• The Manylaws Processing Flow & Outputs: a novel method for extracting data,
relations and meaning from the law
• The most important processing steps explained
• The Manylaws Architecture, for allowing parallel processing over High
Performance Computing infrastructures

3
The size of the problem (or why we need HPC)
The information to be acquired, through internet only, and primarily through web services communication where
available, contains:
• All the legal artefacts published by the European Parliament, the European Commission, the EU Council
(EURlex, EUDOR)
• All the legal artefacts published by the 28 local parliaments, as national laws, in English and /or other language
• News published in EU member states, concerning legal events (e.g. law publication, draft law deliberation, EU
directive publication)
• Other administration-generated content (e.g. local communications, regulations)
• Other citizen-generated relevant content (e.g. blogs, newsletters, social media posts)
We estimate that the above database will contain more than 1 trillion words in 21 different languages,
corresponding to about 10 million “volumes” of classical books, when another 5,000 such “volumes” will be added
for study, on a daily basis.

Law
Acquisition
Law
Preprocessing
Metadata
Extraction
Law
Decomposition
Law
Correlation
Parts of Speech
Extraction
N-Grams
Creation
Translation
JSON File
Generation
4
Legal Text Mining – The Manylaws Process

5
Legal Text Mining – The Manylaws Outputs
• Law Acquisition (Get with bulk, Get from API / crawler)
• Law Preprocessing (Rapidminer Trigger, Convert PDF to Text)
• Metadata Extraction (Get the title of the law, Get the number of the law, Get the year of the law, Get the
topic of the law, Get 10 more attributes from Law Source)
• Law Decomposition (Extract Sections, Extract Parts, Extract Chapters, Extract Articles, Extract Paragraphs,
Extract Sub Paragraphs, Extract Clauses, Extract Sentences)
• Law Correlation (Extract Laws Number, Extract Presidential Decrees Identifier, Extract Ministerial Decrees
Identifier, Extract article number of Constitution, Extract Circular Identifier, Extract Regulation Identifier,
Extract Act of Legislative Content Identifier, Extract Directive Number)
Law Acquisition
Law
Preprocessing
Metadata
Extraction
Law
Decomposition
Law Correlation

6
Legal Text Mining – The Manylaws Outputs
• Parts of Speech Extraction (Extract Nouns, Extract Adjectives, Extract Verbs, Extract
Adverbs)
• N-Grams Creation (Adjective + Noun, Noun + Adjective, Noun + Verb + Noun, Adjective +
Noun + Verb + Adjective + Noun, Adjective + Noun + Noun)
• Translation (Word Translation, Phrase Translation)
• JSON File Generation
Parts of
Speech
Extraction
N-Grams
Creation
Translation
JSON File
Generation

Web Scrapper from et.gr
7
Law Acquisition
HEP API calls for additional
metadata and …..

Each country has its own repository triggering the Rapidminer Process
8
Law Prepocessing – Rapidminer Trigger

9
Law Preprocessing - Convert PDF to plain text
• Remove new lines
• Replace English Characters with
Greek Characters

10
Metadata Extraction (1/3) – Title & Date
Regexp
Regexp

Tokenize Stemming
Remove
Stop
words and
common
words
Term
Frequency
Top 15
words
11
Metadata Extraction (2/3) – Law keywords

Extract other metadata via 2 ways:
1. Extraction of PDF File metadata using
Python (such as Author, Creation Date etc.)
2. Extraction of PDF metadata using
Rapidminer (such as Pages, file size etc.)
12
Metadata Extraction (3/3) – Other

Sections
Parts
Chapters Articles
Paragraphs
Sub-
Paragraphs
Clause
Sentence
13
Law Decomposition(1/2)
But in some cases Greek Laws have texts
from another Law (e.g. within an article)
that conflicts the separation. So, we
replace these texts with an id and recover
them at the end of the process.

Search Regexp
(e.g. ν. [0-9]{4}/[0-9]{4} )
Keep only
Law
Number
with
correlations
Generate
graphs with
Gephi
15
Law Correlation
Insert photo

16
Part of Speech Extraction (1/2)
Tokenize
POS tagging
based on the
endings using
Java Code

IATE API calls to translate words
17
Parts of Speech Extraction(2/2) - Translation

Converting JSON File to XML file is an easy procedure
MongoDB -> saves the json
Relational DB -> saves the tables
File Repository for XML Files
19
Output Data

A Novel Method and Architecture for Law Processing, Utilising High Performance Computing Infrastructures

More Related Content

Similar to A Novel Method and Architecture for Law Processing, Utilising High Performance Computing Infrastructures

More from Samos2019Summit

Recently uploaded

A Novel Method and Architecture for Law Processing, Utilising High Performance Computing Infrastructures