A Novel Method and Architecture for Law
Processing, Utilising High Performance Computing
Infrastructures
Yannis Charalabidis, University of the Aegean, Greece – yannisx@aegean.gr
Michalis Loutsaris, University of the Aegean, Greece – mloutsaris@aegean.gr
Samos, July 2019
2
Presentation Structure
• The Manylaws Processing Flow & Outputs: a novel method for extracting data,
relations and meaning from the law
• The most important processing steps explained
• The Manylaws Architecture, for allowing parallel processing over High
Performance Computing infrastructures
3
The size of the problem (or why we need HPC)
The information to be acquired, through internet only, and primarily through web services communication where
available, contains:
• All the legal artefacts published by the European Parliament, the European Commission, the EU Council
(EURlex, EUDOR)
• All the legal artefacts published by the 28 local parliaments, as national laws, in English and /or other language
• News published in EU member states, concerning legal events (e.g. law publication, draft law deliberation, EU
directive publication)
• Other administration-generated content (e.g. local communications, regulations)
• Other citizen-generated relevant content (e.g. blogs, newsletters, social media posts)
We estimate that the above database will contain more than 1 trillion words in 21 different languages,
corresponding to about 10 million “volumes” of classical books, when another 5,000 such “volumes” will be added
for study, on a daily basis.
Law
Acquisition
Law
Preprocessing
Metadata
Extraction
Law
Decomposition
Law
Correlation
Parts of Speech
Extraction
N-Grams
Creation
Translation
JSON File
Generation
4
Legal Text Mining – The Manylaws Process
5
Legal Text Mining – The Manylaws Outputs
• Law Acquisition (Get with bulk, Get from API / crawler)
• Law Preprocessing (Rapidminer Trigger, Convert PDF to Text)
• Metadata Extraction (Get the title of the law, Get the number of the law, Get the year of the law, Get the
topic of the law, Get 10 more attributes from Law Source)
• Law Decomposition (Extract Sections, Extract Parts, Extract Chapters, Extract Articles, Extract Paragraphs,
Extract Sub Paragraphs, Extract Clauses, Extract Sentences)
• Law Correlation (Extract Laws Number, Extract Presidential Decrees Identifier, Extract Ministerial Decrees
Identifier, Extract article number of Constitution, Extract Circular Identifier, Extract Regulation Identifier,
Extract Act of Legislative Content Identifier, Extract Directive Number)
Law Acquisition
Law
Preprocessing
Metadata
Extraction
Law
Decomposition
Law Correlation
6
Legal Text Mining – The Manylaws Outputs
• Parts of Speech Extraction (Extract Nouns, Extract Adjectives, Extract Verbs, Extract
Adverbs)
• N-Grams Creation (Adjective + Noun, Noun + Adjective, Noun + Verb + Noun, Adjective +
Noun + Verb + Adjective + Noun, Adjective + Noun + Noun)
• Translation (Word Translation, Phrase Translation)
• JSON File Generation
Parts of
Speech
Extraction
N-Grams
Creation
Translation
JSON File
Generation
Web Scrapper from et.gr
7
Law Acquisition
HEP API calls for additional
metadata and …..
Each country has its own repository triggering the Rapidminer Process
8
Law Prepocessing – Rapidminer Trigger
9
Law Preprocessing - Convert PDF to plain text
• Remove new lines
• Replace English Characters with
Greek Characters
10
Metadata Extraction (1/3) – Title & Date
Regexp
Regexp
Tokenize Stemming
Remove
Stop
words and
common
words
Term
Frequency
Top 15
words
11
Metadata Extraction (2/3) – Law keywords
Extract other metadata via 2 ways:
1. Extraction of PDF File metadata using
Python (such as Author, Creation Date etc.)
2. Extraction of PDF metadata using
Rapidminer (such as Pages, file size etc.)
12
Metadata Extraction (3/3) – Other
Sections
Parts
Chapters Articles
Paragraphs
Sub-
Paragraphs
Clause
Sentence
13
Law Decomposition(1/2)
But in some cases Greek Laws have texts
from another Law (e.g. within an article)
that conflicts the separation. So, we
replace these texts with an id and recover
them at the end of the process.
14
Law Decomposition (2/2)
Search Regexp
(e.g. ν. [0-9]{4}/[0-9]{4} )
Keep only
Law
Number
with
correlations
Generate
graphs with
Gephi
15
Law Correlation
Insert photo
16
Part of Speech Extraction (1/2)
Tokenize
POS tagging
based on the
endings using
Java Code
IATE API calls to translate words
17
Parts of Speech Extraction(2/2) - Translation
18
Generation of JSON file
Converting JSON File to XML file is an easy procedure
MongoDB -> saves the json
Relational DB -> saves the tables
File Repository for XML Files
19
Output Data
20
Many Laws Architecture

A Novel Method and Architecture for Law Processing, Utilising High Performance Computing Infrastructures

  • 1.
    A Novel Methodand Architecture for Law Processing, Utilising High Performance Computing Infrastructures Yannis Charalabidis, University of the Aegean, Greece – yannisx@aegean.gr Michalis Loutsaris, University of the Aegean, Greece – mloutsaris@aegean.gr Samos, July 2019
  • 2.
    2 Presentation Structure • TheManylaws Processing Flow & Outputs: a novel method for extracting data, relations and meaning from the law • The most important processing steps explained • The Manylaws Architecture, for allowing parallel processing over High Performance Computing infrastructures
  • 3.
    3 The size ofthe problem (or why we need HPC) The information to be acquired, through internet only, and primarily through web services communication where available, contains: • All the legal artefacts published by the European Parliament, the European Commission, the EU Council (EURlex, EUDOR) • All the legal artefacts published by the 28 local parliaments, as national laws, in English and /or other language • News published in EU member states, concerning legal events (e.g. law publication, draft law deliberation, EU directive publication) • Other administration-generated content (e.g. local communications, regulations) • Other citizen-generated relevant content (e.g. blogs, newsletters, social media posts) We estimate that the above database will contain more than 1 trillion words in 21 different languages, corresponding to about 10 million “volumes” of classical books, when another 5,000 such “volumes” will be added for study, on a daily basis.
  • 4.
  • 5.
    5 Legal Text Mining– The Manylaws Outputs • Law Acquisition (Get with bulk, Get from API / crawler) • Law Preprocessing (Rapidminer Trigger, Convert PDF to Text) • Metadata Extraction (Get the title of the law, Get the number of the law, Get the year of the law, Get the topic of the law, Get 10 more attributes from Law Source) • Law Decomposition (Extract Sections, Extract Parts, Extract Chapters, Extract Articles, Extract Paragraphs, Extract Sub Paragraphs, Extract Clauses, Extract Sentences) • Law Correlation (Extract Laws Number, Extract Presidential Decrees Identifier, Extract Ministerial Decrees Identifier, Extract article number of Constitution, Extract Circular Identifier, Extract Regulation Identifier, Extract Act of Legislative Content Identifier, Extract Directive Number) Law Acquisition Law Preprocessing Metadata Extraction Law Decomposition Law Correlation
  • 6.
    6 Legal Text Mining– The Manylaws Outputs • Parts of Speech Extraction (Extract Nouns, Extract Adjectives, Extract Verbs, Extract Adverbs) • N-Grams Creation (Adjective + Noun, Noun + Adjective, Noun + Verb + Noun, Adjective + Noun + Verb + Adjective + Noun, Adjective + Noun + Noun) • Translation (Word Translation, Phrase Translation) • JSON File Generation Parts of Speech Extraction N-Grams Creation Translation JSON File Generation
  • 7.
    Web Scrapper fromet.gr 7 Law Acquisition HEP API calls for additional metadata and …..
  • 8.
    Each country hasits own repository triggering the Rapidminer Process 8 Law Prepocessing – Rapidminer Trigger
  • 9.
    9 Law Preprocessing -Convert PDF to plain text • Remove new lines • Replace English Characters with Greek Characters
  • 10.
    10 Metadata Extraction (1/3)– Title & Date Regexp Regexp
  • 11.
    Tokenize Stemming Remove Stop words and common words Term Frequency Top15 words 11 Metadata Extraction (2/3) – Law keywords
  • 12.
    Extract other metadatavia 2 ways: 1. Extraction of PDF File metadata using Python (such as Author, Creation Date etc.) 2. Extraction of PDF metadata using Rapidminer (such as Pages, file size etc.) 12 Metadata Extraction (3/3) – Other
  • 13.
    Sections Parts Chapters Articles Paragraphs Sub- Paragraphs Clause Sentence 13 Law Decomposition(1/2) Butin some cases Greek Laws have texts from another Law (e.g. within an article) that conflicts the separation. So, we replace these texts with an id and recover them at the end of the process.
  • 14.
  • 15.
    Search Regexp (e.g. ν.[0-9]{4}/[0-9]{4} ) Keep only Law Number with correlations Generate graphs with Gephi 15 Law Correlation Insert photo
  • 16.
    16 Part of SpeechExtraction (1/2) Tokenize POS tagging based on the endings using Java Code
  • 17.
    IATE API callsto translate words 17 Parts of Speech Extraction(2/2) - Translation
  • 18.
  • 19.
    Converting JSON Fileto XML file is an easy procedure MongoDB -> saves the json Relational DB -> saves the tables File Repository for XML Files 19 Output Data
  • 20.