SlideShare a Scribd company logo
1 of 20
A Novel Method and Architecture for Law
Processing, Utilising High Performance Computing
Infrastructures
Yannis Charalabidis, University of the Aegean, Greece – yannisx@aegean.gr
Michalis Loutsaris, University of the Aegean, Greece – mloutsaris@aegean.gr
Samos, July 2019
2
Presentation Structure
• The Manylaws Processing Flow & Outputs: a novel method for extracting data,
relations and meaning from the law
• The most important processing steps explained
• The Manylaws Architecture, for allowing parallel processing over High
Performance Computing infrastructures
3
The size of the problem (or why we need HPC)
The information to be acquired, through internet only, and primarily through web services communication where
available, contains:
• All the legal artefacts published by the European Parliament, the European Commission, the EU Council
(EURlex, EUDOR)
• All the legal artefacts published by the 28 local parliaments, as national laws, in English and /or other language
• News published in EU member states, concerning legal events (e.g. law publication, draft law deliberation, EU
directive publication)
• Other administration-generated content (e.g. local communications, regulations)
• Other citizen-generated relevant content (e.g. blogs, newsletters, social media posts)
We estimate that the above database will contain more than 1 trillion words in 21 different languages,
corresponding to about 10 million “volumes” of classical books, when another 5,000 such “volumes” will be added
for study, on a daily basis.
Law
Acquisition
Law
Preprocessing
Metadata
Extraction
Law
Decomposition
Law
Correlation
Parts of Speech
Extraction
N-Grams
Creation
Translation
JSON File
Generation
4
Legal Text Mining – The Manylaws Process
5
Legal Text Mining – The Manylaws Outputs
• Law Acquisition (Get with bulk, Get from API / crawler)
• Law Preprocessing (Rapidminer Trigger, Convert PDF to Text)
• Metadata Extraction (Get the title of the law, Get the number of the law, Get the year of the law, Get the
topic of the law, Get 10 more attributes from Law Source)
• Law Decomposition (Extract Sections, Extract Parts, Extract Chapters, Extract Articles, Extract Paragraphs,
Extract Sub Paragraphs, Extract Clauses, Extract Sentences)
• Law Correlation (Extract Laws Number, Extract Presidential Decrees Identifier, Extract Ministerial Decrees
Identifier, Extract article number of Constitution, Extract Circular Identifier, Extract Regulation Identifier,
Extract Act of Legislative Content Identifier, Extract Directive Number)
Law Acquisition
Law
Preprocessing
Metadata
Extraction
Law
Decomposition
Law Correlation
6
Legal Text Mining – The Manylaws Outputs
• Parts of Speech Extraction (Extract Nouns, Extract Adjectives, Extract Verbs, Extract
Adverbs)
• N-Grams Creation (Adjective + Noun, Noun + Adjective, Noun + Verb + Noun, Adjective +
Noun + Verb + Adjective + Noun, Adjective + Noun + Noun)
• Translation (Word Translation, Phrase Translation)
• JSON File Generation
Parts of
Speech
Extraction
N-Grams
Creation
Translation
JSON File
Generation
Web Scrapper from et.gr
7
Law Acquisition
HEP API calls for additional
metadata and …..
Each country has its own repository triggering the Rapidminer Process
8
Law Prepocessing – Rapidminer Trigger
9
Law Preprocessing - Convert PDF to plain text
• Remove new lines
• Replace English Characters with
Greek Characters
10
Metadata Extraction (1/3) – Title & Date
Regexp
Regexp
Tokenize Stemming
Remove
Stop
words and
common
words
Term
Frequency
Top 15
words
11
Metadata Extraction (2/3) – Law keywords
Extract other metadata via 2 ways:
1. Extraction of PDF File metadata using
Python (such as Author, Creation Date etc.)
2. Extraction of PDF metadata using
Rapidminer (such as Pages, file size etc.)
12
Metadata Extraction (3/3) – Other
Sections
Parts
Chapters Articles
Paragraphs
Sub-
Paragraphs
Clause
Sentence
13
Law Decomposition(1/2)
But in some cases Greek Laws have texts
from another Law (e.g. within an article)
that conflicts the separation. So, we
replace these texts with an id and recover
them at the end of the process.
14
Law Decomposition (2/2)
Search Regexp
(e.g. ν. [0-9]{4}/[0-9]{4} )
Keep only
Law
Number
with
correlations
Generate
graphs with
Gephi
15
Law Correlation
Insert photo
16
Part of Speech Extraction (1/2)
Tokenize
POS tagging
based on the
endings using
Java Code
IATE API calls to translate words
17
Parts of Speech Extraction(2/2) - Translation
18
Generation of JSON file
Converting JSON File to XML file is an easy procedure
MongoDB -> saves the json
Relational DB -> saves the tables
File Repository for XML Files
19
Output Data
20
Many Laws Architecture

More Related Content

Similar to A Novel Method and Architecture for Law Processing, Utilising High Performance Computing Infrastructures

ManyLaws CEF Project, on legal informatics
ManyLaws CEF Project, on legal informatics ManyLaws CEF Project, on legal informatics
ManyLaws CEF Project, on legal informatics Yannis Charalabidis
 
Legal Markup Generation in the Large: An Experience Report
Legal Markup Generation in the Large: An Experience ReportLegal Markup Generation in the Large: An Experience Report
Legal Markup Generation in the Large: An Experience ReportLionel Briand
 
界昇 20151007 ira_cognizer
界昇 20151007 ira_cognizer界昇 20151007 ira_cognizer
界昇 20151007 ira_cognizer景逸 王
 
A Gentle Introduction to Big Data
A Gentle Introduction to Big DataA Gentle Introduction to Big Data
A Gentle Introduction to Big DataMehmet Ali Akyol
 
Librarian Internet Index
Librarian Internet IndexLibrarian Internet Index
Librarian Internet IndexOlga Bautista
 
Cronolex: One System for the Dinamic Respresentation of Laws
Cronolex: One System for the Dinamic Respresentation of LawsCronolex: One System for the Dinamic Respresentation of Laws
Cronolex: One System for the Dinamic Respresentation of LawsJavier De Andrés Rivero
 
Information Technology and Legal Education_
Information Technology and Legal Education_Information Technology and Legal Education_
Information Technology and Legal Education_Kamlesh Singh
 
Agata overview
Agata overviewAgata overview
Agata overviewUdi Levin
 
eDiscovery A-Z - June 2011
eDiscovery A-Z - June 2011eDiscovery A-Z - June 2011
eDiscovery A-Z - June 2011eamonnsfl
 
AZ to eDiscovery
AZ to eDiscoveryAZ to eDiscovery
AZ to eDiscoveryeamonnsfl
 
Mervyn Colton la open source forum 2003
Mervyn Colton  la open source forum 2003Mervyn Colton  la open source forum 2003
Mervyn Colton la open source forum 2003OpenSourceLGMA
 
Exploring legacy ware with rdf and survol.17 july 2018
Exploring legacy ware with rdf and survol.17 july 2018Exploring legacy ware with rdf and survol.17 july 2018
Exploring legacy ware with rdf and survol.17 july 2018Remi Chateauneu
 

Similar to A Novel Method and Architecture for Law Processing, Utilising High Performance Computing Infrastructures (20)

ManyLaws CEF Project, on legal informatics
ManyLaws CEF Project, on legal informatics ManyLaws CEF Project, on legal informatics
ManyLaws CEF Project, on legal informatics
 
Legal Markup Generation in the Large: An Experience Report
Legal Markup Generation in the Large: An Experience ReportLegal Markup Generation in the Large: An Experience Report
Legal Markup Generation in the Large: An Experience Report
 
Workshop on "Legislative XML
Workshop on "Legislative XMLWorkshop on "Legislative XML
Workshop on "Legislative XML
 
XML for the Recovery of the Law in Force
XML for the Recovery of the Law in ForceXML for the Recovery of the Law in Force
XML for the Recovery of the Law in Force
 
Vasily Bunakov, Keith Jeffery: Licence management for Public Sector Information
Vasily Bunakov, Keith Jeffery: Licence management for Public Sector InformationVasily Bunakov, Keith Jeffery: Licence management for Public Sector Information
Vasily Bunakov, Keith Jeffery: Licence management for Public Sector Information
 
Many laws leos_v3
Many laws leos_v3Many laws leos_v3
Many laws leos_v3
 
界昇 20151007 ira_cognizer
界昇 20151007 ira_cognizer界昇 20151007 ira_cognizer
界昇 20151007 ira_cognizer
 
A Gentle Introduction to Big Data
A Gentle Introduction to Big DataA Gentle Introduction to Big Data
A Gentle Introduction to Big Data
 
Librarian Internet Index
Librarian Internet IndexLibrarian Internet Index
Librarian Internet Index
 
Co Dig Gov Ecm 08
Co Dig Gov Ecm 08Co Dig Gov Ecm 08
Co Dig Gov Ecm 08
 
Cronolex: One System for the Dinamic Respresentation of Laws
Cronolex: One System for the Dinamic Respresentation of LawsCronolex: One System for the Dinamic Respresentation of Laws
Cronolex: One System for the Dinamic Respresentation of Laws
 
World Wide Web(WWW)
World Wide Web(WWW)World Wide Web(WWW)
World Wide Web(WWW)
 
Information Technology and Legal Education_
Information Technology and Legal Education_Information Technology and Legal Education_
Information Technology and Legal Education_
 
Agata overview
Agata overviewAgata overview
Agata overview
 
eDiscovery A-Z - June 2011
eDiscovery A-Z - June 2011eDiscovery A-Z - June 2011
eDiscovery A-Z - June 2011
 
AZ to eDiscovery
AZ to eDiscoveryAZ to eDiscovery
AZ to eDiscovery
 
Mervyn Colton la open source forum 2003
Mervyn Colton  la open source forum 2003Mervyn Colton  la open source forum 2003
Mervyn Colton la open source forum 2003
 
Exploring legacy ware with rdf and survol.17 july 2018
Exploring legacy ware with rdf and survol.17 july 2018Exploring legacy ware with rdf and survol.17 july 2018
Exploring legacy ware with rdf and survol.17 july 2018
 
unit 1(chapter1).pdf
unit 1(chapter1).pdfunit 1(chapter1).pdf
unit 1(chapter1).pdf
 
Chapter 8
Chapter 8Chapter 8
Chapter 8
 

More from Samos2019Summit

A Cross-Border Perspective in Data Exchange
A Cross-Border Perspective in Data ExchangeA Cross-Border Perspective in Data Exchange
A Cross-Border Perspective in Data ExchangeSamos2019Summit
 
Electronic Health (eHealth) Interoperability Challenges
Electronic Health (eHealth) Interoperability ChallengesElectronic Health (eHealth) Interoperability Challenges
Electronic Health (eHealth) Interoperability ChallengesSamos2019Summit
 
A Framework for eHealth Interoperability Management in Greece
A Framework for eHealth Interoperability Management in GreeceA Framework for eHealth Interoperability Management in Greece
A Framework for eHealth Interoperability Management in GreeceSamos2019Summit
 
Digital Government Assessment Methods
Digital Government Assessment MethodsDigital Government Assessment Methods
Digital Government Assessment MethodsSamos2019Summit
 
Ε-Democracy as Humanistic Communication
Ε-Democracy as Humanistic CommunicationΕ-Democracy as Humanistic Communication
Ε-Democracy as Humanistic CommunicationSamos2019Summit
 
Drafting Reports for Bill Voting using ManyLaws Platform
Drafting Reports for Bill Voting using ManyLaws PlatformDrafting Reports for Bill Voting using ManyLaws Platform
Drafting Reports for Bill Voting using ManyLaws PlatformSamos2019Summit
 
Legal Implications of data-driven decision making
Legal Implications of data-driven decision makingLegal Implications of data-driven decision making
Legal Implications of data-driven decision makingSamos2019Summit
 
Open Government Data for transparency, innovation and public engagement in so...
Open Government Data for transparency, innovation and public engagement in so...Open Government Data for transparency, innovation and public engagement in so...
Open Government Data for transparency, innovation and public engagement in so...Samos2019Summit
 
Digital Transformation of Public Administration
Digital Transformation of Public AdministrationDigital Transformation of Public Administration
Digital Transformation of Public AdministrationSamos2019Summit
 
Electronic Open and Collaborative Governance - An Introduction
Electronic Open and Collaborative Governance - An Introduction Electronic Open and Collaborative Governance - An Introduction
Electronic Open and Collaborative Governance - An Introduction Samos2019Summit
 
Electronic Open and Collaborative Governance - An Introduction
Electronic Open and Collaborative Governance - An Introduction Electronic Open and Collaborative Governance - An Introduction
Electronic Open and Collaborative Governance - An Introduction Samos2019Summit
 
Empowering Digital Direct Democracy: Policy making via Stance Classification
Empowering Digital Direct Democracy: Policy making via Stance ClassificationEmpowering Digital Direct Democracy: Policy making via Stance Classification
Empowering Digital Direct Democracy: Policy making via Stance ClassificationSamos2019Summit
 
Ethical Issues on eGovernment 3.0: Big Data and AI
Ethical Issues on eGovernment 3.0: Big Data and AIEthical Issues on eGovernment 3.0: Big Data and AI
Ethical Issues on eGovernment 3.0: Big Data and AISamos2019Summit
 
9th Session: Workshop IV on Science Base Creation in Digital Governance
9th Session: Workshop IV on Science Base Creation in Digital Governance9th Session: Workshop IV on Science Base Creation in Digital Governance
9th Session: Workshop IV on Science Base Creation in Digital GovernanceSamos2019Summit
 
TOOP project: Once Only Principle
TOOP project: Once Only PrincipleTOOP project: Once Only Principle
TOOP project: Once Only PrincipleSamos2019Summit
 
U4SSC: Blockchain for cities
U4SSC: Blockchain for citiesU4SSC: Blockchain for cities
U4SSC: Blockchain for citiesSamos2019Summit
 
Discussion on Training Needs about Entrepreneurship in digital government
Discussion on Training Needs about Entrepreneurship in digital governmentDiscussion on Training Needs about Entrepreneurship in digital government
Discussion on Training Needs about Entrepreneurship in digital governmentSamos2019Summit
 
Workshop II on a Roadmap to Future Government
Workshop II on a Roadmap to Future GovernmentWorkshop II on a Roadmap to Future Government
Workshop II on a Roadmap to Future GovernmentSamos2019Summit
 

More from Samos2019Summit (20)

A Cross-Border Perspective in Data Exchange
A Cross-Border Perspective in Data ExchangeA Cross-Border Perspective in Data Exchange
A Cross-Border Perspective in Data Exchange
 
Electronic Health (eHealth) Interoperability Challenges
Electronic Health (eHealth) Interoperability ChallengesElectronic Health (eHealth) Interoperability Challenges
Electronic Health (eHealth) Interoperability Challenges
 
A Framework for eHealth Interoperability Management in Greece
A Framework for eHealth Interoperability Management in GreeceA Framework for eHealth Interoperability Management in Greece
A Framework for eHealth Interoperability Management in Greece
 
Digital Government Assessment Methods
Digital Government Assessment MethodsDigital Government Assessment Methods
Digital Government Assessment Methods
 
Ε-Democracy as Humanistic Communication
Ε-Democracy as Humanistic CommunicationΕ-Democracy as Humanistic Communication
Ε-Democracy as Humanistic Communication
 
Drafting Reports for Bill Voting using ManyLaws Platform
Drafting Reports for Bill Voting using ManyLaws PlatformDrafting Reports for Bill Voting using ManyLaws Platform
Drafting Reports for Bill Voting using ManyLaws Platform
 
Legal Implications of data-driven decision making
Legal Implications of data-driven decision makingLegal Implications of data-driven decision making
Legal Implications of data-driven decision making
 
Open Government Data for transparency, innovation and public engagement in so...
Open Government Data for transparency, innovation and public engagement in so...Open Government Data for transparency, innovation and public engagement in so...
Open Government Data for transparency, innovation and public engagement in so...
 
Digital Transformation of Public Administration
Digital Transformation of Public AdministrationDigital Transformation of Public Administration
Digital Transformation of Public Administration
 
Electronic Open and Collaborative Governance - An Introduction
Electronic Open and Collaborative Governance - An Introduction Electronic Open and Collaborative Governance - An Introduction
Electronic Open and Collaborative Governance - An Introduction
 
Electronic Open and Collaborative Governance - An Introduction
Electronic Open and Collaborative Governance - An Introduction Electronic Open and Collaborative Governance - An Introduction
Electronic Open and Collaborative Governance - An Introduction
 
Empowering Digital Direct Democracy: Policy making via Stance Classification
Empowering Digital Direct Democracy: Policy making via Stance ClassificationEmpowering Digital Direct Democracy: Policy making via Stance Classification
Empowering Digital Direct Democracy: Policy making via Stance Classification
 
Ethical Issues on eGovernment 3.0: Big Data and AI
Ethical Issues on eGovernment 3.0: Big Data and AIEthical Issues on eGovernment 3.0: Big Data and AI
Ethical Issues on eGovernment 3.0: Big Data and AI
 
9th Session: Workshop IV on Science Base Creation in Digital Governance
9th Session: Workshop IV on Science Base Creation in Digital Governance9th Session: Workshop IV on Science Base Creation in Digital Governance
9th Session: Workshop IV on Science Base Creation in Digital Governance
 
TOOP project: Once Only Principle
TOOP project: Once Only PrincipleTOOP project: Once Only Principle
TOOP project: Once Only Principle
 
U4SSC: Blockchain for cities
U4SSC: Blockchain for citiesU4SSC: Blockchain for cities
U4SSC: Blockchain for cities
 
Discussion on Training Needs about Entrepreneurship in digital government
Discussion on Training Needs about Entrepreneurship in digital governmentDiscussion on Training Needs about Entrepreneurship in digital government
Discussion on Training Needs about Entrepreneurship in digital government
 
Workshop II on a Roadmap to Future Government
Workshop II on a Roadmap to Future GovernmentWorkshop II on a Roadmap to Future Government
Workshop II on a Roadmap to Future Government
 
Government 3.0 Roadmap
Government 3.0 RoadmapGovernment 3.0 Roadmap
Government 3.0 Roadmap
 
Big Policy Canvas
Big Policy CanvasBig Policy Canvas
Big Policy Canvas
 

Recently uploaded

GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 

Recently uploaded (20)

The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 

A Novel Method and Architecture for Law Processing, Utilising High Performance Computing Infrastructures

  • 1. A Novel Method and Architecture for Law Processing, Utilising High Performance Computing Infrastructures Yannis Charalabidis, University of the Aegean, Greece – yannisx@aegean.gr Michalis Loutsaris, University of the Aegean, Greece – mloutsaris@aegean.gr Samos, July 2019
  • 2. 2 Presentation Structure • The Manylaws Processing Flow & Outputs: a novel method for extracting data, relations and meaning from the law • The most important processing steps explained • The Manylaws Architecture, for allowing parallel processing over High Performance Computing infrastructures
  • 3. 3 The size of the problem (or why we need HPC) The information to be acquired, through internet only, and primarily through web services communication where available, contains: • All the legal artefacts published by the European Parliament, the European Commission, the EU Council (EURlex, EUDOR) • All the legal artefacts published by the 28 local parliaments, as national laws, in English and /or other language • News published in EU member states, concerning legal events (e.g. law publication, draft law deliberation, EU directive publication) • Other administration-generated content (e.g. local communications, regulations) • Other citizen-generated relevant content (e.g. blogs, newsletters, social media posts) We estimate that the above database will contain more than 1 trillion words in 21 different languages, corresponding to about 10 million “volumes” of classical books, when another 5,000 such “volumes” will be added for study, on a daily basis.
  • 5. 5 Legal Text Mining – The Manylaws Outputs • Law Acquisition (Get with bulk, Get from API / crawler) • Law Preprocessing (Rapidminer Trigger, Convert PDF to Text) • Metadata Extraction (Get the title of the law, Get the number of the law, Get the year of the law, Get the topic of the law, Get 10 more attributes from Law Source) • Law Decomposition (Extract Sections, Extract Parts, Extract Chapters, Extract Articles, Extract Paragraphs, Extract Sub Paragraphs, Extract Clauses, Extract Sentences) • Law Correlation (Extract Laws Number, Extract Presidential Decrees Identifier, Extract Ministerial Decrees Identifier, Extract article number of Constitution, Extract Circular Identifier, Extract Regulation Identifier, Extract Act of Legislative Content Identifier, Extract Directive Number) Law Acquisition Law Preprocessing Metadata Extraction Law Decomposition Law Correlation
  • 6. 6 Legal Text Mining – The Manylaws Outputs • Parts of Speech Extraction (Extract Nouns, Extract Adjectives, Extract Verbs, Extract Adverbs) • N-Grams Creation (Adjective + Noun, Noun + Adjective, Noun + Verb + Noun, Adjective + Noun + Verb + Adjective + Noun, Adjective + Noun + Noun) • Translation (Word Translation, Phrase Translation) • JSON File Generation Parts of Speech Extraction N-Grams Creation Translation JSON File Generation
  • 7. Web Scrapper from et.gr 7 Law Acquisition HEP API calls for additional metadata and …..
  • 8. Each country has its own repository triggering the Rapidminer Process 8 Law Prepocessing – Rapidminer Trigger
  • 9. 9 Law Preprocessing - Convert PDF to plain text • Remove new lines • Replace English Characters with Greek Characters
  • 10. 10 Metadata Extraction (1/3) – Title & Date Regexp Regexp
  • 11. Tokenize Stemming Remove Stop words and common words Term Frequency Top 15 words 11 Metadata Extraction (2/3) – Law keywords
  • 12. Extract other metadata via 2 ways: 1. Extraction of PDF File metadata using Python (such as Author, Creation Date etc.) 2. Extraction of PDF metadata using Rapidminer (such as Pages, file size etc.) 12 Metadata Extraction (3/3) – Other
  • 13. Sections Parts Chapters Articles Paragraphs Sub- Paragraphs Clause Sentence 13 Law Decomposition(1/2) But in some cases Greek Laws have texts from another Law (e.g. within an article) that conflicts the separation. So, we replace these texts with an id and recover them at the end of the process.
  • 15. Search Regexp (e.g. ν. [0-9]{4}/[0-9]{4} ) Keep only Law Number with correlations Generate graphs with Gephi 15 Law Correlation Insert photo
  • 16. 16 Part of Speech Extraction (1/2) Tokenize POS tagging based on the endings using Java Code
  • 17. IATE API calls to translate words 17 Parts of Speech Extraction(2/2) - Translation
  • 19. Converting JSON File to XML file is an easy procedure MongoDB -> saves the json Relational DB -> saves the tables File Repository for XML Files 19 Output Data