SlideShare a Scribd company logo
Harnessing court data using NLP and spoken language technology
Constantin Orăsan and Hadeel Saadany
{C.Orasan,Hadeel.Saadany}@surrey.ac.uk
Objective 1: Improve Automatic Transcription of Courtroom Hearings
Amazon Web Service Transcription
gives errors with legal terminology/
jargon, named entities, numbers
We minimize the transcription errors by building a Custom
Language Model (CLM) trained on legal text and edited
transcriptions. Our CLM results correct most errors of AWS
generic model and reduce Word Error Rate (WER)
1. Employ NLP tools to extract a vocabulary list with most common bigrams in the edited transcripts.
2. Employ Blackstone library to create a list of legal terms
3. Scraped a large dataset of judgements (approx. 25k documents) from the UK Tribunal Decisions
(https://www.judiciary.uk) to use for training the CLM.
4. Use the extracted list of legal terms and most common entities to build CLM using different
training datasets (supreme court judgements and edited transcripts).
Our pipeline for Improving Transcription Quality
Objective 2: Linking Judgement to Transcripts
The objective and design expectations
• Transcripts (very long 10+ hrs.) can be used as a tool for
students/academics/lawyers to learn about a particular law.
• We semantically link time spans of transcription to
sections/paragraphs in the judgement.
• This can help understand how the judgment is reached.
Progress so far
• GPT 3 semantic search spots the gold-standard timestamp in
a list of top 20 most relevant.
• Manual analysis of results shows that the 20 extracted
timestamps are relevant to the judgement and
demonstrates good linking.
Our Method
We treat the judgement sections as a Query and the Transcriptions as potential
answers. We rank spans using the cosine similarity between the GPT3 sentence
embeddings of the judgement and transcription spans
Next steps
• Create a large gold standard for training and testing
• Experiment with more NLP methods to improve linking
accuracy
• Produce a smaller list of timestamps linked to judgement
• Gain a better understanding of how the users would use
such a system
• Create a pipeline for cleaning, segmenting and linking the
Courtroom hearings and the Transcriptions produced by
our CLM model
Publications
Saadany, H., Breslin, C., Orăsan, C. and Walker, S. (2022) Better Transcription of UK Supreme Court Hearings. arXiv:2211.17094

More Related Content

Similar to Harnessing Court Data using NLP and Spoken Language Technology

Implementing a Caching Scheme for Media Streaming in a Proxy Server
Implementing a Caching Scheme for Media Streaming in a Proxy ServerImplementing a Caching Scheme for Media Streaming in a Proxy Server
Implementing a Caching Scheme for Media Streaming in a Proxy Server
Abdelrahman Hosny
 
team10.ppt.pptx
team10.ppt.pptxteam10.ppt.pptx
team10.ppt.pptx
REMEGIUSPRAVEENSAHAY
 
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
Kripa (कृपा) Rajshekhar
 
Top Essay Writing Websites
Top Essay Writing WebsitesTop Essay Writing Websites
Top Essay Writing Websites
Beth Hernandez
 
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIRA NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
cscpconf
 
Grouper
GrouperGrouper
Grouper
Preet Kanwal
 
Test Estimation
Test Estimation Test Estimation
Test Estimation
SQALab
 
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Edmond Lepedus
 
Ju3517011704
Ju3517011704Ju3517011704
Ju3517011704
IJERA Editor
 
Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.
Sheeyam Shellvacumar
 
Presentation summerstudent 2009-aug09-lbl-summer
Presentation summerstudent 2009-aug09-lbl-summerPresentation summerstudent 2009-aug09-lbl-summer
Presentation summerstudent 2009-aug09-lbl-summer
balmanme
 
Network and Multimedia QoE Management
Network and Multimedia QoE ManagementNetwork and Multimedia QoE Management
Network and Multimedia QoE Management
Sheng-Wei (Kuan-Ta) Chen
 
Moses
MosesMoses
A Recorded Debating Dataset
A Recorded Debating DatasetA Recorded Debating Dataset
A Recorded Debating Dataset
Scott Faria
 
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training EnsemblesSemi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
Mohamed El-Geish
 
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
Kripa (कृपा) Rajshekhar
 
A Two-Tiered On-Line Server-Side Bandwidth Reservation Framework for the Real...
A Two-Tiered On-Line Server-Side Bandwidth Reservation Framework for the Real...A Two-Tiered On-Line Server-Side Bandwidth Reservation Framework for the Real...
A Two-Tiered On-Line Server-Side Bandwidth Reservation Framework for the Real...
white paper
 
An Efficient Distributed Control Law for Load Balancing in Content Delivery N...
An Efficient Distributed Control Law for Load Balancing in Content Delivery N...An Efficient Distributed Control Law for Load Balancing in Content Delivery N...
An Efficient Distributed Control Law for Load Balancing in Content Delivery N...
IJMER
 
MADS4007_Fall2022-Intro to Web Technologies.docx.pptx
MADS4007_Fall2022-Intro to Web Technologies.docx.pptxMADS4007_Fall2022-Intro to Web Technologies.docx.pptx
MADS4007_Fall2022-Intro to Web Technologies.docx.pptx
awadalsabbah
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
kevig
 

Similar to Harnessing Court Data using NLP and Spoken Language Technology (20)

Implementing a Caching Scheme for Media Streaming in a Proxy Server
Implementing a Caching Scheme for Media Streaming in a Proxy ServerImplementing a Caching Scheme for Media Streaming in a Proxy Server
Implementing a Caching Scheme for Media Streaming in a Proxy Server
 
team10.ppt.pptx
team10.ppt.pptxteam10.ppt.pptx
team10.ppt.pptx
 
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
 
Top Essay Writing Websites
Top Essay Writing WebsitesTop Essay Writing Websites
Top Essay Writing Websites
 
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIRA NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
 
Grouper
GrouperGrouper
Grouper
 
Test Estimation
Test Estimation Test Estimation
Test Estimation
 
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...
 
Ju3517011704
Ju3517011704Ju3517011704
Ju3517011704
 
Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.
 
Presentation summerstudent 2009-aug09-lbl-summer
Presentation summerstudent 2009-aug09-lbl-summerPresentation summerstudent 2009-aug09-lbl-summer
Presentation summerstudent 2009-aug09-lbl-summer
 
Network and Multimedia QoE Management
Network and Multimedia QoE ManagementNetwork and Multimedia QoE Management
Network and Multimedia QoE Management
 
Moses
MosesMoses
Moses
 
A Recorded Debating Dataset
A Recorded Debating DatasetA Recorded Debating Dataset
A Recorded Debating Dataset
 
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training EnsemblesSemi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
 
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
 
A Two-Tiered On-Line Server-Side Bandwidth Reservation Framework for the Real...
A Two-Tiered On-Line Server-Side Bandwidth Reservation Framework for the Real...A Two-Tiered On-Line Server-Side Bandwidth Reservation Framework for the Real...
A Two-Tiered On-Line Server-Side Bandwidth Reservation Framework for the Real...
 
An Efficient Distributed Control Law for Load Balancing in Content Delivery N...
An Efficient Distributed Control Law for Load Balancing in Content Delivery N...An Efficient Distributed Control Law for Load Balancing in Content Delivery N...
An Efficient Distributed Control Law for Load Balancing in Content Delivery N...
 
MADS4007_Fall2022-Intro to Web Technologies.docx.pptx
MADS4007_Fall2022-Intro to Web Technologies.docx.pptxMADS4007_Fall2022-Intro to Web Technologies.docx.pptx
MADS4007_Fall2022-Intro to Web Technologies.docx.pptx
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
 

More from Constantin Orasan

New trends in NLP applications
New trends in NLP applicationsNew trends in NLP applications
New trends in NLP applications
Constantin Orasan
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?
Constantin Orasan
 
QALL-ME: Ontology and Semantic Web
QALL-ME: Ontology and Semantic WebQALL-ME: Ontology and Semantic Web
QALL-ME: Ontology and Semantic Web
Constantin Orasan
 
The role of linguistic information for shallow language processing
The role of linguistic information for shallow language processingThe role of linguistic information for shallow language processing
The role of linguistic information for shallow language processing
Constantin Orasan
 
What is Computer-Aided Summarisation and does it really work?
What is Computer-Aided Summarisation and does it really work?What is Computer-Aided Summarisation and does it really work?
What is Computer-Aided Summarisation and does it really work?
Constantin Orasan
 
Tutorial on automatic summarization
Tutorial on automatic summarizationTutorial on automatic summarization
Tutorial on automatic summarization
Constantin Orasan
 
Message project leaflet
Message project leafletMessage project leaflet
Message project leaflet
Constantin Orasan
 
Porting the QALL-ME framework to Romanian
Porting the QALL-ME framework to RomanianPorting the QALL-ME framework to Romanian
Porting the QALL-ME framework to Romanian
Constantin Orasan
 
Annotation of anaphora and coreference for automatic processing
Annotation of anaphora and coreference for automatic processingAnnotation of anaphora and coreference for automatic processing
Annotation of anaphora and coreference for automatic processing
Constantin Orasan
 

More from Constantin Orasan (9)

New trends in NLP applications
New trends in NLP applicationsNew trends in NLP applications
New trends in NLP applications
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?
 
QALL-ME: Ontology and Semantic Web
QALL-ME: Ontology and Semantic WebQALL-ME: Ontology and Semantic Web
QALL-ME: Ontology and Semantic Web
 
The role of linguistic information for shallow language processing
The role of linguistic information for shallow language processingThe role of linguistic information for shallow language processing
The role of linguistic information for shallow language processing
 
What is Computer-Aided Summarisation and does it really work?
What is Computer-Aided Summarisation and does it really work?What is Computer-Aided Summarisation and does it really work?
What is Computer-Aided Summarisation and does it really work?
 
Tutorial on automatic summarization
Tutorial on automatic summarizationTutorial on automatic summarization
Tutorial on automatic summarization
 
Message project leaflet
Message project leafletMessage project leaflet
Message project leaflet
 
Porting the QALL-ME framework to Romanian
Porting the QALL-ME framework to RomanianPorting the QALL-ME framework to Romanian
Porting the QALL-ME framework to Romanian
 
Annotation of anaphora and coreference for automatic processing
Annotation of anaphora and coreference for automatic processingAnnotation of anaphora and coreference for automatic processing
Annotation of anaphora and coreference for automatic processing
 

Recently uploaded

Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 

Recently uploaded (20)

Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 

Harnessing Court Data using NLP and Spoken Language Technology

  • 1. Harnessing court data using NLP and spoken language technology Constantin Orăsan and Hadeel Saadany {C.Orasan,Hadeel.Saadany}@surrey.ac.uk Objective 1: Improve Automatic Transcription of Courtroom Hearings Amazon Web Service Transcription gives errors with legal terminology/ jargon, named entities, numbers We minimize the transcription errors by building a Custom Language Model (CLM) trained on legal text and edited transcriptions. Our CLM results correct most errors of AWS generic model and reduce Word Error Rate (WER) 1. Employ NLP tools to extract a vocabulary list with most common bigrams in the edited transcripts. 2. Employ Blackstone library to create a list of legal terms 3. Scraped a large dataset of judgements (approx. 25k documents) from the UK Tribunal Decisions (https://www.judiciary.uk) to use for training the CLM. 4. Use the extracted list of legal terms and most common entities to build CLM using different training datasets (supreme court judgements and edited transcripts). Our pipeline for Improving Transcription Quality Objective 2: Linking Judgement to Transcripts The objective and design expectations • Transcripts (very long 10+ hrs.) can be used as a tool for students/academics/lawyers to learn about a particular law. • We semantically link time spans of transcription to sections/paragraphs in the judgement. • This can help understand how the judgment is reached. Progress so far • GPT 3 semantic search spots the gold-standard timestamp in a list of top 20 most relevant. • Manual analysis of results shows that the 20 extracted timestamps are relevant to the judgement and demonstrates good linking. Our Method We treat the judgement sections as a Query and the Transcriptions as potential answers. We rank spans using the cosine similarity between the GPT3 sentence embeddings of the judgement and transcription spans Next steps • Create a large gold standard for training and testing • Experiment with more NLP methods to improve linking accuracy • Produce a smaller list of timestamps linked to judgement • Gain a better understanding of how the users would use such a system • Create a pipeline for cleaning, segmenting and linking the Courtroom hearings and the Transcriptions produced by our CLM model Publications Saadany, H., Breslin, C., Orăsan, C. and Walker, S. (2022) Better Transcription of UK Supreme Court Hearings. arXiv:2211.17094