SlideShare a Scribd company logo
1 of 1
Download to read offline
Harnessing court data using NLP and spoken language technology
Constantin Orăsan and Hadeel Saadany
{C.Orasan,Hadeel.Saadany}@surrey.ac.uk
Objective 1: Improve Automatic Transcription of Courtroom Hearings
Amazon Web Service Transcription
gives errors with legal terminology/
jargon, named entities, numbers
We minimize the transcription errors by building a Custom
Language Model (CLM) trained on legal text and edited
transcriptions. Our CLM results correct most errors of AWS
generic model and reduce Word Error Rate (WER)
1. Employ NLP tools to extract a vocabulary list with most common bigrams in the edited transcripts.
2. Employ Blackstone library to create a list of legal terms
3. Scraped a large dataset of judgements (approx. 25k documents) from the UK Tribunal Decisions
(https://www.judiciary.uk) to use for training the CLM.
4. Use the extracted list of legal terms and most common entities to build CLM using different
training datasets (supreme court judgements and edited transcripts).
Our pipeline for Improving Transcription Quality
Objective 2: Linking Judgement to Transcripts
The objective and design expectations
• Transcripts (very long 10+ hrs.) can be used as a tool for
students/academics/lawyers to learn about a particular law.
• We semantically link time spans of transcription to
sections/paragraphs in the judgement.
• This can help understand how the judgment is reached.
Progress so far
• GPT 3 semantic search spots the gold-standard timestamp in
a list of top 20 most relevant.
• Manual analysis of results shows that the 20 extracted
timestamps are relevant to the judgement and
demonstrates good linking.
Our Method
We treat the judgement sections as a Query and the Transcriptions as potential
answers. We rank spans using the cosine similarity between the GPT3 sentence
embeddings of the judgement and transcription spans
Next steps
• Create a large gold standard for training and testing
• Experiment with more NLP methods to improve linking
accuracy
• Produce a smaller list of timestamps linked to judgement
• Gain a better understanding of how the users would use
such a system
• Create a pipeline for cleaning, segmenting and linking the
Courtroom hearings and the Transcriptions produced by
our CLM model
Publications
Saadany, H., Breslin, C., Orăsan, C. and Walker, S. (2022) Better Transcription of UK Supreme Court Hearings. arXiv:2211.17094

More Related Content

Similar to Harnessing Court Data using NLP and Spoken Language Technology

ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
Kripa (कृपा) Rajshekhar
 
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Edmond Lepedus
 
Presentation summerstudent 2009-aug09-lbl-summer
Presentation summerstudent 2009-aug09-lbl-summerPresentation summerstudent 2009-aug09-lbl-summer
Presentation summerstudent 2009-aug09-lbl-summer
balmanme
 
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
Kripa (कृपा) Rajshekhar
 
A Two-Tiered On-Line Server-Side Bandwidth Reservation Framework for the Real...
A Two-Tiered On-Line Server-Side Bandwidth Reservation Framework for the Real...A Two-Tiered On-Line Server-Side Bandwidth Reservation Framework for the Real...
A Two-Tiered On-Line Server-Side Bandwidth Reservation Framework for the Real...
white paper
 
MADS4007_Fall2022-Intro to Web Technologies.docx.pptx
MADS4007_Fall2022-Intro to Web Technologies.docx.pptxMADS4007_Fall2022-Intro to Web Technologies.docx.pptx
MADS4007_Fall2022-Intro to Web Technologies.docx.pptx
awadalsabbah
 

Similar to Harnessing Court Data using NLP and Spoken Language Technology (20)

Implementing a Caching Scheme for Media Streaming in a Proxy Server
Implementing a Caching Scheme for Media Streaming in a Proxy ServerImplementing a Caching Scheme for Media Streaming in a Proxy Server
Implementing a Caching Scheme for Media Streaming in a Proxy Server
 
team10.ppt.pptx
team10.ppt.pptxteam10.ppt.pptx
team10.ppt.pptx
 
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
 
Top Essay Writing Websites
Top Essay Writing WebsitesTop Essay Writing Websites
Top Essay Writing Websites
 
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIRA NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
 
Grouper
GrouperGrouper
Grouper
 
Test Estimation
Test Estimation Test Estimation
Test Estimation
 
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...
 
Ju3517011704
Ju3517011704Ju3517011704
Ju3517011704
 
Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.
 
Presentation summerstudent 2009-aug09-lbl-summer
Presentation summerstudent 2009-aug09-lbl-summerPresentation summerstudent 2009-aug09-lbl-summer
Presentation summerstudent 2009-aug09-lbl-summer
 
Network and Multimedia QoE Management
Network and Multimedia QoE ManagementNetwork and Multimedia QoE Management
Network and Multimedia QoE Management
 
Moses
MosesMoses
Moses
 
A Recorded Debating Dataset
A Recorded Debating DatasetA Recorded Debating Dataset
A Recorded Debating Dataset
 
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training EnsemblesSemi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
 
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
 
A Two-Tiered On-Line Server-Side Bandwidth Reservation Framework for the Real...
A Two-Tiered On-Line Server-Side Bandwidth Reservation Framework for the Real...A Two-Tiered On-Line Server-Side Bandwidth Reservation Framework for the Real...
A Two-Tiered On-Line Server-Side Bandwidth Reservation Framework for the Real...
 
An Efficient Distributed Control Law for Load Balancing in Content Delivery N...
An Efficient Distributed Control Law for Load Balancing in Content Delivery N...An Efficient Distributed Control Law for Load Balancing in Content Delivery N...
An Efficient Distributed Control Law for Load Balancing in Content Delivery N...
 
MADS4007_Fall2022-Intro to Web Technologies.docx.pptx
MADS4007_Fall2022-Intro to Web Technologies.docx.pptxMADS4007_Fall2022-Intro to Web Technologies.docx.pptx
MADS4007_Fall2022-Intro to Web Technologies.docx.pptx
 
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
 

More from Constantin Orasan

New trends in NLP applications
New trends in NLP applicationsNew trends in NLP applications
New trends in NLP applications
Constantin Orasan
 

More from Constantin Orasan (9)

New trends in NLP applications
New trends in NLP applicationsNew trends in NLP applications
New trends in NLP applications
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?
 
QALL-ME: Ontology and Semantic Web
QALL-ME: Ontology and Semantic WebQALL-ME: Ontology and Semantic Web
QALL-ME: Ontology and Semantic Web
 
The role of linguistic information for shallow language processing
The role of linguistic information for shallow language processingThe role of linguistic information for shallow language processing
The role of linguistic information for shallow language processing
 
What is Computer-Aided Summarisation and does it really work?
What is Computer-Aided Summarisation and does it really work?What is Computer-Aided Summarisation and does it really work?
What is Computer-Aided Summarisation and does it really work?
 
Tutorial on automatic summarization
Tutorial on automatic summarizationTutorial on automatic summarization
Tutorial on automatic summarization
 
Message project leaflet
Message project leafletMessage project leaflet
Message project leaflet
 
Porting the QALL-ME framework to Romanian
Porting the QALL-ME framework to RomanianPorting the QALL-ME framework to Romanian
Porting the QALL-ME framework to Romanian
 
Annotation of anaphora and coreference for automatic processing
Annotation of anaphora and coreference for automatic processingAnnotation of anaphora and coreference for automatic processing
Annotation of anaphora and coreference for automatic processing
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
UK Journal
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 

Recently uploaded (20)

The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 

Harnessing Court Data using NLP and Spoken Language Technology

  • 1. Harnessing court data using NLP and spoken language technology Constantin Orăsan and Hadeel Saadany {C.Orasan,Hadeel.Saadany}@surrey.ac.uk Objective 1: Improve Automatic Transcription of Courtroom Hearings Amazon Web Service Transcription gives errors with legal terminology/ jargon, named entities, numbers We minimize the transcription errors by building a Custom Language Model (CLM) trained on legal text and edited transcriptions. Our CLM results correct most errors of AWS generic model and reduce Word Error Rate (WER) 1. Employ NLP tools to extract a vocabulary list with most common bigrams in the edited transcripts. 2. Employ Blackstone library to create a list of legal terms 3. Scraped a large dataset of judgements (approx. 25k documents) from the UK Tribunal Decisions (https://www.judiciary.uk) to use for training the CLM. 4. Use the extracted list of legal terms and most common entities to build CLM using different training datasets (supreme court judgements and edited transcripts). Our pipeline for Improving Transcription Quality Objective 2: Linking Judgement to Transcripts The objective and design expectations • Transcripts (very long 10+ hrs.) can be used as a tool for students/academics/lawyers to learn about a particular law. • We semantically link time spans of transcription to sections/paragraphs in the judgement. • This can help understand how the judgment is reached. Progress so far • GPT 3 semantic search spots the gold-standard timestamp in a list of top 20 most relevant. • Manual analysis of results shows that the 20 extracted timestamps are relevant to the judgement and demonstrates good linking. Our Method We treat the judgement sections as a Query and the Transcriptions as potential answers. We rank spans using the cosine similarity between the GPT3 sentence embeddings of the judgement and transcription spans Next steps • Create a large gold standard for training and testing • Experiment with more NLP methods to improve linking accuracy • Produce a smaller list of timestamps linked to judgement • Gain a better understanding of how the users would use such a system • Create a pipeline for cleaning, segmenting and linking the Courtroom hearings and the Transcriptions produced by our CLM model Publications Saadany, H., Breslin, C., Orăsan, C. and Walker, S. (2022) Better Transcription of UK Supreme Court Hearings. arXiv:2211.17094