SlideShare a Scribd company logo
Abaca:
Technically Assisted Sensitivity
Review of Digital Records
0
Agenda
● Transferring of Records to Archives
● The Digital Problem
● The Abaca Project
● Abaca Classifier Experiment
● The Test Collection
● The Abaca Project - Where Next?
● Break-Out Group Session
● Groups Discussion
1
Transferring of Records to Archives
● Department selects and appraises records
for permanent preservation
– In paper, about 5% of output selected - digital
may rise to 20%
● Prior to transfer, department must
complete sensitivity review
– Paper review is well understood
– Digital presents many new challenges and is
not so well understood
● Hence our research !
2
The Digital Problem
● The file has gone
● Volume will increase
– The way business is done has changed
– Largely unstructured despite EDRMs
● Big transfers of departmental records
● Appraisal
– Separate issue not addressed today
● Precautionary closure
– Need to research a solution
● Not unique to public records
3
Our Approach
● Provide a Framework of Utilities ...
– to assist the Review Process
● Need Methods ...
– that respect the reality of Digital Records in all
their “Glory”
– that can be tailored to specific circumstances
● Need tools ...
– to help reviewers be more productive
4
The Abaca Project
● Research to show that utilities will help
● Two Phases
– Proof of Concept (In Progress)
– Full Project (Seeking external funding)
● Today we are describing our proof-of-concept
work
● Abaca:
Technically Assisted Sensitivity Review of Digital Records
6
Abaca Classifier Experiment
● Overview of the Task & Approach
● Predicting Exemptions using a Classifier
– Features
– Types of Features
● Example Sensitive Document
● Research Question
● Overview of Classification
● Evaluation Methodology
● Results
7
The Task
Produce a classifier that can predict the presence of
sensitive material within unstructured text.
Initially focusing on two FOIA sensitivities
Section 27: International Relations
Section 40: Personal Information
8
Approach
Manually review sensitive data to create a test collection.
Split test collection into training and test sets.
Train a classifier to predict the sensitivities in documents
using the set of identified features.
Test the classifier on previously “unseen” documents.
Measure classification success.
9
External Resources
External Resources
Predict Exemptions Using a Classifier
Feature
Extraction
Learn
Classifier
Features represented
as real numbers.
Documents represented
as feature vectors.
Feature
Extraction
Run
Classifier
Features represented
as real numbers.
Documents represented
as feature vectors.
Learned Model
Predictions
Using
10
Features
Document features, such as the words it contains or the
entities it references, convey information about a
document.
11
Features
Document features, such as the words it contains or the
entities it references, convey information about a
document.
A document can be modelled by using a statistical
representation of its features.
11
Features
Document features, such as the words it contains or the
entities it references, convey information about a
document.
A document can be modelled by using a statistical
representation of its features.
We use external knowledge bases, Natural Language
Processing and semantic analysis to better understand
the document features.
11
Features
Document features, such as the words it contains or the
entities it references, convey information about a
document.
A document can be modelled by using a statistical
representation of its features.
We use external knowledge bases, Natural Language
Processing and semantic analysis to better understand
the document features.
The classifier recognises patterns in the documents’
feature sets and uses them for prediction.
11
The features we use can be divided into three main categories.
Types of Features
Feature Type Examples Comments
Structure
Lists of Words (tf/idf)
Document Length
Number of Recipients
Ubiquitous throughout the collection.
Can expose patterns in document types.
High value information about the nature
of the communication.
Content
Subjectivity
Verbs
“D.O.B”
Negation
By applying techniques such as Natural
Language Processing and dictionary
based term matching, we can identify the
tone of the communication.
Entities
Countries
People
Organisations
Tells us what the document “is about”.
Context related to the entity, such as a
“high-risk” country or a “significant”
person or role can suggest sensitivity
likelihood.
12
Research Question:
Can we produce a classifier that can predict the presence
of sensitive material within unstructured text?
13
Research Question:
Measure:
Can we produce a classifier that can predict the presence
of sensitive material within unstructured text?
Balanced Accuracy - Arithmetic mean of True Positive and
True Negative predictions, with random = 0.5000
13
Research Question:
Measure:
Test Collection:
Can we produce a classifier that can predict the presence
of sensitive material within unstructured text?
Balanced Accuracy - Arithmetic mean of True Positive and
True Negative predictions, with random = 0.5000
Total Documents 1849
Total Section 27 208
Total Section 40 142
13
Overview of Classification
Learn
Classifier
on training
data
Run
Classifier
on unseen
data
Learned Model
Predictions
Test
Collection
14
Evaluation Methodology
Test Collection
Assessor
Judgments
ResultsStatistical analysis
Classifier
Predictions
15
Results
By adding features to a tf/idf text classification baseline, we
see noticeable improvement in both Section 27 and
Section 40 predictions.
But there is still much work to be done !
Balanced AccuracyBalanced Accuracy
Features s27 s40
Text Classification 0.6327 0.6344
+ Source Count 0.6369 0.6303
+ Country Count 0.6453 0.6406
+ Country Risk Score 0.6417 0.6368
+ DOB Score 0.6327 0.6391
+ Negation Score 0.6378 0.6382
16
Test Collection - Aims
● To provide sensitivity judgements and
training data to develop and measure tools
17
Test Collection - Aims
● To provide sensitivity judgements and
training data to develop and measure tools
● To measure and understand assessors’
behavior
17
Test Collection - Measurments
● Time
18
Test Collection - Measurments
● Time
● Agreement of sensitivity
– Not previously studied
18
Test Collection - Measurments
● Time
● Agreement of sensitivity
– Not previously studied
● Hard Judgements
● Identify borderline cases
● Sensitivities sub-categories
– Good indicator for features
18
The Abaca Project - Where Next?
● Understanding the real digital environment
– Changes in working practice
● Testing our proof-of-concept system against real
data
● More, wider and deeper
– More exemptions, more data, more features
– BIS, HO, MOJ, FCO, ... and more to come!
– Funding
19
Questions and Feedback
20
Break-Out Groups
Discuss sensitivity review in the
Welsh Government and language context.
Share your understanding and
develop some ideas.
Aims:
21
Break-Out Groups
Questions:
1. What digital records does The Welsh Government
create?
2. What sort of sensitivities are expected within these
digital records?
3. What aspects of the sensitivity review process could
be technically supported by a software tool or system?
4. What document features could be used to identify the
expected sensitivities?
22
Contact
http://projectabaca.wordpress.com/
graham.mcdonald@glasgow.ac.uk
23

More Related Content

Similar to Abacá: Technically Assisted Sensitivity Review of Digital Records

Data Science Highlights
Data Science Highlights Data Science Highlights
Data Science Highlights
Joe Lamantia
 
Practice Tips for Successful Discovery Projects
Practice Tips for Successful Discovery ProjectsPractice Tips for Successful Discovery Projects
Practice Tips for Successful Discovery Projects
droselli
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Neo4j
 
Text Analytics in Enterprise Search
Text Analytics in Enterprise SearchText Analytics in Enterprise Search
Text Analytics in Enterprise Search
Findwise
 
Text Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel LingText Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel Ling
lucenerevolution
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
CareerBuilder.com
 
Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...
Riccardo Albertoni
 
“Filling the digital preservation gap” an update from the Jisc Research Data ...
“Filling the digital preservation gap”an update from the Jisc Research Data ...“Filling the digital preservation gap”an update from the Jisc Research Data ...
“Filling the digital preservation gap” an update from the Jisc Research Data ...
Jenny Mitcham
 
Data Quality
Data QualityData Quality
Data Quality
jerdeb
 
data analysis.pptx
data analysis.pptxdata analysis.pptx
data analysis.pptx
HanaKassahun1
 
data analysis.ppt
data analysis.pptdata analysis.ppt
data analysis.ppt
HanaKassahun1
 
empirical-SLR.pptx
empirical-SLR.pptxempirical-SLR.pptx
empirical-SLR.pptx
Jitha Kannan
 
Willmers&King open con2016-ct-14.11.16
Willmers&King open con2016-ct-14.11.16Willmers&King open con2016-ct-14.11.16
Willmers&King open con2016-ct-14.11.16
Michelle Willmers
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
Ian Feller
 
Data Acquisition for Sentiment Analysis
Data Acquisition for Sentiment AnalysisData Acquisition for Sentiment Analysis
Data Acquisition for Sentiment Analysis
Ali BELCAID
 
data science and business analytics
data science and business analyticsdata science and business analytics
data science and business analytics
sunnypatil1778
 
Software Quality without Testing
Software Quality without TestingSoftware Quality without Testing
Software Quality without Testing
Nagarro
 
Data analysis – using computers
Data analysis – using computersData analysis – using computers
Data analysis – using computers
Noonapau
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
Dr. Haxel Consult
 
Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...
RINUSATHYAN
 

Similar to Abacá: Technically Assisted Sensitivity Review of Digital Records (20)

Data Science Highlights
Data Science Highlights Data Science Highlights
Data Science Highlights
 
Practice Tips for Successful Discovery Projects
Practice Tips for Successful Discovery ProjectsPractice Tips for Successful Discovery Projects
Practice Tips for Successful Discovery Projects
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
 
Text Analytics in Enterprise Search
Text Analytics in Enterprise SearchText Analytics in Enterprise Search
Text Analytics in Enterprise Search
 
Text Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel LingText Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel Ling
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...
 
“Filling the digital preservation gap” an update from the Jisc Research Data ...
“Filling the digital preservation gap”an update from the Jisc Research Data ...“Filling the digital preservation gap”an update from the Jisc Research Data ...
“Filling the digital preservation gap” an update from the Jisc Research Data ...
 
Data Quality
Data QualityData Quality
Data Quality
 
data analysis.pptx
data analysis.pptxdata analysis.pptx
data analysis.pptx
 
data analysis.ppt
data analysis.pptdata analysis.ppt
data analysis.ppt
 
empirical-SLR.pptx
empirical-SLR.pptxempirical-SLR.pptx
empirical-SLR.pptx
 
Willmers&King open con2016-ct-14.11.16
Willmers&King open con2016-ct-14.11.16Willmers&King open con2016-ct-14.11.16
Willmers&King open con2016-ct-14.11.16
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
 
Data Acquisition for Sentiment Analysis
Data Acquisition for Sentiment AnalysisData Acquisition for Sentiment Analysis
Data Acquisition for Sentiment Analysis
 
data science and business analytics
data science and business analyticsdata science and business analytics
data science and business analytics
 
Software Quality without Testing
Software Quality without TestingSoftware Quality without Testing
Software Quality without Testing
 
Data analysis – using computers
Data analysis – using computersData analysis – using computers
Data analysis – using computers
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...
 

Recently uploaded

Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 

Recently uploaded (20)

Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 

Abacá: Technically Assisted Sensitivity Review of Digital Records

  • 2. Agenda ● Transferring of Records to Archives ● The Digital Problem ● The Abaca Project ● Abaca Classifier Experiment ● The Test Collection ● The Abaca Project - Where Next? ● Break-Out Group Session ● Groups Discussion 1
  • 3. Transferring of Records to Archives ● Department selects and appraises records for permanent preservation – In paper, about 5% of output selected - digital may rise to 20% ● Prior to transfer, department must complete sensitivity review – Paper review is well understood – Digital presents many new challenges and is not so well understood ● Hence our research ! 2
  • 4. The Digital Problem ● The file has gone ● Volume will increase – The way business is done has changed – Largely unstructured despite EDRMs ● Big transfers of departmental records ● Appraisal – Separate issue not addressed today ● Precautionary closure – Need to research a solution ● Not unique to public records 3
  • 5. Our Approach ● Provide a Framework of Utilities ... – to assist the Review Process ● Need Methods ... – that respect the reality of Digital Records in all their “Glory” – that can be tailored to specific circumstances ● Need tools ... – to help reviewers be more productive 4
  • 6. The Abaca Project ● Research to show that utilities will help ● Two Phases – Proof of Concept (In Progress) – Full Project (Seeking external funding) ● Today we are describing our proof-of-concept work ● Abaca: Technically Assisted Sensitivity Review of Digital Records 6
  • 7. Abaca Classifier Experiment ● Overview of the Task & Approach ● Predicting Exemptions using a Classifier – Features – Types of Features ● Example Sensitive Document ● Research Question ● Overview of Classification ● Evaluation Methodology ● Results 7
  • 8. The Task Produce a classifier that can predict the presence of sensitive material within unstructured text. Initially focusing on two FOIA sensitivities Section 27: International Relations Section 40: Personal Information 8
  • 9. Approach Manually review sensitive data to create a test collection. Split test collection into training and test sets. Train a classifier to predict the sensitivities in documents using the set of identified features. Test the classifier on previously “unseen” documents. Measure classification success. 9
  • 10. External Resources External Resources Predict Exemptions Using a Classifier Feature Extraction Learn Classifier Features represented as real numbers. Documents represented as feature vectors. Feature Extraction Run Classifier Features represented as real numbers. Documents represented as feature vectors. Learned Model Predictions Using 10
  • 11. Features Document features, such as the words it contains or the entities it references, convey information about a document. 11
  • 12. Features Document features, such as the words it contains or the entities it references, convey information about a document. A document can be modelled by using a statistical representation of its features. 11
  • 13. Features Document features, such as the words it contains or the entities it references, convey information about a document. A document can be modelled by using a statistical representation of its features. We use external knowledge bases, Natural Language Processing and semantic analysis to better understand the document features. 11
  • 14. Features Document features, such as the words it contains or the entities it references, convey information about a document. A document can be modelled by using a statistical representation of its features. We use external knowledge bases, Natural Language Processing and semantic analysis to better understand the document features. The classifier recognises patterns in the documents’ feature sets and uses them for prediction. 11
  • 15. The features we use can be divided into three main categories. Types of Features Feature Type Examples Comments Structure Lists of Words (tf/idf) Document Length Number of Recipients Ubiquitous throughout the collection. Can expose patterns in document types. High value information about the nature of the communication. Content Subjectivity Verbs “D.O.B” Negation By applying techniques such as Natural Language Processing and dictionary based term matching, we can identify the tone of the communication. Entities Countries People Organisations Tells us what the document “is about”. Context related to the entity, such as a “high-risk” country or a “significant” person or role can suggest sensitivity likelihood. 12
  • 16. Research Question: Can we produce a classifier that can predict the presence of sensitive material within unstructured text? 13
  • 17. Research Question: Measure: Can we produce a classifier that can predict the presence of sensitive material within unstructured text? Balanced Accuracy - Arithmetic mean of True Positive and True Negative predictions, with random = 0.5000 13
  • 18. Research Question: Measure: Test Collection: Can we produce a classifier that can predict the presence of sensitive material within unstructured text? Balanced Accuracy - Arithmetic mean of True Positive and True Negative predictions, with random = 0.5000 Total Documents 1849 Total Section 27 208 Total Section 40 142 13
  • 19. Overview of Classification Learn Classifier on training data Run Classifier on unseen data Learned Model Predictions Test Collection 14
  • 21. Results By adding features to a tf/idf text classification baseline, we see noticeable improvement in both Section 27 and Section 40 predictions. But there is still much work to be done ! Balanced AccuracyBalanced Accuracy Features s27 s40 Text Classification 0.6327 0.6344 + Source Count 0.6369 0.6303 + Country Count 0.6453 0.6406 + Country Risk Score 0.6417 0.6368 + DOB Score 0.6327 0.6391 + Negation Score 0.6378 0.6382 16
  • 22. Test Collection - Aims ● To provide sensitivity judgements and training data to develop and measure tools 17
  • 23. Test Collection - Aims ● To provide sensitivity judgements and training data to develop and measure tools ● To measure and understand assessors’ behavior 17
  • 24. Test Collection - Measurments ● Time 18
  • 25. Test Collection - Measurments ● Time ● Agreement of sensitivity – Not previously studied 18
  • 26. Test Collection - Measurments ● Time ● Agreement of sensitivity – Not previously studied ● Hard Judgements ● Identify borderline cases ● Sensitivities sub-categories – Good indicator for features 18
  • 27. The Abaca Project - Where Next? ● Understanding the real digital environment – Changes in working practice ● Testing our proof-of-concept system against real data ● More, wider and deeper – More exemptions, more data, more features – BIS, HO, MOJ, FCO, ... and more to come! – Funding 19
  • 29. Break-Out Groups Discuss sensitivity review in the Welsh Government and language context. Share your understanding and develop some ideas. Aims: 21
  • 30. Break-Out Groups Questions: 1. What digital records does The Welsh Government create? 2. What sort of sensitivities are expected within these digital records? 3. What aspects of the sensitivity review process could be technically supported by a software tool or system? 4. What document features could be used to identify the expected sensitivities? 22