SlideShare a Scribd company logo
Jeremie Charlet
25th May2016
Presentation of Taxonomy Applications
and their development
to the BBC
Introduction
3
– Categorisation was initially done with Autonomy: 2 years work from the
Taxonomy team to write and perfect category queries
– Since we migrated our search engine to Solr, we had to build the taxonomy
tools from scratch
“air force” "Air Force" OR "air forces" OR "Air Ministry" OR "Air Historical
Branch" OR "Air Department" OR "Air Board" OR "Air Council"
OR "Department of the Air Member" OR "air army“ …
Plan
Introduction
1. Solution
2. How we implemented it
3. Attempt on Machine Learning
Conclusion: learnings and next steps
http://discovery.nationalarchives.gov.uk/
4
5
Categories displayed on Discovery our archives
portal
Administration User Interface for taxonomists
Command Line Interface to categorise everything once
Batch Job to categorise documents every day
1/ Solution
1. Solution / Discovery
7
1. Solution / admin GUI
8
1. Solution / admin GUI
9
Application to categorise documents every day
1.to categorise new documents
2.to re-categorise documents when they are updated
1. Solution / daily updates
10
1. Solution / daily updates
11
Application to categorise everything once
1.To do it for the first time
2.to apply latest modifications from taxonomists on all documents
1. Solution / categorise all docs
12
under the hood of taxonomy-batch-cat-all1. Solution / categorise all docs
13
Categorisation and updates on Solr are decoupled1. Solution / categorise all docs
14
Architecture diagram for daily updates (Java side)1. Solution
Plan
Introduction
1. Solution
– Discovery portal
– Administration UI
– Tool to categorise everything once
– Batch Job to categorise every day
2. How we implemented it
3. Attempt on Machine Learning
Conclusion: learnings and next steps
http://discovery.nationalarchives.gov.uk/
15
16
To get it right
To get it fast
• Algorithm
• Fine tuning
• Distributed system with Akka
2. Implementation
Many parameters to take into account
• Is case sensitiveness important?
• Use punctuation?
• Use synonyms?
• Ignore stop words (of, the, a, …)?
• Use wildcards?
• Which meta data to use?
= Iterative process
How to evaluate if our results are valid?
> Use documents and categories from former system
> Categorise them again and compare results
To do that quickly, created Command Line Interface
17
[jcharlet@server ~]$
./runCli.sh -EVALcategoriseEvalDataSet --lucene.index.useSynonymFilter=true
2. Implementation / get it right
It depends
It depends
Yes
No, use stop words
* ?
Title, description, context description, categories, people, places,
corporate bodies
We apply our 136 categories to 22 millions records in 1,5 days (~ 5ms per doc)
• We create an index in memory with a single document and run our queries
against it. Then we run the matching queries to the complete index to have a
score that enables us to rank matches
• Distributed system with Akka (13 processes running on 2 servers)
2 * 24 Core CPU
40 Go RAM
18
2. Implementation / get it fast
Use the right driver for your system (NRTCacheDirectory instead of default
one)
> 1 line in 1 file = 20% faster on search queries
Use filter instead of query to search on only 1 document
+ use carefully low level api
Profile your application frequently
> Identify ugly code, where to add cache, where to add concurrency
Spent 7% on creating Query objects for every document: instead, create them once and
store them in memory
19
2. Implementation / get it fast
How to transmit documents to categorise efficiently?
By sending messages to workers
See the problem?
Categorisation
Supervisor
Categorisation
Worker
Categorisation
Worker
Categorisation
Worker
C456321;C65465;
C654879;C56879
C456321;C65465;
C654879;C56879
C456321;C65465;
C654879;C56879
C456321;C65465;
C654879;C56879C456321;C65465;
C654879;C56879C456321;C65465;
C654879;C56879C456321;C65465;
C654879;C56879
C456321;C65465;
C654879;C56879C456321;C65465;
C654879;C56879C456321;C65465;
C654879;C56879
C456321;C65465;
C654879;C56879C456321;C65465;
C654879;C56879C456321;C65465;
C654879;C56879C456321;C65465;
C654879;C56879
2. Implementation / get it fast
Solution: http://www.michaelpollmeier.com/akka-work-pulling-pattern/
2. Implementation / get it fast
Applied to taxonomy Applications
https://github.com/nationalarchives/taxonomy
There are 2 types batch applications (each runs in its own application server)
• 1 instance of Taxonomy-cat-all-supervisor
• N instances of Taxonomy-cat-all-worker
Categorisation supervisor browses the whole index and retrieve 1000 documents at a time
Categorisation worker receives categorisation requests that contains a list of documents to
categorise
2. Implementation / get it right
Plan
Introduction
1. Solution
– Discovery portal
– Administration UI
– Tool to categorise everything once
– Batch Job to categorise every day
2. How we implemented it
– Get it right
– Get it fast
• Fine tuning
• Distributed system with Akka
3. Attempt on Machine Learning
Conclusion: learnings and next steps
http://discovery.nationalarchives.gov.uk/
23
Research on a training set based solution for 2 months
1.Take a data set of known (already classified) documents
2.Split it into a test set and training set
– Train the system with the training set
– Evaluate it using the test set
– Iterate until satisfactory
3.Move it to production
– Classify new documents using the trained system
24
3. Attempt on Machine Learning
Why it did not work
1.Using category queries to create the training set
– Highly dependent on the validity/accuracy of the category queries
2.Nature of our categories
– far too many (136)
– categories too vague / broad or too similar (“Poverty”, “Military”): do not
suit such a system
3.Not the right tool? We used Lucene (search engine) built in tool
4.Nature of the data? Quality of the meta data?
25
3. Attempt on Machine Learning
Plan
Introduction
1. Solution
– Discovery portal
– Administration UI
– Tool to categorise everything once
– Batch Job to categorise every day
2. How we implemented it
– Get it right
– Get it fast
• Fine tuning
• Distributed system with Akka
3. Attempt on Machine Learning
Conclusion: learnings and next steps
http://discovery.nationalarchives.gov.uk/
26
Conclusion: learnings and next steps
27
Gains and losses
No * within words
categorisation 10 times faster
use of free solutions (*)
admin interface more fluid and useable
Conclusion: learnings and next steps
28
Possible improvements
- Update documents for 1 category on demand
- Create more generic solution
- Add missing GUI (reporting, categorise all)
- Build solution upon Solr, not Lucene
- Use Cloud Services instead of onsite servers
Next steps
- Categorise other archives
- Work on new digital-born records
 New categories ?
 New research on machine learning ?
Solr
Lucene
Thank you for listening
Any questions ?

More Related Content

Viewers also liked

2016 SDN-NFV and Software Tools Brand Leader Survey (Mini Report)
2016 SDN-NFV and Software Tools Brand Leader Survey (Mini Report)2016 SDN-NFV and Software Tools Brand Leader Survey (Mini Report)
2016 SDN-NFV and Software Tools Brand Leader Survey (Mini Report)
IT Brand Pulse
 
2016 Networking and Scale-out Storage Brand Leader Survey (Mini-Report)
2016 Networking and Scale-out Storage Brand Leader Survey (Mini-Report)2016 Networking and Scale-out Storage Brand Leader Survey (Mini-Report)
2016 Networking and Scale-out Storage Brand Leader Survey (Mini-Report)
IT Brand Pulse
 
2016 Flash Storage-NVMe Brand Leader Mini-Report
2016 Flash Storage-NVMe Brand Leader Mini-Report2016 Flash Storage-NVMe Brand Leader Mini-Report
2016 Flash Storage-NVMe Brand Leader Mini-Report
IT Brand Pulse
 
2017 Networked Storage Brand Leader Mini-Report
2017 Networked Storage Brand Leader Mini-Report2017 Networked Storage Brand Leader Mini-Report
2017 Networked Storage Brand Leader Mini-Report
IT Brand Pulse
 
2016 Cloud vs. On Premise Brand Leader Survey Report
2016 Cloud vs. On Premise Brand Leader Survey Report2016 Cloud vs. On Premise Brand Leader Survey Report
2016 Cloud vs. On Premise Brand Leader Survey Report
IT Brand Pulse
 
2016 Enterprise Flash Storage Buyer Behavior
2016 Enterprise Flash Storage Buyer Behavior2016 Enterprise Flash Storage Buyer Behavior
2016 Enterprise Flash Storage Buyer Behavior
IT Brand Pulse
 
Cloud Storage: The Next 40 Years
Cloud Storage: The Next 40 YearsCloud Storage: The Next 40 Years
Cloud Storage: The Next 40 Years
IT Brand Pulse
 

Viewers also liked (7)

2016 SDN-NFV and Software Tools Brand Leader Survey (Mini Report)
2016 SDN-NFV and Software Tools Brand Leader Survey (Mini Report)2016 SDN-NFV and Software Tools Brand Leader Survey (Mini Report)
2016 SDN-NFV and Software Tools Brand Leader Survey (Mini Report)
 
2016 Networking and Scale-out Storage Brand Leader Survey (Mini-Report)
2016 Networking and Scale-out Storage Brand Leader Survey (Mini-Report)2016 Networking and Scale-out Storage Brand Leader Survey (Mini-Report)
2016 Networking and Scale-out Storage Brand Leader Survey (Mini-Report)
 
2016 Flash Storage-NVMe Brand Leader Mini-Report
2016 Flash Storage-NVMe Brand Leader Mini-Report2016 Flash Storage-NVMe Brand Leader Mini-Report
2016 Flash Storage-NVMe Brand Leader Mini-Report
 
2017 Networked Storage Brand Leader Mini-Report
2017 Networked Storage Brand Leader Mini-Report2017 Networked Storage Brand Leader Mini-Report
2017 Networked Storage Brand Leader Mini-Report
 
2016 Cloud vs. On Premise Brand Leader Survey Report
2016 Cloud vs. On Premise Brand Leader Survey Report2016 Cloud vs. On Premise Brand Leader Survey Report
2016 Cloud vs. On Premise Brand Leader Survey Report
 
2016 Enterprise Flash Storage Buyer Behavior
2016 Enterprise Flash Storage Buyer Behavior2016 Enterprise Flash Storage Buyer Behavior
2016 Enterprise Flash Storage Buyer Behavior
 
Cloud Storage: The Next 40 Years
Cloud Storage: The Next 40 YearsCloud Storage: The Next 40 Years
Cloud Storage: The Next 40 Years
 

Similar to TNA taxonomies 20160525

Tna how taxonomy applications were built
Tna how taxonomy applications were builtTna how taxonomy applications were built
Tna how taxonomy applications were built
Jeremie Charlet
 
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Chris Hammerschmidt
 
KineMatik November 2010
KineMatik November 2010KineMatik November 2010
KineMatik November 2010Michael Price
 
Prototype of the Export Information System for Managing Cargo Data
Prototype of the Export Information System for Managing Cargo DataPrototype of the Export Information System for Managing Cargo Data
Prototype of the Export Information System for Managing Cargo Data
IJSRED
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
Dr. Haxel Consult
 
CIS 321 Case Study ‘Equipment Check-Out System’MILESTONE 3 – PRO.docx
CIS 321 Case Study ‘Equipment Check-Out System’MILESTONE 3 – PRO.docxCIS 321 Case Study ‘Equipment Check-Out System’MILESTONE 3 – PRO.docx
CIS 321 Case Study ‘Equipment Check-Out System’MILESTONE 3 – PRO.docx
clarebernice
 
Library Management System Waterfall Model
Library Management System Waterfall ModelLibrary Management System Waterfall Model
Library Management System Waterfall Model
mitwa1990
 
Online examination management system..pdf
Online examination management system..pdfOnline examination management system..pdf
Online examination management system..pdf
Kamal Acharya
 
Documentation project of college management [1]
Documentation project of college management [1]Documentation project of college management [1]
Documentation project of college management [1]
Priyaranjan Verma
 
Proposal with sdlc
Proposal with sdlcProposal with sdlc
Proposal with sdlc
Kamau Francis
 
Artificially Intelligent Warehouse Management System
Artificially Intelligent Warehouse Management System  Artificially Intelligent Warehouse Management System
Artificially Intelligent Warehouse Management System
IIJSRJournal
 
lake city institute of technology
lake city institute of technology lake city institute of technology
lake city institute of technology
RaviKalola786
 
Online-Voting-System.doc
Online-Voting-System.docOnline-Voting-System.doc
Online-Voting-System.doc
ShangaviS2
 
Being effective with legacy projects
Being effective with legacy projectsBeing effective with legacy projects
Being effective with legacy projects
Konstantin Kudryashov
 
@SIMUL8 Virtual User Group, September: Brian Harrington, Less is More
@SIMUL8 Virtual User Group, September: Brian Harrington, Less is More@SIMUL8 Virtual User Group, September: Brian Harrington, Less is More
@SIMUL8 Virtual User Group, September: Brian Harrington, Less is More
SIMUL8 Corporation
 
Online Exam Management System(OEMS)
Online Exam Management System(OEMS)Online Exam Management System(OEMS)
Online Exam Management System(OEMS)
PUST
 
A Project to Automate Inventory Management in a Fast Food, Cas.docx
A Project to Automate Inventory Management in a Fast Food, Cas.docxA Project to Automate Inventory Management in a Fast Food, Cas.docx
A Project to Automate Inventory Management in a Fast Food, Cas.docx
ransayo
 
Coursework2 2013 distributed systems(1)
Coursework2 2013 distributed systems(1)Coursework2 2013 distributed systems(1)
Coursework2 2013 distributed systems(1)randomP786
 
Online hostel management_system
Online hostel management_systemOnline hostel management_system
Online hostel management_system
md faruk
 

Similar to TNA taxonomies 20160525 (20)

Tna how taxonomy applications were built
Tna how taxonomy applications were builtTna how taxonomy applications were built
Tna how taxonomy applications were built
 
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
 
KineMatik November 2010
KineMatik November 2010KineMatik November 2010
KineMatik November 2010
 
Prototype of the Export Information System for Managing Cargo Data
Prototype of the Export Information System for Managing Cargo DataPrototype of the Export Information System for Managing Cargo Data
Prototype of the Export Information System for Managing Cargo Data
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
CIS 321 Case Study ‘Equipment Check-Out System’MILESTONE 3 – PRO.docx
CIS 321 Case Study ‘Equipment Check-Out System’MILESTONE 3 – PRO.docxCIS 321 Case Study ‘Equipment Check-Out System’MILESTONE 3 – PRO.docx
CIS 321 Case Study ‘Equipment Check-Out System’MILESTONE 3 – PRO.docx
 
Library Management System Waterfall Model
Library Management System Waterfall ModelLibrary Management System Waterfall Model
Library Management System Waterfall Model
 
Online examination management system..pdf
Online examination management system..pdfOnline examination management system..pdf
Online examination management system..pdf
 
Documentation project of college management [1]
Documentation project of college management [1]Documentation project of college management [1]
Documentation project of college management [1]
 
Proposal with sdlc
Proposal with sdlcProposal with sdlc
Proposal with sdlc
 
OpenKM commercial
OpenKM commercialOpenKM commercial
OpenKM commercial
 
Artificially Intelligent Warehouse Management System
Artificially Intelligent Warehouse Management System  Artificially Intelligent Warehouse Management System
Artificially Intelligent Warehouse Management System
 
lake city institute of technology
lake city institute of technology lake city institute of technology
lake city institute of technology
 
Online-Voting-System.doc
Online-Voting-System.docOnline-Voting-System.doc
Online-Voting-System.doc
 
Being effective with legacy projects
Being effective with legacy projectsBeing effective with legacy projects
Being effective with legacy projects
 
@SIMUL8 Virtual User Group, September: Brian Harrington, Less is More
@SIMUL8 Virtual User Group, September: Brian Harrington, Less is More@SIMUL8 Virtual User Group, September: Brian Harrington, Less is More
@SIMUL8 Virtual User Group, September: Brian Harrington, Less is More
 
Online Exam Management System(OEMS)
Online Exam Management System(OEMS)Online Exam Management System(OEMS)
Online Exam Management System(OEMS)
 
A Project to Automate Inventory Management in a Fast Food, Cas.docx
A Project to Automate Inventory Management in a Fast Food, Cas.docxA Project to Automate Inventory Management in a Fast Food, Cas.docx
A Project to Automate Inventory Management in a Fast Food, Cas.docx
 
Coursework2 2013 distributed systems(1)
Coursework2 2013 distributed systems(1)Coursework2 2013 distributed systems(1)
Coursework2 2013 distributed systems(1)
 
Online hostel management_system
Online hostel management_systemOnline hostel management_system
Online hostel management_system
 

More from Jeremie Charlet

Do we know our data, as good as we know our tools
Do we know our data, as good as we know our tools Do we know our data, as good as we know our tools
Do we know our data, as good as we know our tools
Jeremie Charlet
 
Machine learning study group 17 4 2019
Machine learning study group 17 4 2019Machine learning study group 17 4 2019
Machine learning study group 17 4 2019
Jeremie Charlet
 
Tna Discovery Portal
Tna Discovery PortalTna Discovery Portal
Tna Discovery Portal
Jeremie Charlet
 
TNA Portail Discovery
TNA Portail DiscoveryTNA Portail Discovery
TNA Portail Discovery
Jeremie Charlet
 
TNA Introduction to taxonomy applications
TNA Introduction to taxonomy applicationsTNA Introduction to taxonomy applications
TNA Introduction to taxonomy applications
Jeremie Charlet
 
Introduction to Shell Scripting
Introduction to Shell ScriptingIntroduction to Shell Scripting
Introduction to Shell Scripting
Jeremie Charlet
 
Actors with akka
Actors with akkaActors with akka
Actors with akka
Jeremie Charlet
 
Bibliography & Appendixes Can new web technologies HTML5 & CSS3 kill Flash? D...
Bibliography & Appendixes Can new web technologies HTML5 & CSS3 kill Flash? D...Bibliography & Appendixes Can new web technologies HTML5 & CSS3 kill Flash? D...
Bibliography & Appendixes Can new web technologies HTML5 & CSS3 kill Flash? D...
Jeremie Charlet
 
Can new web technologies HTML5 & CSS3 kill Flash? Dissertation by Jeremie Cha...
Can new web technologies HTML5 & CSS3 kill Flash? Dissertation by Jeremie Cha...Can new web technologies HTML5 & CSS3 kill Flash? Dissertation by Jeremie Cha...
Can new web technologies HTML5 & CSS3 kill Flash? Dissertation by Jeremie Cha...
Jeremie Charlet
 

More from Jeremie Charlet (9)

Do we know our data, as good as we know our tools
Do we know our data, as good as we know our tools Do we know our data, as good as we know our tools
Do we know our data, as good as we know our tools
 
Machine learning study group 17 4 2019
Machine learning study group 17 4 2019Machine learning study group 17 4 2019
Machine learning study group 17 4 2019
 
Tna Discovery Portal
Tna Discovery PortalTna Discovery Portal
Tna Discovery Portal
 
TNA Portail Discovery
TNA Portail DiscoveryTNA Portail Discovery
TNA Portail Discovery
 
TNA Introduction to taxonomy applications
TNA Introduction to taxonomy applicationsTNA Introduction to taxonomy applications
TNA Introduction to taxonomy applications
 
Introduction to Shell Scripting
Introduction to Shell ScriptingIntroduction to Shell Scripting
Introduction to Shell Scripting
 
Actors with akka
Actors with akkaActors with akka
Actors with akka
 
Bibliography & Appendixes Can new web technologies HTML5 & CSS3 kill Flash? D...
Bibliography & Appendixes Can new web technologies HTML5 & CSS3 kill Flash? D...Bibliography & Appendixes Can new web technologies HTML5 & CSS3 kill Flash? D...
Bibliography & Appendixes Can new web technologies HTML5 & CSS3 kill Flash? D...
 
Can new web technologies HTML5 & CSS3 kill Flash? Dissertation by Jeremie Cha...
Can new web technologies HTML5 & CSS3 kill Flash? Dissertation by Jeremie Cha...Can new web technologies HTML5 & CSS3 kill Flash? Dissertation by Jeremie Cha...
Can new web technologies HTML5 & CSS3 kill Flash? Dissertation by Jeremie Cha...
 

Recently uploaded

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 

Recently uploaded (20)

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 

TNA taxonomies 20160525

  • 1.
  • 2. Jeremie Charlet 25th May2016 Presentation of Taxonomy Applications and their development to the BBC
  • 3. Introduction 3 – Categorisation was initially done with Autonomy: 2 years work from the Taxonomy team to write and perfect category queries – Since we migrated our search engine to Solr, we had to build the taxonomy tools from scratch “air force” "Air Force" OR "air forces" OR "Air Ministry" OR "Air Historical Branch" OR "Air Department" OR "Air Board" OR "Air Council" OR "Department of the Air Member" OR "air army“ …
  • 4. Plan Introduction 1. Solution 2. How we implemented it 3. Attempt on Machine Learning Conclusion: learnings and next steps http://discovery.nationalarchives.gov.uk/ 4
  • 5. 5 Categories displayed on Discovery our archives portal Administration User Interface for taxonomists Command Line Interface to categorise everything once Batch Job to categorise documents every day 1/ Solution
  • 6. 1. Solution / Discovery
  • 7. 7 1. Solution / admin GUI
  • 8. 8 1. Solution / admin GUI
  • 9. 9 Application to categorise documents every day 1.to categorise new documents 2.to re-categorise documents when they are updated 1. Solution / daily updates
  • 10. 10 1. Solution / daily updates
  • 11. 11 Application to categorise everything once 1.To do it for the first time 2.to apply latest modifications from taxonomists on all documents 1. Solution / categorise all docs
  • 12. 12 under the hood of taxonomy-batch-cat-all1. Solution / categorise all docs
  • 13. 13 Categorisation and updates on Solr are decoupled1. Solution / categorise all docs
  • 14. 14 Architecture diagram for daily updates (Java side)1. Solution
  • 15. Plan Introduction 1. Solution – Discovery portal – Administration UI – Tool to categorise everything once – Batch Job to categorise every day 2. How we implemented it 3. Attempt on Machine Learning Conclusion: learnings and next steps http://discovery.nationalarchives.gov.uk/ 15
  • 16. 16 To get it right To get it fast • Algorithm • Fine tuning • Distributed system with Akka 2. Implementation
  • 17. Many parameters to take into account • Is case sensitiveness important? • Use punctuation? • Use synonyms? • Ignore stop words (of, the, a, …)? • Use wildcards? • Which meta data to use? = Iterative process How to evaluate if our results are valid? > Use documents and categories from former system > Categorise them again and compare results To do that quickly, created Command Line Interface 17 [jcharlet@server ~]$ ./runCli.sh -EVALcategoriseEvalDataSet --lucene.index.useSynonymFilter=true 2. Implementation / get it right It depends It depends Yes No, use stop words * ? Title, description, context description, categories, people, places, corporate bodies
  • 18. We apply our 136 categories to 22 millions records in 1,5 days (~ 5ms per doc) • We create an index in memory with a single document and run our queries against it. Then we run the matching queries to the complete index to have a score that enables us to rank matches • Distributed system with Akka (13 processes running on 2 servers) 2 * 24 Core CPU 40 Go RAM 18 2. Implementation / get it fast
  • 19. Use the right driver for your system (NRTCacheDirectory instead of default one) > 1 line in 1 file = 20% faster on search queries Use filter instead of query to search on only 1 document + use carefully low level api Profile your application frequently > Identify ugly code, where to add cache, where to add concurrency Spent 7% on creating Query objects for every document: instead, create them once and store them in memory 19 2. Implementation / get it fast
  • 20. How to transmit documents to categorise efficiently? By sending messages to workers See the problem? Categorisation Supervisor Categorisation Worker Categorisation Worker Categorisation Worker C456321;C65465; C654879;C56879 C456321;C65465; C654879;C56879 C456321;C65465; C654879;C56879 C456321;C65465; C654879;C56879C456321;C65465; C654879;C56879C456321;C65465; C654879;C56879C456321;C65465; C654879;C56879 C456321;C65465; C654879;C56879C456321;C65465; C654879;C56879C456321;C65465; C654879;C56879 C456321;C65465; C654879;C56879C456321;C65465; C654879;C56879C456321;C65465; C654879;C56879C456321;C65465; C654879;C56879 2. Implementation / get it fast
  • 22. Applied to taxonomy Applications https://github.com/nationalarchives/taxonomy There are 2 types batch applications (each runs in its own application server) • 1 instance of Taxonomy-cat-all-supervisor • N instances of Taxonomy-cat-all-worker Categorisation supervisor browses the whole index and retrieve 1000 documents at a time Categorisation worker receives categorisation requests that contains a list of documents to categorise 2. Implementation / get it right
  • 23. Plan Introduction 1. Solution – Discovery portal – Administration UI – Tool to categorise everything once – Batch Job to categorise every day 2. How we implemented it – Get it right – Get it fast • Fine tuning • Distributed system with Akka 3. Attempt on Machine Learning Conclusion: learnings and next steps http://discovery.nationalarchives.gov.uk/ 23
  • 24. Research on a training set based solution for 2 months 1.Take a data set of known (already classified) documents 2.Split it into a test set and training set – Train the system with the training set – Evaluate it using the test set – Iterate until satisfactory 3.Move it to production – Classify new documents using the trained system 24 3. Attempt on Machine Learning
  • 25. Why it did not work 1.Using category queries to create the training set – Highly dependent on the validity/accuracy of the category queries 2.Nature of our categories – far too many (136) – categories too vague / broad or too similar (“Poverty”, “Military”): do not suit such a system 3.Not the right tool? We used Lucene (search engine) built in tool 4.Nature of the data? Quality of the meta data? 25 3. Attempt on Machine Learning
  • 26. Plan Introduction 1. Solution – Discovery portal – Administration UI – Tool to categorise everything once – Batch Job to categorise every day 2. How we implemented it – Get it right – Get it fast • Fine tuning • Distributed system with Akka 3. Attempt on Machine Learning Conclusion: learnings and next steps http://discovery.nationalarchives.gov.uk/ 26
  • 27. Conclusion: learnings and next steps 27 Gains and losses No * within words categorisation 10 times faster use of free solutions (*) admin interface more fluid and useable
  • 28. Conclusion: learnings and next steps 28 Possible improvements - Update documents for 1 category on demand - Create more generic solution - Add missing GUI (reporting, categorise all) - Build solution upon Solr, not Lucene - Use Cloud Services instead of onsite servers Next steps - Categorise other archives - Work on new digital-born records  New categories ?  New research on machine learning ? Solr Lucene
  • 29. Thank you for listening Any questions ?

Editor's Notes

  1. 1 developer – 6 months project
  2. 20
  3. 21
  4. 22
  5. It’s as if We were trying to categorise a catalogue of books by using their titles instead of their contents
  6. Gains and losses Almost no loss: decided to stop using * wildcard in words but it was very rarely used (only in a few categories) (*): All in house, 6months developer, using free technologies for the new platform dedicated to taxonomies. It is connected to the Discovery platform which uses the enterprise version of the same tools (Mongo)
  7. Main possible improvements Update documents for 1 category on demand Technology influenced the end solution a lot. We tried an entirely different approach with a training set based solution, and when we got into a dead end and haven’t got time to do new research, we roll backed to build something similar to the former system: taxonomists work on their queries, and categorisation of the complete process is done separately. And this is working this way because we decided to work on categorising one document at a time rather than work per category. The latter solution would have made it easier to update a category after taxonomists’ update. Those categories suit very well our TNA collections but does it for new collections which are not curated but computer generated? What if we not only have access to the titles but also the contents? Would require a massive regular amount of work from the taxonomists - An extra GUI to categorise everything + get reports would make the Taxonomy team completely autonomous Linux shell scripts to monitor daily the nb of documents requested to categorise + nb of documents categorised on Discovery get a report on untagged documents track in real time the nb of documents processed when categorising all documents + speed of categorisation and estimation of completion time A much simpler solution would have been to use Solr instead of Lucene build a POC with Solr, another with Lucene and compare the perfs Use Cloud PAAS/IAAS to spawn VM on demand: completely isolate service from others + only use the servers we need + possibly make it even faster Our next steps - Categorise other archives Work on new digital born records Opportunity to test those categories on new/different collections and maybe create a new set of categories Complete our work with a new GUI to provide required reporting + trigger the categorisation of everything New opportunity to do research on machine learning and come up with a more automated classification system I would work with the taxonomy team to come up with a new taxonomy tree based, consisting of maybe fewer categories. and then build a prototype with a cloud machine learning api