SlideShare a Scribd company logo
1 of 35
Content OCRing and
Scoring
Mauro Teixeira
mauro.teixeira@cern.ch | @mauro1855
Learn. Connect. Collaborate.
2
About CERN
• At the European Organization for Nuclear Research,
physicists and engineers are probing the fundamental
structure of the universe;
• Accelerators boost beams of particles to high energies
before the beams are made to collide with each other
or with stationary targets. Detectors observe and record
the results of these collisions;
• The WWW was born at CERN.
Learn. Connect. Collaborate.
3
Our usage of Alfresco
• Administrative Staff connects to eFiles (our instance of alfresco) to upload documents, or to
search for them;
• Documents uploaded are restricted to PDFs and, for each document uploaded, users need to
input a predefined set of metadata;
• To search for documents, we use Alfresco search capabilities with SOLR 4 configured to search
on our custom properties (metadata) and node content as well.
OCR
Learn. Connect. Collaborate.
The challenge of the OCR*
Solution:
• Ideally, eFiles would OCR documents
automatically as soon as they were
uploaded;
• We would offer OCR as a Service, so that
users and other application developers
could also benefit from our efforts.
Challenge:
• In order to search for words in the PDFs,
they need to have embedded text – this
meant that scanned documents needed to
be OCRed;
• Users often had to manually ran an OCR
process on the PDFs before uploading
them;
• Still, around 30% of PDFs uploaded didn’t
have embedded text content.
*OCR = Optical character recognition 5
Learn. Connect. Collaborate.
6
Task requirements
• OCR Service side:
 independent asynchronous service to OCR PDF documents;
 protection against own failure;
 protection against failure of the client.
• Alfresco side:
 uploaded documents are sent to the OCR Service;
 protection against failure of the OCR Service;
 expose an endpoint to receive the OCRed document.
Learn. Connect. Collaborate.
7
OCR Service
Independent asynchronous service
• Spring Boot application with one endpoint exposed;
• Arriving OCR requests are queued;
• OCR;
• Send the PDF file back to Alfresco.
pypdfocr*
*@virantha/pypdfocr
Learn. Connect. Collaborate.
@RequestMapping(value = "/request", method = RequestMethod.POST)
@ResponseBody
public ResponseEntity<String> registerNewOCRRequest(
@RequestParam(required = false) String requestorReference,
@RequestParam Short priority,
@RequestParam String callbackEndpoint,
@RequestParam HttpMethod callbackMethod,
@RequestParam MultipartFile file)
8
OCR Service
Independent asynchronous service
Learn. Connect. Collaborate.
9
OCR Service
Independent asynchronous service
Response:
202 – ACCEPTED
{
“success”: true,
“message”: “The request was accepted”,
“requestorReference”: “SameRequestorReferenceAsBefore”,
“requestID”: 12000
}
Learn. Connect. Collaborate.
10
OCR Service
Protection against own failure
• Every step of the way is stored in a database table;
• When the application starts, it will fetch all incomplete requests:
@PostConstruct
public void restoreRequestsFromLastSession(){
List<OCRRequest> requests = ocrRequestRepository.getAllUnprocessedRequests();
for (OCRRequest req : requests) {
priorityExecutor.submit(ocrRequestWorker.getRunnable(req));
}
}
Learn. Connect. Collaborate.
11
OCR Service
Protection against failure of the client
• In case the client cannot be contacted, request is marked as “Failed communication”. Will try to
send the file to the client every 30 seconds:
@Scheduled(fixedDelay = 30000L)
public void processReplyQueue(){
List<OCRRequest> requests = getAllFailedCommunicationRequests();
// tries to reply to the client of each request
for(OCRRequest req : requests) {
replyToRequest(req);
}
}
Learn. Connect. Collaborate.
12
Alfresco
Uploaded documents are sent to the OCR Service
• Everytime a document is uploaded and the node type changed to our custom model type, we
request the document OCR using a behavior;
• Our custom metadata model includes an aspect for OCR status.
Document sent to OCR Service:
OCR: {
“ocrRequestId”: 12000,
“ocrRequestStartDate”: Date,
“ocrRequestFinishDate”: null,
“ocrRequestSuccessful”: false
}
OCR is complete:
OCR: {
“ocrRequestId”: 12000,
“ocrRequestStartDate”: Date,
“ocrRequestFinishDate”: Date + Δt,
“ocrRequestSuccessful”: true
}
Learn. Connect. Collaborate.
13
Alfresco
Protection against failure of the OCR Service
• The OCR Service might fail, in which case we need to send all the documents that are still not
OCRed to the service once it is available. We use a job that runs every hour;
• Additionally, this job also allows documents uploaded prior to the existence of the service to be
sent for OCR.
@Override
public void execute(){
List<NodeEntity> nodesToOCR = efilesCannedQueryService.getEfilesNodesNotOCRed();
for(NodeEntity node : nodesToOCR) {
NodeRef nodeRef = node.getNodeRef();
efilesOCRService.sendOCRNodeRequest(nodeRef);
}
}
Learn. Connect. Collaborate.
14
Alfresco
Expose an endpoint to receive the OCRed document
• Because OCRing a document takes time, the OCR Service will, at a later date, send the OCRed
PDF, so we created an endpoint on the alfresco side:
@AlfrescoAuthentication(AuthenticationType.USER)
@AlfrescoTransaction
@RequestMapping(value = "/receiveOCR",
method = RequestMethod.POST,
consumes = MediaType.MULTIPART_FORM_DATA_VALUE)
@ResponseBody
public ResponseEntity<String> getOCRResponse(
@RequestParam Long requestId,
@RequestParam String requestorReference,
@RequestParam Integer statusCode,
@RequestParam(required = false) MultipartFile file)
@dgradecak/
alfresco-mvc
Learn. Connect. Collaborate.
15
OCR Service deployed
• Since 08/May/2017, the OCR Service processed more than 170000 files;
• Average time to OCR a document: 24 seconds;
• OCR success rate: 99.94%.
@mauro1855/ocr-service
Score
Learn. Connect. Collaborate.
17
The challenge of metadata quality
Learn. Connect. Collaborate.
18
The challenge of metadata quality
• When users upload a document, they need to fill out a set of metadata for the document;
• The administrative personnel claims the metadata for files is not always correct;
• We want to devise a way to automatically confirm the metadata of the document based on it’s
content;
• Because we OCRed all PDFs, we now have the OCRed text to use in our validation.
Learn. Connect. Collaborate.
19
Metadata quality scoring system
 Find metadata property in content;
 Calculate a score for the metadata property;
 Calculate the aggregated score for the document;
 Score all documents;
 Inform users of documents with low score;
 Allow users to confirm document metadata.
Learn. Connect. Collaborate.
20
Scoring system
Find metadata in content
• OCR software is not 100% accurate, and it’s more likely than not that it misrecognized some
letters or words in the PDF;
• It’s not possible to simply do an equals() to find the metadata;
• Use instead string distance algorithms to find the string in the content that is more similar to
the metadata property;
• Because of OCR misrecognition, remove all accentuation and non-alphanumeric characters and
only then compare;
Learn. Connect. Collaborate.
21
Scoring system
Find metadata in content
Normalizer.normalize(contentToScore, Normalizer.Form.NFD)
.replaceAll("[^p{Alnum}]", "");
• Canonical decomposition: é (U+00e9)  e´ (U+0065 U+0301)
• Code application example:
1123 T!éix$#èïr]a5
1123 T!e´ix$#e`i¨r]a5
1123Teixeira5
Learn. Connect. Collaborate.
22
Scoring system
Find metadata in content
• We divide the content in all the possible strings of the length of the metadata we are looking
for, and apply the Levenshtein distance algorithm.
StringUtils.getLevenshteinDistance(metadataProperty, partialContent)
Example:
Metadata property: “Teixeira”
Content: 1123Teixeira5
• We record the minimum distance found. 𝑠𝑐𝑜𝑟𝑒 =
𝑙𝑒𝑛𝑔𝑡ℎ − 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒
𝑙𝑒𝑛𝑔𝑡ℎ
× 100%
Learn. Connect. Collaborate.
23
Scoring system
Calculate the aggregated score for the document
• An Alfresco node can be represented as a tree:
• The root is the node itself
• The children of the node are aspects;
• The children of the aspects are properties;
Node
Person aspect
ID First name Last name
Institute aspect
ID
Institute
name
Budget Code
aspect
ID
Budget
code name
Average
Weight Max
Min
50% 25% 25%
Learn. Connect. Collaborate.
24
Scoring system
Calculate the aggregated score for the document
scored: {
“score”: 55,
“scoringDate”: Date,
“scoringExecutionTime”: 10,
“scoringSkipped”: false
“manuallyChecked”: false
}
Learn. Connect. Collaborate.
25
Scoring system
Score all documents
• Score all existing and new documents in order to find incorrect metadata;
• Can be achieved through a job that retrieves already OCRed documents;
• Each hour 2500 documents are retrieved and their score calculated.
Learn. Connect. Collaborate.
26
Scoring system
Inform users of document with low score
• Our team defined that users should double check documents with a score lower than 60%;
• We used smart folders to show these documents with low score to our users.
"query": "PATH:'<our path>' AND (ef:score:[0 TO 60])"
Learn. Connect. Collaborate.
27
Scoring system
Allow users to confirm document metadata
• All documents that are scored have a new action available in
share: “Confirm metadata is correct”;
• A “manual check” from the user overrides the score and
confirms that the document metadata is correct for the
document.
scored: {
“score”: 100,
“scoringDate”: Date,
“scoringExecutionTime”: 10,
“scoringSkipped”: false
“manuallyChecked”: true
}
Learn. Connect. Collaborate.
28
Scoring system deployed
• After testing with thousands of randomly
chosen documents, we got some indicators
on the average score for documents in our
alfresco instance;
• Initially we were only trying to find
documents with incorrect metadata;
• We soon realized that the scores obtained
allowed to evaluate the quality of our brand
new OCR Service.
Learn. Connect. Collaborate.
29
Scoring system deployed
Scoring for contracts:
Other documents with similar results:
• CVs;
• Bank details;
• …
Learn. Connect. Collaborate.
30
Scoring system deployed
Learn. Connect. Collaborate.
31
Scoring system deployed
Scoring for Visas:
Other documents with similar results
• Passports;
• ID Cards;
• …
Learn. Connect. Collaborate.
32
Scoring system deployed
Learn. Connect. Collaborate.
33
Scoring system deployed
OCRed text for the previous slide’s example:
...m...immlwÈ"'23:22:12”nponype/TypeCddlgodoPels/mœunmsmœ/omœräamnmWALMM.PCPRTTHATXXX[DilAaefunWSumm/NomPEREÎIRA
TEIXEIRALUZ]Manuel's)préprln(s)/GhannannlsHPrénomrns)MAUR31}RAFAEL[031Ma.../mary...[04]Alum/Hummus:PDRTUGUESA1.66m,.[05]Da
tedenasclrrærlbo/Dumalm/Daœdem[08]Nu'merude...mir-nd...—WOULD BE TOO EASY[GT]Sexo/Sum[08]LocaldeneWafa—
uma...LISBOAïISBOA[09]Datadeemlssäo/DmMime/Damœdeuvmnœ[10]Aumndadelmw/AMXXXXXXXXXXSEF-
SERVEREFRONÎEIRAS[11]Vélidoeté/Daœufwiry/Daœdwmnm[12]AssinatmedaMar/mam'm/Signm‘uredunmlanXXXXXXXXXXX
 Score of 100% for word “Teixeira”
Bad, but common, example for Visa text (OCRed):
ami-md'un_I..-[ifi";"“E'"...“UL"für—J0-...‘1-u*;M...“"“"1n&li'"r-ninhf:ni—_«...._A‘.,1‘.p,V.,,‘i‘A.....‘È!_,":..“!.???KL':-
_......__‘rigamAuK"_rmàyîwfäa,ngrs—suaŸ“wat—seanxäâ‘k—zææâäê.it?-<%“"|__-F‘ITGE‘T'n’Jfn-
TEIXEIRA
Learn. Connect. Collaborate.
34
Next steps
• Find a more suitable OCR software solution;
• Improve the scoring algorithm to take into account common OCR misrecognitions;
• Create a suggestion service that reads the OCRed content and suggests, to the user, the right
metadata.
Thank you!

More Related Content

What's hot

Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy SokolenkoProvectus
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Sematext Group, Inc.
 
Domain driven security_java_zone2016
Domain driven security_java_zone2016Domain driven security_java_zone2016
Domain driven security_java_zone2016Omegapoint Academy
 
The autodiscover algorithm for locating the source of information part 05#36
The autodiscover algorithm for locating the source of information  part 05#36The autodiscover algorithm for locating the source of information  part 05#36
The autodiscover algorithm for locating the source of information part 05#36Eyal Doron
 
Open Policy Agent (OPA) と Kubernetes Policy
Open Policy Agent (OPA) と Kubernetes PolicyOpen Policy Agent (OPA) と Kubernetes Policy
Open Policy Agent (OPA) と Kubernetes PolicyMotonori Shindo
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoopJoey Echeverria
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC OsloDavid Pilato
 
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Karel Minarik
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgDavid Pilato
 
A Novel Additive Order Protocol in Cloud Storage and Avoiding the Trapdoors
A Novel Additive Order Protocol in Cloud Storage and Avoiding the TrapdoorsA Novel Additive Order Protocol in Cloud Storage and Avoiding the Trapdoors
A Novel Additive Order Protocol in Cloud Storage and Avoiding the TrapdoorsIRJET Journal
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
OGCE Project Overview
OGCE Project OverviewOGCE Project Overview
OGCE Project Overviewmarpierc
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearchMinsoo Jun
 
NoSQL Endgame LWJUG 2021
NoSQL Endgame LWJUG 2021NoSQL Endgame LWJUG 2021
NoSQL Endgame LWJUG 2021Thodoris Bais
 
Reactive integrations with Akka Streams
Reactive integrations with Akka StreamsReactive integrations with Akka Streams
Reactive integrations with Akka StreamsKonrad Malawski
 
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)PyData
 
Functional Composition of Sensor Web APIs
Functional Composition of Sensor Web APIsFunctional Composition of Sensor Web APIs
Functional Composition of Sensor Web APIsRuben Verborgh
 
Don't Loose Sleep - Secure Your Rest - php[tek] 2017
Don't Loose Sleep - Secure Your Rest - php[tek] 2017Don't Loose Sleep - Secure Your Rest - php[tek] 2017
Don't Loose Sleep - Secure Your Rest - php[tek] 2017Adam Englander
 

What's hot (20)

Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
 
Domain driven security_java_zone2016
Domain driven security_java_zone2016Domain driven security_java_zone2016
Domain driven security_java_zone2016
 
The autodiscover algorithm for locating the source of information part 05#36
The autodiscover algorithm for locating the source of information  part 05#36The autodiscover algorithm for locating the source of information  part 05#36
The autodiscover algorithm for locating the source of information part 05#36
 
Lucene
LuceneLucene
Lucene
 
Open Policy Agent (OPA) と Kubernetes Policy
Open Policy Agent (OPA) と Kubernetes PolicyOpen Policy Agent (OPA) と Kubernetes Policy
Open Policy Agent (OPA) と Kubernetes Policy
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
 
Ajax
AjaxAjax
Ajax
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
 
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed Luxembourg
 
A Novel Additive Order Protocol in Cloud Storage and Avoiding the Trapdoors
A Novel Additive Order Protocol in Cloud Storage and Avoiding the TrapdoorsA Novel Additive Order Protocol in Cloud Storage and Avoiding the Trapdoors
A Novel Additive Order Protocol in Cloud Storage and Avoiding the Trapdoors
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
OGCE Project Overview
OGCE Project OverviewOGCE Project Overview
OGCE Project Overview
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
 
NoSQL Endgame LWJUG 2021
NoSQL Endgame LWJUG 2021NoSQL Endgame LWJUG 2021
NoSQL Endgame LWJUG 2021
 
Reactive integrations with Akka Streams
Reactive integrations with Akka StreamsReactive integrations with Akka Streams
Reactive integrations with Akka Streams
 
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
 
Functional Composition of Sensor Web APIs
Functional Composition of Sensor Web APIsFunctional Composition of Sensor Web APIs
Functional Composition of Sensor Web APIs
 
Don't Loose Sleep - Secure Your Rest - php[tek] 2017
Don't Loose Sleep - Secure Your Rest - php[tek] 2017Don't Loose Sleep - Secure Your Rest - php[tek] 2017
Don't Loose Sleep - Secure Your Rest - php[tek] 2017
 

Similar to Content OCRing and Scoring

Building APIs in an easy way using API Platform
Building APIs in an easy way using API PlatformBuilding APIs in an easy way using API Platform
Building APIs in an easy way using API PlatformAntonio Peric-Mazar
 
Data encoding and Metadata for Streams
Data encoding and Metadata for StreamsData encoding and Metadata for Streams
Data encoding and Metadata for Streamsunivalence
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
 
System insight without Interference
System insight without InterferenceSystem insight without Interference
System insight without InterferenceTony Tam
 
Getting started with Apollo Client and GraphQL
Getting started with Apollo Client and GraphQLGetting started with Apollo Client and GraphQL
Getting started with Apollo Client and GraphQLMorgan Dedmon
 
MongoDB.local Dallas 2019: Pissing Off IT and Delivery: A Tale of 2 ODS's
MongoDB.local Dallas 2019: Pissing Off IT and Delivery: A Tale of 2 ODS'sMongoDB.local Dallas 2019: Pissing Off IT and Delivery: A Tale of 2 ODS's
MongoDB.local Dallas 2019: Pissing Off IT and Delivery: A Tale of 2 ODS'sMongoDB
 
Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataRobert Dempsey
 
EZproxy Implementation in Sierra
EZproxy Implementation in SierraEZproxy Implementation in Sierra
EZproxy Implementation in Sierrasjospratt
 
04 integrate entityframework
04 integrate entityframework04 integrate entityframework
04 integrate entityframeworkErhwen Kuo
 
Standard Provenance Reporting and Scientific Software Management in Virtual L...
Standard Provenance Reporting and Scientific Software Management in Virtual L...Standard Provenance Reporting and Scientific Software Management in Virtual L...
Standard Provenance Reporting and Scientific Software Management in Virtual L...njcar
 
Data cleaning with the Kurator toolkit: Bridging the gap between conventional...
Data cleaning with the Kurator toolkit: Bridging the gap between conventional...Data cleaning with the Kurator toolkit: Bridging the gap between conventional...
Data cleaning with the Kurator toolkit: Bridging the gap between conventional...Timothy McPhillips
 
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...Amazon Web Services
 
Reliable, Remote Computation at All Scales
Reliable, Remote Computation at All ScalesReliable, Remote Computation at All Scales
Reliable, Remote Computation at All ScalesGlobus
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
 
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data AnalyticsAmazon Web Services
 
Organizing Capabilities using Formal Concept Analysis
Organizing Capabilities using Formal Concept AnalysisOrganizing Capabilities using Formal Concept Analysis
Organizing Capabilities using Formal Concept AnalysisWassim Derguech
 
Bluemix paas 기반 saas 개발 사례
Bluemix paas 기반 saas 개발 사례Bluemix paas 기반 saas 개발 사례
Bluemix paas 기반 saas 개발 사례uEngine Solutions
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldOpenSource Connections
 

Similar to Content OCRing and Scoring (20)

Building APIs in an easy way using API Platform
Building APIs in an easy way using API PlatformBuilding APIs in an easy way using API Platform
Building APIs in an easy way using API Platform
 
Data encoding and Metadata for Streams
Data encoding and Metadata for StreamsData encoding and Metadata for Streams
Data encoding and Metadata for Streams
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
System insight without Interference
System insight without InterferenceSystem insight without Interference
System insight without Interference
 
Getting started with Apollo Client and GraphQL
Getting started with Apollo Client and GraphQLGetting started with Apollo Client and GraphQL
Getting started with Apollo Client and GraphQL
 
MongoDB.local Dallas 2019: Pissing Off IT and Delivery: A Tale of 2 ODS's
MongoDB.local Dallas 2019: Pissing Off IT and Delivery: A Tale of 2 ODS'sMongoDB.local Dallas 2019: Pissing Off IT and Delivery: A Tale of 2 ODS's
MongoDB.local Dallas 2019: Pissing Off IT and Delivery: A Tale of 2 ODS's
 
Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of Data
 
EZproxy Implementation in Sierra
EZproxy Implementation in SierraEZproxy Implementation in Sierra
EZproxy Implementation in Sierra
 
Docker discovery service
Docker   discovery serviceDocker   discovery service
Docker discovery service
 
04 integrate entityframework
04 integrate entityframework04 integrate entityframework
04 integrate entityframework
 
Databasecentricapisonthecloudusingplsqlandnodejscon3153oow2016 160922021655
Databasecentricapisonthecloudusingplsqlandnodejscon3153oow2016 160922021655Databasecentricapisonthecloudusingplsqlandnodejscon3153oow2016 160922021655
Databasecentricapisonthecloudusingplsqlandnodejscon3153oow2016 160922021655
 
Standard Provenance Reporting and Scientific Software Management in Virtual L...
Standard Provenance Reporting and Scientific Software Management in Virtual L...Standard Provenance Reporting and Scientific Software Management in Virtual L...
Standard Provenance Reporting and Scientific Software Management in Virtual L...
 
Data cleaning with the Kurator toolkit: Bridging the gap between conventional...
Data cleaning with the Kurator toolkit: Bridging the gap between conventional...Data cleaning with the Kurator toolkit: Bridging the gap between conventional...
Data cleaning with the Kurator toolkit: Bridging the gap between conventional...
 
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
 
Reliable, Remote Computation at All Scales
Reliable, Remote Computation at All ScalesReliable, Remote Computation at All Scales
Reliable, Remote Computation at All Scales
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
 
Organizing Capabilities using Formal Concept Analysis
Organizing Capabilities using Formal Concept AnalysisOrganizing Capabilities using Formal Concept Analysis
Organizing Capabilities using Formal Concept Analysis
 
Bluemix paas 기반 saas 개발 사례
Bluemix paas 기반 saas 개발 사례Bluemix paas 기반 saas 개발 사례
Bluemix paas 기반 saas 개발 사례
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Content OCRing and Scoring

  • 1. Content OCRing and Scoring Mauro Teixeira mauro.teixeira@cern.ch | @mauro1855
  • 2. Learn. Connect. Collaborate. 2 About CERN • At the European Organization for Nuclear Research, physicists and engineers are probing the fundamental structure of the universe; • Accelerators boost beams of particles to high energies before the beams are made to collide with each other or with stationary targets. Detectors observe and record the results of these collisions; • The WWW was born at CERN.
  • 3. Learn. Connect. Collaborate. 3 Our usage of Alfresco • Administrative Staff connects to eFiles (our instance of alfresco) to upload documents, or to search for them; • Documents uploaded are restricted to PDFs and, for each document uploaded, users need to input a predefined set of metadata; • To search for documents, we use Alfresco search capabilities with SOLR 4 configured to search on our custom properties (metadata) and node content as well.
  • 4. OCR
  • 5. Learn. Connect. Collaborate. The challenge of the OCR* Solution: • Ideally, eFiles would OCR documents automatically as soon as they were uploaded; • We would offer OCR as a Service, so that users and other application developers could also benefit from our efforts. Challenge: • In order to search for words in the PDFs, they need to have embedded text – this meant that scanned documents needed to be OCRed; • Users often had to manually ran an OCR process on the PDFs before uploading them; • Still, around 30% of PDFs uploaded didn’t have embedded text content. *OCR = Optical character recognition 5
  • 6. Learn. Connect. Collaborate. 6 Task requirements • OCR Service side:  independent asynchronous service to OCR PDF documents;  protection against own failure;  protection against failure of the client. • Alfresco side:  uploaded documents are sent to the OCR Service;  protection against failure of the OCR Service;  expose an endpoint to receive the OCRed document.
  • 7. Learn. Connect. Collaborate. 7 OCR Service Independent asynchronous service • Spring Boot application with one endpoint exposed; • Arriving OCR requests are queued; • OCR; • Send the PDF file back to Alfresco. pypdfocr* *@virantha/pypdfocr
  • 8. Learn. Connect. Collaborate. @RequestMapping(value = "/request", method = RequestMethod.POST) @ResponseBody public ResponseEntity<String> registerNewOCRRequest( @RequestParam(required = false) String requestorReference, @RequestParam Short priority, @RequestParam String callbackEndpoint, @RequestParam HttpMethod callbackMethod, @RequestParam MultipartFile file) 8 OCR Service Independent asynchronous service
  • 9. Learn. Connect. Collaborate. 9 OCR Service Independent asynchronous service Response: 202 – ACCEPTED { “success”: true, “message”: “The request was accepted”, “requestorReference”: “SameRequestorReferenceAsBefore”, “requestID”: 12000 }
  • 10. Learn. Connect. Collaborate. 10 OCR Service Protection against own failure • Every step of the way is stored in a database table; • When the application starts, it will fetch all incomplete requests: @PostConstruct public void restoreRequestsFromLastSession(){ List<OCRRequest> requests = ocrRequestRepository.getAllUnprocessedRequests(); for (OCRRequest req : requests) { priorityExecutor.submit(ocrRequestWorker.getRunnable(req)); } }
  • 11. Learn. Connect. Collaborate. 11 OCR Service Protection against failure of the client • In case the client cannot be contacted, request is marked as “Failed communication”. Will try to send the file to the client every 30 seconds: @Scheduled(fixedDelay = 30000L) public void processReplyQueue(){ List<OCRRequest> requests = getAllFailedCommunicationRequests(); // tries to reply to the client of each request for(OCRRequest req : requests) { replyToRequest(req); } }
  • 12. Learn. Connect. Collaborate. 12 Alfresco Uploaded documents are sent to the OCR Service • Everytime a document is uploaded and the node type changed to our custom model type, we request the document OCR using a behavior; • Our custom metadata model includes an aspect for OCR status. Document sent to OCR Service: OCR: { “ocrRequestId”: 12000, “ocrRequestStartDate”: Date, “ocrRequestFinishDate”: null, “ocrRequestSuccessful”: false } OCR is complete: OCR: { “ocrRequestId”: 12000, “ocrRequestStartDate”: Date, “ocrRequestFinishDate”: Date + Δt, “ocrRequestSuccessful”: true }
  • 13. Learn. Connect. Collaborate. 13 Alfresco Protection against failure of the OCR Service • The OCR Service might fail, in which case we need to send all the documents that are still not OCRed to the service once it is available. We use a job that runs every hour; • Additionally, this job also allows documents uploaded prior to the existence of the service to be sent for OCR. @Override public void execute(){ List<NodeEntity> nodesToOCR = efilesCannedQueryService.getEfilesNodesNotOCRed(); for(NodeEntity node : nodesToOCR) { NodeRef nodeRef = node.getNodeRef(); efilesOCRService.sendOCRNodeRequest(nodeRef); } }
  • 14. Learn. Connect. Collaborate. 14 Alfresco Expose an endpoint to receive the OCRed document • Because OCRing a document takes time, the OCR Service will, at a later date, send the OCRed PDF, so we created an endpoint on the alfresco side: @AlfrescoAuthentication(AuthenticationType.USER) @AlfrescoTransaction @RequestMapping(value = "/receiveOCR", method = RequestMethod.POST, consumes = MediaType.MULTIPART_FORM_DATA_VALUE) @ResponseBody public ResponseEntity<String> getOCRResponse( @RequestParam Long requestId, @RequestParam String requestorReference, @RequestParam Integer statusCode, @RequestParam(required = false) MultipartFile file) @dgradecak/ alfresco-mvc
  • 15. Learn. Connect. Collaborate. 15 OCR Service deployed • Since 08/May/2017, the OCR Service processed more than 170000 files; • Average time to OCR a document: 24 seconds; • OCR success rate: 99.94%. @mauro1855/ocr-service
  • 16. Score
  • 17. Learn. Connect. Collaborate. 17 The challenge of metadata quality
  • 18. Learn. Connect. Collaborate. 18 The challenge of metadata quality • When users upload a document, they need to fill out a set of metadata for the document; • The administrative personnel claims the metadata for files is not always correct; • We want to devise a way to automatically confirm the metadata of the document based on it’s content; • Because we OCRed all PDFs, we now have the OCRed text to use in our validation.
  • 19. Learn. Connect. Collaborate. 19 Metadata quality scoring system  Find metadata property in content;  Calculate a score for the metadata property;  Calculate the aggregated score for the document;  Score all documents;  Inform users of documents with low score;  Allow users to confirm document metadata.
  • 20. Learn. Connect. Collaborate. 20 Scoring system Find metadata in content • OCR software is not 100% accurate, and it’s more likely than not that it misrecognized some letters or words in the PDF; • It’s not possible to simply do an equals() to find the metadata; • Use instead string distance algorithms to find the string in the content that is more similar to the metadata property; • Because of OCR misrecognition, remove all accentuation and non-alphanumeric characters and only then compare;
  • 21. Learn. Connect. Collaborate. 21 Scoring system Find metadata in content Normalizer.normalize(contentToScore, Normalizer.Form.NFD) .replaceAll("[^p{Alnum}]", ""); • Canonical decomposition: é (U+00e9)  e´ (U+0065 U+0301) • Code application example: 1123 T!éix$#èïr]a5 1123 T!e´ix$#e`i¨r]a5 1123Teixeira5
  • 22. Learn. Connect. Collaborate. 22 Scoring system Find metadata in content • We divide the content in all the possible strings of the length of the metadata we are looking for, and apply the Levenshtein distance algorithm. StringUtils.getLevenshteinDistance(metadataProperty, partialContent) Example: Metadata property: “Teixeira” Content: 1123Teixeira5 • We record the minimum distance found. 𝑠𝑐𝑜𝑟𝑒 = 𝑙𝑒𝑛𝑔𝑡ℎ − 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑙𝑒𝑛𝑔𝑡ℎ × 100%
  • 23. Learn. Connect. Collaborate. 23 Scoring system Calculate the aggregated score for the document • An Alfresco node can be represented as a tree: • The root is the node itself • The children of the node are aspects; • The children of the aspects are properties; Node Person aspect ID First name Last name Institute aspect ID Institute name Budget Code aspect ID Budget code name Average Weight Max Min 50% 25% 25%
  • 24. Learn. Connect. Collaborate. 24 Scoring system Calculate the aggregated score for the document scored: { “score”: 55, “scoringDate”: Date, “scoringExecutionTime”: 10, “scoringSkipped”: false “manuallyChecked”: false }
  • 25. Learn. Connect. Collaborate. 25 Scoring system Score all documents • Score all existing and new documents in order to find incorrect metadata; • Can be achieved through a job that retrieves already OCRed documents; • Each hour 2500 documents are retrieved and their score calculated.
  • 26. Learn. Connect. Collaborate. 26 Scoring system Inform users of document with low score • Our team defined that users should double check documents with a score lower than 60%; • We used smart folders to show these documents with low score to our users. "query": "PATH:'<our path>' AND (ef:score:[0 TO 60])"
  • 27. Learn. Connect. Collaborate. 27 Scoring system Allow users to confirm document metadata • All documents that are scored have a new action available in share: “Confirm metadata is correct”; • A “manual check” from the user overrides the score and confirms that the document metadata is correct for the document. scored: { “score”: 100, “scoringDate”: Date, “scoringExecutionTime”: 10, “scoringSkipped”: false “manuallyChecked”: true }
  • 28. Learn. Connect. Collaborate. 28 Scoring system deployed • After testing with thousands of randomly chosen documents, we got some indicators on the average score for documents in our alfresco instance; • Initially we were only trying to find documents with incorrect metadata; • We soon realized that the scores obtained allowed to evaluate the quality of our brand new OCR Service.
  • 29. Learn. Connect. Collaborate. 29 Scoring system deployed Scoring for contracts: Other documents with similar results: • CVs; • Bank details; • …
  • 31. Learn. Connect. Collaborate. 31 Scoring system deployed Scoring for Visas: Other documents with similar results • Passports; • ID Cards; • …
  • 33. Learn. Connect. Collaborate. 33 Scoring system deployed OCRed text for the previous slide’s example: ...m...immlwÈ"'23:22:12”nponype/TypeCddlgodoPels/mœunmsmœ/omœräamnmWALMM.PCPRTTHATXXX[DilAaefunWSumm/NomPEREÎIRA TEIXEIRALUZ]Manuel's)préprln(s)/GhannannlsHPrénomrns)MAUR31}RAFAEL[031Ma.../mary...[04]Alum/Hummus:PDRTUGUESA1.66m,.[05]Da tedenasclrrærlbo/Dumalm/Daœdem[08]Nu'merude...mir-nd...—WOULD BE TOO EASY[GT]Sexo/Sum[08]LocaldeneWafa— uma...LISBOAïISBOA[09]Datadeemlssäo/DmMime/Damœdeuvmnœ[10]Aumndadelmw/AMXXXXXXXXXXSEF- SERVEREFRONÎEIRAS[11]Vélidoeté/Daœufwiry/Daœdwmnm[12]AssinatmedaMar/mam'm/Signm‘uredunmlanXXXXXXXXXXX  Score of 100% for word “Teixeira” Bad, but common, example for Visa text (OCRed): ami-md'un_I..-[ifi";"“E'"...“UL"für—J0-...‘1-u*;M...“"“"1n&li'"r-ninhf:ni—_«...._A‘.,1‘.p,V.,,‘i‘A.....‘È!_,":..“!.???KL':- _......__‘rigamAuK"_rmàyîwfäa,ngrs—suaŸ“wat—seanxäâ‘k—zææâäê.it?-<%“"|__-F‘ITGE‘T'n’Jfn- TEIXEIRA
  • 34. Learn. Connect. Collaborate. 34 Next steps • Find a more suitable OCR software solution; • Improve the scoring algorithm to take into account common OCR misrecognitions; • Create a suggestion service that reads the OCRed content and suggests, to the user, the right metadata.