Content OCRing and Scoring

Content OCRing and
Scoring
Mauro Teixeira
mauro.teixeira@cern.ch | @mauro1855

Learn. Connect. Collaborate.
2
About CERN
• At the European Organization for Nuclear Research,
physicists and engineers are probing the fundamental
structure of the universe;
• Accelerators boost beams of particles to high energies
before the beams are made to collide with each other
or with stationary targets. Detectors observe and record
the results of these collisions;
• The WWW was born at CERN.

3
Our usage of Alfresco
• Administrative Staff connects to eFiles (our instance of alfresco) to upload documents, or to
search for them;
• Documents uploaded are restricted to PDFs and, for each document uploaded, users need to
input a predefined set of metadata;
• To search for documents, we use Alfresco search capabilities with SOLR 4 configured to search
on our custom properties (metadata) and node content as well.

The challenge of the OCR*
Solution:
• Ideally, eFiles would OCR documents
automatically as soon as they were
uploaded;
• We would offer OCR as a Service, so that
users and other application developers
could also benefit from our efforts.
Challenge:
• In order to search for words in the PDFs,
they need to have embedded text – this
meant that scanned documents needed to
be OCRed;
• Users often had to manually ran an OCR
process on the PDFs before uploading
them;
• Still, around 30% of PDFs uploaded didn’t
have embedded text content.
*OCR = Optical character recognition 5

6
Task requirements
• OCR Service side:
 independent asynchronous service to OCR PDF documents;
 protection against own failure;
 protection against failure of the client.
• Alfresco side:
 uploaded documents are sent to the OCR Service;
 protection against failure of the OCR Service;
 expose an endpoint to receive the OCRed document.

7
OCR Service
Independent asynchronous service
• Spring Boot application with one endpoint exposed;
• Arriving OCR requests are queued;
• OCR;
• Send the PDF file back to Alfresco.
pypdfocr*
*@virantha/pypdfocr

@RequestMapping(value = "/request", method = RequestMethod.POST)
@ResponseBody
public ResponseEntity<String> registerNewOCRRequest(
@RequestParam(required = false) String requestorReference,
@RequestParam Short priority,
@RequestParam String callbackEndpoint,
@RequestParam HttpMethod callbackMethod,
@RequestParam MultipartFile file)
8
OCR Service

9
OCR Service
Response:
202 – ACCEPTED
{
“success”: true,
“message”: “The request was accepted”,
“requestorReference”: “SameRequestorReferenceAsBefore”,
“requestID”: 12000
}

10
OCR Service
Protection against own failure
• Every step of the way is stored in a database table;
• When the application starts, it will fetch all incomplete requests:
@PostConstruct
public void restoreRequestsFromLastSession(){
List<OCRRequest> requests = ocrRequestRepository.getAllUnprocessedRequests();
for (OCRRequest req : requests) {
priorityExecutor.submit(ocrRequestWorker.getRunnable(req));
}
}

11
OCR Service
Protection against failure of the client
• In case the client cannot be contacted, request is marked as “Failed communication”. Will try to
send the file to the client every 30 seconds:
@Scheduled(fixedDelay = 30000L)
public void processReplyQueue(){
List<OCRRequest> requests = getAllFailedCommunicationRequests();
// tries to reply to the client of each request
for(OCRRequest req : requests) {
replyToRequest(req);
}
}

12
Alfresco
Uploaded documents are sent to the OCR Service
• Everytime a document is uploaded and the node type changed to our custom model type, we
request the document OCR using a behavior;
• Our custom metadata model includes an aspect for OCR status.
Document sent to OCR Service:
OCR: {
“ocrRequestId”: 12000,
“ocrRequestStartDate”: Date,
“ocrRequestFinishDate”: null,
“ocrRequestSuccessful”: false
}
OCR is complete:
OCR: {
“ocrRequestId”: 12000,
“ocrRequestStartDate”: Date,
“ocrRequestFinishDate”: Date + Δt,
“ocrRequestSuccessful”: true
}

13
Alfresco
Protection against failure of the OCR Service
• The OCR Service might fail, in which case we need to send all the documents that are still not
OCRed to the service once it is available. We use a job that runs every hour;
• Additionally, this job also allows documents uploaded prior to the existence of the service to be
sent for OCR.
@Override
public void execute(){
List<NodeEntity> nodesToOCR = efilesCannedQueryService.getEfilesNodesNotOCRed();
for(NodeEntity node : nodesToOCR) {
NodeRef nodeRef = node.getNodeRef();
efilesOCRService.sendOCRNodeRequest(nodeRef);
}
}

14
Alfresco
Expose an endpoint to receive the OCRed document
• Because OCRing a document takes time, the OCR Service will, at a later date, send the OCRed
PDF, so we created an endpoint on the alfresco side:
@AlfrescoAuthentication(AuthenticationType.USER)
@AlfrescoTransaction
@RequestMapping(value = "/receiveOCR",
method = RequestMethod.POST,
consumes = MediaType.MULTIPART_FORM_DATA_VALUE)
@ResponseBody
public ResponseEntity<String> getOCRResponse(
@RequestParam Long requestId,
@RequestParam String requestorReference,
@RequestParam Integer statusCode,
@RequestParam(required = false) MultipartFile file)
@dgradecak/
alfresco-mvc

15
OCR Service deployed
• Since 08/May/2017, the OCR Service processed more than 170000 files;
• Average time to OCR a document: 24 seconds;
• OCR success rate: 99.94%.
@mauro1855/ocr-service

17
The challenge of metadata quality

18
The challenge of metadata quality
• When users upload a document, they need to fill out a set of metadata for the document;
• The administrative personnel claims the metadata for files is not always correct;
• We want to devise a way to automatically confirm the metadata of the document based on it’s
content;
• Because we OCRed all PDFs, we now have the OCRed text to use in our validation.

19
Metadata quality scoring system
 Find metadata property in content;
 Calculate a score for the metadata property;
 Calculate the aggregated score for the document;
 Score all documents;
 Inform users of documents with low score;
 Allow users to confirm document metadata.

20
Scoring system
Find metadata in content
• OCR software is not 100% accurate, and it’s more likely than not that it misrecognized some
letters or words in the PDF;
• It’s not possible to simply do an equals() to find the metadata;
• Use instead string distance algorithms to find the string in the content that is more similar to
the metadata property;
• Because of OCR misrecognition, remove all accentuation and non-alphanumeric characters and
only then compare;

21
Scoring system
Normalizer.normalize(contentToScore, Normalizer.Form.NFD)
.replaceAll("[^p{Alnum}]", "");
• Canonical decomposition: é (U+00e9)  e´ (U+0065 U+0301)
• Code application example:
1123 T!éix$#èïr]a5
1123 T!e´ix$#e`i¨r]a5
1123Teixeira5

22
Scoring system
• We divide the content in all the possible strings of the length of the metadata we are looking
for, and apply the Levenshtein distance algorithm.
StringUtils.getLevenshteinDistance(metadataProperty, partialContent)
Example:
Metadata property: “Teixeira”
Content: 1123Teixeira5
• We record the minimum distance found. 𝑠𝑐𝑜𝑟𝑒 =
𝑙𝑒𝑛𝑔𝑡ℎ − 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒
𝑙𝑒𝑛𝑔𝑡ℎ
× 100%

23
Scoring system
Calculate the aggregated score for the document
• An Alfresco node can be represented as a tree:
• The root is the node itself
• The children of the node are aspects;
• The children of the aspects are properties;
Node
Person aspect
ID First name Last name
Institute aspect
ID
Institute
name
Budget Code
aspect
ID
Budget
code name
Average
Weight Max
Min
50% 25% 25%

24
Scoring system
Calculate the aggregated score for the document
scored: {
“score”: 55,
“scoringDate”: Date,
“scoringExecutionTime”: 10,
“scoringSkipped”: false
“manuallyChecked”: false
}

25
Scoring system
Score all documents
• Score all existing and new documents in order to find incorrect metadata;
• Can be achieved through a job that retrieves already OCRed documents;
• Each hour 2500 documents are retrieved and their score calculated.

26
Scoring system
Inform users of document with low score
• Our team defined that users should double check documents with a score lower than 60%;
• We used smart folders to show these documents with low score to our users.
"query": "PATH:'<our path>' AND (ef:score:[0 TO 60])"

27
Scoring system
Allow users to confirm document metadata
• All documents that are scored have a new action available in
share: “Confirm metadata is correct”;
• A “manual check” from the user overrides the score and
confirms that the document metadata is correct for the
document.
scored: {
“score”: 100,
“scoringDate”: Date,
“scoringExecutionTime”: 10,
“scoringSkipped”: false
“manuallyChecked”: true
}

28
Scoring system deployed
• After testing with thousands of randomly
chosen documents, we got some indicators
on the average score for documents in our
alfresco instance;
• Initially we were only trying to find
documents with incorrect metadata;
• We soon realized that the scores obtained
allowed to evaluate the quality of our brand
new OCR Service.

29
Scoring for contracts:
Other documents with similar results:
• CVs;
• Bank details;
• …

30

31
Scoring for Visas:
Other documents with similar results
• Passports;
• ID Cards;
• …

32

33
OCRed text for the previous slide’s example:
...m...immlwÈ"'23:22:12”nponype/TypeCddlgodoPels/mœunmsmœ/omœräamnmWALMM.PCPRTTHATXXX[DilAaefunWSumm/NomPEREÎIRA
TEIXEIRALUZ]Manuel's)préprln(s)/GhannannlsHPrénomrns)MAUR31}RAFAEL[031Ma.../mary...[04]Alum/Hummus:PDRTUGUESA1.66m,.[05]Da
tedenasclrrærlbo/Dumalm/Daœdem[08]Nu'merude...mir-nd...—WOULD BE TOO EASY[GT]Sexo/Sum[08]LocaldeneWafa—
uma...LISBOAïISBOA[09]Datadeemlssäo/DmMime/Damœdeuvmnœ[10]Aumndadelmw/AMXXXXXXXXXXSEF-
SERVEREFRONÎEIRAS[11]Vélidoeté/Daœufwiry/Daœdwmnm[12]AssinatmedaMar/mam'm/Signm‘uredunmlanXXXXXXXXXXX
 Score of 100% for word “Teixeira”
Bad, but common, example for Visa text (OCRed):
ami-md'un_I..-[ifi";"“E'"...“UL"für—J0-...‘1-u*;M...“"“"1n&li'"r-ninhf:ni—_«...._A‘.,1‘.p,V.,,‘i‘A.....‘È!_,":..“!.???KL':-
_......__‘rigamAuK"_rmàyîwfäa,ngrs—suaŸ“wat—seanxäâ‘k—zææâäê.it?-<%“"|__-F‘ITGE‘T'n’Jfn-
TEIXEIRA

34
Next steps
• Find a more suitable OCR software solution;
• Improve the scoring algorithm to take into account common OCR misrecognitions;
• Create a suggestion service that reads the OCRed content and suggests, to the user, the right
metadata.

Content OCRing and Scoring

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Content OCRing and Scoring

Similar to Content OCRing and Scoring (20)

Recently uploaded

Recently uploaded (20)

Content OCRing and Scoring