Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BIT Alpha - ICoC

795 views

Published on

Presentation of the ocr software BIT Alpha by BIT Entreprise during the ICoC Annual General Meeting 2015

Published in: Technology
  • Be the first to comment

  • Be the first to like this

BIT Alpha - ICoC

  1. 1. Bureau Ingénieur Tomasi S.A.R.L. Adaptive OCR Solution ______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : bit.entreprise.at@gmail.com N° SIRET 503 902 983 00017 R. C. TOULOUSE Summary I. General presentation...............................................................................2 II. Binarisation .............................................................................................2 III. Segmentation ...........................................................................................3 IV. OCR Recognition ....................................................................................4 V. Sequencer.................................................................................................5 VI. Post-OCR correction with Spellchecking.............................................6 VII. Pictures Treatment/Export....................................................................7 VIII. Export of content:...................................................................................7 IX. Contact.....................................................................................................8
  2. 2. Bureau Ingénieur Tomasi S.A.R.L. Adaptive OCR Solution ______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : bit.entreprise.at@gmail.com N° SIRET 503 902 983 00017 R. C. TOULOUSE I. General presentation B.I.T. has developed an adaptive OCR solution called BIT-Alpha. This semiautomatic adaptive OCR is able to adapt itself to all types of text, independently of their language, typeface or age. Specifically developed for the treatment of historical and heritage documents, BIT-Alpha allows scientific research and access to content. BIT-Alpha is a tool containing the whole workflow:  Binarisation  Segmentation  OCR recognition  Post OCR correction with spellchecking  Picture processing/Export  Export of content II. Binarisation 3 Binarisation modes in BIT-Alpha:  A Binarisation through Threshold ideal for Newspapers BIT-Alpha analyses the document by domains/fields so the Binarisation will not be the same at the bottom, top or left right corner… Through this domains/fields analysis instead of a global analysis of the whole document, the binarisation will adapt to the different contrasts of the document.  A Binarisation through the “Niblack” algorithm BIT-Alpha is analyzing the contrast variance around each letter. In this respect BIT-Alpha is able to make the difference between a letter and a color spot close to a letter and therefore is able to eliminate the background noise without eliminating parts of a letter.
  3. 3. Bureau Ingénieur Tomasi S.A.R.L. Adaptive OCR Solution ______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : bit.entreprise.at@gmail.com N° SIRET 503 902 983 00017 R. C. TOULOUSE BIT-Alpha does the variance analysis over neighborhoods and so determines if a pixel is part of a text area, non-text area, interline or a picture.  A Binarisation based on an algorithm develop by B.I.T. Thanks to this very advanced spectral-decomposition algorithm, BIT-Alpha is able to redraw/reconstruct damaged letters, as if BIT-Alpha were choosing an optimal paint brush (fine or large). It also allows to maintain very fine traits of characters which may be deleted by other algorithms. Those binarisation allows to prepare the document as best as possible in order to get the best OCR results that are possible for these historic/ heritage documents. III. Segmentation BIT-Alpha is segmenting titles, sub-titles, pictures, picture comments, chapters and articles, for example in Newspapers: Fraktur dated 1805 at 1944: segmentation of title, sub-titles and chapters
  4. 4. Bureau Ingénieur Tomasi S.A.R.L. Adaptive OCR Solution ______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : bit.entreprise.at@gmail.com N° SIRET 503 902 983 00017 R. C. TOULOUSE During segmentation BIT-Alpha is detecting each line, for each line each word and for each word each character individually. Note that Bit-Alpha can output the position of each character (for example into an alto file). IV. OCR Recognition Developed for the processing of historical/ heritage documents, BIT-Alpha is an adaptive OCR able of adapting itself to all types of text, independently of their language, typeface or age. Character learning can be done manually and automatically:  Manually Training with human action: Memory storage of characters’ digital signatures. As the “image” of a character is much heavier than its digital signature, BIT-Alpha has the ability to create bigger data bases than tools saving “images” of characters.  Automatic Training without human action:  BIT-Alpha can learn the characters automatically from the text to be processed. During a Batch process, BIT-Alpha is reading and recognizing characters already known those characters which are recognized with high reliability are then used to train the OCR engine. Thereby, BIT-Alpha’s reliability rates will be increase with each processed page.  A spellchecking database which is adapted to the type of documents that are to be treated (for example Latin database) can be loaded into BIT- Alpha. If BIT-Alpha recognizes a word from the database, BIT-Alpha learns all the character constituting this word automatically. BIT-Alpha can handle any databases consisting of more than 500 000 words.  BIT-Alpha is able to identify the nature of fonts constituting a text even when the fonts are mixed-up: Gothic (before 1845), Fracture (after 1845), Antiqua, Cursive, Greece, Hebrew...
  5. 5. Bureau Ingénieur Tomasi S.A.R.L. Adaptive OCR Solution ______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : bit.entreprise.at@gmail.com N° SIRET 503 902 983 00017 R. C. TOULOUSE  BIT-Alpha is able to recognize and read embellished letters, miniatures, abbreviations and can deal with unusual characters. V. Sequencer The Sequencer permits to:  Reconstruct fragmented characters: Sometimes a letter can be fragmented into two or more parts. BIT-Alpha recognises the fragments of a letter and reconstitutes it. Recognition of the right hand side of a lower-case “n” (RKN) Recognition of the left hand side of a lower-case “n” (LKN)
  6. 6. Bureau Ingénieur Tomasi S.A.R.L. Adaptive OCR Solution ______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : bit.entreprise.at@gmail.com N° SIRET 503 902 983 00017 R. C. TOULOUSE Assembling of the two fragments by the sequencer and reconstruction of the “n”  Extend abbreviations In Roman writing a “q” followed by ”;” means “que”.  Correct wrong sequences of letters When other OCR reads “nnn”, the sequencer corrects that to « mm ». BIT- Alpha considers the typical sequences of the language of the document processed and is therefore able to correct incorrect sequences of letters. For example in Latin the wrong sequence “dcn” is changed into the typical one: “den”. Another example would be the incorrect sequence “qn” which is changed changed into the typical one used in Latin: “qu”. The Sequencer is composed of more than 900 sequences preprogramed in BIT- Alphas’s data base. By each use, the Sequencer’s data base can be enhanced and conversely the sequences preprogramed disturbing can be removed. VI. Post-OCR correction with Spellchecking BIT-Alpha’s post-OCR correction is based on the “Levenshtein” distance algorithm. Alpha analyses the edit-distance (different editing operations correspond to different OCR-mistakes and may have different weights) between two words, the words in the text and the reference from the database. Thanks to this technology BIT-Alpha is able to reconstitute words or to separate them with blanks if needed. For example, in German composed words (very common in
  7. 7. Bureau Ingénieur Tomasi S.A.R.L. Adaptive OCR Solution ______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : bit.entreprise.at@gmail.com N° SIRET 503 902 983 00017 R. C. TOULOUSE German) may be checked by checking the components individually against known words from the database. Whereas for Latin texts (where composed words rarely occur) BIT-Alpha separates the words that are sticking together with blanks. BIT-Alpha permit to switch off the post-OCR correction and also to adapt how aggressively it corrects pure OCR results. VII. Pictures Treatment/Export BIT-Alpha has very advanced technology for the processing of pictures (for example in newspapers). BIT-Alpha is able to detect pictures, to delete interpolate dithered images and to deliver a high-quality true-color digital image. Dithered image (binary): Interpolated image without dithering (greyscale): VIII. Export of content: The results can be rendered in different formats, for example:  Txt  Pdf with Highlighting (text as transparent overlay over the original image, allowing to search, select, copy)  BIT-Alpha creates a lightweight pdf by reducing the resolution (dpi) of the document in order facilitate exchange of the document or online publication.  Alto (pixel or 10 de mm)  Tei
  8. 8. Bureau Ingénieur Tomasi S.A.R.L. Adaptive OCR Solution ______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : bit.entreprise.at@gmail.com N° SIRET 503 902 983 00017 R. C. TOULOUSE  Html The Html export from BIT-Alpha keeps mathematical formula, pictures, etc. and positions them at the same place where they were in the original document. IX. Contact Head of sales department Anne Tomasi, +33 786 844 845 bit.entreprise.at@gmail.com

×