SlideShare a Scribd company logo
1 of 3
Extracting Person Names from Diverse and Noisy OCR Text<br />Thomas Packer, Joshua Lutes, Aaron Stewart, David Embley, Eric Ringger, Kevin SeppiLee JensenDepartment of Computer ScienceAncestry.com, Inc.Brigham Young UniversityProvo, Utah, USAProvo, Utah, USA<br />Information extraction can be of great value to genealogy and other historical research.  The more automatically and accurately we can categorize the words appearing in the scanned images of a document, the more value a researcher can get out of a large collection of documents.  For example, the greater the number of person names that can be correctly identified in a collection (the higher the recall), the more likely those names will be discovered by someone searching for them.  Likewise, the fewer the number of words that are incorrectly categorized as being part of a person's name (the higher the precision), the less time it will take a researcher to sift through a collection to find information about a person of interest.<br />In the research presented here, we were concerned with extracting person names from scanned document images, including books and newspapers.  In doing this, we faced two challenges.  The first was extracting information from a collection of documents with great genre and format diversity and with few specialized resources from which to develop the extractors (e.g. labeled training data for machine learning approaches).  Pages of our test set were sampled from 10 documents from the following genres: family and local history books, newspapers, city directories, Navy cruise books and church congregation year books.<br />The second challenge was extracting information from the noisy text produced by the processes of scanning and OCR transcription.  In the dataset we used, these processes introduce the following kinds of errors: character substitutions; incorrectly split or joined words; and word-order, text-line boundaries and other document structure information being lost by post-processing of the OCR output.  For example, in the section of the page shown in Figure 1, the two-column list of names runs together with text on the opposite page to produce a non-delimited stream of words.  Splotches on the page also cause character recognition errors.<br />Figure 1.  The upper right-hand corner of one of the “noisier” pages in our collection.  The OCR transcription of the first three lines read as follows: “ot Bo quo qu 4 o ot 4 uo q o qu uot q i JKIF q o t q o ot quot q u i I q John Adams John Keit h to have died with his armour on The his death James MacIceen James Wright though infirm and ill he Sunday preached preceding to his ople from the text - R MacDonald Huh Fraser”<br />We applied the following six techniques for extracting person names.  Their extraction qualities are compared in the graph in Figure 2.  The lighter green bar (left side of each pair) is the exact-match F-measure where an extracted name is considered to be correct only if every word within that name is classified as part of the name and no additional words are joined with the name.  the darker blue bar (right side of each pair) is the fragment-match F-measure where credit for correct and incorrect classification is given on a word-by-word basis.  In both cases, F-measure is the harmonic mean of precision and recall.  These metrics are computed against the hand-annotated full names in our test set.<br />,[object Object]
Regular Expression (Regex).  Similar to Dictionary, but only full names that matched one of several hand-written regular expression patterns were output, e.g. “Title GivenName Initial Surname”.
Context-free Grammar (CFG).  Similar to Regex but with a more complicated set of hand-written rules in a more expressive weighted context-free grammar.

More Related Content

Similar to Extracting Names from Diverse OCR Text

Automatic multiple choice question generation system for
Automatic multiple choice question generation system forAutomatic multiple choice question generation system for
Automatic multiple choice question generation system forAlexander Decker
 
Document Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language StructureDocument Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language Structurekevig
 
Document Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language StructureDocument Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language Structurekevig
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingTheodore J. LaGrow
 
Analysis And Indexing General Terms Experimentation
Analysis And Indexing General Terms ExperimentationAnalysis And Indexing General Terms Experimentation
Analysis And Indexing General Terms ExperimentationAshley Hernandez
 
RESUME INFORMATION EXTRACTION WITH A NOVEL TEXT BLOCK SEGMENTATION ALGORITHM
RESUME INFORMATION EXTRACTION WITH A NOVEL TEXT BLOCK SEGMENTATION ALGORITHMRESUME INFORMATION EXTRACTION WITH A NOVEL TEXT BLOCK SEGMENTATION ALGORITHM
RESUME INFORMATION EXTRACTION WITH A NOVEL TEXT BLOCK SEGMENTATION ALGORITHMijnlc
 
RESUME INFORMATION EXTRACTION WITH A NOVEL TEXT BLOCK SEGMENTATION ALGORITHM
RESUME INFORMATION EXTRACTION WITH A NOVEL TEXT BLOCK SEGMENTATION ALGORITHMRESUME INFORMATION EXTRACTION WITH A NOVEL TEXT BLOCK SEGMENTATION ALGORITHM
RESUME INFORMATION EXTRACTION WITH A NOVEL TEXT BLOCK SEGMENTATION ALGORITHMkevig
 
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAIDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAijistjournal
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations onijistjournal
 
Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...butest
 
601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio Edition601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio EditionJordan Chapman
 
ryanandbernard- Handbook Sage.pdf
ryanandbernard- Handbook Sage.pdfryanandbernard- Handbook Sage.pdf
ryanandbernard- Handbook Sage.pdfabhishekkumar4959
 
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...Nischal Lal Shrestha
 
The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...greatsalvation813
 
A Simple Approach To Classify Fictional And Non-Fictional Genres
A Simple Approach To Classify Fictional And Non-Fictional GenresA Simple Approach To Classify Fictional And Non-Fictional Genres
A Simple Approach To Classify Fictional And Non-Fictional GenresAndrea Porter
 
Domain Specific Named Entity Recognition Using Supervised Approach
Domain Specific Named Entity Recognition Using Supervised ApproachDomain Specific Named Entity Recognition Using Supervised Approach
Domain Specific Named Entity Recognition Using Supervised ApproachWaqas Tariq
 
On Automated Landmark Identification in Written Route Descriptions by Visuall...
On Automated Landmark Identification in Written Route Descriptions by Visuall...On Automated Landmark Identification in Written Route Descriptions by Visuall...
On Automated Landmark Identification in Written Route Descriptions by Visuall...Vladimir Kulyukin
 
Improvement of soundex algorithm for indian language based on phonetic matching
Improvement of soundex algorithm for indian language based on phonetic matchingImprovement of soundex algorithm for indian language based on phonetic matching
Improvement of soundex algorithm for indian language based on phonetic matchingIJCSEA Journal
 

Similar to Extracting Names from Diverse OCR Text (20)

Automatic multiple choice question generation system for
Automatic multiple choice question generation system forAutomatic multiple choice question generation system for
Automatic multiple choice question generation system for
 
Document Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language StructureDocument Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language Structure
 
Document Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language StructureDocument Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language Structure
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
 
Analysis And Indexing General Terms Experimentation
Analysis And Indexing General Terms ExperimentationAnalysis And Indexing General Terms Experimentation
Analysis And Indexing General Terms Experimentation
 
RESUME INFORMATION EXTRACTION WITH A NOVEL TEXT BLOCK SEGMENTATION ALGORITHM
RESUME INFORMATION EXTRACTION WITH A NOVEL TEXT BLOCK SEGMENTATION ALGORITHMRESUME INFORMATION EXTRACTION WITH A NOVEL TEXT BLOCK SEGMENTATION ALGORITHM
RESUME INFORMATION EXTRACTION WITH A NOVEL TEXT BLOCK SEGMENTATION ALGORITHM
 
RESUME INFORMATION EXTRACTION WITH A NOVEL TEXT BLOCK SEGMENTATION ALGORITHM
RESUME INFORMATION EXTRACTION WITH A NOVEL TEXT BLOCK SEGMENTATION ALGORITHMRESUME INFORMATION EXTRACTION WITH A NOVEL TEXT BLOCK SEGMENTATION ALGORITHM
RESUME INFORMATION EXTRACTION WITH A NOVEL TEXT BLOCK SEGMENTATION ALGORITHM
 
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAIDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations on
 
Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...
 
601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio Edition601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio Edition
 
ryanandbernard- Handbook Sage.pdf
ryanandbernard- Handbook Sage.pdfryanandbernard- Handbook Sage.pdf
ryanandbernard- Handbook Sage.pdf
 
Token
TokenToken
Token
 
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
 
The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...
 
A Simple Approach To Classify Fictional And Non-Fictional Genres
A Simple Approach To Classify Fictional And Non-Fictional GenresA Simple Approach To Classify Fictional And Non-Fictional Genres
A Simple Approach To Classify Fictional And Non-Fictional Genres
 
Domain Specific Named Entity Recognition Using Supervised Approach
Domain Specific Named Entity Recognition Using Supervised ApproachDomain Specific Named Entity Recognition Using Supervised Approach
Domain Specific Named Entity Recognition Using Supervised Approach
 
On Automated Landmark Identification in Written Route Descriptions by Visuall...
On Automated Landmark Identification in Written Route Descriptions by Visuall...On Automated Landmark Identification in Written Route Descriptions by Visuall...
On Automated Landmark Identification in Written Route Descriptions by Visuall...
 
Improvement of soundex algorithm for indian language based on phonetic matching
Improvement of soundex algorithm for indian language based on phonetic matchingImprovement of soundex algorithm for indian language based on phonetic matching
Improvement of soundex algorithm for indian language based on phonetic matching
 
Ceis 3
Ceis 3Ceis 3
Ceis 3
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Extracting Names from Diverse OCR Text

  • 1.
  • 2. Regular Expression (Regex). Similar to Dictionary, but only full names that matched one of several hand-written regular expression patterns were output, e.g. “Title GivenName Initial Surname”.
  • 3. Context-free Grammar (CFG). Similar to Regex but with a more complicated set of hand-written rules in a more expressive weighted context-free grammar.
  • 4. Maximum Entropy Markov Model (MEMM). This statistical sequence model was trained on hand-labeled newswire text from a CoNLL dataset. The trained model was then applied to the noisy OCR text with some domain adaptation attempted.
  • 5. Conditional Random Field (CRF). Another well known statistical sequence model trained and executed just like the MEMM.
  • 6. Ensemble. The best ensemble among several variations was one that allowed the Dictionary, Regex and MEMM extractors to vote. It output a full name when two or more agreed.Figure 2. Exact-match (left) and fragment-match (right) F-measure of each method evaluated on all 10 documents of the test set. There were one to two pages of hand annotated text taken from each document and placed in this test set.<br />In conclusion, from this project we learned that relatively simple methods (1 and 2) can outperform more sophisticated methods in a limited-resource setting. We were also encouraged to see an improvement over the individual methods by using a simple voting ensemble.<br />There is still room for improvement. We will take the results of, and experience gained from, this initial project as a starting point on which to build better, more adaptive extractors using page format recognition, OCR error correction and other advanced techniques.<br />