Digital Humanities 101 - 2013/2014 - Course 7
Digital Humanities Laboratory
Andrea Mazzei and Fr´d´ric Kaplan
e e
andrea.m...
o

A Job offer
• Running an OCR transcription of 320 pages
• about 60 hours of work
• 25 CHF / hour.

Digital Humanities 1...
o

Results of the peer grading process

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

3
o

Results of the peer grading process

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

4
o

Results of the peer grading process

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

5
o

Results of the peer grading process

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

6
o

Results of the peer grading process

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

7
o

New projects

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

8
o

Venitian opera staging and machinery
• A project that find way for better understanding and visualizing opera staging
ba...
o

Venetian storytelling in the Middle-Age
• Marin Sanudo was an historical writer. In contrast to others writer of the
ep...
o

Looking at music printing typefaces
• A project that looks at the different music typefaces used in Venetian
prints. Typ...
o

Music at San Marco
• A project that can look at how the capella di San Marco evolved over
time : how many musicians, wh...
o

Venetian music prints in libraries today
• A project that looks at the production of music prints in Venice and
where t...
o

Semester 1 : Content of each course
• (1) 19.09 Introduction to the course / Live Tweeting and Collective note
taking
•...
o

Semester 1 : Content of each course
• (7) 6.11 Pattern recognition / OCR / Semantic disambiguation
• (8) 13.11 Historic...
o

Today's course
• Printed Text Recognition
• Hand Writing Recognition
• Ornament Recognition
• Text Mining and semantic ...
o

Part I : Printed Text Recognition

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

17
o

OCR : Optical Character Recognition
A system that provides a full recognition of all the printed characters by
simply s...
o

Mori et al. (1992). Historical review of OCR R&D
• 1940 : The first version of OCR
• 1950 : The first OCR machines appear...
o

OCR capabilities
The recognition performance depends on the type and number of fonts
recognized.
• Fixed font : the syt...
o

Omni-font OCR Overview Of Processing

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

21
o

Preprocessing : Text Lines Straightening

Zhang, Z., & Tan, C. L. (2002, June). Straightening warped text lines using p...
o

Preprocessing : Character Detection
• Image binarization using local adaptive thresholding

• Character detection using...
o

Segmentation Problems : Touching and fragmented characters
• Joints will occur if the document is a dark photocopy or i...
o

Segmentation Problems : Distinguishing noise from text
Dots and accents may be mistaken for noise, and vice versa.

Dig...
o

Segmentation Problems : Mistaking graphics for text
This leads to non-text being sent or text not being sent to recogni...
o

Feature Extraction
From each character several features can be extracted :
• Rasterized pixels
• Geometric moment invar...
o

Feature Extraction : Zoning
MxN zones are computed as average gray level from the image of the
character.

Due Trier, O...
o

Feature Extraction : Projection Profile

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

29
o

Feature Extraction : Structural Analysis
Strokes, bays, end-points, intersections between lines and loops.
High toleran...
o

Classification
The principal approaches to decision-theoretic recognition are minimum
distance classifiers, statistical ...
o

Matching

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

32
o

Optimum statistical classifiers.
• Bayesian classifier. Given an unknown symbol described by its feature
vector, the pro...
o

Post Processing : Grouping
From symbols to strings using symbols proximity
Eikvil, L. (1993). Optical Character Recogni...
o

Post Processing : Error Detection and Correction
• Use of rules defining the syntax of the word. Ex. In English the k ne...
o

Self-learning
Modern OCR systems enlarge the database of characters when new fonts
are encountered. Character recogniti...
o

Handwriting Recognition (HWR)

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

37
o

Offline HWR : Many difficult problems
• Stroke ordering

• Broken lines

• Merged blobs

Digital Humanities 101 - 2013/...
o

From Offline to Simulated Online

It is not reliable
• What order were the strokes written in ?
• Doubled-up line segme...
o

Segmentation : Strokes Extraction

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

40
o

Segmentation : Segments Fitting
Robustly cut letters into segments
Match multiple segments to detect letters
Easier tha...
o

Analytical Approach
It treats a word as a collection of simpler sub-units such as characters
• Segmentation of the word...
o

Problems with the Analytical Approach
• segmentation ambiguity : deciding where to segment the word image

• variabilit...
o

Holistic Matching
Treats the word as a single, indivisible entity and attempts to recognize it
using features of the wo...
o

Advantages of the Holystic Matching
Coarticulation effect, i.e., the changes in the appearance of a character
as a funct...
o

Advantages of the Holystic Matching
Orthogonality of holistic features : information about the word that
is clearly ort...
o

Advantages of the Holystic Matching
Evidence from psychological studies : psychological studies of
reading points towar...
o

Dynamic Global Search
Assemble word spelling from possible letter readings

Digital Humanities 101 - 2013/2014 - Course...
o

Result 1

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

49
o

Result 2

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

50
o

Result 3

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

51
o

ABBYY Fine Reader : A Case Study

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

52
o

Scanned Document

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

53
o

Image Rotation Adjustment

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

54
o

Image Rotation Adjustment

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

55
o

First Extraction

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

56
o

Synthetizing the Table

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

57
o

Second Extraction

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

58
o

Retrieval of the ornaments from the Hand-Press Period

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

59
o

Problem Statement
For millions of intact books and tens of millions of loose pages, the
provenance of the manuscripts m...
o

Manual Solution
Human experts are capable to regain the provenance by examining
linguistic, cultural and/or stylistic c...
o

Automatic Solution
By comparing the initial letters in the manuscript to annotated initial
letters whose origin is know...
o

What are the Challenges ?

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

63
o

Ornament Segmentation
Ornament(s) detection and localization with respect to the page reference system.

˜
Baudrier, E....
o

A Compression Based Distance Measure for Texture
The distance between a window and an annotated initial letter is
denot...
o

Properties of CK1 Distance Measure
Efficient, robust and parameter-free texture similarity measure.
Rotation, Colour and ...
o

Gabor Filters

Images are convolved with each filter.
The standard deviation and mean of each response => 48 length vect...
o

Data Sets

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

68
o

Experimental Results

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

69
o

Part II : Text mining and semantic disambiguation

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

70
o

Case study : Extracting named entities (people, places,
etc.) in a text using Wikipedia

Digital Humanities 101 - 2013/...
o

Using Wikipedia
• A Unique ID : A Wikipedia article is identified by a unique name, which is
the article title itself. T...
o

Using Wikipedia
• Redirections : Some entities can have multiple names. In order to address
this issue, Wikipedia has s...
o

Using Wikipedia
• Disambiguation pages : A disambiguation page is created for ambiguous
entity names and it enumerates ...
o

Using Wikipedia
• Outgoing links : In the body text of the Wikipedia article there are
references (links) to other arti...
o

3 steps
• Data extraction : A (sequence of) word(s) is extracted from a ”Le
Temps” article (e.g. Le Paris). Set the rig...
o

Disambiguation strategy

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

77
o

(1) Data extraction
• The first step is the extraction of possible named entities. This step is
based on the fact that t...
o

(2) Disambiguation
• The disambiguation process employs a vector space model, in which a
vectorial representation of th...
o

(3) Entity classification
• The last step is to classify the entities into persons, places, companies,
etc.
• Ex : It t...
o

Partial results
• We have implemented the algorithm and tested it on a subset of the
database
• Our current estimation ...
o

From Wikipedia to Wikipast
• The First principle of Wikipedia is that it is an encyclopedia. Not all
entites are allowe...
Upcoming SlideShare
Loading in …5
×

DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

2,611 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,611
On SlideShare
0
From Embeds
0
Number of Embeds
428
Actions
Shares
0
Downloads
17
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

  1. 1. Digital Humanities 101 - 2013/2014 - Course 7 Digital Humanities Laboratory Andrea Mazzei and Fr´d´ric Kaplan e e andrea.mazzei,frederic.kaplan@epfl.ch
  2. 2. o A Job offer • Running an OCR transcription of 320 pages • about 60 hours of work • 25 CHF / hour. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 2
  3. 3. o Results of the peer grading process Digital Humanities 101 - 2013/2014 - Course 7 | 2013 3
  4. 4. o Results of the peer grading process Digital Humanities 101 - 2013/2014 - Course 7 | 2013 4
  5. 5. o Results of the peer grading process Digital Humanities 101 - 2013/2014 - Course 7 | 2013 5
  6. 6. o Results of the peer grading process Digital Humanities 101 - 2013/2014 - Course 7 | 2013 6
  7. 7. o Results of the peer grading process Digital Humanities 101 - 2013/2014 - Course 7 | 2013 7
  8. 8. o New projects Digital Humanities 101 - 2013/2014 - Course 7 | 2013 8
  9. 9. o Venitian opera staging and machinery • A project that find way for better understanding and visualizing opera staging based on evidences found in historical sources (treatise, music prints, etc.) • Rosand, E. 1990. Opera in Seventeenth-Century Venice : The Creation of a Genre. Berkeley : University of California Press. • Bjurstr¨m, P. 1962. Giacomo Torelli and Baroque Stage Design. Stockholm : o Almqvist and Wiksell. ˜ a • Leclerc, H. 1987. Venise et l’av`nement de l’op´ra public A l’ˆge baroque. Paris : e o A. Colin. • Larson, O. K. 1980. Giacomo Torelli, Sir Philip Skippon, and Stage Machinery for the Venetian Opera, Theatre Journal, Vol. 32, No. 4, pp. 448-457. www.jstor.org/stable/3207407 Digital Humanities 101 - 2013/2014 - Course 7 | 2013 9
  10. 10. o Venetian storytelling in the Middle-Age • Marin Sanudo was an historical writer. In contrast to others writer of the epoch, he wrote a diary noting all the events happend in Venice. Of course it is not the only one diary wrote in Venice. Imagine how to use this personal information. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 10
  11. 11. o Looking at music printing typefaces • A project that looks at the different music typefaces used in Venetian prints. Typical questions are : the size of the typeface, when they were used, for what repertoire, what printers used them, etc. • Agee, R. 1998. The Gardano Music Printing Firms, 1569-1611. Rochester, University of Rochester Press. • Bernstein, J. 1998. Music Printing in Renaissance Venice. The Scotto Press (1539-1572). Oxford, Oxford University Press. • Bernstein, J. 2001. Print Culture and Music in Sixteenth-Century Venice. Oxford, Oxford University Press. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 11
  12. 12. o Music at San Marco • A project that can look at how the capella di San Marco evolved over time : how many musicians, where they played in the Basilica, what they played, etc. • Selfridge-Field, E. 1994. Venetian instrumental music from Gabrieli to Vivaldi. New York : Dover. • Moretti, L. 2004. Jacopo Sansovino and Adrian Willaert at St Mark’s, Early Music History, Vol. 23, pp. 153-184. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 12
  13. 13. o Venetian music prints in libraries today • A project that looks at the production of music prints in Venice and where they are hold today in libraries and archives around the world • The Repertoire International des Source Musicales, Series A/I on music prints. http ://www.rism.info [will be made available digitally for the project] Digital Humanities 101 - 2013/2014 - Course 7 | 2013 13
  14. 14. o Semester 1 : Content of each course • (1) 19.09 Introduction to the course / Live Tweeting and Collective note taking • (2) 25.09 Introduction to Digital Humanities / Wordpress / First assignment • (3) 2.10 Introduction to the Venice Time Machine project / Zotero • 9.10 No course • (4) 16.10 Digitization techniques / Deadline first assignment • (5) 23.10 Datafication / Presentation of projects • (6) 30.10 Semantic modelling / RDF / Deadline peer-reviewing of first assignment Digital Humanities 101 - 2013/2014 - Course 7 | 2013 14
  15. 15. o Semester 1 : Content of each course • (7) 6.11 Pattern recognition / OCR / Semantic disambiguation • (8) 13.11 Historical Geographical Information Systems, Procedural modelling / City Engine / Deadline Project selection • (9) 20.11 Crowdsourcing / Wikipedia / OpenStreetMap • (10) 27.11 Cultural heritage interfaces and visualisation / Museographic experiences • 4.12 Group work on the projects • 11.12 Oral exam / Presentation of projects / Deadline Project blog • 18.12 Oral exam / Presentation of projects Digital Humanities 101 - 2013/2014 - Course 7 | 2013 15
  16. 16. o Today's course • Printed Text Recognition • Hand Writing Recognition • Ornament Recognition • Text Mining and semantic disambiguation : Extracting named entities (people, places, etc.) in a text using Wikipedia Digital Humanities 101 - 2013/2014 - Course 7 | 2013 16
  17. 17. o Part I : Printed Text Recognition Digital Humanities 101 - 2013/2014 - Course 7 | 2013 17
  18. 18. o OCR : Optical Character Recognition A system that provides a full recognition of all the printed characters by simply scanning the support. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 18
  19. 19. o Mori et al. (1992). Historical review of OCR R&D • 1940 : The first version of OCR • 1950 : The first OCR machines appear • 1960 - 1965 : First generation OCR : NOF, Farrington 360, IBM 1418. They all used a special font • 1965 - 1975 : Second generation OCR : IBM 1287, NEC, Toshiba. They could also recognize constrained hand-printed alpha-numerals. • 1975 - 1985 : Third generation OCR : IBM 1975, Poor print quality or handwritten characters. 275 fonts. Handwriting recognition. • 1986 - Today : OCR to the people Eikvil, L. (1993). Optical Character Recognition Digital Humanities 101 - 2013/2014 - Course 7 | 2013 19
  20. 20. o OCR capabilities The recognition performance depends on the type and number of fonts recognized. • Fixed font : the sytem can recognize only one font • Multi font : the system can recognize multiple fonts • Omni font : the system can recognize most nonstylized fonts without having to maintain huge databases of specific font information Digital Humanities 101 - 2013/2014 - Course 7 | 2013 20
  21. 21. o Omni-font OCR Overview Of Processing Digital Humanities 101 - 2013/2014 - Course 7 | 2013 21
  22. 22. o Preprocessing : Text Lines Straightening Zhang, Z., & Tan, C. L. (2002, June). Straightening warped text lines using polynomial regression. In Image Processing. 2002. Proceedings. 2002 International Conference on (Vol. 3, pp. 977-980). IEEE. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 22
  23. 23. o Preprocessing : Character Detection • Image binarization using local adaptive thresholding • Character detection using region growing-based methods. PROBLEM ! Eikvil, L. (1993). Optical Character Recognition Digital Humanities 101 - 2013/2014 - Course 7 | 2013 23
  24. 24. o Segmentation Problems : Touching and fragmented characters • Joints will occur if the document is a dark photocopy or if it is scanned at a low threshold. • Joints are common if the fonts are serifed. • The characters may be split if the document stems from a light photocopy or is scanned at a high threshold Digital Humanities 101 - 2013/2014 - Course 7 | 2013 24
  25. 25. o Segmentation Problems : Distinguishing noise from text Dots and accents may be mistaken for noise, and vice versa. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 25
  26. 26. o Segmentation Problems : Mistaking graphics for text This leads to non-text being sent or text not being sent to recognition Digital Humanities 101 - 2013/2014 - Course 7 | 2013 26
  27. 27. o Feature Extraction From each character several features can be extracted : • Rasterized pixels • Geometric moment invariant • Morphological features Digital Humanities 101 - 2013/2014 - Course 7 | 2013 27
  28. 28. o Feature Extraction : Zoning MxN zones are computed as average gray level from the image of the character. Due Trier, O., Jain, A. K., & Taxt, T. (1996). Feature extraction methods for character recognition-a survey. Pattern recognition, 29(4), 641-662 Digital Humanities 101 - 2013/2014 - Course 7 | 2013 28
  29. 29. o Feature Extraction : Projection Profile Digital Humanities 101 - 2013/2014 - Course 7 | 2013 29
  30. 30. o Feature Extraction : Structural Analysis Strokes, bays, end-points, intersections between lines and loops. High tolerance to noise and style variations. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 30
  31. 31. o Classification The principal approaches to decision-theoretic recognition are minimum distance classifiers, statistical classifiers and neural networks. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 31
  32. 32. o Matching Digital Humanities 101 - 2013/2014 - Course 7 | 2013 32
  33. 33. o Optimum statistical classifiers. • Bayesian classifier. Given an unknown symbol described by its feature vector, the probability that the symbol belongs to the class c is computed for all classes c = 1...N. The symbol is then assigned the class which gives the maximum probability. • ... Digital Humanities 101 - 2013/2014 - Course 7 | 2013 33
  34. 34. o Post Processing : Grouping From symbols to strings using symbols proximity Eikvil, L. (1993). Optical Character Recognition Digital Humanities 101 - 2013/2014 - Course 7 | 2013 34
  35. 35. o Post Processing : Error Detection and Correction • Use of rules defining the syntax of the word. Ex. In English the k never appears after the h. • Use of dictionaries. If the word is not in the dictionary, an error has been detected, and may be corrected by changing the word into the most similar word. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 35
  36. 36. o Self-learning Modern OCR systems enlarge the database of characters when new fonts are encountered. Character recognition is based on the database previously built in, which contains the important features related to the characters which are known already. It is necessary that this database is able to self expand as more and more new characters are met in order to increase the recognition ability. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 36
  37. 37. o Handwriting Recognition (HWR) Digital Humanities 101 - 2013/2014 - Course 7 | 2013 37
  38. 38. o Offline HWR : Many difficult problems • Stroke ordering • Broken lines • Merged blobs Digital Humanities 101 - 2013/2014 - Course 7 | 2013 38
  39. 39. o From Offline to Simulated Online It is not reliable • What order were the strokes written in ? • Doubled-up line segments ? • Ink blobs ? • Spurious joins between letters ? • Missing joins ? Digital Humanities 101 - 2013/2014 - Course 7 | 2013 39
  40. 40. o Segmentation : Strokes Extraction Digital Humanities 101 - 2013/2014 - Course 7 | 2013 40
  41. 41. o Segmentation : Segments Fitting Robustly cut letters into segments Match multiple segments to detect letters Easier than matching whole letter Hutchison L. Handwriting Recognition for Genealogical Records - Course 7 | 2013 Digital Humanities 101 - 2013/2014 41
  42. 42. o Analytical Approach It treats a word as a collection of simpler sub-units such as characters • Segmentation of the word into these units • Identification of the units • Word-level interpretation using a predefined lexicon Digital Humanities 101 - 2013/2014 - Course 7 | 2013 42
  43. 43. o Problems with the Analytical Approach • segmentation ambiguity : deciding where to segment the word image • variability of segment shape : determining the identity of each segment Digital Humanities 101 - 2013/2014 - Course 7 | 2013 43
  44. 44. o Holistic Matching Treats the word as a single, indivisible entity and attempts to recognize it using features of the word as whole. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 44
  45. 45. o Advantages of the Holystic Matching Coarticulation effect, i.e., the changes in the appearance of a character as a function of the shapes of neighboring characters Digital Humanities 101 - 2013/2014 - Course 7 | 2013 45
  46. 46. o Advantages of the Holystic Matching Orthogonality of holistic features : information about the word that is clearly orthogonal to the knowledge of characters in it and it stands to reason that the introduction of this knowledge should improve recognition Digital Humanities 101 - 2013/2014 - Course 7 | 2013 46
  47. 47. o Advantages of the Holystic Matching Evidence from psychological studies : psychological studies of reading points towards the fact that humans do not, in general, read words letter by letter. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 47
  48. 48. o Dynamic Global Search Assemble word spelling from possible letter readings Digital Humanities 101 - 2013/2014 - Course 7 | 2013 48
  49. 49. o Result 1 Digital Humanities 101 - 2013/2014 - Course 7 | 2013 49
  50. 50. o Result 2 Digital Humanities 101 - 2013/2014 - Course 7 | 2013 50
  51. 51. o Result 3 Digital Humanities 101 - 2013/2014 - Course 7 | 2013 51
  52. 52. o ABBYY Fine Reader : A Case Study Digital Humanities 101 - 2013/2014 - Course 7 | 2013 52
  53. 53. o Scanned Document Digital Humanities 101 - 2013/2014 - Course 7 | 2013 53
  54. 54. o Image Rotation Adjustment Digital Humanities 101 - 2013/2014 - Course 7 | 2013 54
  55. 55. o Image Rotation Adjustment Digital Humanities 101 - 2013/2014 - Course 7 | 2013 55
  56. 56. o First Extraction Digital Humanities 101 - 2013/2014 - Course 7 | 2013 56
  57. 57. o Synthetizing the Table Digital Humanities 101 - 2013/2014 - Course 7 | 2013 57
  58. 58. o Second Extraction Digital Humanities 101 - 2013/2014 - Course 7 | 2013 58
  59. 59. o Retrieval of the ornaments from the Hand-Press Period Digital Humanities 101 - 2013/2014 - Course 7 | 2013 59
  60. 60. o Problem Statement For millions of intact books and tens of millions of loose pages, the provenance of the manuscripts may be in doubt or completely unknown Digital Humanities 101 - 2013/2014 - Course 7 | 2013 60
  61. 61. o Manual Solution Human experts are capable to regain the provenance by examining linguistic, cultural and/or stylistic clues. However, such experts are rare and this investigation is a time-consuming process. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 61
  62. 62. o Automatic Solution By comparing the initial letters in the manuscript to annotated initial letters whose origin is known, the provenance can be determined. This process can be automatized Digital Humanities 101 - 2013/2014 - Course 7 | 2013 62
  63. 63. o What are the Challenges ? Digital Humanities 101 - 2013/2014 - Course 7 | 2013 63
  64. 64. o Ornament Segmentation Ornament(s) detection and localization with respect to the page reference system. ˜ Baudrier, E., Busson, S., Corsini, S., Delalandre, M., LandrA c , J., & Morain-Nicolier, F. (2009, July). Retrieval of the ornaments from 2013/2014 - Course 7 | 2013 Digital Humanities 101 - the hand-press 64
  65. 65. o A Compression Based Distance Measure for Texture The distance between a window and an annotated initial letter is denoted as : mpegSize(W , IL) + mpegSize(IL, W ) distCK 1(W , IL) = −1 mpegSize(W , W ) + mpegSize(IL, IL) The first image supplied to mpegSize is assigned as an I frame and the second becomes a P frame. Campana, B. J., & Keogh, E. J. (2010). A compression-based distance measure for texture. Statistical Analysis and Data Mining, 3(6), 381-398 Digital Humanities 101 - 2013/2014 - Course 7 | 2013 65
  66. 66. o Properties of CK1 Distance Measure Efficient, robust and parameter-free texture similarity measure. Rotation, Colour and Illumination Invariant. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 66
  67. 67. o Gabor Filters Images are convolved with each filter. The standard deviation and mean of each response => 48 length vector Vector Euclidean distance Wang, X., Ding, X., & Liu, C. (2005). Gabor filters-based feature extraction for character recognition. Pattern recognition, 38(3), 369-379 Digital Humanities 101 - 2013/2014 - Course 7 | 2013 67
  68. 68. o Data Sets Digital Humanities 101 - 2013/2014 - Course 7 | 2013 68
  69. 69. o Experimental Results Digital Humanities 101 - 2013/2014 - Course 7 | 2013 69
  70. 70. o Part II : Text mining and semantic disambiguation Digital Humanities 101 - 2013/2014 - Course 7 | 2013 70
  71. 71. o Case study : Extracting named entities (people, places, etc.) in a text using Wikipedia Digital Humanities 101 - 2013/2014 - Course 7 | 2013 71
  72. 72. o Using Wikipedia • A Unique ID : A Wikipedia article is identified by a unique name, which is the article title itself. The respective URL of a Wikipedia article can be created by concatenating the words in the article title and appending it to the URL root of the Wikipedia Digital Humanities 101 - 2013/2014 - Course 7 | 2013 72
  73. 73. o Using Wikipedia • Redirections : Some entities can have multiple names. In order to address this issue, Wikipedia has some article titles that do not have a substantive article and are only redirected to a different Wikipedia article with another title. This mechanism is called redirection. Redirections are used for other purposes such as spelling resolution (e.g. the article title Oranges is redirected to Orange) and abbreviation resolution (e.g. the article title UCLA is redirected to University of California, Los Angeles). Digital Humanities 101 - 2013/2014 - Course 7 | 2013 73
  74. 74. o Using Wikipedia • Disambiguation pages : A disambiguation page is created for ambiguous entity names and it enumerates all the possible articles for that name. For example, the disambiguation page for Paris enumerates 25 places called Paris (in America, Canada and Europe), 33 people having Paris as name or surname, 10 television series and films, whose title contains the word Paris, etc. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 74
  75. 75. o Using Wikipedia • Outgoing links : In the body text of the Wikipedia article there are references (links) to other articles. The references are within pairs of double square brackets. • Infobox : An infobox is a fixed-format table designed to be added to the top right-hand corner of articles to consistently present a summary of some unifying aspect that the articles share and sometimes to improve navigation to other interrelated articles. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 75
  76. 76. o 3 steps • Data extraction : A (sequence of) word(s) is extracted from a ”Le Temps” article (e.g. Le Paris). Set the right boundaries in the extracted data (e.g. from ”Le Paris” is retrieved the ”Paris” ). • Disambiguation : Retrieve all the Wikipedia articles whose title contains the word ”Paris” (e.g. Paris (France), Paris (Texas), Paris Hilton, Paris (mythology), etc). Find the Wikipedia article that maximizes the agreement between the content extracted from Wikipedia and the context of the ”Le Temps” article. • Entity classification : Classify the entity as place, person, company, etc, based on the chosen Wikipedia article Digital Humanities 101 - 2013/2014 - Course 7 | 2013 76
  77. 77. o Disambiguation strategy Digital Humanities 101 - 2013/2014 - Course 7 | 2013 77
  78. 78. o (1) Data extraction • The first step is the extraction of possible named entities. This step is based on the fact that the named entities consist of capitalized words. The rules that we apply for the extraction of possible named mentions in the text are the following : • Retrieve all the capitalized words (e.g. England) • Retrieve recursively terms T0 of the form T1 Particle T2, where Particle is one of a possessive pronoun, and the terms T1 and T2 are capitalized words or sequences of capitalized words (e.g. University of Edinburgh, European Society of Athletic Therapy and Training) • In French, some entities can contain non-capitalized words, after some specific words. Therefore, we retrieve non-capitalized words if they are followed by a word that is contained in a predefined set of words (e.g. Union, Biblioth`que, etc). For example the Union e sovietique is considered as entity. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 78
  79. 79. o (2) Disambiguation • The disambiguation process employs a vector space model, in which a vectorial representation of the processed article is compared with the vectorial representations of the Wikipedia entities. • The vectorial representation of the processed article (article vector) is a vector having all the possible entities of the specific article obtained during the previous step, while the vectorial representation of a Wikipedia article (Wikipedia vector) is a vector having all the outgoing links in the body text of the article. • Once a Wikipedia article is identified as the most similar to the processed article, the article vector is updated by adopting the features of the chosen Wikipedia vector. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 79
  80. 80. o (3) Entity classification • The last step is to classify the entities into persons, places, companies, etc. • Ex : It the entity a place ? If the Wikipedia article contains an infobox, then we retrieve it and we search for specific tags in it that can classify the entity as a place. • If the Wikipedia article does not have an infobox, then we use the first sentence of the body text. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 80
  81. 81. o Partial results • We have implemented the algorithm and tested it on a subset of the database • Our current estimation of the number of entity retrieved is 85 % • Main issue : Some entites are not in Wikipedia. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 81
  82. 82. o From Wikipedia to Wikipast • The First principle of Wikipedia is that it is an encyclopedia. Not all entites are allowed. Sourcing is important but secondary • On going discussion with Wikimedia to create an alternative to Wikipedia, allowing page on any person, place, etc. from the past as long at it is clearly sourced. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 82

×