Basic Techniques in NLP
Making Structure from Unstructured Text
Data mining applications use structured information
Data is in a spreadsheet format (Tables)
Text is presented in a different format
Text mining uses many data mining algorithms
Need to present the text in a tabular form
Structuring Unstructured data
The company provides jobs to overseas students
Doc No. Company Provide Jobs Overseas Students
1 1 1 1 1 1
2 0 0 1 1 0
3 1 0 1 1 0
4 1 0 1 0 1
Applications
Classification
Prediction
Categorization
Information retrieval
Similarity between documents
Not based on linguistic analysis
Based on statistical and associational analysis
Text Preprocessing
Document triage
Character encoding identification: determines the character
encoding (or encodings) for any file and optionally converts
between encodings.
Language identification: determines the natural language for a
document; this step is closely linked the character encoding.
Text sectioning identifies the actual content within a file such
as images, tables, headers, links, and HTML formatting while
discarding undesirable elements
Character Encoding Identification
 Most texts were encoded in the 7-bit character set ASCII, which allowed only 128
(27) characters. Eight-bit character sets can encode 256 (28) characters using a
single 8-bit byte, but most of these 8-bit sets reserve the first 128 characters for
the original ASCII characters
 Same range of numeric values can represent different characters in different
encodings can be a problem for tokenization.
 For example, English or Spanish are both normally stored in the common 8-bit
encoding Latin-1 (or ISO-8859-1). An English or Spanish tokenizer would need to
be aware that bytes in the (decimal) range 161–191 in Latin-1 represent
punctuation marks and other symbols (such as ‘¡’, ‘¿’, ‘£’, and ‘©
 Tokenization rules be required to handle each symbol (and thus its byte code)
appropriately for that language
 Tokenizers must be targeted to a specific language in a specific encoding
Language Identification
 languages with a unique alphabet not used by any other languages, Greek or Hebrew,
language identification is determined by character set identification
 Character set identification can be used to narrow the task of language identification to a
smaller number of languages that all share many characters,
 Arabic vs. Persian, Russian vs. Ukrainian, or Norwegian vs. Swedish
 European languages that use exactly the same character set but with different frequencies,
final identification can be performed by training models of byte/character distributions in
each of the languages
 A very effective algorithm for this would sort the bytes in a file by frequency count and use
the sorted list as a signature vector for comparison via an n-gram or vector distance
model.
Normalization
 In “right-to-left languages” like Hebrew and Arabic: you can have
“left-to-right” text interspersed (e.g., for dollar amounts).
 Need to “normalize” indexed text as well as query terms into the
same form
 Character-level alphabet detection and conversion
 Tokenization not separable from this.
 Sometimes ambiguous:
7月30日 vs. 7/30
Morgen will ich in MIT
withTomorrow I will stand in MIT
Text Preprocessing
Text Preprocessing
Text segmentation (Tokenization)
Process of converting a well-defined text corpus into its
component words and sentences.
Word segmentation: breaks up the sequence of characters in a
text by locating the word boundaries
Sentence segmentation: process of determining the longer
processing units consisting of one or more words
Character Dependence
Depends on the “writing System” used
Logographic (Egyptian Hieroglyphics)
 each character represents a word, morpheme
Syllabic (Devanagari )
The consonants each have an inherent vowel which can be changed to
another vowel or muted by means of diacritics or other modifications
Alphabetic (English)
Standard set of letters (graphemes) of consonants and vowels that
encode based on the general principle that they represent speech sounds
Carpus Dependence
 Availability of large corpora in multiple languages that encompass a wide range of
data types (e.g., newswire texts, email messages, closed captioning data, Internet
news pages, and weblogs) has required the development of robust NLP
approaches
 These corpora frequently contain misspellings, erratic punctuation and spacing,
and other irregular features.
 It is notoriously difficult to prescribe rules governing the use of a written language;
it is even more difficult to get people to “follow the rules
 Example (request for help)
 ive just loaded pcl onto my akcl. when i do an ‘in- package’ to load pcl, ill get the prompt
but im not able to use functions like defclass, etc... is there womething basic im missing
or am i just left hanging, twisting in the breeze?
AKCL: Austin Kyoto Common Lisp PCL: Printer Control Language
Application Dependence
 There is no absolute definition for what constitutes a word or a sentence.
 English words I am are frequently contracted to I’m, and a tokenizer expands it to recover the
essential grammatical features of the pronoun and the verb
 Another tokenizer does not expand this contraction to the component words would pass the
single token I’m to later processing stages. Unless these processors, which may include
morphological analyzers, part-of-speech taggers, lexical lookup routines, or parsers, are aware
of both the contracted and un-contracted forms, the token may be treated as an unknown
word
 Treatment of the English possessive ’s in various tagged corpora. In the Brown corpus the
word governor’s is considered one token and is tagged as a possessive noun. In the Susanne
corpus the same word is treated as two tokens, governor and ’s, tagged singular noun and
possessive, respectively
Format/language stripping
 Documents being indexed can include docs from many different
languages
 A single index may have to contain terms of several languages.
 Sometimes a document or its components can contain multiple
languages/formats
 French email with a Portuguese pdf attachment.
 What is a unit document?
 An email?
 With attachments?
 An email with a zip containing documents?
Tokenization
Space delimited languages
English, other European languages
“The Cow Jumped over the Moon”
Unsegmented languages
Chinese/Japanese
牛は月を飛び越えました
(Ushi wa tsuki o tobikoemashita)
Tokenization in Space-Delimited Languages
Very Common in Artificial Languages (C++)
Words are separated by whitespace
 NS Synthetics Ltd. said it expects to report a net loss for its second quarter
ended March 16 and doesn’t expect to meet analysts’ profit estimates of ₹3.9
to ₹4 million, or 76 paisa a share to 79 paisa share, for its year ending Sept. 24.
 It uses periods in three different ways – Abbr., decimal, end of
sentence
 apostrophes in two ways: Genitive (analysts’) and Contraction
(doesn’t)
Tokenization
How to treat “76 paisa a share”
Is it 4 tokens or one token?
₹3.9 to ₹4 million
Same as
3.9 to 4 million rupees or ₹3,900,000 to ₹4,000,000
Is 3.9 same as 3.90, 3.9000?
Tokenization
 initial tokenization of a space-delimited language would be to
consider as a separate token any sequence of characters
preceded and followed by space
 Use other punctuation characters such as , . ; ‘ “ as
separate tokens
 What about 4.983 million? Rs. 40,000? Doesn’t?
 markup and headers (including HTML markup), extra
whitespace, and extraneous control characters
Tokenization of Punctuation
Many times punctuation characters should be
“attached” to another token
Vary from one language to another
Abbreviations – imp. For both tokenization as well as
sentence segmentation
St. Theresa High School on Bendur St. is a well known
landmark in Mangalore
Tokenization of Punctuation
 Quotation marks (‘words’ or “words”) create ambiguity
 Determine opening or closing of a quoted passage
 Single quote also doubles as apostrophe
 Apostrophes is used to mark the genitive form of a noun, to mark
contractions, and to mark certain plural forms (Tom’s; isn’t; 80’s)
 quotation marks are also commonly used when “Romanizing” writing
systems, umlauts and accents are denoted by a double/single quotation
mark
 Some European languages do not use apostrophe (Peters Kopf =
Peter’s head)
Tokenization of Punctuation
(apostrophe)
 Multiple contractions (fo’c’s’le = forecastle)
 French and other such languages (l’homme, c’etait)
 Recognizing the contractions to expand requires knowledge
of the language, and the specific contractions to expand, as
well as the expanded forms, must be enumerated which
allows the proper tokenization of multi-contracted words
 we’ve  we have (two tokens)
 All other word-internal apostrophes are treated as a part of
the token and not expanded
Tokenization of Punctuation
Hyphen
 agglutinating constructions (Supercalifragilisticexpialidocious)
 Super-cali-fragilistic-expiali-docious (super- "above", cali- "beauty", fragilistic- "delicate",
expiali- "to atone", and -docious "educable", with the sum of these parts signifying roughly
"Atoning for educability through delicate beauty)
 More common in German (Feuerundlebensversicherung = fire
and life insurance  four tokens)
 hyphens at the ends of lines to break a word too long to include
on one line (About 5% of the end-of-line hyphens in an English
corpus were word-internal hyphens)
 No help from “white-space”
Tokenization of Multi-word Expressions
 Multiple words in a given order – need to be treated as a single token
 In spite of; de facto
 Multi word numerical expressions
 March 13, 2016; Mar. 13 2016; 13 Mar. 2016; 13/3/2016; 3/13/2016
 1.25 Rupees a share dividend (one token or 5 tokens?)
 highly language-dependent and application-dependent, but can easily
be handled in the tokenization stage
 “No one” = no-one = nobody compare with “No one man can do it
alone”
 What is the difference between can’t and cannot?
Unsegmented Languages
 No white spaces. Requires more informed approach
 An extensive word list combined with an informed segmentation
algorithm works well
 Problem recognizing unknown (or out-of-vocabulary) words
 A simple word segmentation algorithm consists of considering each
character to be a distinct word. This is practical for Chinese because the
average word length is very short
 Does not work with Japanese
 Modern Japanese includes Kanji (Chinese Hanzi symbols), hiragana (a syllabary for
grammatical markers and for words of Japanese origin), katakana (a syllabary for
words of foreign origin), romanji (words written in the Roman alphabet) and Arabic
numerals
Greedy Algorithm
 The greedy algorithm starts at the first character in a text
 Using a word list for the language being segmented, attempts
to find the longest word in the list starting with that character.
 If a word is found, the maximum-matching algorithm marks a
boundary at the end of the longest word
 Then begins the same longest match search starting at the
character following the match
 “thetabledownthere” would be segmented by the greedy algorithm as “theta
bled own there”
 The reverse maximum matching algorithm will get it right!
Sentence Segmentation
 Requires understanding of the various uses of punctuation characters in that
language
 Problem reduces to disambiguating all instances of punctuation characters
that may delimit sentences
 Punctuation marks which can denote sentence boundaries: periods,
question marks, exclamation points etc.
 But, many of these can occur in the middle of the sentence
 Ellipsis (a series of periods (...)) can occur both within sentences and at
sentence boundaries
 Exclamation points and question marks can occur at the end of a sentence,
but also within quotation marks or parentheses (Yahoo!)
Alice in Wonderland
 Sentence boundary punctuation marks considered are the period,
question mark, and exclamation point, and the definition of
sentence is limited to the text sentence which begins with a capital
letter and ends in a full stop.
 There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way
to hear the Rabbit say to itself, ‘Oh dear! Oh dear! I shall be late!’ (when she thought it over
afterwards, it occurred to her that she ought to have wondered at this, but at the time it all
seemed quite natural); but when the Rabbit actually TOOK AWATCHOUT OF ITS
WAISTCOATPOCKET, and looked at it, and then hurried on, Alice started to her feet, for it flashed
across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch
to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was
just in time to see it pop down a large rabbit-hole under the hedge.
Alice in Wonderland
 If the semicolon and comma were allowed to end sentences, the
example could be decomposed into as many as ten grammatical
sentences
 ‘Oh dear! Oh dear! I shall be late!’
 Embedded sentence boundary: Treating embedded sentences and
their punctuation differently (multiple levels of embedding would be
possible)
 The approach needs to integrate segmentation with parsing
 Research on Application of trainable techniques to broader problems
is required
Simple Rules
 Simple rule: finding a period followed by one or more spaces followed
by a word beginning with a capital letter
 The Call of the Wild by Jack London has 1640 periods as sentence boundaries, this single
rule correctly identifies 1608 boundaries (98%)
 In a small corpus of the WSJ has 16,466 periods as sentence boundaries, this simple rule
could detect only 14,562 (88.4%) while producing 2900 false positives
 “Two high-ranking positions were filled Friday by Penn St. University
President Graham Spanier” – one sentence
 Two high-ranking positions were filled Friday at Penn St. University
President Graham Spanier announced the appointments – two
sentences
Context can Help
 Case distinctions—In languages and corpora where both uppercase and
lowercase letters are consistently used, whether a word is capitalized provides
information about sentence boundaries.
 Part of speech—Palmer and Hearst (1997) showed that the parts of speech of
the words within three tokens of the punctuation mark can assist in sentence
segmentation. Their results indicate that even an estimate of the possible parts of
speech can produce good results.
 Wordlength—Riley (1989) used the length of the words before and after a
period as one contextual feature.
 Lexical endings—Müller et al. (1980) used morphological analysis to recognize
suffixes and thereby filter out words which were not likely to be abbreviations. The
analysis made it possible to identify words that were not otherwise present in the
extensive word lists used to identify abbreviations.
Context can Help
 Prefixes and suffixes—Reynar and Ratnaparkhi (1997) used both
prefixes and suffixes of the words surrounding the punctuation mark
as one contextual feature.
 Abbreviation classes—Riley (1989) and Reynar and Ratnaparkhi
(1997) further divided abbreviations into categories such as titles and
corporate designators
 Internal punctuation—Kiss and Strunk (2006) used the presence of
periods within a token as a feature.
 Proper nouns—Mikheev (2002) used the presence of a proper noun
to the right of a period as a feature
Training Models
 Classification and regression trees to classify periods according to
contextual features describing the single word preceding and
following the period.
 These contextual features included word length, punctuation after
the period, abbreviation class, case of the word, and the probability
of the word occurring at beginning or end of a sentence
 Used 25 million words from the AP newswire, and he reported an
accuracy of 99.8% when tested on the Brown corpus.
 Riley, M. D. (1989). Some applications of tree-based modelling to speech and
language indexing. In Proceedings of the DARPA Speech and Natural Language
Workshop, San Mateo, CA, pp. 339–352. Morgan Kaufmann
Training Models
 Machine learning algorithm to disambiguate all occurrences of
periods, exclamation points, and question marks.
 The system defined a contextual feature array for three words
preceding and three words following the punctuation mark; the
feature array encoded the context as the parts of speech, which can
be attributed to each word in the context.
 Using the lexical feature arrays, both a neural network and a decision
tree were trained to disambiguate the punctuation marks, and
achieved a high accuracy rate (98%–99%) on a large corpus from the
Wall Street Journal
 Palmer, D. D. and M. A. Hearst (1997). Adaptive multilingual sentence boundary
disambiguation. Computational Linguistics 23(2), 241–67
Training Models
 Identify English sentence boundaries using a statistical maximum entropy
model.
 Used a system of contextual templates, which encoded one word of context
preceding and following the punctuation mark, using such features as prefixes,
suffixes, and abbreviation class.
 Successful in inducing an abbreviation list from the training data for use in the
disambiguation.
 The algorithm, trained in less than 30 min on 40,000 manually annotated
sentences, achieved an accuracy rate of98%+ on the same test corpus used by
Palmer and Hearst without requiring specific lexical information, word lists, or
any domain-specific information.
 Reynar, J. C. and A. Ratnaparkhi (1997). A maximum entropy approach to identifying
sentence boundaries. In Proceedings of the Fifth ACL Conference on Applied Natural
Language Processing, Washington, DC
Training Models
 A sentence segmentation algorithm that jointly identifies abbreviations,
proper names, and sentence boundaries.
 The algorithm casts the sentence segmentation problem as one of
disambiguating abbreviations to the left of a period and proper names
to the right.
 While using unsupervised training methods, the algorithm encodes a
great deal of manual information regarding abbreviation structure and
length.
 The algorithm also relies heavily on consistent capitalization in order to
identify proper names
 Mikheev, A. (2002). Periods, capitalized words, etc. Computational Linguistics 28(3), 289–
318
Training Models
 Kiss and Strunk (2006) developed a largely unsupervised approach
to sentence boundary detection that focuses primarily on
identifying abbreviations.
 The algorithm encodes manual heuristics for abbreviation detection
into a statistical model that first identifies abbreviations and then
disambiguates sentence boundaries.
 The approach is essentially language independent, and they report
results for a large number of European languages.
 Kiss, T. and J. Strunk (2006). Unsupervised multilingual sentence boundary detection.
Computational Linguistics 32(4), 485–525.
Case on Sentence Boundary Identification
 Disambiguation of sentence boundary is critical
 “The group included Dr. M. R. Rao and T. Art Stoker Jr.”
 “This issue crosses party lines and crosses philosophical lines!” said
Rep. John Rowland (R., Conn.).
 “It was due Friday by 5 p.m. Saturday would be too late.”
 “She has an appointment at 5 p.m. Saturday to get her car fixed.”
Methodology
 Classification trees and ANNs are used
 Objective is to predict “sentence boundary”
 Used several corpora in three languages: English, German, and
French
 The input data is the lexical information obtained after
tokenization
 Initially, a tokenization was done to extract the tokens
(words/tokens)
Methodology
 The context surrounding a punctuation mark as a sequence of
vectors was created
 The context vector (descriptor arrays) constructed for each context
word represents an estimate of the part-of-speech distribution for
the word, obtained from a lexicon containing part-of-speech
frequency data.
 These vectors form the input to a machine learning algorithm
trained to disambiguate sentence boundaries
 The output of the learning algorithm is then used to determine the
role of the punctuation mark in the sentence
Descriptor Array
 The context surrounding a punctuation mark is defined by using the
individual words preceding and following the punctuation mark
 “at the plant. He had thought”
 the context is approximated by using a single part-of-speech for
each word
 preposition article noun . pronoun verb verb
 prior probabilities of all parts of speech for that word are
assigned or a binary value for each possible part of speech is
assigned for that word
 preposition(1.0) article(1.0) noun(0.8)/verb(0.2)
 pronoun(1.0) verb(1.0) noun(0.1)/verb(0.9)
Binary Value
 Binary part-of-speech is assigned, for each possible part of
speech
 The vector is assigned the value 1 if the word can ever occur
as that part-of-speech (according to the lexicon), and the
value 0 if it cannot
 preposition(1) article(1) noun(1)/verb(1)
 pronoun(1) verb(1) noun(1)/verb(1)
 The part-of-speech data necessary to construct probabilistic
and binary vectors is based on
 the lexicon of a part-of-speech tagger or
 existing NLP tool, or
 obtained from word lists
Heuristics for unknown words
 Unknown tokens containing a digit (0-9) are assumed to be numbers.
 Any token beginning with a period, exclamation point, or question mark is
assigned a “possible end-of-sentence punctuation” tag. (“?!” and “...”)
 Common morphological endings are recognized and the appropriate
part(s)-of-speech is assigned to the entire word.
 Words containing a hyphen are assigned a series of tags and frequencies
equally distributed between adjective, common noun, and proper noun
 Words containing an internal period are assumed to be abbreviations
 A capitalized word is not always a proper noun, even when it appears
somewhere other than in a sentence’s initial position (e.g., “American”).
 Those words not present in the lexicon are assigned a probability 0.9 for proper noun, and the remainder
is distributed uniformly among adjective, common noun, verb, and abbreviation
Descriptor Array Construction
 Parts of Speech
noun verb
conjunction pronoun
preposition proper noun
number comma or semicolon
left parentheses right parentheses
article modifier
non-punctuation character possessive
colon or dash abbreviation
sentence-ending punctuation others
Descriptor Array Construction
 The 18 category frequencies for the word are converted to
probabilities by using the relative frequencies
 For a binary vector, all categories with a non-zero frequency
count are assigned a value of 1, and all others are assigned a
value of 0.
 The descriptor array contains two additional flags that indicate if
the word begins with a capital letter and if it follows a
punctuation mark,
 20 items in each descriptor array
ANNs
Context
Size
Training
Epochs
Testing
Errors Error (%)
4 1731 1424 5.2
6 218 409 1.5
8 831 877 3.2
Tokens preceding and following the punctuation mark
ANNs – Mixed cases
Experimented by changing case
Type
Probabilistic Binary
Training
Epochs
Testing
Errors Error (%)
Training
Epochs
Testing
Errors Error (%)
Mixed Case 368 483 1.8 312 474 1.7
Lower Case 182 890 3.3 148 813 3
Upper Case 542 956 3.5 190 744 2.7
Classification Trees
Questions?
Clarifications?

Basic techniques in nlp

  • 1.
  • 2.
    Making Structure fromUnstructured Text Data mining applications use structured information Data is in a spreadsheet format (Tables) Text is presented in a different format Text mining uses many data mining algorithms Need to present the text in a tabular form
  • 3.
    Structuring Unstructured data Thecompany provides jobs to overseas students Doc No. Company Provide Jobs Overseas Students 1 1 1 1 1 1 2 0 0 1 1 0 3 1 0 1 1 0 4 1 0 1 0 1
  • 4.
    Applications Classification Prediction Categorization Information retrieval Similarity betweendocuments Not based on linguistic analysis Based on statistical and associational analysis
  • 5.
    Text Preprocessing Document triage Characterencoding identification: determines the character encoding (or encodings) for any file and optionally converts between encodings. Language identification: determines the natural language for a document; this step is closely linked the character encoding. Text sectioning identifies the actual content within a file such as images, tables, headers, links, and HTML formatting while discarding undesirable elements
  • 6.
    Character Encoding Identification Most texts were encoded in the 7-bit character set ASCII, which allowed only 128 (27) characters. Eight-bit character sets can encode 256 (28) characters using a single 8-bit byte, but most of these 8-bit sets reserve the first 128 characters for the original ASCII characters  Same range of numeric values can represent different characters in different encodings can be a problem for tokenization.  For example, English or Spanish are both normally stored in the common 8-bit encoding Latin-1 (or ISO-8859-1). An English or Spanish tokenizer would need to be aware that bytes in the (decimal) range 161–191 in Latin-1 represent punctuation marks and other symbols (such as ‘¡’, ‘¿’, ‘£’, and ‘©  Tokenization rules be required to handle each symbol (and thus its byte code) appropriately for that language  Tokenizers must be targeted to a specific language in a specific encoding
  • 7.
    Language Identification  languageswith a unique alphabet not used by any other languages, Greek or Hebrew, language identification is determined by character set identification  Character set identification can be used to narrow the task of language identification to a smaller number of languages that all share many characters,  Arabic vs. Persian, Russian vs. Ukrainian, or Norwegian vs. Swedish  European languages that use exactly the same character set but with different frequencies, final identification can be performed by training models of byte/character distributions in each of the languages  A very effective algorithm for this would sort the bytes in a file by frequency count and use the sorted list as a signature vector for comparison via an n-gram or vector distance model.
  • 8.
    Normalization  In “right-to-leftlanguages” like Hebrew and Arabic: you can have “left-to-right” text interspersed (e.g., for dollar amounts).  Need to “normalize” indexed text as well as query terms into the same form  Character-level alphabet detection and conversion  Tokenization not separable from this.  Sometimes ambiguous: 7月30日 vs. 7/30 Morgen will ich in MIT withTomorrow I will stand in MIT
  • 9.
  • 10.
    Text Preprocessing Text segmentation(Tokenization) Process of converting a well-defined text corpus into its component words and sentences. Word segmentation: breaks up the sequence of characters in a text by locating the word boundaries Sentence segmentation: process of determining the longer processing units consisting of one or more words
  • 11.
    Character Dependence Depends onthe “writing System” used Logographic (Egyptian Hieroglyphics)  each character represents a word, morpheme Syllabic (Devanagari ) The consonants each have an inherent vowel which can be changed to another vowel or muted by means of diacritics or other modifications Alphabetic (English) Standard set of letters (graphemes) of consonants and vowels that encode based on the general principle that they represent speech sounds
  • 12.
    Carpus Dependence  Availabilityof large corpora in multiple languages that encompass a wide range of data types (e.g., newswire texts, email messages, closed captioning data, Internet news pages, and weblogs) has required the development of robust NLP approaches  These corpora frequently contain misspellings, erratic punctuation and spacing, and other irregular features.  It is notoriously difficult to prescribe rules governing the use of a written language; it is even more difficult to get people to “follow the rules  Example (request for help)  ive just loaded pcl onto my akcl. when i do an ‘in- package’ to load pcl, ill get the prompt but im not able to use functions like defclass, etc... is there womething basic im missing or am i just left hanging, twisting in the breeze? AKCL: Austin Kyoto Common Lisp PCL: Printer Control Language
  • 13.
    Application Dependence  Thereis no absolute definition for what constitutes a word or a sentence.  English words I am are frequently contracted to I’m, and a tokenizer expands it to recover the essential grammatical features of the pronoun and the verb  Another tokenizer does not expand this contraction to the component words would pass the single token I’m to later processing stages. Unless these processors, which may include morphological analyzers, part-of-speech taggers, lexical lookup routines, or parsers, are aware of both the contracted and un-contracted forms, the token may be treated as an unknown word  Treatment of the English possessive ’s in various tagged corpora. In the Brown corpus the word governor’s is considered one token and is tagged as a possessive noun. In the Susanne corpus the same word is treated as two tokens, governor and ’s, tagged singular noun and possessive, respectively
  • 14.
    Format/language stripping  Documentsbeing indexed can include docs from many different languages  A single index may have to contain terms of several languages.  Sometimes a document or its components can contain multiple languages/formats  French email with a Portuguese pdf attachment.  What is a unit document?  An email?  With attachments?  An email with a zip containing documents?
  • 15.
    Tokenization Space delimited languages English,other European languages “The Cow Jumped over the Moon” Unsegmented languages Chinese/Japanese 牛は月を飛び越えました (Ushi wa tsuki o tobikoemashita)
  • 16.
    Tokenization in Space-DelimitedLanguages Very Common in Artificial Languages (C++) Words are separated by whitespace  NS Synthetics Ltd. said it expects to report a net loss for its second quarter ended March 16 and doesn’t expect to meet analysts’ profit estimates of ₹3.9 to ₹4 million, or 76 paisa a share to 79 paisa share, for its year ending Sept. 24.  It uses periods in three different ways – Abbr., decimal, end of sentence  apostrophes in two ways: Genitive (analysts’) and Contraction (doesn’t)
  • 17.
    Tokenization How to treat“76 paisa a share” Is it 4 tokens or one token? ₹3.9 to ₹4 million Same as 3.9 to 4 million rupees or ₹3,900,000 to ₹4,000,000 Is 3.9 same as 3.90, 3.9000?
  • 18.
    Tokenization  initial tokenizationof a space-delimited language would be to consider as a separate token any sequence of characters preceded and followed by space  Use other punctuation characters such as , . ; ‘ “ as separate tokens  What about 4.983 million? Rs. 40,000? Doesn’t?  markup and headers (including HTML markup), extra whitespace, and extraneous control characters
  • 19.
    Tokenization of Punctuation Manytimes punctuation characters should be “attached” to another token Vary from one language to another Abbreviations – imp. For both tokenization as well as sentence segmentation St. Theresa High School on Bendur St. is a well known landmark in Mangalore
  • 20.
    Tokenization of Punctuation Quotation marks (‘words’ or “words”) create ambiguity  Determine opening or closing of a quoted passage  Single quote also doubles as apostrophe  Apostrophes is used to mark the genitive form of a noun, to mark contractions, and to mark certain plural forms (Tom’s; isn’t; 80’s)  quotation marks are also commonly used when “Romanizing” writing systems, umlauts and accents are denoted by a double/single quotation mark  Some European languages do not use apostrophe (Peters Kopf = Peter’s head)
  • 21.
    Tokenization of Punctuation (apostrophe) Multiple contractions (fo’c’s’le = forecastle)  French and other such languages (l’homme, c’etait)  Recognizing the contractions to expand requires knowledge of the language, and the specific contractions to expand, as well as the expanded forms, must be enumerated which allows the proper tokenization of multi-contracted words  we’ve  we have (two tokens)  All other word-internal apostrophes are treated as a part of the token and not expanded
  • 22.
    Tokenization of Punctuation Hyphen agglutinating constructions (Supercalifragilisticexpialidocious)  Super-cali-fragilistic-expiali-docious (super- "above", cali- "beauty", fragilistic- "delicate", expiali- "to atone", and -docious "educable", with the sum of these parts signifying roughly "Atoning for educability through delicate beauty)  More common in German (Feuerundlebensversicherung = fire and life insurance  four tokens)  hyphens at the ends of lines to break a word too long to include on one line (About 5% of the end-of-line hyphens in an English corpus were word-internal hyphens)  No help from “white-space”
  • 23.
    Tokenization of Multi-wordExpressions  Multiple words in a given order – need to be treated as a single token  In spite of; de facto  Multi word numerical expressions  March 13, 2016; Mar. 13 2016; 13 Mar. 2016; 13/3/2016; 3/13/2016  1.25 Rupees a share dividend (one token or 5 tokens?)  highly language-dependent and application-dependent, but can easily be handled in the tokenization stage  “No one” = no-one = nobody compare with “No one man can do it alone”  What is the difference between can’t and cannot?
  • 24.
    Unsegmented Languages  Nowhite spaces. Requires more informed approach  An extensive word list combined with an informed segmentation algorithm works well  Problem recognizing unknown (or out-of-vocabulary) words  A simple word segmentation algorithm consists of considering each character to be a distinct word. This is practical for Chinese because the average word length is very short  Does not work with Japanese  Modern Japanese includes Kanji (Chinese Hanzi symbols), hiragana (a syllabary for grammatical markers and for words of Japanese origin), katakana (a syllabary for words of foreign origin), romanji (words written in the Roman alphabet) and Arabic numerals
  • 25.
    Greedy Algorithm  Thegreedy algorithm starts at the first character in a text  Using a word list for the language being segmented, attempts to find the longest word in the list starting with that character.  If a word is found, the maximum-matching algorithm marks a boundary at the end of the longest word  Then begins the same longest match search starting at the character following the match  “thetabledownthere” would be segmented by the greedy algorithm as “theta bled own there”  The reverse maximum matching algorithm will get it right!
  • 26.
    Sentence Segmentation  Requiresunderstanding of the various uses of punctuation characters in that language  Problem reduces to disambiguating all instances of punctuation characters that may delimit sentences  Punctuation marks which can denote sentence boundaries: periods, question marks, exclamation points etc.  But, many of these can occur in the middle of the sentence  Ellipsis (a series of periods (...)) can occur both within sentences and at sentence boundaries  Exclamation points and question marks can occur at the end of a sentence, but also within quotation marks or parentheses (Yahoo!)
  • 27.
    Alice in Wonderland Sentence boundary punctuation marks considered are the period, question mark, and exclamation point, and the definition of sentence is limited to the text sentence which begins with a capital letter and ends in a full stop.  There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, ‘Oh dear! Oh dear! I shall be late!’ (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually TOOK AWATCHOUT OF ITS WAISTCOATPOCKET, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge.
  • 28.
    Alice in Wonderland If the semicolon and comma were allowed to end sentences, the example could be decomposed into as many as ten grammatical sentences  ‘Oh dear! Oh dear! I shall be late!’  Embedded sentence boundary: Treating embedded sentences and their punctuation differently (multiple levels of embedding would be possible)  The approach needs to integrate segmentation with parsing  Research on Application of trainable techniques to broader problems is required
  • 29.
    Simple Rules  Simplerule: finding a period followed by one or more spaces followed by a word beginning with a capital letter  The Call of the Wild by Jack London has 1640 periods as sentence boundaries, this single rule correctly identifies 1608 boundaries (98%)  In a small corpus of the WSJ has 16,466 periods as sentence boundaries, this simple rule could detect only 14,562 (88.4%) while producing 2900 false positives  “Two high-ranking positions were filled Friday by Penn St. University President Graham Spanier” – one sentence  Two high-ranking positions were filled Friday at Penn St. University President Graham Spanier announced the appointments – two sentences
  • 30.
    Context can Help Case distinctions—In languages and corpora where both uppercase and lowercase letters are consistently used, whether a word is capitalized provides information about sentence boundaries.  Part of speech—Palmer and Hearst (1997) showed that the parts of speech of the words within three tokens of the punctuation mark can assist in sentence segmentation. Their results indicate that even an estimate of the possible parts of speech can produce good results.  Wordlength—Riley (1989) used the length of the words before and after a period as one contextual feature.  Lexical endings—Müller et al. (1980) used morphological analysis to recognize suffixes and thereby filter out words which were not likely to be abbreviations. The analysis made it possible to identify words that were not otherwise present in the extensive word lists used to identify abbreviations.
  • 31.
    Context can Help Prefixes and suffixes—Reynar and Ratnaparkhi (1997) used both prefixes and suffixes of the words surrounding the punctuation mark as one contextual feature.  Abbreviation classes—Riley (1989) and Reynar and Ratnaparkhi (1997) further divided abbreviations into categories such as titles and corporate designators  Internal punctuation—Kiss and Strunk (2006) used the presence of periods within a token as a feature.  Proper nouns—Mikheev (2002) used the presence of a proper noun to the right of a period as a feature
  • 32.
    Training Models  Classificationand regression trees to classify periods according to contextual features describing the single word preceding and following the period.  These contextual features included word length, punctuation after the period, abbreviation class, case of the word, and the probability of the word occurring at beginning or end of a sentence  Used 25 million words from the AP newswire, and he reported an accuracy of 99.8% when tested on the Brown corpus.  Riley, M. D. (1989). Some applications of tree-based modelling to speech and language indexing. In Proceedings of the DARPA Speech and Natural Language Workshop, San Mateo, CA, pp. 339–352. Morgan Kaufmann
  • 33.
    Training Models  Machinelearning algorithm to disambiguate all occurrences of periods, exclamation points, and question marks.  The system defined a contextual feature array for three words preceding and three words following the punctuation mark; the feature array encoded the context as the parts of speech, which can be attributed to each word in the context.  Using the lexical feature arrays, both a neural network and a decision tree were trained to disambiguate the punctuation marks, and achieved a high accuracy rate (98%–99%) on a large corpus from the Wall Street Journal  Palmer, D. D. and M. A. Hearst (1997). Adaptive multilingual sentence boundary disambiguation. Computational Linguistics 23(2), 241–67
  • 34.
    Training Models  IdentifyEnglish sentence boundaries using a statistical maximum entropy model.  Used a system of contextual templates, which encoded one word of context preceding and following the punctuation mark, using such features as prefixes, suffixes, and abbreviation class.  Successful in inducing an abbreviation list from the training data for use in the disambiguation.  The algorithm, trained in less than 30 min on 40,000 manually annotated sentences, achieved an accuracy rate of98%+ on the same test corpus used by Palmer and Hearst without requiring specific lexical information, word lists, or any domain-specific information.  Reynar, J. C. and A. Ratnaparkhi (1997). A maximum entropy approach to identifying sentence boundaries. In Proceedings of the Fifth ACL Conference on Applied Natural Language Processing, Washington, DC
  • 35.
    Training Models  Asentence segmentation algorithm that jointly identifies abbreviations, proper names, and sentence boundaries.  The algorithm casts the sentence segmentation problem as one of disambiguating abbreviations to the left of a period and proper names to the right.  While using unsupervised training methods, the algorithm encodes a great deal of manual information regarding abbreviation structure and length.  The algorithm also relies heavily on consistent capitalization in order to identify proper names  Mikheev, A. (2002). Periods, capitalized words, etc. Computational Linguistics 28(3), 289– 318
  • 36.
    Training Models  Kissand Strunk (2006) developed a largely unsupervised approach to sentence boundary detection that focuses primarily on identifying abbreviations.  The algorithm encodes manual heuristics for abbreviation detection into a statistical model that first identifies abbreviations and then disambiguates sentence boundaries.  The approach is essentially language independent, and they report results for a large number of European languages.  Kiss, T. and J. Strunk (2006). Unsupervised multilingual sentence boundary detection. Computational Linguistics 32(4), 485–525.
  • 37.
    Case on SentenceBoundary Identification  Disambiguation of sentence boundary is critical  “The group included Dr. M. R. Rao and T. Art Stoker Jr.”  “This issue crosses party lines and crosses philosophical lines!” said Rep. John Rowland (R., Conn.).  “It was due Friday by 5 p.m. Saturday would be too late.”  “She has an appointment at 5 p.m. Saturday to get her car fixed.”
  • 38.
    Methodology  Classification treesand ANNs are used  Objective is to predict “sentence boundary”  Used several corpora in three languages: English, German, and French  The input data is the lexical information obtained after tokenization  Initially, a tokenization was done to extract the tokens (words/tokens)
  • 39.
    Methodology  The contextsurrounding a punctuation mark as a sequence of vectors was created  The context vector (descriptor arrays) constructed for each context word represents an estimate of the part-of-speech distribution for the word, obtained from a lexicon containing part-of-speech frequency data.  These vectors form the input to a machine learning algorithm trained to disambiguate sentence boundaries  The output of the learning algorithm is then used to determine the role of the punctuation mark in the sentence
  • 40.
    Descriptor Array  Thecontext surrounding a punctuation mark is defined by using the individual words preceding and following the punctuation mark  “at the plant. He had thought”  the context is approximated by using a single part-of-speech for each word  preposition article noun . pronoun verb verb  prior probabilities of all parts of speech for that word are assigned or a binary value for each possible part of speech is assigned for that word  preposition(1.0) article(1.0) noun(0.8)/verb(0.2)  pronoun(1.0) verb(1.0) noun(0.1)/verb(0.9)
  • 41.
    Binary Value  Binarypart-of-speech is assigned, for each possible part of speech  The vector is assigned the value 1 if the word can ever occur as that part-of-speech (according to the lexicon), and the value 0 if it cannot  preposition(1) article(1) noun(1)/verb(1)  pronoun(1) verb(1) noun(1)/verb(1)  The part-of-speech data necessary to construct probabilistic and binary vectors is based on  the lexicon of a part-of-speech tagger or  existing NLP tool, or  obtained from word lists
  • 42.
    Heuristics for unknownwords  Unknown tokens containing a digit (0-9) are assumed to be numbers.  Any token beginning with a period, exclamation point, or question mark is assigned a “possible end-of-sentence punctuation” tag. (“?!” and “...”)  Common morphological endings are recognized and the appropriate part(s)-of-speech is assigned to the entire word.  Words containing a hyphen are assigned a series of tags and frequencies equally distributed between adjective, common noun, and proper noun  Words containing an internal period are assumed to be abbreviations  A capitalized word is not always a proper noun, even when it appears somewhere other than in a sentence’s initial position (e.g., “American”).  Those words not present in the lexicon are assigned a probability 0.9 for proper noun, and the remainder is distributed uniformly among adjective, common noun, verb, and abbreviation
  • 43.
    Descriptor Array Construction Parts of Speech noun verb conjunction pronoun preposition proper noun number comma or semicolon left parentheses right parentheses article modifier non-punctuation character possessive colon or dash abbreviation sentence-ending punctuation others
  • 44.
    Descriptor Array Construction The 18 category frequencies for the word are converted to probabilities by using the relative frequencies  For a binary vector, all categories with a non-zero frequency count are assigned a value of 1, and all others are assigned a value of 0.  The descriptor array contains two additional flags that indicate if the word begins with a capital letter and if it follows a punctuation mark,  20 items in each descriptor array
  • 45.
    ANNs Context Size Training Epochs Testing Errors Error (%) 41731 1424 5.2 6 218 409 1.5 8 831 877 3.2 Tokens preceding and following the punctuation mark
  • 46.
    ANNs – Mixedcases Experimented by changing case Type Probabilistic Binary Training Epochs Testing Errors Error (%) Training Epochs Testing Errors Error (%) Mixed Case 368 483 1.8 312 474 1.7 Lower Case 182 890 3.3 148 813 3 Upper Case 542 956 3.5 190 744 2.7
  • 47.
  • 49.

Editor's Notes

  • #3 Handbook of NLP by NITIN INDURKHYA and FRED J. DAMERAU
  • #38 Adaptive Multilingual Sentence Boundary Disambiguation by David D. Palmer and Marti A. Hearst