Natural Language Processing
Text Normalization
& Corpus
Text Normalization
• Conversion of text that includes ‘nonstandard’ words like numbers,
abbreviations, misspellings into normal words.
Example :
u r dng btr thn ny autmtc txt nrmlztion prgrm cn do.
$200" would be pronounced as "two hundred dollars" in English.
• Text normalization requires being aware of what type of text is to be
normalized and how it is to be processed afterwards; there is no all-purpose
normalization procedure.
Text Normalization
• Text normalization is frequently used when converting text to speech.
• Numbers, dates, acronyms, and abbreviations are non-standard "words" that need to
be pronounced differently depending on context.
• M, me ,mein (non standard) - mein(hindi) (standard)(Challenging )
• M school ja rahi h –
• Me schl jaaaa ri hu
• OMG – (rule based normalization)
• Gr8- great –
• $ 200 - ()
Text Normalization
• Given a string of characters in a text, what is the (reasonable) set of possible
actual words (or word sequences) that might correspond to it.
• Which of those is right for the particular context?
What is Corpus
• Corpus is a large collection of texts. It is a body of written or spoken material
upon which a linguistic analysis is based.
• The plural form of corpus is corpora.
• Some popular corpora are British National Corpus (BNC),
COBUILD/Birmingham Corpus, IBM/Lancaster Spoken English Corpus.
• European Corpus Initiative (ECI) corpus is multilingual having 98 million words
in Turkish, Japenese, Russian, Chinese, and other languages.
• The corpus may be composed of written language, spoken language or both.
Spoken corpus is usually in the form of audio recordings.
Types of Corpus
• A corpus may be open or closed. An open corpus is one which does not
claim to contain all data from a specific area while a closed corpus does
claim to contain all or nearly all data from a particular field. Medical
corpora, for example, are closed as there can be no further input to an area.
• Monolingual corpora represent only one language while bilingual corpora
represent two languages.
• Parallel corpus
• Balanced Corpus
Balanced Corpus
What should be covered in a balanced corpus?
Balanced: covers a range of text categories
• Definition depends upon the intended uses
• No true objective measure of balance
• Usually based on proportional sampling
• Balance can be based on a text typology, a classification of text types
Uses of Corpus
• A corpus provides grammarians, lexicographers, and other interested parties
with better descriptions of a language.
• Computer-procesable corpora allow linguists to adopt the principle of total
accountability, retrieving all the occurrences of a particular word or
structure for inspection or randomly selected samples.
• Corpus analysis provide lexical information, morphosyntactic information,
semantic information and pragmatic information.
Applications of Corpus
• Corpora are used in the development of NLP tools.
• Applications include spell-checking, grammar-checking, speech recognition,
text-to-speech and speech-to-text synthesis, automatic abstraction and
indexing, information retrieval and machine translation.
• Corpora also used for creation of new dictionaries and grammars for
learners.

4 Natural Language Processing-Text Normalization.pptx

  • 1.
    Natural Language Processing TextNormalization & Corpus
  • 2.
    Text Normalization • Conversionof text that includes ‘nonstandard’ words like numbers, abbreviations, misspellings into normal words. Example : u r dng btr thn ny autmtc txt nrmlztion prgrm cn do. $200" would be pronounced as "two hundred dollars" in English. • Text normalization requires being aware of what type of text is to be normalized and how it is to be processed afterwards; there is no all-purpose normalization procedure.
  • 3.
    Text Normalization • Textnormalization is frequently used when converting text to speech. • Numbers, dates, acronyms, and abbreviations are non-standard "words" that need to be pronounced differently depending on context. • M, me ,mein (non standard) - mein(hindi) (standard)(Challenging ) • M school ja rahi h – • Me schl jaaaa ri hu • OMG – (rule based normalization) • Gr8- great – • $ 200 - ()
  • 4.
    Text Normalization • Givena string of characters in a text, what is the (reasonable) set of possible actual words (or word sequences) that might correspond to it. • Which of those is right for the particular context?
  • 5.
    What is Corpus •Corpus is a large collection of texts. It is a body of written or spoken material upon which a linguistic analysis is based. • The plural form of corpus is corpora. • Some popular corpora are British National Corpus (BNC), COBUILD/Birmingham Corpus, IBM/Lancaster Spoken English Corpus. • European Corpus Initiative (ECI) corpus is multilingual having 98 million words in Turkish, Japenese, Russian, Chinese, and other languages. • The corpus may be composed of written language, spoken language or both. Spoken corpus is usually in the form of audio recordings.
  • 6.
    Types of Corpus •A corpus may be open or closed. An open corpus is one which does not claim to contain all data from a specific area while a closed corpus does claim to contain all or nearly all data from a particular field. Medical corpora, for example, are closed as there can be no further input to an area. • Monolingual corpora represent only one language while bilingual corpora represent two languages. • Parallel corpus • Balanced Corpus
  • 7.
    Balanced Corpus What shouldbe covered in a balanced corpus? Balanced: covers a range of text categories • Definition depends upon the intended uses • No true objective measure of balance • Usually based on proportional sampling • Balance can be based on a text typology, a classification of text types
  • 8.
    Uses of Corpus •A corpus provides grammarians, lexicographers, and other interested parties with better descriptions of a language. • Computer-procesable corpora allow linguists to adopt the principle of total accountability, retrieving all the occurrences of a particular word or structure for inspection or randomly selected samples. • Corpus analysis provide lexical information, morphosyntactic information, semantic information and pragmatic information.
  • 9.
    Applications of Corpus •Corpora are used in the development of NLP tools. • Applications include spell-checking, grammar-checking, speech recognition, text-to-speech and speech-to-text synthesis, automatic abstraction and indexing, information retrieval and machine translation. • Corpora also used for creation of new dictionaries and grammars for learners.