NLP?Natural language processing is the process of building computationalmodels for understanding natural language. It studies the problemsof automated generation and understanding of natural humanlanguages. NLP includes natural-language-generation systems thatconvert information from computer databases into normal humanlanguage and natural-language-understanding systems that convertsamples of human language into more formal representations thatare easier for computer programs to manipulate.NLP involve multiple disciplines. Including artificial intelligencetechniques, multivariate, logical inference, statistics, linguistics andany other technique that can be used to process, generate orinterpret language with computers.In order to understand this field it is fundamental to know andunderstand the meaning of the terms used in this field. This wordscan reffer either to the processes used in this field or to definitions ofdifferent kind of information. This information definitions are:1- repositories of knowledge containing linguistic information, realfacts and different kinds of relations that can be found inlanguage.2- specifications describing kinds of content and how to obtainthem from texts which provide information about differentaspects of texts.Machine learning and NLPMachine learning is a subfield of artificial intelligence (AI) concernedwith algorithms that allow computers to learn.We can view NLP as “an extension of what machine learning” or “aspecial kind of machine learning”. Both need to build models usingalgorithms and datasets in order to be able to process the new datawith these already built models.Machine-learning can provide natural language processing a range ofalternative Learning algorithms as well as additional generalapproaches and methodologies.NLP also introduces new learning frameworks and techniques suchas: information retrieval and extraction, through speech recognitionto syntax, semantics and language understanding related tasks. Italso presents the theoretical paradigms: learning theoretic,
probabilistic and information theoretic, and the relations amongthem, along with the main algorithmic techniques developed withinthese and in key natural language applications.The 2 NLP approaches1. Statistical NLP: comprises all quantitative approaches toautomated language processing, including probabilisticmodeling, information theory, and linear algebra.Thetechnology for statistical NLP comes mainly from machinelearning and data mining, both of which are fields of artificialintelligence that involve learning from data.2. Linguistic oriented: based on large repositories that containinformation about texts, for example a list of synonims, ataxonomy, definition of the gramatic rules of languages, etc..Mayor task in NLP Automatic summarization: Produce a readable summary of achunk of text. Often used to provide summaries of text of a knowntype, such as articles in the financial section of a newspaper. Machine translation: Automatically translate text from one humanlanguage to another. This is one of the most difficult problems,and is a member of a class of problems colloquially termed "AI-complete", i.e. requiring all of the different types of knowledgethat humans possess (grammar, semantics, facts about the realworld, etc.) in order to solve properly. Part-of-speech tagging: Given a sentence, determine the part ofspeech for each word. Many words, especially common ones, canserve as multiple parts of speech. For example, "book" can be anoun ("the book on the table") or verb ("to book a flight"); "set"can be a noun, verb or adjective; and "out" can be any of at leastfive different parts of speech. Note that some languages havemore such ambiguity than others. Languages with little inflectionalmorphology, such as English are particularly prone to suchambiguity. Chinese is prone to such ambiguity because it is a tonallanguage during verbalization. Such inflection is not readilyconveyed via the entities employed within the orthography toconvey intended meaning.
Parsing: Determine the parse tree (grammatical analysis) of agiven sentence. The grammar for naturallanguages is ambiguous and typical sentences have multiplepossible analyses. In fact, perhaps surprisingly, for a typicalsentence there may be thousands of potential parses (most ofwhich will seem completely nonsensical to a human). Sentiment analysis: Extract subjective information usually from aset of documents, often using online reviews to determine"polarity" about specific objects. It is especially useful foridentifying trends of public opinion in the social media, for thepurpose of marketing. Topic segmentation and recognition: Given a chunk of text,separate it into segments each of which is devoted to a topic, andidentify the topic of the segment.Part of NLP specific vocabulary and its meaning.Linguistics is the scientific and philosophical study of language,encompassing a number of sub-fields. At the core of theoreticallinguistics is the study of language structure (grammar) and thestudy of meaning (semantics). The first of these encompassesmorphology (the formation and composition of words) and syntax(the rules that determine how words combine into phrases andsentences).A controlled vocabulary is a list of terms that have beenenumerated explicitly. This list is controlled by and is available from acontrolled vocabulary registration authority. All terms in a controlledvocabulary should have an unambiguous, non-redundant definition.Named entity recognition is a subtask of information extractionthat seeks to locate and classify atomic elements in text intopredefined categories such as the names of persons, organizations,locations, expressions of times, quantities, monetary values,percentages, etc.A taxonomy is a collection of controlled vocabulary termsorganized into a hierarchical structure (tree shaped). Each term inthe taxonomy is in one or more parent-child relationships. The childkind of thing has by definition the same constraints as the father typeones plus one or more additional constraints. For example, car is achild of vehicle. So any car is also a vehicle, but not every vehicle is acar. There are also specific kind of taxonomies like an “enterprisetaxonomy” which contains terms related only to this specific
field. Taxonomies are seen as less broadthan ontologies because ontologies include logic inference andallow a larger variety of relation types.An ontology is a formal representation of a set of concepts within adomain and the relationships between those concepts. It is used toreason about the properties of that domain, and may be used todefine the domain. They are a form of knowledge representation.Part-of-speech (POS) tagging is a process whereby tokens aresequentially labeled with syntactic labels, such as "finite verb" or"gerund" or "subordinating conjunction".Morphology is the study of the internal structure of words.Lexeme is the distinction between these two senses of "word" isarguably the most important one in morphology. The first sense of"word," the one in which dog and dogs are "the same word," this iscalled lexeme. The second one is called word-form. We thus saythat dog and dogs have a common Lemma. a Stemmer is used totransform words to its Lemma (also called root). ttjere are differentforms of the same lexeme. There is a form of a word that is chosenconventionally to represent the canonical form ofa Lemma. A Lexicon is the collection of all the lexemes of alanguage.Grammar is the field of linguistics that covers the rules governingthe use of any given spoken languages. It mainlyincludes morphology andsyntax, but it can be complemented withother linguistic fields.Syntax is the study of the principles and rules for constructingsentences in natural languages; the term syntax is also used to referdirectly to the rules and principles that govern the sentencestructure. Semantics is basically the study of the meaning of signs.These studies can be performed at word level, sentence level,paragraph level, and even larger units of discourse levels..Corpus is a large and structured set of texts used to do statisticalanalysis, text-mining, validation of linguistic rules, calculatedocument similarities, etc..Slow but well organize video introduction:http://www.youtube.com/watch?v=bDPULOFFlaI