Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics 1. Course Overview Jonas Kuhn Universität Potsdam, 2007
  2. 2. Outline <ul><li>Course Overview & Introduction </li></ul><ul><li>Some Python Programming </li></ul>
  3. 3. Course Overview <ul><li>Simple Python Programming </li></ul><ul><li>Basic Probability Theory </li></ul><ul><li>N-Gram Language Modeling </li></ul><ul><ul><li>Basic Information Theory: Entropy </li></ul></ul><ul><ul><li>Data Sparseness & Smoothing Techniques </li></ul></ul><ul><li>Machine Learning Paradigms </li></ul><ul><li>Part-of-Speech-Tagging with Statistical and ML Techniques </li></ul><ul><li>Probabilistic Grammars & Parsing </li></ul><ul><li>Statistical Machine Translation </li></ul>
  4. 4. The Status of Statistical Methods <ul><li>Eric Brill and Raymond J. Mooney (1997): </li></ul><ul><li>An Overview of Empirical Natural Language Processing </li></ul><ul><li>In: AI Magazine, 18(4): Winter 1997, 13-24. </li></ul><ul><li>The linguistic knowledge-acquisition problem </li></ul><ul><ul><li>Rationalist methods </li></ul></ul><ul><ul><li>Empirical or corpus-based methods </li></ul></ul>
  5. 5. Rationalist methods
  6. 6. Empirical or corpus-based methods
  7. 7. History of NLP <ul><li>1950s: empirical and statistical analyses of natural language (compare: behaviorism in psychology; Skinner) </li></ul><ul><li>Mid-1950s: </li></ul><ul><ul><li>Chomsky’s program </li></ul></ul><ul><ul><ul><li>Observational and explanatory adequacy </li></ul></ul></ul><ul><ul><ul><li>Arguments against learnability of language from data; Innateness hypothesis </li></ul></ul></ul><ul><ul><li>Rationalist methods in AI research in NLP </li></ul></ul><ul><ul><ul><li>Hand coding of rules </li></ul></ul></ul><ul><li>Starting in early 1980s </li></ul><ul><ul><li>Some work on induction of lexical and syntactic information from text </li></ul></ul><ul><ul><li>Empirical methods in speech recognition (hidden Markov models; HMMs) </li></ul></ul>
  8. 8. History of NLP <ul><li>Late 1980s/1990s: Statistical techniques in various areas of NLP </li></ul><ul><ul><li>POS tagging </li></ul></ul><ul><ul><li>Machine translation </li></ul></ul><ul><ul><li>Probabilistic context-free grammars </li></ul></ul><ul><ul><li>Word sense disambiguation </li></ul></ul><ul><ul><li>Anaphora resolution </li></ul></ul>
  9. 9. Reasons for the Resurgence of Empiricism <ul><li>Empirical methods offer potential solutions to several related, long-standing problems in NLP: </li></ul><ul><li>(1) Acquisition, automatically identifying and coding all the necessary knowledge </li></ul><ul><li>(2) Coverage, accounting for all the phenomena in a given domain or application </li></ul><ul><li>(3) Robustness, accommodating real data that contain noise and aspects not accounted for by the underlying model </li></ul><ul><li>(4) Extensibility, easily extending or porting a system to a new set of data or a new task or domain </li></ul>
  10. 10. Reasons for the Resurgence of Empiricism <ul><li>Additional factors: </li></ul><ul><li>(1) computing resources, the availability of relatively inexpensive workstations with sufficient processing and memory resources to analyze large amounts of data </li></ul><ul><li>(2) data resources, the development and availability of large corpora of linguistic and lexical data for training and testing systems </li></ul><ul><li>(3) emphasis on applications and evaluation, industrial and government focus on the development of practical systems that are experimentally evaluated on real data </li></ul>
  11. 11. Categories of Empirical Methods (1) <ul><li>Probabilistic methods </li></ul><ul><li>Symbolic learning methods </li></ul><ul><li>Neural network/connectionist methods </li></ul>
  12. 12. Categories of Empirical Methods (2) <ul><li>Different dimension: type of training data </li></ul><ul><ul><li>Supervised learning </li></ul></ul><ul><ul><ul><li>Annotated text </li></ul></ul></ul><ul><ul><li>Unsupervised learning </li></ul></ul><ul><ul><ul><li>Indirect feedback </li></ul></ul></ul><ul><li>Important: combination of rationalist and empirical methods </li></ul>
  13. 13. An Interdisciplinary Field Computational Neuroscience Computer Science Linguistics Mathematics Electrical Engineering Artificial Intelligence Computational Linguistics Philosophy Algorithms & Data Structures Search Algorithms Machine Learning Neural Networks Natural Language Parsing Grammar Formalisms Complexity Theory Formal Language Theory Probability Theory Information Theory Pattern/Speech Recognition Information Retrieval Clustering Corpus Linguistics Empirical Sciences Statistics Psycho- linguistics Statistical NLP
  14. 14. Practical Aspects <ul><li>We will use </li></ul><ul><ul><li>Python for small programming exercises </li></ul></ul><ul><ul><ul><li>http://www.python.org/ </li></ul></ul></ul><ul><ul><li>NLTK library (in Python) – Natural Language Toolkit </li></ul></ul><ul><ul><ul><li>http:// nltk.sourceforge.net / </li></ul></ul></ul><ul><ul><li>(probably) WEKA for small Machine Learning experiments </li></ul></ul><ul><ul><ul><li>http:// www.cs.waikato.ac.nz/ml/weka / </li></ul></ul></ul>
  15. 15. Python <ul><li>Tutorial introduction in an NLP context: </li></ul><ul><ul><li>http://nltk.sourceforge.net/docs.html </li></ul></ul><ul><ul><li>Chapter 2: Programming </li></ul></ul>
  16. 16. Python: Key Features <ul><li>Simple yet powerful, shallow learning curve </li></ul><ul><li>Object-oriented: encapsulation, re-use </li></ul><ul><li>Scripting language, facilitates interactive exploration </li></ul><ul><li>Excellent functionality for processing linguistic data </li></ul><ul><li>Extensive standard library, incl graphics, web, numerical processing </li></ul><ul><li>Downloaded for free from http://www.python.org/ </li></ul>Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP
  17. 17. Python example <ul><li>import sys </li></ul><ul><li>for line in sys.stdin.readlines(): </li></ul><ul><li>for word in line.split(): </li></ul><ul><li>if word.endswith(’ing’): </li></ul><ul><li>print word </li></ul><ul><li>whitespace: nesting lines of code; scope </li></ul><ul><li>object-oriented: attributes, methods (e.g. line ) </li></ul><ul><li>readable </li></ul>Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP
  18. 18. Comparison with Perl <ul><li>while (<>) { </li></ul><ul><li>foreach my $word (split) { </li></ul><ul><li>if ($word =~ /ing$/) { </li></ul><ul><li>print &quot;$word &quot;; </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>syntax is obscure: what are: <> $ my split ? </li></ul><ul><li>“ it is quite easy in Perl to write programs that simply look like raving gibberish, even to experienced Perl programmers” (Hammond Perl Programming for Linguists 2003:47) </li></ul><ul><li>large programs difficult to maintain, reuse </li></ul>Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP
  19. 19. What NLTK adds to Python <ul><li>NLTK defines a basic infrastructure that can be used to build NLP programs in Python. It provides: </li></ul><ul><ul><li>Basic classes for representing data relevant to natural language processing </li></ul></ul><ul><ul><li>Standard interfaces for performing tasks, such as tokenization, tagging, and parsing </li></ul></ul><ul><ul><li>Standard implementations for each task, which can be combined to solve complex problems </li></ul></ul><ul><ul><li>Extensive documentation, including tutorials and reference documentation </li></ul></ul>Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP
  20. 20. Installing Python and NLTK <ul><li>Install Python, Numeric </li></ul><ul><li>Install NLTK-Lite, NLTK-Lite-Corpora </li></ul><ul><li>Set environment variable NLTK_LITE_CORPORA </li></ul><ul><li>For detailed instructions, see: </li></ul><ul><ul><li>http://nltk.sourceforge.net/install.html </li></ul></ul>
  21. 21. Running Project Idea <ul><li>Language Identification </li></ul><ul><ul><li>In what language is a given text document? </li></ul></ul><ul><ul><li>First ideas? </li></ul></ul><ul><ul><li>(Using simple text processing techniques) </li></ul></ul>