Upcoming SlideShare
Loading in...5

Like this? Share it with your network








Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

ppt Presentation Transcript

  • 1. Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics 1. Course Overview Jonas Kuhn Universität Potsdam, 2007
  • 2. Outline
    • Course Overview & Introduction
    • Some Python Programming
  • 3. Course Overview
    • Simple Python Programming
    • Basic Probability Theory
    • N-Gram Language Modeling
      • Basic Information Theory: Entropy
      • Data Sparseness & Smoothing Techniques
    • Machine Learning Paradigms
    • Part-of-Speech-Tagging with Statistical and ML Techniques
    • Probabilistic Grammars & Parsing
    • Statistical Machine Translation
  • 4. The Status of Statistical Methods
    • Eric Brill and Raymond J. Mooney (1997):
    • An Overview of Empirical Natural Language Processing
    • In: AI Magazine, 18(4): Winter 1997, 13-24.
    • The linguistic knowledge-acquisition problem
      • Rationalist methods
      • Empirical or corpus-based methods
  • 5. Rationalist methods
  • 6. Empirical or corpus-based methods
  • 7. History of NLP
    • 1950s: empirical and statistical analyses of natural language (compare: behaviorism in psychology; Skinner)
    • Mid-1950s:
      • Chomsky’s program
        • Observational and explanatory adequacy
        • Arguments against learnability of language from data; Innateness hypothesis
      • Rationalist methods in AI research in NLP
        • Hand coding of rules
    • Starting in early 1980s
      • Some work on induction of lexical and syntactic information from text
      • Empirical methods in speech recognition (hidden Markov models; HMMs)
  • 8. History of NLP
    • Late 1980s/1990s: Statistical techniques in various areas of NLP
      • POS tagging
      • Machine translation
      • Probabilistic context-free grammars
      • Word sense disambiguation
      • Anaphora resolution
  • 9. Reasons for the Resurgence of Empiricism
    • Empirical methods offer potential solutions to several related, long-standing problems in NLP:
    • (1) Acquisition, automatically identifying and coding all the necessary knowledge
    • (2) Coverage, accounting for all the phenomena in a given domain or application
    • (3) Robustness, accommodating real data that contain noise and aspects not accounted for by the underlying model
    • (4) Extensibility, easily extending or porting a system to a new set of data or a new task or domain
  • 10. Reasons for the Resurgence of Empiricism
    • Additional factors:
    • (1) computing resources, the availability of relatively inexpensive workstations with sufficient processing and memory resources to analyze large amounts of data
    • (2) data resources, the development and availability of large corpora of linguistic and lexical data for training and testing systems
    • (3) emphasis on applications and evaluation, industrial and government focus on the development of practical systems that are experimentally evaluated on real data
  • 11. Categories of Empirical Methods (1)
    • Probabilistic methods
    • Symbolic learning methods
    • Neural network/connectionist methods
  • 12. Categories of Empirical Methods (2)
    • Different dimension: type of training data
      • Supervised learning
        • Annotated text
      • Unsupervised learning
        • Indirect feedback
    • Important: combination of rationalist and empirical methods
  • 13. An Interdisciplinary Field Computational Neuroscience Computer Science Linguistics Mathematics Electrical Engineering Artificial Intelligence Computational Linguistics Philosophy Algorithms & Data Structures Search Algorithms Machine Learning Neural Networks Natural Language Parsing Grammar Formalisms Complexity Theory Formal Language Theory Probability Theory Information Theory Pattern/Speech Recognition Information Retrieval Clustering Corpus Linguistics Empirical Sciences Statistics Psycho- linguistics Statistical NLP
  • 14. Practical Aspects
    • We will use
      • Python for small programming exercises
        • http://www.python.org/
      • NLTK library (in Python) – Natural Language Toolkit
        • http:// nltk.sourceforge.net /
      • (probably) WEKA for small Machine Learning experiments
        • http:// www.cs.waikato.ac.nz/ml/weka /
  • 15. Python
    • Tutorial introduction in an NLP context:
      • http://nltk.sourceforge.net/docs.html
      • Chapter 2: Programming
  • 16. Python: Key Features
    • Simple yet powerful, shallow learning curve
    • Object-oriented: encapsulation, re-use
    • Scripting language, facilitates interactive exploration
    • Excellent functionality for processing linguistic data
    • Extensive standard library, incl graphics, web, numerical processing
    • Downloaded for free from http://www.python.org/
    Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP
  • 17. Python example
    • import sys
    • for line in sys.stdin.readlines():
    • for word in line.split():
    • if word.endswith(’ing’):
    • print word
    • whitespace: nesting lines of code; scope
    • object-oriented: attributes, methods (e.g. line )
    • readable
    Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP
  • 18. Comparison with Perl
    • while (<>) {
    • foreach my $word (split) {
    • if ($word =~ /ing$/) {
    • print &quot;$word &quot;;
    • }
    • }
    • }
    • syntax is obscure: what are: <> $ my split ?
    • “ it is quite easy in Perl to write programs that simply look like raving gibberish, even to experienced Perl programmers” (Hammond Perl Programming for Linguists 2003:47)
    • large programs difficult to maintain, reuse
    Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP
  • 19. What NLTK adds to Python
    • NLTK defines a basic infrastructure that can be used to build NLP programs in Python. It provides:
      • Basic classes for representing data relevant to natural language processing
      • Standard interfaces for performing tasks, such as tokenization, tagging, and parsing
      • Standard implementations for each task, which can be combined to solve complex problems
      • Extensive documentation, including tutorials and reference documentation
    Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP
  • 20. Installing Python and NLTK
    • Install Python, Numeric
    • Install NLTK-Lite, NLTK-Lite-Corpora
    • Set environment variable NLTK_LITE_CORPORA
    • For detailed instructions, see:
      • http://nltk.sourceforge.net/install.html
  • 21. Running Project Idea
    • Language Identification
      • In what language is a given text document?
      • First ideas?
      • (Using simple text processing techniques)