Your SlideShare is downloading. ×
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics 1. Course Overview Jonas Kuhn Universität Potsdam, 2007
  • 2. Outline
    • Course Overview & Introduction
    • Some Python Programming
  • 3. Course Overview
    • Simple Python Programming
    • Basic Probability Theory
    • N-Gram Language Modeling
      • Basic Information Theory: Entropy
      • Data Sparseness & Smoothing Techniques
    • Machine Learning Paradigms
    • Part-of-Speech-Tagging with Statistical and ML Techniques
    • Probabilistic Grammars & Parsing
    • Statistical Machine Translation
  • 4. The Status of Statistical Methods
    • Eric Brill and Raymond J. Mooney (1997):
    • An Overview of Empirical Natural Language Processing
    • In: AI Magazine, 18(4): Winter 1997, 13-24.
    • The linguistic knowledge-acquisition problem
      • Rationalist methods
      • Empirical or corpus-based methods
  • 5. Rationalist methods
  • 6. Empirical or corpus-based methods
  • 7. History of NLP
    • 1950s: empirical and statistical analyses of natural language (compare: behaviorism in psychology; Skinner)
    • Mid-1950s:
      • Chomsky’s program
        • Observational and explanatory adequacy
        • Arguments against learnability of language from data; Innateness hypothesis
      • Rationalist methods in AI research in NLP
        • Hand coding of rules
    • Starting in early 1980s
      • Some work on induction of lexical and syntactic information from text
      • Empirical methods in speech recognition (hidden Markov models; HMMs)
  • 8. History of NLP
    • Late 1980s/1990s: Statistical techniques in various areas of NLP
      • POS tagging
      • Machine translation
      • Probabilistic context-free grammars
      • Word sense disambiguation
      • Anaphora resolution
  • 9. Reasons for the Resurgence of Empiricism
    • Empirical methods offer potential solutions to several related, long-standing problems in NLP:
    • (1) Acquisition, automatically identifying and coding all the necessary knowledge
    • (2) Coverage, accounting for all the phenomena in a given domain or application
    • (3) Robustness, accommodating real data that contain noise and aspects not accounted for by the underlying model
    • (4) Extensibility, easily extending or porting a system to a new set of data or a new task or domain
  • 10. Reasons for the Resurgence of Empiricism
    • Additional factors:
    • (1) computing resources, the availability of relatively inexpensive workstations with sufficient processing and memory resources to analyze large amounts of data
    • (2) data resources, the development and availability of large corpora of linguistic and lexical data for training and testing systems
    • (3) emphasis on applications and evaluation, industrial and government focus on the development of practical systems that are experimentally evaluated on real data
  • 11. Categories of Empirical Methods (1)
    • Probabilistic methods
    • Symbolic learning methods
    • Neural network/connectionist methods
  • 12. Categories of Empirical Methods (2)
    • Different dimension: type of training data
      • Supervised learning
        • Annotated text
      • Unsupervised learning
        • Indirect feedback
    • Important: combination of rationalist and empirical methods
  • 13. An Interdisciplinary Field Computational Neuroscience Computer Science Linguistics Mathematics Electrical Engineering Artificial Intelligence Computational Linguistics Philosophy Algorithms & Data Structures Search Algorithms Machine Learning Neural Networks Natural Language Parsing Grammar Formalisms Complexity Theory Formal Language Theory Probability Theory Information Theory Pattern/Speech Recognition Information Retrieval Clustering Corpus Linguistics Empirical Sciences Statistics Psycho- linguistics Statistical NLP
  • 14. Practical Aspects
    • We will use
      • Python for small programming exercises
      • NLTK library (in Python) – Natural Language Toolkit
        • http:// /
      • (probably) WEKA for small Machine Learning experiments
        • http:// /
  • 15. Python
    • Tutorial introduction in an NLP context:
      • Chapter 2: Programming
  • 16. Python: Key Features
    • Simple yet powerful, shallow learning curve
    • Object-oriented: encapsulation, re-use
    • Scripting language, facilitates interactive exploration
    • Excellent functionality for processing linguistic data
    • Extensive standard library, incl graphics, web, numerical processing
    • Downloaded for free from
    Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP
  • 17. Python example
    • import sys
    • for line in sys.stdin.readlines():
    • for word in line.split():
    • if word.endswith(’ing’):
    • print word
    • whitespace: nesting lines of code; scope
    • object-oriented: attributes, methods (e.g. line )
    • readable
    Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP
  • 18. Comparison with Perl
    • while (<>) {
    • foreach my $word (split) {
    • if ($word =~ /ing$/) {
    • print &quot;$word &quot;;
    • }
    • }
    • }
    • syntax is obscure: what are: <> $ my split ?
    • “ it is quite easy in Perl to write programs that simply look like raving gibberish, even to experienced Perl programmers” (Hammond Perl Programming for Linguists 2003:47)
    • large programs difficult to maintain, reuse
    Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP
  • 19. What NLTK adds to Python
    • NLTK defines a basic infrastructure that can be used to build NLP programs in Python. It provides:
      • Basic classes for representing data relevant to natural language processing
      • Standard interfaces for performing tasks, such as tokenization, tagging, and parsing
      • Standard implementations for each task, which can be combined to solve complex problems
      • Extensive documentation, including tutorials and reference documentation
    Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP
  • 20. Installing Python and NLTK
    • Install Python, Numeric
    • Install NLTK-Lite, NLTK-Lite-Corpora
    • Set environment variable NLTK_LITE_CORPORA
    • For detailed instructions, see:
  • 21. Running Project Idea
    • Language Identification
      • In what language is a given text document?
      • First ideas?
      • (Using simple text processing techniques)