Your SlideShare is downloading. ×
0
ppt
ppt
ppt
ppt
ppt
ppt
ppt
ppt
ppt
ppt
ppt
ppt
ppt
ppt
ppt
ppt
ppt
ppt
ppt
ppt
ppt
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

ppt

1,707

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,707
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
20
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics 1. Course Overview Jonas Kuhn Universität Potsdam, 2007
  • 2. Outline <ul><li>Course Overview & Introduction </li></ul><ul><li>Some Python Programming </li></ul>
  • 3. Course Overview <ul><li>Simple Python Programming </li></ul><ul><li>Basic Probability Theory </li></ul><ul><li>N-Gram Language Modeling </li></ul><ul><ul><li>Basic Information Theory: Entropy </li></ul></ul><ul><ul><li>Data Sparseness & Smoothing Techniques </li></ul></ul><ul><li>Machine Learning Paradigms </li></ul><ul><li>Part-of-Speech-Tagging with Statistical and ML Techniques </li></ul><ul><li>Probabilistic Grammars & Parsing </li></ul><ul><li>Statistical Machine Translation </li></ul>
  • 4. The Status of Statistical Methods <ul><li>Eric Brill and Raymond J. Mooney (1997): </li></ul><ul><li>An Overview of Empirical Natural Language Processing </li></ul><ul><li>In: AI Magazine, 18(4): Winter 1997, 13-24. </li></ul><ul><li>The linguistic knowledge-acquisition problem </li></ul><ul><ul><li>Rationalist methods </li></ul></ul><ul><ul><li>Empirical or corpus-based methods </li></ul></ul>
  • 5. Rationalist methods
  • 6. Empirical or corpus-based methods
  • 7. History of NLP <ul><li>1950s: empirical and statistical analyses of natural language (compare: behaviorism in psychology; Skinner) </li></ul><ul><li>Mid-1950s: </li></ul><ul><ul><li>Chomsky’s program </li></ul></ul><ul><ul><ul><li>Observational and explanatory adequacy </li></ul></ul></ul><ul><ul><ul><li>Arguments against learnability of language from data; Innateness hypothesis </li></ul></ul></ul><ul><ul><li>Rationalist methods in AI research in NLP </li></ul></ul><ul><ul><ul><li>Hand coding of rules </li></ul></ul></ul><ul><li>Starting in early 1980s </li></ul><ul><ul><li>Some work on induction of lexical and syntactic information from text </li></ul></ul><ul><ul><li>Empirical methods in speech recognition (hidden Markov models; HMMs) </li></ul></ul>
  • 8. History of NLP <ul><li>Late 1980s/1990s: Statistical techniques in various areas of NLP </li></ul><ul><ul><li>POS tagging </li></ul></ul><ul><ul><li>Machine translation </li></ul></ul><ul><ul><li>Probabilistic context-free grammars </li></ul></ul><ul><ul><li>Word sense disambiguation </li></ul></ul><ul><ul><li>Anaphora resolution </li></ul></ul>
  • 9. Reasons for the Resurgence of Empiricism <ul><li>Empirical methods offer potential solutions to several related, long-standing problems in NLP: </li></ul><ul><li>(1) Acquisition, automatically identifying and coding all the necessary knowledge </li></ul><ul><li>(2) Coverage, accounting for all the phenomena in a given domain or application </li></ul><ul><li>(3) Robustness, accommodating real data that contain noise and aspects not accounted for by the underlying model </li></ul><ul><li>(4) Extensibility, easily extending or porting a system to a new set of data or a new task or domain </li></ul>
  • 10. Reasons for the Resurgence of Empiricism <ul><li>Additional factors: </li></ul><ul><li>(1) computing resources, the availability of relatively inexpensive workstations with sufficient processing and memory resources to analyze large amounts of data </li></ul><ul><li>(2) data resources, the development and availability of large corpora of linguistic and lexical data for training and testing systems </li></ul><ul><li>(3) emphasis on applications and evaluation, industrial and government focus on the development of practical systems that are experimentally evaluated on real data </li></ul>
  • 11. Categories of Empirical Methods (1) <ul><li>Probabilistic methods </li></ul><ul><li>Symbolic learning methods </li></ul><ul><li>Neural network/connectionist methods </li></ul>
  • 12. Categories of Empirical Methods (2) <ul><li>Different dimension: type of training data </li></ul><ul><ul><li>Supervised learning </li></ul></ul><ul><ul><ul><li>Annotated text </li></ul></ul></ul><ul><ul><li>Unsupervised learning </li></ul></ul><ul><ul><ul><li>Indirect feedback </li></ul></ul></ul><ul><li>Important: combination of rationalist and empirical methods </li></ul>
  • 13. An Interdisciplinary Field Computational Neuroscience Computer Science Linguistics Mathematics Electrical Engineering Artificial Intelligence Computational Linguistics Philosophy Algorithms & Data Structures Search Algorithms Machine Learning Neural Networks Natural Language Parsing Grammar Formalisms Complexity Theory Formal Language Theory Probability Theory Information Theory Pattern/Speech Recognition Information Retrieval Clustering Corpus Linguistics Empirical Sciences Statistics Psycho- linguistics Statistical NLP
  • 14. Practical Aspects <ul><li>We will use </li></ul><ul><ul><li>Python for small programming exercises </li></ul></ul><ul><ul><ul><li>http://www.python.org/ </li></ul></ul></ul><ul><ul><li>NLTK library (in Python) – Natural Language Toolkit </li></ul></ul><ul><ul><ul><li>http:// nltk.sourceforge.net / </li></ul></ul></ul><ul><ul><li>(probably) WEKA for small Machine Learning experiments </li></ul></ul><ul><ul><ul><li>http:// www.cs.waikato.ac.nz/ml/weka / </li></ul></ul></ul>
  • 15. Python <ul><li>Tutorial introduction in an NLP context: </li></ul><ul><ul><li>http://nltk.sourceforge.net/docs.html </li></ul></ul><ul><ul><li>Chapter 2: Programming </li></ul></ul>
  • 16. Python: Key Features <ul><li>Simple yet powerful, shallow learning curve </li></ul><ul><li>Object-oriented: encapsulation, re-use </li></ul><ul><li>Scripting language, facilitates interactive exploration </li></ul><ul><li>Excellent functionality for processing linguistic data </li></ul><ul><li>Extensive standard library, incl graphics, web, numerical processing </li></ul><ul><li>Downloaded for free from http://www.python.org/ </li></ul>Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP
  • 17. Python example <ul><li>import sys </li></ul><ul><li>for line in sys.stdin.readlines(): </li></ul><ul><li>for word in line.split(): </li></ul><ul><li>if word.endswith(’ing’): </li></ul><ul><li>print word </li></ul><ul><li>whitespace: nesting lines of code; scope </li></ul><ul><li>object-oriented: attributes, methods (e.g. line ) </li></ul><ul><li>readable </li></ul>Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP
  • 18. Comparison with Perl <ul><li>while (<>) { </li></ul><ul><li>foreach my $word (split) { </li></ul><ul><li>if ($word =~ /ing$/) { </li></ul><ul><li>print &quot;$word &quot;; </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>syntax is obscure: what are: <> $ my split ? </li></ul><ul><li>“ it is quite easy in Perl to write programs that simply look like raving gibberish, even to experienced Perl programmers” (Hammond Perl Programming for Linguists 2003:47) </li></ul><ul><li>large programs difficult to maintain, reuse </li></ul>Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP
  • 19. What NLTK adds to Python <ul><li>NLTK defines a basic infrastructure that can be used to build NLP programs in Python. It provides: </li></ul><ul><ul><li>Basic classes for representing data relevant to natural language processing </li></ul></ul><ul><ul><li>Standard interfaces for performing tasks, such as tokenization, tagging, and parsing </li></ul></ul><ul><ul><li>Standard implementations for each task, which can be combined to solve complex problems </li></ul></ul><ul><ul><li>Extensive documentation, including tutorials and reference documentation </li></ul></ul>Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP
  • 20. Installing Python and NLTK <ul><li>Install Python, Numeric </li></ul><ul><li>Install NLTK-Lite, NLTK-Lite-Corpora </li></ul><ul><li>Set environment variable NLTK_LITE_CORPORA </li></ul><ul><li>For detailed instructions, see: </li></ul><ul><ul><li>http://nltk.sourceforge.net/install.html </li></ul></ul>
  • 21. Running Project Idea <ul><li>Language Identification </li></ul><ul><ul><li>In what language is a given text document? </li></ul></ul><ul><ul><li>First ideas? </li></ul></ul><ul><ul><li>(Using simple text processing techniques) </li></ul></ul>

×