Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

MathFoundations.ppt

476 views

Published on

  • Be the first to comment

  • Be the first to like this

MathFoundations.ppt

  1. 1. IGERT External Advisory Board Meeting Wednesday, March 14, 2007 INSTITUTE FOR COGNITIVE SCIENCES University of Pennsylvania
  2. 2. COGS 501 & COGS 502 A two-semester sequence that aims to provide basic mathematical and algorithmic tools for the study of animal, human or machine communication. COGS 501: 1. Mathematical and programming basics:  linear algebra and Matlab 2. Probability 3. Model fitting 4. Signal processing COGS 502 topics: 1. Information theory 2. Formal language theory 3. Logic 4. Machine learning
  3. 3. Challenges  Diverse student background: from almost nothing to MS-level skills in  Mathematics  Programming  Breadth of topics and applications: normally many courses with many prerequisites  Lack of suitable instructional materials
  4. 4. Precedent: LING 525 / CIS 558 Computer Analysis and Modeling of Biological Signals and Systems. A hands-on signal and image processing course for non-EE graduate students needing these skills. We will go through all the fundamentals of signal and image processing using computer exercises developed in MATLAB. Examples will be drawn from speech analysis and synthesis, computer vision, and biological modeling.
  5. 5. History CIS 558 / LING 525: “Digital signal processing for non-EEs” - started in 1996 by Simoncelli and Liberman - similar problems of student diversity, breadth of topics, lack of suitable materials Solutions: - Matlab-based lab course: concepts, digital methods, applications - Several tiers for each topic: basic, intermediate, advanced - Extensive custom-built lecture notes and problem sets - Lots of individual hand-holding Results: - Successful uptake for wide range of student backgrounds (e.g. from “no math since high school” to “MS in math”; from “never programmed” to “five years in industry”) - Successor now a required course in NYU neuroscience program: “Mathematical tools for neural science”
  6. 6. Mathematical Foundations Course goal: IGERT students should understand and be able to apply  Models of language and communication  Experimental design and analysis  Corpus-based methods in research areas including  Sentence processing  Animal communication  Language learning  Communicative interaction  Cognitive neuroscience
  7. 7. COGS 501-2 rev. 0.1  Problem is somewhat more difficult:  Students are even more diverse  Concepts and applications are even broader  Advance preparation was inadequate  Lack of pre-prepared lecture notes and problems (except for those derived from other courses)  Not enough coordination by faculty  Not enough explicit connections to research
  8. 8. COGS 501-2: how to do better  Will start sequence again in Fall 2007  Plans for rev. 0.9:  Advance preparation of course-specific lecture notes and problem sets  Systematic remediation where needed  for mathematical background  for entry-level Matlab programming  Connection to research themes (e.g. sequence modeling, birdsong analysis, artificial language learning)  Historical papers  Contemporary research
  9. 9. Research theme: example “Colorless green ideas sleep furiously”  Shannon 1948  Chomsky 1957  Pereira 2000  (?)
  10. 10. Word sequences: Shannon C. Shannon, “A mathematical theory of communication”, BSTJ, 1948: …a sufficiently complex stochastic process will give a satisfactory representation of a discrete source. [The entropy] H … can be determined by limiting operations directly from the statistics of the message sequences… [Specifically:] Theorem 6: Let p(Bi, Sj) be the probability of sequence Bi followed by symbol Sj and pBi(Sj) … be the conditional probability of Sj after Bi. Let where the sum is over all blocks Bi of N-1 symbols and over all symbols Sj. Then FN is a monotonic decreasing function of N, … and LimN→∞ FN = H. )(log),( , jBj ji iN SpSBpF i∑−=
  11. 11. Word sequences: Chomsky N. Chomsky, Syntactic Structures, 1957: (1) Colorless green ideas sleep furiously. (2) Furiously sleep ideas green colorless. . . . It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally ‘remote’ from English. Yet (1), though nonsensical, is grammatical, while (2) is not.
  12. 12. Word sequences: Pereira F. Pereira, “Formal grammar and information theory: together again?”, 2000: [Chomsky’s argument] relies on the unstated assumption that any probabilistic model necessarily assigns zero probability to unseen events. Indeed, this would be the case if the model probability estimates were just the relative frequencies of observed events (the maximum-likelihood estimator). But we now understand that this naive method badly overfits the training data. […] To avoid this, one usually smoothes the data […] In fact, one of the earliest such methods, due to Turing and Good (Good, 1953), had been published before Chomsky's attack on empiricism…
  13. 13. Word sequences: Pereira Hidden variables … can also be used to create factored models of joint distributions that have far fewer parameters to estimate, and are thus easier to learn, than models of the full joint distribution. As a very simple but useful example, we may approximate the conditional probability p(x, y) of occurrence of two words x and y in a given configuration as ∑= c xcpcypxpyxp )|()|()(),( where c is a hidden “class” variable for the associations between x and y … When (x ,y) = (vi, vi+1) we have an aggregate bigram model … which is useful for modeling word sequences that include unseen bigrams. With such a model, we can approximate the probability of a string p(w1 … wn) by )|(()...( 1 2 11 − = ∏= i n i in wwpwpwwp
  14. 14. Word sequences: Pereira Using this estimate for the probability of a string and an aggregate model with C=16 trained on newspaper text using the expectation-maximization method, we find that 5 102 )colorlessgreenideassleepFuriously( )furiouslysleepideasgreenColorless( ×≈ p p
  15. 15. Word sequences: concepts & problems  Concepts:  Information entropy (and conditional entropy, cross-entropy, mutual information)  Markov models, N-gram models  Chomsky hierarchy (first glimpse)  Problems:  Entropy estimation algorithms (n-gram, LZW, BW, etc.)  LNRE smoothing methods  EM estimation of hidden variables (learned earlier for gaussian mixtures)

×