Invited Talk for the SIGNLL Conference on Computational Natural Language Learning 2017 (CoNLL 2017) Chris Dyer (DeepMind / CMU) 3 Aug 2017. Vancouver, Canada
talk at KTH 14 May 2014 about matrix factorization, different latent and neighborhood models, graphs and energy diffusion for recommender systems, as well as what makes good/bad recommendations.
talk at KTH 14 May 2014 about matrix factorization, different latent and neighborhood models, graphs and energy diffusion for recommender systems, as well as what makes good/bad recommendations.
State of Blockchain 2017: Smartnetworks and the Blockchain EconomyMelanie Swan
Ā
Blockchain is a fundamental IT for secure value transfer over networks. For any asset registered in a cryptographic ledger, the whole Internet is a VPN for its confirmation, assurity, and transfer. Blockchain reinvents economics and governance for the digital age. The long-tail structure of digital networks allows personalized economic and governance services. Smartnetworks are a new form of automated global infrastructure for large-scale next-generation projects.
Deep Qualia: Philosophy of Statistics, Deep Learning, and Blockchain
Deep learning: What is it, why is it important, and what do I need to know?
The aim of this talk is to discuss deep learning as an advanced computational method and its philosophical implications. Computing is a fundamental model by which we are understanding more about ourselves and the world. We think that reality is composed of patterns, which can be detected by machine learning methods.
Deep learning is a complexity optimization technique in which algorithms learn from data by modeling high-level abstractions and assigning probabilities to nodes as they characterize the system and make predictions. An important challenge in deep learning is that these methods work in certain domains (image, speech, and text recognition), but we do not have a good explanation for why, which impedes a wider application of these solutions.
Another recent advance in computational methods is blockchain technology which allows the secure transfer of assets and information, and the automated coordination of operations via a trackable remunerative ledger and smart contracts (automatically-executing Internet-based programs).
This talk looks at how deep learning technology, particularly as coupled with blockchain systems, might be used to produce a new kind of global computing platform. The goal is for blockchain deep learning systems to address higher-dimensional computing challenges that require learning and dynamic response in domains such as economics and financial risk, epidemiology, social modeling, public health (cancer, aging), dark matter, atomic reactions, network-modeling (transportation, energy, smart cities), artificial intelligence, and consciousness.
Vectorland: Brief Notes from Using Text Embeddings for SearchBhaskar Mitra
Ā
(Invited talk at Search Solutions 2015)
A lot of recent work in neural models and āDeep Learningā is focused on learning vector representations for text, image, speech, entities, and other nuggets of information. From word analogies to automatically generating human level descriptions of images, the use of text embeddings has become a key ingredient in many natural language processing (NLP) and information retrieval (IR) tasks.
In this talk, I will present some personal learnings from working on (neural and non-neural) text embeddings for IR, as well as highlight a few key recent insights from the broader academic community. I will talk about the affinity of certain embeddings for certain kinds of tasks, and how the notion of relatedness in an embedding space depends on how the vector representations are trained. The goal of this talk is to encourage everyone to start thinking about text embeddings beyond just as an output of a āblack boxā machine learning model, and to highlight that the relationships between different embedding spaces are about as interesting as the relationships between items within an embedding space.
One weekend software hack called "Movie Hack Attack".
Video content is played and is analyzed realtime for sentiment, emotions, and more.
Sentiment shown in chart below, emotions+objects of attention to the left with random picture grabbed from google search, persons/locations/organizations to the right with random picture grabbed from google search.
Writing NodeJS applications is an easy task for JavaScript developers. However, getting what is happening under the hood in NodeJS may be intimidating, but understanding it is vital for web developers.
Indeed, when you try to learn NodeJS, most tutorials are about the NodeJS ecosystem like Express, Socket.IO, PassportJS. It is really rare to see some tutorials about the NodeJS runtime itself.
By this meetup, I want to spot the light on some advanced NodeJS topics so as to help developers answering questions an experienced NodeJS developer is expected to answer. Understanding these topics is essential to make you a much more desirable developer. I want to explore several topics including the famous event-loop along with NodeJS Module Patterns and how dependencies actually work in NodeJS.
I hope that this meetup would help you to be more comfortable understanding advanced code written in NodeJS.
The next phase of Smart Network Convergence could be putting Deep Learning systems on the Internet. Deep Learning and Blockchain Technology might be combined in the smart networks of the future for automated identification (deep learning) and automated transaction (blockchain). Large scale future-class problems might be addressed with Blockchain Deep Learning nets as an advanced computational infrastructure, challenges such as million-member genome banks, energy storage markets, global financial risk assessment, real-time voting, and asteroid mining.
Blockchain Deep Learning nets and Smart Networks more generally are computing networks with intelligence built in such that identification and transfer is performed by the network itself through sophisticated protocols that automatically identify (deep learning), and validate, confirm, and route transactions (blockchain) within the network.
Blockchain Smartnetworks: Bitcoin and Blockchain ExplainedMelanie Swan
Ā
Beyond digitalizing money, payments, economics, and finance, and governance, smart property and smart contracts, blockchains secure automated fleet coordination
The implications could be an orderly transition to the automation economy and trust-rich digital smartnetwork societies of the future
State of Blockchain 2017: Smartnetworks and the Blockchain EconomyMelanie Swan
Ā
Blockchain is a fundamental IT for secure value transfer over networks. For any asset registered in a cryptographic ledger, the whole Internet is a VPN for its confirmation, assurity, and transfer. Blockchain reinvents economics and governance for the digital age. The long-tail structure of digital networks allows personalized economic and governance services. Smartnetworks are a new form of automated global infrastructure for large-scale next-generation projects.
Deep Qualia: Philosophy of Statistics, Deep Learning, and Blockchain
Deep learning: What is it, why is it important, and what do I need to know?
The aim of this talk is to discuss deep learning as an advanced computational method and its philosophical implications. Computing is a fundamental model by which we are understanding more about ourselves and the world. We think that reality is composed of patterns, which can be detected by machine learning methods.
Deep learning is a complexity optimization technique in which algorithms learn from data by modeling high-level abstractions and assigning probabilities to nodes as they characterize the system and make predictions. An important challenge in deep learning is that these methods work in certain domains (image, speech, and text recognition), but we do not have a good explanation for why, which impedes a wider application of these solutions.
Another recent advance in computational methods is blockchain technology which allows the secure transfer of assets and information, and the automated coordination of operations via a trackable remunerative ledger and smart contracts (automatically-executing Internet-based programs).
This talk looks at how deep learning technology, particularly as coupled with blockchain systems, might be used to produce a new kind of global computing platform. The goal is for blockchain deep learning systems to address higher-dimensional computing challenges that require learning and dynamic response in domains such as economics and financial risk, epidemiology, social modeling, public health (cancer, aging), dark matter, atomic reactions, network-modeling (transportation, energy, smart cities), artificial intelligence, and consciousness.
Vectorland: Brief Notes from Using Text Embeddings for SearchBhaskar Mitra
Ā
(Invited talk at Search Solutions 2015)
A lot of recent work in neural models and āDeep Learningā is focused on learning vector representations for text, image, speech, entities, and other nuggets of information. From word analogies to automatically generating human level descriptions of images, the use of text embeddings has become a key ingredient in many natural language processing (NLP) and information retrieval (IR) tasks.
In this talk, I will present some personal learnings from working on (neural and non-neural) text embeddings for IR, as well as highlight a few key recent insights from the broader academic community. I will talk about the affinity of certain embeddings for certain kinds of tasks, and how the notion of relatedness in an embedding space depends on how the vector representations are trained. The goal of this talk is to encourage everyone to start thinking about text embeddings beyond just as an output of a āblack boxā machine learning model, and to highlight that the relationships between different embedding spaces are about as interesting as the relationships between items within an embedding space.
One weekend software hack called "Movie Hack Attack".
Video content is played and is analyzed realtime for sentiment, emotions, and more.
Sentiment shown in chart below, emotions+objects of attention to the left with random picture grabbed from google search, persons/locations/organizations to the right with random picture grabbed from google search.
Writing NodeJS applications is an easy task for JavaScript developers. However, getting what is happening under the hood in NodeJS may be intimidating, but understanding it is vital for web developers.
Indeed, when you try to learn NodeJS, most tutorials are about the NodeJS ecosystem like Express, Socket.IO, PassportJS. It is really rare to see some tutorials about the NodeJS runtime itself.
By this meetup, I want to spot the light on some advanced NodeJS topics so as to help developers answering questions an experienced NodeJS developer is expected to answer. Understanding these topics is essential to make you a much more desirable developer. I want to explore several topics including the famous event-loop along with NodeJS Module Patterns and how dependencies actually work in NodeJS.
I hope that this meetup would help you to be more comfortable understanding advanced code written in NodeJS.
The next phase of Smart Network Convergence could be putting Deep Learning systems on the Internet. Deep Learning and Blockchain Technology might be combined in the smart networks of the future for automated identification (deep learning) and automated transaction (blockchain). Large scale future-class problems might be addressed with Blockchain Deep Learning nets as an advanced computational infrastructure, challenges such as million-member genome banks, energy storage markets, global financial risk assessment, real-time voting, and asteroid mining.
Blockchain Deep Learning nets and Smart Networks more generally are computing networks with intelligence built in such that identification and transfer is performed by the network itself through sophisticated protocols that automatically identify (deep learning), and validate, confirm, and route transactions (blockchain) within the network.
Blockchain Smartnetworks: Bitcoin and Blockchain ExplainedMelanie Swan
Ā
Beyond digitalizing money, payments, economics, and finance, and governance, smart property and smart contracts, blockchains secure automated fleet coordination
The implications could be an orderly transition to the automation economy and trust-rich digital smartnetwork societies of the future
The main thesis here is this: (i) The Data-Driven approach to NLU is utterly fallacious; (ii) Logical Semantics has been seriously misguided; and (iii) logical semantics can be rectified, and here we suggest how this can be done and how to go forward, again
KBSI 2015: Towards multimodal indicators of idea improvement in knowledge bui...Bodong Chen
Ā
An invited talk about potential multimodal indicators for assessing idea improvement in group knowledge building. Trieste, Italy. Early thoughts; comments welcomed.
Assessing Experiential Learning: Epic Finales and Roleplaying RubricsSeriousGamesAssoc
Ā
Tony Crider, Professor of Astrophysics | Elon University
Assessing Experiential Learning: Epic Finales and Roleplaying Rubrics
The Reacting to the Past curriculum for higher education offers many games that are ready for adoption āas-isā into many different types of college classes.
The rubrics and grading methods are templates that are easily ported. The more creative examples of roleplay (e.g. the Epic Finales from a class about extraterrestrials) are meant to be more inspirational than directly adoptable.
As educators adopt more engaged teaching practices, multiple-choice and essay exams become increasingly incapable of capturing and reflecting student learning. In this talk, we will highlight engaged role-playing and experiential approaches, including Reacting to the Past games, Cultures of the Imagination, and selections from Anthony Westonās book, Teaching as the Art of Staging.
Weāll also look at examples of the implementation of these in a variety of classes ranging from first-year seminars to astrophysics. We will then review rubrics and guidelines on how to assess student learning during these activities and at the end of the semester with Epic Finales.
Presented by the
Serious Play Conference
seriousplayconf.com
Montreal, Canada, Quebec,
UNIVERSITĆ DU QUĆBEC Ć MONTRĆAL,
UNIVERSITY OF QUEBEC IN MONTREAL,
July 10-12, 2019
Assessing Experiential Learning: Epic Finales and Roleplaying RubricsSeriousGamesAssoc
Ā
Tony Crider, Professor of Astrophysics | Elon University
Assessing Experiential Learning: Epic Finales and Roleplaying Rubrics
The Reacting to the Past curriculum for higher education offers many games that are ready for adoption āas-isā into many different types of college classes.
The rubrics and grading methods are templates that are easily ported. The more creative examples of roleplay (e.g. the Epic Finales from a class about extraterrestrials) are meant to be more inspirational than directly adoptable.
As educators adopt more engaged teaching practices, multiple-choice and essay exams become increasingly incapable of capturing and reflecting student learning. In this talk, we will highlight engaged role-playing and experiential approaches, including Reacting to the Past games, Cultures of the Imagination, and selections from Anthony Westonās book, Teaching as the Art of Staging.
Weāll also look at examples of the implementation of these in a variety of classes ranging from first-year seminars to astrophysics. We will then review rubrics and guidelines on how to assess student learning during these activities and at the end of the semester with Epic Finales.
Presented by the
Serious Play Conference
seriousplayconf.com
at
Orlando,
University of Central Florida,
UCF,
July 24-26, 2019
- High-level overview
- Challenges in natural language processing
- What is intelligence?
- Sequence prediction
- A very short history of Solomonoff induction
- Meaning acquisition
- Logistic loss
- Gradient descent
- Applications
The deconstruction of the Chinese Room Bogdan Bocse
Ā
In spite of being tempting and easy-to-understand, Thought Experiments have a high chance to enforce preconscious biases and assumptions of the researcher/student/listener. This "injection of bias" happens because, upon hearing a story, the listener pre-consciously fills in the blanks/the unknowns with hidden/mute assumptions, which do NOT then reach the "debate" from the conscious mind. As you may expect, this phenomenon especially affects universities from Western countries with biases about non-Western cultures.
Similar to Chris Dyer - 2017 - CoNLL Invited Talk: Should Neural Network Architecture Reflect Linguistic Structure? (20)
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
Delivering Micro-Credentials in Technical and Vocational Education and TrainingAG2 Design
Ā
Explore how micro-credentials are transforming Technical and Vocational Education and Training (TVET) with this comprehensive slide deck. Discover what micro-credentials are, their importance in TVET, the advantages they offer, and the insights from industry experts. Additionally, learn about the top software applications available for creating and managing micro-credentials. This presentation also includes valuable resources and a discussion on the future of these specialised certifications.
For more detailed information on delivering micro-credentials in TVET, visit this https://tvettrainer.com/delivering-micro-credentials-in-tvet/
This presentation was provided by Steph Pollock of The American Psychological Associationās Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
A Strategic Approach: GenAI in EducationPeter Windle
Ā
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...NelTorrente
Ā
In this research, it concludes that while the readiness of teachers in Caloocan City to implement the MATATAG Curriculum is generally positive, targeted efforts in professional development, resource distribution, support networks, and comprehensive preparation can address the existing gaps and ensure successful curriculum implementation.
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
Ā
Chris Dyer - 2017 - CoNLL Invited Talk: Should Neural Network Architecture Reflect Linguistic Structure?
1. Should Neural Network
Architecture Reļ¬ect
Linguistic Structure?
CoNLLAugust 3, 2017
Adhi KuncoroāØ
(Oxford)
Phil BlunsomāØ
(DeepMind)
Dani YogatamaāØ
(DeepMind)
Chris DyerāØ
(DeepMind/CMU)
Miguel BallesterosāØ
(IBM)
Wang LingāØ
(DeepMind)
Noah A. SmithāØ
(UW)
Ed GrefenstetteāØ
(DeepMind)
4. (1) a. The talk I gave did not appeal to anybody.
Modeling languageāØ
Sentences are hierarchical
Examples adapted from Everaert et al. (TICS 2015)
5. (1) a. The talk I gave did not appeal to anybody.
b. *The talk I gave appealed to anybody.
Modeling languageāØ
Sentences are hierarchical
Examples adapted from Everaert et al. (TICS 2015)
6. (1) a. The talk I gave did not appeal to anybody.
b. *The talk I gave appealed to anybody.
Modeling languageāØ
Sentences are hierarchical
NPI
Examples adapted from Everaert et al. (TICS 2015)
7. (1) a. The talk I gave did not appeal to anybody.
b. *The talk I gave appealed to anybody.
Generalization hypothesis: not must come before anybody
Modeling languageāØ
Sentences are hierarchical
NPI
Examples adapted from Everaert et al. (TICS 2015)
8. (1) a. The talk I gave did not appeal to anybody.
b. *The talk I gave appealed to anybody.
Generalization hypothesis: not must come before anybody
(2) *The talk I did not give appealed to anybody.
Modeling languageāØ
Sentences are hierarchical
NPI
Examples adapted from Everaert et al. (TICS 2015)
9. Examples adapted from Everaert et al. (TICS 2015)
containing anybody. (This structural conļ¬guration is called c(onstituent)-command in the
linguistics literature [31].) When the relationship between not and anybody adheres to this
structural conļ¬guration, the sentence is well-formed.
In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not
in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence
is not well-formed.
(A) (B)
X X
X X X X
The book X X X The book X appealed to anybody
did not
that I bought appeal to anybody that I did not buy
Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.
(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.
- the psychological reality of structural sensitivtyāØ
is not empirically controversial
- many different theories of the details of structure
- more controversial hypothesis: kids learn languageāØ
easily because they donāt consider many āobviousā struct
insensitive hypotheses
Language is hierarchical
The talk
I gave
did not
appeal to anybody
appealed to anybodyThe talk
I did not give
10. Examples adapted from Everaert et al. (TICS 2015)
containing anybody. (This structural conļ¬guration is called c(onstituent)-command in the
linguistics literature [31].) When the relationship between not and anybody adheres to this
structural conļ¬guration, the sentence is well-formed.
In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not
in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence
is not well-formed.
(A) (B)
X X
X X X X
The book X X X The book X appealed to anybody
did not
that I bought appeal to anybody that I did not buy
Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.
(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.
Generalization: not must āstructurally precedeā anybody
- the psychological reality of structural sensitivtyāØ
is not empirically controversial
- many different theories of the details of structure
- more controversial hypothesis: kids learn languageāØ
easily because they donāt consider many āobviousā struct
insensitive hypotheses
Language is hierarchical
The talk
I gave
did not
appeal to anybody
appealed to anybodyThe talk
I did not give
11. Examples adapted from Everaert et al. (TICS 2015)
containing anybody. (This structural conļ¬guration is called c(onstituent)-command in the
linguistics literature [31].) When the relationship between not and anybody adheres to this
structural conļ¬guration, the sentence is well-formed.
In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not
in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence
is not well-formed.
(A) (B)
X X
X X X X
The book X X X The book X appealed to anybody
did not
that I bought appeal to anybody that I did not buy
Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.
(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.
Generalization: not must āstructurally precedeā anybody
- the psychological reality of structural sensitivtyāØ
is not empirically controversial
- many different theories of the details of structure
- more controversial hypothesis: kids learn languageāØ
easily because they donāt consider many āobviousā struct
insensitive hypotheses
Language is hierarchical
The talk
I gave
did not
appeal to anybody
appealed to anybodyThe talk
I did not give
12. Examples adapted from Everaert et al. (TICS 2015)
containing anybody. (This structural conļ¬guration is called c(onstituent)-command in the
linguistics literature [31].) When the relationship between not and anybody adheres to this
structural conļ¬guration, the sentence is well-formed.
In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not
in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence
is not well-formed.
(A) (B)
X X
X X X X
The book X X X The book X appealed to anybody
did not
that I bought appeal to anybody that I did not buy
Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.
(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.
Generalization: not must āstructurally precedeā anybody
- the psychological reality of structural sensitivtyāØ
is not empirically controversial
- many different theories of the details of structure
- more controversial hypothesis: kids learn languageāØ
easily because they donāt consider many āobviousā struct
insensitive hypotheses
Language is hierarchical
The talk
I gave
did not
appeal to anybody
appealed to anybodyThe talk
I did not give
13. Examples adapted from Everaert et al. (TICS 2015)
containing anybody. (This structural conļ¬guration is called c(onstituent)-command in the
linguistics literature [31].) When the relationship between not and anybody adheres to this
structural conļ¬guration, the sentence is well-formed.
In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not
in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence
is not well-formed.
(A) (B)
X X
X X X X
The book X X X The book X appealed to anybody
did not
that I bought appeal to anybody that I did not buy
Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.
(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.
Generalization: not must āstructurally precedeā anybody
- the psychological reality of structural sensitivtyāØ
is not empirically controversial
- many different theories of the details of structure
- more controversial hypothesis: kids learn languageāØ
easily because they donāt consider many āobviousāāØ
structural insensitive hypotheses
Language is hierarchical
The talk
I gave
did not
appeal to anybody
appealed to anybodyThe talk
I did not give
14. ā¢ Recurrent neural networks are incredibly powerful models of
sequences (e.g., of words)
ā¢ In fact, RNNs are Turing complete!āØ
(Siegelmann, 1995)
ā¢ But do they make good generalizations from ļ¬nite samples of
data?
ā¢ What inductive biases do they have?
ā¢ Using syntactic annotations to improve inductive bias.
ā¢ Inferring better compositional structure without explicit
annotation
Recurrent neural networksāØ
Good models of language?
(part 1)
(part 2)
15. ā¢ Recurrent neural networks are incredibly powerful models of
sequences (e.g., of words)
ā¢ In fact, RNNs are Turing complete!āØ
(Siegelmann, 1995)
ā¢ But do they make good generalizations from ļ¬nite samples of
data?
ā¢ What inductive biases do they have?
ā¢ Using syntactic annotations to improve inductive bias.
ā¢ Inferring better compositional structure without explicit
annotation
Recurrent neural networksāØ
Good models of language?
(part 1)
(part 2)
16. ā¢ Understanding the biases of neural networks is tricky
ā¢ We have enough trouble understanding the representations they learn in
speciļ¬c cases, much less general cases!
ā¢ But, there is lots of evidence RNNs prefer sequential recency
ā¢ Evidence 1: Gradients become attenuated across time
ā¢ Analysis; experiments with synthetic datasetsāØ
(yes, LSTMs help but they have limits)
ā¢ Evidence 2: Training regimes like reversing sequences in seq2seq learning
ā¢ Evidence 3: Modeling enhancements to use attention (direct connections back
in remote time)
ā¢ Evidence 4: Linzen et al. (2017) ļ¬ndings on English subject-verb agreement.
Recurrent neural networksāØ
Inductive bias
17. ā¢ Understanding the biases of neural networks is tricky
ā¢ We have enough trouble understanding the representations they learn in
speciļ¬c cases, much less general cases!
ā¢ But, there is lots of evidence RNNs have a bias for sequential recency
ā¢ Evidence 1: Gradients become attenuated across time
ā¢ Functional analysis; experiments with synthetic datasetsāØ
(yes, LSTMs/GRUs help, but they also forget)
ā¢ Evidence 2: Training regimes like reversing sequences in seq2seq learning
ā¢ Evidence 3: Modeling enhancements to use attention (direct connections back
in remote time)
ā¢ Evidence 4: Linzen et al. (2017) ļ¬ndings on English subject-verb agreement.
Recurrent neural networksāØ
Inductive bias
18. The keys to the cabinet in the closet is/are on the table.
AGREE
Linzen, Dupoux, Goldberg (2017)āØ
19. ā¢ Understanding the biases of neural networks is tricky
ā¢ We have enough trouble understanding the representations they learn in
speciļ¬c cases, much less general cases!
ā¢ But, there is lots of evidence RNNs have a bias for sequential recency
ā¢ Evidence 1: Gradients become attenuated across time
ā¢ Functional analysis; experiments with synthetic datasetsāØ
(yes, LSTMs/GRUs help, but they also forget)
ā¢ Evidence 2: Training regimes like reversing sequences in seq2seq learning
ā¢ Evidence 3: Modeling enhancements to use attention (direct connections back
in remote time)
ā¢ Evidence 4: Linzen et al. (2017) ļ¬ndings on English subject-verb agreement.
Recurrent neural networksāØ
Inductive bias
Chomsky (crudely paraphrasing 60 years of work):āØ
āØ
Sequential recency is not the right bias forāØ
modeling human language.
20. An alternativeāØ
Recurrent Neural Net Grammars
ā¢ Generate symbols sequentially using an RNN
ā¢ Add some control symbols to rewrite the history occasionally
ā¢ Occasionally compress a sequence into a constituent
ā¢ RNN predicts next terminal/control symbol based on the history of
compressed elements and non-compressed terminals
ā¢ This is a top-down, left-to-right generation of a tree+sequence
Adhi Kuncoro
21. An alternativeāØ
Recurrent Neural Net Grammars
ā¢ Generate symbols sequentially using an RNN
ā¢ Add some control symbols to rewrite the history occasionally
ā¢ Occasionally compress a sequence into a constituent
ā¢ RNN predicts next terminal/control symbol based on the history of
compressed elements and non-compressed terminals
ā¢ This is a top-down, left-to-right generation of a tree+sequence
Adhi Kuncoro
22. ā¢ Generate symbols sequentially using an RNN
ā¢ Add some control symbols to rewrite the history occasionally
ā¢ Occasionally compress a sequence into a constituent
ā¢ RNN predicts next terminal/control symbol based on the history of
compressed elements and non-compressed terminals
ā¢ This is a top-down, left-to-right generation of a tree+sequence
An alternativeāØ
Recurrent Neural Net Grammars
Adhi Kuncoro
30. stack action
(S
(S (NP
(S (NP The
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
31. stack action
(S
(S (NP
(S (NP The
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
32. stack action
(S
(S (NP
(S (NP The hungry
(S (NP The
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
33. stack action
(S
(S (NP
(S (NP The hungry
(S (NP The
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
34. stack action
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
35. stack action
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
36. stack action
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
(S (NP The hungry cat )
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
37. stack action
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
(S (NP The hungry cat )
(S (NP The hungry cat)
Compress āThe hungry catāāØ
into a single composite symbol
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
38. stack action
(S (NP The hungry cat)
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
probability
39. stack action
(S (NP The hungry cat)
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
NT(VP) p(nt(VP) | (S,
(NP The hungry cat)
probability
40. stack action
(S (NP The hungry cat) (VP
(S (NP The hungry cat)
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
NT(VP) p(nt(VP) | (S,
(NP The hungry cat)
probability
41. stack action
GEN(meows)(S (NP The hungry cat) (VP
(S (NP The hungry cat)
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
NT(VP) p(nt(VP) | (S,
(NP The hungry cat)
probability
42. stack action
GEN(meows)
REDUCE
(S (NP The hungry cat) (VP meows) GEN(.)
REDUCE
(S (NP The hungry cat) (VP meows) .)
(S (NP The hungry cat) (VP meows) .
(S (NP The hungry cat) (VP meows
(S (NP The hungry cat) (VP
(S (NP The hungry cat)
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
NT(VP) p(nt(VP) | (S,
(NP The hungry cat)
probability
43. ā¢ Valid (tree, string) pairs are in bijection to valid sequences of
actions (speciļ¬cally, the DFS, left-to-right traversal of the
trees)
ā¢ Every stack conļ¬guration perfectly encodes the complete
history of actions.
ā¢ Therefore, the probability decomposition is justiļ¬ed by the
chain rule, i.e.
(chain rule)
(prop 2)
p(x, y) = p(actions(x, y))
p(actions(x, y)) =
Y
i
p(ai | a<i)
=
Y
i
p(ai | stack(a<i))
(prop 1)
Some things you can showāØ
44. Modeling the next actionāØ
(S (NP The hungry cat) (VP meowsp(ai | )
45. Modeling the next actionāØ
(S (NP The hungry cat) (VP meowsp(ai | )
1. unbounded depth
46. Modeling the next actionāØ
(S (NP The hungry cat) (VP meowsp(ai | )
1. unbounded depth
1. Unbounded depth ā recurrent neural nets
h1 h2 h3 h4
47. Modeling the next actionāØ
(S (NP The hungry cat) (VP meowsp(ai | )
1. Unbounded depth ā recurrent neural nets
h1 h2 h3 h4
48. Modeling the next actionāØ
(S (NP The hungry cat) (VP meowsp(ai | )
1. Unbounded depth ā recurrent neural nets
h1 h2 h3 h4
2. arbitrarily complex trees
2. Arbitrarily complex trees ā recursive neural nets
49. (NP The hungry cat)Need representation for:
Syntactic compositionāØ
50. (NP The hungry cat)Need representation for:
NP
What head type?
Syntactic compositionāØ
51. The
(NP The hungry cat)Need representation for:
NP
What head type?
Syntactic compositionāØ
52. The hungry
(NP The hungry cat)Need representation for:
NP
What head type?
Syntactic compositionāØ
53. The hungry cat
(NP The hungry cat)Need representation for:
NP
What head type?
Syntactic compositionāØ
54. The hungry cat )
(NP The hungry cat)Need representation for:
NP
What head type?
Syntactic compositionāØ
59. TheNP cat ) NP(
(NP The (ADJP very hungry) cat)
Need representation for: (NP The hungry cat)
hungry
Syntactic compositionāØ
Recursion
60. TheNP cat ) NP(
(NP The (ADJP very hungry) cat)
Need representation for: (NP The hungry cat)
| {z }
v
v
Syntactic compositionāØ
Recursion
61. Modeling the next actionāØ
(S (NP The hungry cat) (VP meowsp(ai | )
1. Unbounded depth ā recurrent neural nets
2. Arbitrarily complex trees ā recursive neural nets
h1 h2 h3 h4
62. Modeling the next actionāØ
(S (NP The hungry cat) (VP meowsp(ai | )
1. Unbounded depth ā recurrent neural nets
2. Arbitrarily complex trees ā recursive neural nets
ā REDUCE
h1 h2 h3 h4
63. Modeling the next actionāØ
(S (NP The hungry cat) (VP meowsp(ai | )
1. Unbounded depth ā recurrent neural nets
2. Arbitrarily complex trees ā recursive neural nets
ā REDUCE
(S (NP The hungry cat) (VP meows)p(ai+1 | )
64. Modeling the next actionāØ
(S (NP The hungry cat) (VP meowsp(ai | )
1. Unbounded depth ā recurrent neural nets
2. Arbitrarily complex trees ā recursive neural nets
ā REDUCE
(S (NP The hungry cat) (VP meows)p(ai+1 | )
3. limited updates
3. Limited updates to state ā stack RNNs
65. ā¢ If we accept the following two propositionsā¦
ā¢ Sequential RNNs have a recency bias
ā¢ Syntactic composition learns to represent trees by
endocentric heads
ā¢ then we can say that they have a bias for syntactic recency
rather than sequential recency
RNNGsāØ
Inductive bias?
66. ā¢ If we accept the following two propositionsā¦
ā¢ Sequential RNNs have a recency bias
ā¢ Syntactic composition learns to represent trees by
endocentric heads
ā¢ then we can say that they have a bias for syntactic recency
rather than sequential recency
RNNGsāØ
Inductive bias?
(S (NP The keys (PP to the cabinet (PP in the closet))) (VP is/are
67. ā¢ If we accept the following two propositionsā¦
ā¢ Sequential RNNs have a recency bias
ā¢ Syntactic composition learns to represent trees by
endocentric heads
ā¢ then we can say that they have a bias for syntactic recency
rather than sequential recency
RNNGsāØ
Inductive bias?
(S (NP The keys (PP to the cabinet (PP in the closet))) (VP is/are
ākeysā
z }| {
68. ā¢ If we accept the following two propositionsā¦
ā¢ Sequential RNNs have a recency bias
ā¢ Syntactic composition learns to represent trees by
endocentric heads
ā¢ then we can say that they have a bias for syntactic recency
rather than sequential recency
RNNGsāØ
Inductive bias?
(S (NP The keys (PP to the cabinet (PP in the closet))) (VP is/are
ākeysā
z }| {
69. ā¢ If we accept the following two propositionsā¦
ā¢ Sequential RNNs have a recency bias
ā¢ Syntactic composition learns to represent trees by
endocentric heads
ā¢ then we can say that they have a bias for syntactic recency
rather than sequential recency
RNNGsāØ
Inductive bias?
(S (NP The keys (PP to the cabinet (PP in the closet))) (VP is/are
The keys to the cabinet in the closet is/areSequential RNN:
ākeysā
z }| {
70. ā¢ Generative models are great! We have deļ¬ned a
joint distribution p(x,y) over strings (x) and parse
trees (y)
ā¢ We can look at two questions:
ā¢ What is p(x) for a given x? [language modeling]
ā¢ What is max p(y | x) for a given x? [parsing]
RNNGsāØ
A few empirical results
71. English PTB (Parsing)
Type F1
Petrov and Klein (2007) Gen 90.1
Shindo et al (2012)āØ
Single model
Gen 91.1
Vinyals et al (2015)āØ
PTB only
Disc 90.5
Shindo et al (2012)āØ
Ensemble
Gen 92.4
Vinyals et al (2015)āØ
Semisupervised
Disc+SemiS
up
92.8
DiscriminativeāØ
PTB only
Disc 91.7
GenerativeāØ
PTB only
Gen 93.6
Choe and Charniak (2016)āØ
Semisupervised
GenāØ
+SemiSup
93.8
Fried et al. (2017)
Gen+Semi+
Ensemble
94.7
72. English PTB (Parsing)
Type F1
Petrov and Klein (2007) Gen 90.1
Shindo et al (2012)āØ
Single model
Gen 91.1
Vinyals et al (2015)āØ
PTB only
Disc 90.5
Shindo et al (2012)āØ
Ensemble
Gen 92.4
Vinyals et al (2015)āØ
Semisupervised
Disc+SemiS
up
92.8
DiscriminativeāØ
PTB only
Disc 91.7
GenerativeāØ
PTB only
Gen 93.6
Choe and Charniak (2016)āØ
Semisupervised
GenāØ
+SemiSup
93.8
Fried et al. (2017)
Gen+Semi+
Ensemble
94.7
73. English PTB (Parsing)
Type F1
Petrov and Klein (2007) Gen 90.1
Shindo et al (2012)āØ
Single model
Gen 91.1
Vinyals et al (2015)āØ
PTB only
Disc 90.5
Shindo et al (2012)āØ
Ensemble
Gen 92.4
Vinyals et al (2015)āØ
Semisupervised
Disc+SemiS
up
92.8
DiscriminativeāØ
PTB only
Disc 91.7
GenerativeāØ
PTB only
Gen 93.6
Choe and Charniak (2016)āØ
Semisupervised
GenāØ
+SemiSup
93.8
Fried et al. (2017)
Gen+Semi+
Ensemble
94.7
74. Type F1
Petrov and Klein (2007) Gen 90.1
Shindo et al (2012)āØ
Single model
Gen 91.1
Vinyals et al (2015)āØ
PTB only
Disc 90.5
Shindo et al (2012)āØ
Ensemble
Gen 92.4
Vinyals et al (2015)āØ
Semisupervised
Disc+SemiS
up
92.8
DiscriminativeāØ
PTB only
Disc 91.7
GenerativeāØ
PTB only
Gen 93.6
Choe and Charniak (2016)āØ
Semisupervised
GenāØ
+SemiSup
93.8
Fried et al. (2017)
Gen+Semi+
Ensemble
94.7
English PTB (Parsing)
76. ā¢ RNNGs are effective for modeling language and parsing
ā¢ Finding 1: Generative parser > discriminative parser
ā¢ Better sample complexity
ā¢ No label bias in modeling action sequence
ā¢ Finding 2: RNNGs > RNNs
ā¢ PTB phrase structure is a useful inductive bias for making good
linguistic generalizations!
ā¢ Open question: Do RNNGs solve the Linzen, Dupoux, and
Goldberg problem?
Summary
77. ā¢ Recurrent neural networks are incredibly powerful models of
sequences (e.g., of words)
ā¢ In fact, RNNs are Turing complete!āØ
(Siegelmann, 1995)
ā¢ But do they make good generalizations from ļ¬nite samples of
data?
ā¢ What inductive biases do they have?
ā¢ Using syntactic annotations to improve inductive bias.
ā¢ Inferring better compositional structure without explicit
annotation
Recurrent neural networksāØ
Good models of language?
(part 1)
(part 2)
78. ā¢ We have compared two end points: a sequence model and a
PTB-based syntactic model
ā¢ If we search for structure to obtain the best performance on down
stream tasks, what will it ļ¬nd?
ā¢ In this part of the talk, I focus on representation learning for
sentences.
Unsupervised structureāØ
Do we need syntactic annotation?
Dani Yogatama
79. ā¢ Advances in deep learning have led to three predominant approaches for
constructing representations of sentences
Background
ā¢ Convolutional neural networks
ā¢ Recurrent neural networks
ā¢ Recursive neural networks
Kim, 2014; Kalchbrenner et al., 2014; Ma et al., 2015
Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2014
Socher et al., 2011; Socher et al., 2013;Tai et al., 2015; Bowman et al., 2016
80. ā¢ Advances in deep learning have led to three predominant approaches for
constructing representations of sentences
Background
ā¢ Convolutional neural networks
ā¢ Recurrent neural networks
ā¢ Recursive neural networks
Kim, 2014; Kalchbrenner et al., 2014; Ma et al., 2015
Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2014
Socher et al., 2011; Socher et al., 2013;Tai et al., 2015; Bowman et al., 2016
81. Recurrent Encoder
A boy drags his sleds through the snow
output
input
(word embeddings)
L = log p(y | x)
83. Recursive Encoder
ā¢ Prior work on tree-structured neural networks assumed that the trees are
either provided as input or predicted based on explicit human
annotations (Socher et al., 2013; Bowman et al., 2016; Dyer et al., 2016)
ā¢ When children learn a new language, they are not given parse trees
ā¢ Infer tree structure as a latent variable that is marginalized during learning
of a downstream task that depends on the composed representation.
ā¢ Hierarchical structures can also be encoded into a word embedding
model to improve performance and interpretability (Yogatama et al., 2015)
L = log p(y | x)
= log
X
z2Z(x)
p(y, z | x)
86. Model
ā¢ Shift reduce parsing
boy drags his sleds
(Aho and Ullman, 1972)
Stack Queue
SHIFT
A
A
Tree
87. Model
ā¢ Shift reduce parsing
boy drags his sleds
(Aho and Ullman, 1972)
Stack Queue
A
A
Tree
88. Model
ā¢ Shift reduce parsing
drags his sleds
(Aho and Ullman, 1972)
Stack Queue
SHIFT
A
boy
A
Tree
boy
89. Model
ā¢ Shift reduce parsing
drags his sleds
(Aho and Ullman, 1972)
Stack Queue
A
boy
A
Tree
boy
90. Model
ā¢ Shift reduce parsing
drags his sleds
(Aho and Ullman, 1972)
Stack Queue
A
boy
REDUCE
A
Tree
boy
91. Model
ā¢ Shift reduce parsing
drags his sleds
(Aho and Ullman, 1972)
Stack Queue
REDUCE
i = (WI[hi, hj] + bI)
o = (WO[hi, hj] + bI)
fL = (WFL
[hi, hj] + bFL
)
fR = (WFR
[hi, hj] + bFR
)
g = tanh(WG[hi, hj] + bG)
c = fL ci + fR cj + i g
h = o c
ā¢ Compose top two elements of the stack with Tree LSTM (Tai et al., 2015
Zhu et al., 2015)
A
boy
92. Model
ā¢ Shift reduce parsing
drags his sleds
(Aho and Ullman, 1972)
Stack Queue
REDUCE
ā¢ Compose top two elements of the stack with Tree LSTM (Tai et al., 2015
Zhu et al., 2015)
Tree-LSTM(a, boy)
ā¢ Push the result back onto the stack
A
Tree
boy
93. Model
ā¢ Shift reduce parsing
drags his sleds
(Aho and Ullman, 1972)
Stack Queue
Tree-LSTM(a, boy)
A
Tree
boy
94. Model
ā¢ Shift reduce parsing
his sleds
(Aho and Ullman, 1972)
Stack Queue
Tree-LSTM(a, boy)
SHIFT
drags
A
Tree
boy drags
95. Model
ā¢ Shift reduce parsing (Aho and Ullman, 1972)
Stack Queue
REDUCE
Tree-LSTM(Tree-LSTM(a,boy), Tree-LSTM(drags, Tree-LSTM(his,sleds)))
A
Tree
boy drags his sleds
96. ā¢ Shift reduce parsing
Model
1 2 4 5
3 6
7
1 2 4 6
3
7
5
1 2 3 4
7
5
6
1 2 3 6
4
5
7
S, S, R, S, S, R, R S, S, S, R, R, S, R S, S, R, S, R, S, R S, S, S, S, R, R, R
Different Shift/Reduce sequences lead to different tree structures
(Aho and Ullman, 1972)
97. Learning
ā¢ How do we learn the policy for the shift reduce sequence?
ā¢ Supervised learning! Imitation learning!
ā¢ But we need a treebank.
ā¢ What if we donāt have labels?
98. Reinforcement Learning
ā¢ Given a state, and a set of possible actions, an agent needs to
decide what is the best possible action to take
ā¢ There is no supervision, only rewards which can be not observed until
after several actions are taken
99. Reinforcement Learning
ā¢ Given a state, and a set of possible actions, an agent needs to
decide what is the best possible action to take
ā¢ There is no supervision, only rewards which can be not observed until
after several actions are taken
ā¢ State
ā¢ Action
ā¢ A Shift-Reduce āagentā
ā¢ Reward
embeddings of top two elements of the stack,
embedding of head of the queue
shift, reduce
log likelihood on a downstream task given the
produced representation
100. Reinforcement Learning
ā¢ We use a simple policy gradient method, REINFORCE (Williams, 1992)
R(W) = Eā”(a,s;WR)
" TX
t=1
rtat
#
ā”(a | s; WR)ā¢ The goal of training is to train a policy network
to maximize rewards
ā¢ Other techniques for approximating the marginal likelihood are available
(Guu et al., 2017)
101. Stanford Sentiment Treebank
Method Accuracy
Naive Bayes (from Socher et al., 2013) 81.8
SVM (from Socher et al., 2013) 79.4
Average of Word Embeddings (from Socher et al., 2013) 80.1
Bayesian Optimization (Yogatama et al., 2015) 82.4
Weighted Average of Word Embeddings (Arora et al., 2017) 82.4
Left-to-Right LSTM 84.7
Right-to-Left LSTM 83.9
Bidirectional LSTM 84.7
Supervised Syntax 85.3
Semi-supervised Syntax 86.1
Latent Syntax 86.5
Socher et al., 2013
102. Stanford Sentiment Treebank
Method Accuracy
Naive Bayes (from Socher et al., 2013) 81.8
SVM (from Socher et al., 2013) 79.4
Average of Word Embeddings (from Socher et al., 2013) 80.1
Bayesian Optimization (Yogatama et al., 2015) 82.4
Weighted Average of Word Embeddings (Arora et al., 2017) 82.4
Left-to-Right LSTM 84.7
Right-to-Left LSTM 83.9
Bidirectional LSTM 84.7
Supervised Syntax 85.3
Semi-supervised Syntax 86.1
Latent Syntax 86.5
Socher et al., 2013
105. ā¢ Trees look ānon linguisticā
ā¢ But downstream performance is great!
ā¢ Ignore the generation of the sentence in favor of
optimal structures for interpretation
ā¢ Natural languages must balance both.
Grammar Induction
Summary
106. ā¢ Do we need better bias in our models?
ā¢ Yes! They are making the wrong generalizations,
even from large data.
ā¢ Do we have to have the perfect model?
ā¢ No! Small steps in the right direction can pay big
dividends.
Linguistic Structure
Summary