SlideShare a Scribd company logo
CORPORA-BASED GENERATION OF
DEPENDENCY PARSER MODELS FOR
NATURAL LANGUAGE PROCESSING
by Edmond Lepedus
supervised by Marek Grześ, Christian Kissig and Laura Bocchi
BACKGROUND
Dependency Parsing
Dependency Parsing
• Structure consists of dependencies between
words:
Dependency Parsing
• Structure consists of dependencies between
words:
Hello world
Dependency Parsing
• Structure consists of dependencies between
words:
Hello world
dependent
Dependency Parsing
• Structure consists of dependencies between
words:
Hello world
headdependent
Stanford CoreNLP
Stanford CoreNLP
• Free, open-source NLP toolkit
Stanford CoreNLP
• Free, open-source NLP toolkit
• Includes a dependency parser backed by a neural
network classifier
Stanford CoreNLP
• Free, open-source NLP toolkit
• Includes a dependency parser backed by a neural
network classifier
• Parses 1000 sentences per second at 92.2% accuracy:
Stanford CoreNLP
• Free, open-source NLP toolkit
• Includes a dependency parser backed by a neural
network classifier
• Parses 1000 sentences per second at 92.2% accuracy:
Stanford CoreNLP
• Free, open-source NLP toolkit
• Includes a dependency parser backed by a neural
network classifier
• Parses 1000 sentences per second at 92.2% accuracy:
• Trained on manually annotated text
AIM
Aim
Train a classifier using an unparsed corpus of
English language text
MOTIVATION
Motivation
Motivation
• Decrease the cost of training data
Motivation
• Decrease the cost of training data
• Increase the availability of training data
Motivation
• Decrease the cost of training data
• Increase the availability of training data
• Increase parsing accuracy
Motivation
• Decrease the cost of training data
• Increase the availability of training data
• Increase parsing accuracy
• Enable the parsing of languages with few
remaining speakers
APPROACH
Overview
Overview
1. Create a ‘blank’ model
Overview
1. Create a ‘blank’ model
2. Parse corpus with model & log decisions
Overview
1. Create a ‘blank’ model
2. Parse corpus with model & log decisions
3. Extract heuristics from corpus & parse log
Overview
1. Create a ‘blank’ model
2. Parse corpus with model & log decisions
3. Extract heuristics from corpus & parse log
4. Generate training examples by modifying the
logged decisions to fit the discovered heuristics
Overview
1. Create a ‘blank’ model
2. Parse corpus with model & log decisions
3. Extract heuristics from corpus & parse log
4. Generate training examples by modifying the
logged decisions to fit the discovered heuristics
5. Train model on new examples
Diagram
Create a ‘blank’ model
Parse & log
Extract heuristics
Generate training examples
Train new model
IMPLEMENTATION
Blank Model Creation
Blank Model Creation
• Outputs left arcs with custom ‘unknown’ label:
Blank Model Creation
• Outputs left arcs with custom ‘unknown’ label:
Blank Model Creation
• Outputs left arcs with custom ‘unknown’ label:
• Supports the creation of new training examples
Parse Decision Logs
Parse Decision Logs
• Log every parse decision to YAML:
Parse Decision Logs
• Log every parse decision to YAML:
Parse Decision Logs
• Log every parse decision to YAML:
required for training
Heuristic Extraction
Heuristic Extraction
• Count bigram occurrences:
Heuristic Extraction
• Count bigram occurrences:
Heuristic Extraction
• Count bigram occurrences:
• Assume that frequent bigrams indicate
dependency
Training Example Generation
Training Example Generation
Training Example Generation
Training Example Generation
Training Example Generation
RESULTS
Post-training Parses
Post-training Parses
Post-training Parses
FURTHER WORK
Further Work
Further Work
• Improve efficiency to enable the use of larger
corpora
Further Work
• Improve efficiency to enable the use of larger
corpora
• Develop better heuristic analyses
Further Work
• Improve efficiency to enable the use of larger
corpora
• Develop better heuristic analyses
• Implement arc labels
CONCLUSION
Conclusion
Conclusion
• We modified the Stanford CoreNLP toolkit to
enable the creation of ‘blank’ parser models
Conclusion
• We modified the Stanford CoreNLP toolkit to
enable the creation of ‘blank’ parser models
• We developed a workflow for training parser
models without using annotated corpora
Conclusion
• We modified the Stanford CoreNLP toolkit to
enable the creation of ‘blank’ parser models
• We developed a workflow for training parser
models without using annotated corpora
• We showed that this quickly yields qualitative
improvements in parser outputs over the ‘blank’
models
Conclusion
• We modified the Stanford CoreNLP toolkit to
enable the creation of ‘blank’ parser models
• We developed a workflow for training parser
models without using annotated corpora
• We showed that this quickly yields qualitative
improvements in parser outputs over the ‘blank’
models
• We proposed three avenues for further research
ANY QUESTIONS?
REFERENCES
[1] C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky,
“The Stanford CoreNLP Natural Language Processing Toolkit,” presented at the
Proceedings of 52nd Annual Meeting of the Association for Computational
Linguistics: System Demonstrations, Stroudsburg, PA, USA, 2014, pp. 55–60.
[2] J. Nivre, “Dependency Parsing,” vol. 4, no. 3, pp. 138–152, Mar. 2010.
[3] D. Chen and C. D. Manning, “A Fast and Accurate Dependency Parser
using Neural Networks.,” EMNLP, pp. 740–750, 2014.
[4] D. Jurafsky and J. H. Martin, Speech and language processing: an
introduction to natural language processing, computational linguistics, and
speech recognition. Upper Saddle River, NJ: Prentice-Hall, 2000.

More Related Content

Similar to CO620

Eskm20140903
Eskm20140903Eskm20140903
Eskm20140903
Shuhei Otani
 
"Hands Off! Best Practices for Code Hand Offs"
"Hands Off!  Best Practices for Code Hand Offs""Hands Off!  Best Practices for Code Hand Offs"
"Hands Off! Best Practices for Code Hand Offs"
Naomi Dushay
 
Proactive Empirical Assessment of New Language Feature Adoption via Automated...
Proactive Empirical Assessment of New Language Feature Adoption via Automated...Proactive Empirical Assessment of New Language Feature Adoption via Automated...
Proactive Empirical Assessment of New Language Feature Adoption via Automated...
Raffi Khatchadourian
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
Kai Li
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OStrata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Sri Ambati
 
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
Lucidworks
 
An NLP-based architecture for the autocompletion of partial domain models
An NLP-based architecture for the autocompletion of partial domain modelsAn NLP-based architecture for the autocompletion of partial domain models
An NLP-based architecture for the autocompletion of partial domain models
Lola Burgueño
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Sri Ambati
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
Sangameswar Venkatraman
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Kris Jack
 
Sergey Nikolenko and Anton Alekseev User Profiling in Text-Based Recommende...
Sergey Nikolenko and  Anton Alekseev  User Profiling in Text-Based Recommende...Sergey Nikolenko and  Anton Alekseev  User Profiling in Text-Based Recommende...
Sergey Nikolenko and Anton Alekseev User Profiling in Text-Based Recommende...
AIST
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
Databricks
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
Young Seok Kim
 
Webinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior RelevanceWebinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior Relevance
Lucidworks
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomics
Altuna Akalin
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
c.titus.brown
 
Tutorial on Coreference Resolution
Tutorial on Coreference Resolution Tutorial on Coreference Resolution
Tutorial on Coreference Resolution
Anirudh Jayakumar
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...
Lifeng (Aaron) Han
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
Databricks
 
Webinar at AgileTD Mondays: Mind maps to support exploratory testing: a team ...
Webinar at AgileTD Mondays: Mind maps to support exploratory testing: a team ...Webinar at AgileTD Mondays: Mind maps to support exploratory testing: a team ...
Webinar at AgileTD Mondays: Mind maps to support exploratory testing: a team ...
Claudia Badell
 

Similar to CO620 (20)

Eskm20140903
Eskm20140903Eskm20140903
Eskm20140903
 
"Hands Off! Best Practices for Code Hand Offs"
"Hands Off!  Best Practices for Code Hand Offs""Hands Off!  Best Practices for Code Hand Offs"
"Hands Off! Best Practices for Code Hand Offs"
 
Proactive Empirical Assessment of New Language Feature Adoption via Automated...
Proactive Empirical Assessment of New Language Feature Adoption via Automated...Proactive Empirical Assessment of New Language Feature Adoption via Automated...
Proactive Empirical Assessment of New Language Feature Adoption via Automated...
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OStrata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2O
 
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
 
An NLP-based architecture for the autocompletion of partial domain models
An NLP-based architecture for the autocompletion of partial domain modelsAn NLP-based architecture for the autocompletion of partial domain models
An NLP-based architecture for the autocompletion of partial domain models
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
 
Sergey Nikolenko and Anton Alekseev User Profiling in Text-Based Recommende...
Sergey Nikolenko and  Anton Alekseev  User Profiling in Text-Based Recommende...Sergey Nikolenko and  Anton Alekseev  User Profiling in Text-Based Recommende...
Sergey Nikolenko and Anton Alekseev User Profiling in Text-Based Recommende...
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Webinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior RelevanceWebinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior Relevance
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomics
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Tutorial on Coreference Resolution
Tutorial on Coreference Resolution Tutorial on Coreference Resolution
Tutorial on Coreference Resolution
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
 
Webinar at AgileTD Mondays: Mind maps to support exploratory testing: a team ...
Webinar at AgileTD Mondays: Mind maps to support exploratory testing: a team ...Webinar at AgileTD Mondays: Mind maps to support exploratory testing: a team ...
Webinar at AgileTD Mondays: Mind maps to support exploratory testing: a team ...
 

CO620