This document discusses structured prediction problems in machine learning for natural language processing tasks. It covers using linear classifiers like perceptrons and SVMs for structured outputs by factorizing feature representations. Sequence labeling tasks are used as a running example, with explanations of how to apply the Viterbi algorithm for inference and conditional random fields for learning. Dependency parsing is presented as a case study for structured prediction.
Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017) Alexander Litvinenko
Overview of our latest works in applying low-rank tensor techniques to a) solving PDEs with uncertain coefficients (or multi-parametric PDEs) b) postprocessing high-dimensional data c) compute the largest element, level sets, TOP5% elelments
Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017) Alexander Litvinenko
Overview of our latest works in applying low-rank tensor techniques to a) solving PDEs with uncertain coefficients (or multi-parametric PDEs) b) postprocessing high-dimensional data c) compute the largest element, level sets, TOP5% elelments
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Pooyan Jamshidi
Despite achieving state-of-the-art performance across many domains, machine learning systems are highly vulnerable to subtle adversarial perturbations. Although defense approaches have been proposed in recent years, many have been bypassed by even weak adversarial attacks. Previous studies showed that ensembles created by combining multiple weak defenses (i.e., input data transformations) are still weak. In this talk, I will show that it is indeed possible to construct effective ensembles using weak defenses to block adversarial attacks. However, to do so requires a diverse set of such weak defenses. Based on this motivation, I will present Athena, an extensible framework for building effective defenses to adversarial attacks against machine learning systems. I will talk about the effectiveness of ensemble strategies with a diverse set of many weak defenses that comprise transforming the inputs (e.g., rotation, shifting, noising, denoising, and many more) before feeding them to target deep neural network classifiers. I will also discuss the effectiveness of the ensembles with adversarial examples generated by various adversaries in different threat models. In the second half of the talk, I will explain why building defenses based on the idea of many diverse weak defenses works, when it is most effective, and what its inherent limitations and overhead are.
Efficient end-to-end learning for quantizable representationsNAVER Engineering
발표자: 정연우(서울대 박사과정)
발표일: 2018.7.
유사한 이미지 검색을 위해 neural network를 이용해 이미지의 embedding을 학습시킨다. 기존 연구에서는 검색 속도 증가를 위해 binary code의 hamming distance를 활용하지만 여전히 전체 데이터 셋을 검색해야 하며 정확도가 떨어지는 다는 단점이 있다. 이 논문에서는 sparse한 binary code를 학습하여 검색 정확도가 떨어지지 않으면서 검색 속도도 향상시키는 해쉬 테이블을 생성한다. 또한 mini-batch 상에서 optimal한 sparse binary code를 minimum cost flow problem을 통해 찾을 수 있음을 보였다. 우리의 방법은 Cifar-100과 ImageNet에서 precision@k, NMI에서 최고의 검색 정확도를 보였으며 각각 98× 와 478×의 검색 속도 증가가 있었다.
Image sciences, image processing, image restoration, photo manipulation. Image and videos representation. Digital versus analog imagery. Quantization and sampling. Sources and models of noises in digital CCD imagery: photon, thermal and readout noises. Sources and models of blurs. Convolutions and point spread functions. Overview of other standard models, problems and tasks: salt-and-pepper and impulse noises, half toning, inpainting, super-resolution, compressed sensing, high dynamic range imagery, demosaicing. Short introduction to other types of imagery: SAR, Sonar, ultrasound, CT and MRI. Linear and ill-posed restoration problems.
this is the forth slide for machine learning workshop in Hulu. Machine learning methods are summarized in the beginning of this slide, and boosting tree is introduced then. You are commended to try boosting tree when the feature number is not too much (<1000)
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Pooyan Jamshidi
Despite achieving state-of-the-art performance across many domains, machine learning systems are highly vulnerable to subtle adversarial perturbations. Although defense approaches have been proposed in recent years, many have been bypassed by even weak adversarial attacks. Previous studies showed that ensembles created by combining multiple weak defenses (i.e., input data transformations) are still weak. In this talk, I will show that it is indeed possible to construct effective ensembles using weak defenses to block adversarial attacks. However, to do so requires a diverse set of such weak defenses. Based on this motivation, I will present Athena, an extensible framework for building effective defenses to adversarial attacks against machine learning systems. I will talk about the effectiveness of ensemble strategies with a diverse set of many weak defenses that comprise transforming the inputs (e.g., rotation, shifting, noising, denoising, and many more) before feeding them to target deep neural network classifiers. I will also discuss the effectiveness of the ensembles with adversarial examples generated by various adversaries in different threat models. In the second half of the talk, I will explain why building defenses based on the idea of many diverse weak defenses works, when it is most effective, and what its inherent limitations and overhead are.
Efficient end-to-end learning for quantizable representationsNAVER Engineering
발표자: 정연우(서울대 박사과정)
발표일: 2018.7.
유사한 이미지 검색을 위해 neural network를 이용해 이미지의 embedding을 학습시킨다. 기존 연구에서는 검색 속도 증가를 위해 binary code의 hamming distance를 활용하지만 여전히 전체 데이터 셋을 검색해야 하며 정확도가 떨어지는 다는 단점이 있다. 이 논문에서는 sparse한 binary code를 학습하여 검색 정확도가 떨어지지 않으면서 검색 속도도 향상시키는 해쉬 테이블을 생성한다. 또한 mini-batch 상에서 optimal한 sparse binary code를 minimum cost flow problem을 통해 찾을 수 있음을 보였다. 우리의 방법은 Cifar-100과 ImageNet에서 precision@k, NMI에서 최고의 검색 정확도를 보였으며 각각 98× 와 478×의 검색 속도 증가가 있었다.
Image sciences, image processing, image restoration, photo manipulation. Image and videos representation. Digital versus analog imagery. Quantization and sampling. Sources and models of noises in digital CCD imagery: photon, thermal and readout noises. Sources and models of blurs. Convolutions and point spread functions. Overview of other standard models, problems and tasks: salt-and-pepper and impulse noises, half toning, inpainting, super-resolution, compressed sensing, high dynamic range imagery, demosaicing. Short introduction to other types of imagery: SAR, Sonar, ultrasound, CT and MRI. Linear and ill-posed restoration problems.
this is the forth slide for machine learning workshop in Hulu. Machine learning methods are summarized in the beginning of this slide, and boosting tree is introduced then. You are commended to try boosting tree when the feature number is not too much (<1000)
What is an "ensemble learner"? How can we combine different base learners into an ensemble in order to improve the overall classification performance? In this lecture, we are providing some answers to these questions.
Machine Learning and Data Mining: 01 Data MiningPier Luca Lanzi
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. This second lecture Introduces the field of data mining
Machine Learning in Pathology Diagnostics with Simagis Livekhvatkov
Simagis Live Digital Pathology platform employs latest generation of visual recognition technology with Deep Learning bring game changing application to pathology cancer diagnostics
Machine Learning and Real-World ApplicationsMachinePulse
This presentation was created by Ajay, Machine Learning Scientist at MachinePulse, to present at a Meetup on Jan. 30, 2015. These slides provide an overview of widely used machine learning algorithms. The slides conclude with examples of real world applications.
Ajay Ramaseshan, is a Machine Learning Scientist at MachinePulse. He holds a Bachelors degree in Computer Science from NITK, Suratkhal and a Master in Machine Learning and Data Mining from Aalto University School of Science, Finland. He has extensive experience in the machine learning domain and has dealt with various real world problems.
An impactful approach to the Seven Deadly Sins you and your Brand should avoid on Social Media! From a humoristic approach to a modern-life analogy for Social Media and including everything in between, this deck is a compelling resource that will provide you with more than a few take-aways for your Brand!
How People Really Hold and Touch (their Phones)Steven Hoober
For the newest version of this presentation, always go to: 4ourth.com/tppt
For the latest video version, see: 4ourth.com/tvid
Presented at ConveyUX in Seattle, 7 Feb 2014
For the newest version of this presentation, always go to: 4ourth.com/tppt
For the latest video version, see: 4ourth.com/tvid
We are finally starting to think about how touchscreen devices really work, and design proper sized targets, think about touch as different from mouse selection, and to create common gesture libraries.
But despite this we still forget the user. Fingers and thumbs take up space, and cover the screen. Corners of screens have different accuracy than the center. It's time to re-evaluate what we think we know.
Steven reviews his ongoing research into how people actually interact with mobile devices, presents some new ideas on how we can design to avoid errors and take advantage of this new knowledge, and leaves you with 10 (relatively) simple steps to improve your touchscreen designs tomorrow.
You are dumb at the internet. You don't know what will go viral. We don't either. But we are slighter less dumber. So here's a bunch of stuff we learned that will help you be less dumb too.
What 33 Successful Entrepreneurs Learned From FailureReferralCandy
Entrepreneurs encounter failure often. Successful entrepreneurs overcome failure and emerge wiser. We've taken 33 lessons about failure from Brian Honigman's article "33 Entrepreneurs Share Their Biggest Lessons Learned from Failure", illustrated them with statistics and a little story about entrepreneurship... in space!
SEO has changed a lot over the last two decades. We all know about Google Panda & Penguin, but did you know there was a time when search engine results were returned by humans? Crazy right? We take a trip down memory lane to chart some of the biggest events in SEO that have helped shape the industry today.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data anlytics tools.
ABC with data cloning for MLE in state space modelsUmberto Picchini
An application of the "data cloning" method for parameter estimation via MLE aided by Approximate Bayesian Computation. The relevant paper is http://arxiv.org/abs/1505.06318
This lecture is part of the course Machine Learning: Basic Principles delivered at Aalto University. This lecture presents the basic anatomy of a machine-learning problem. We start with discussing of how to transform raw data into features and labels. Then we detail different representations of predictor and classifier mappings, e.g., decision trees and neural networks. We also introduce the notion of a loss function and the associated empirical risk. Finally, we discuss how to learn good predictors (classifiers) vie empirical risk minimization.
https://github.com/telecombcn-dl/dlmm-2017-dcu
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Marina Santini
Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenient because their creation is fast and inexpensive. Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora. Less attention has been paid to assess the representativeness of specialized or domain-specific web corpora. In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is possible to assess the degree of domainspecificity, or domainhood, of web corpora. We present a case study where we explore the effectiveness of different measures - namely the Mann-Withney-Wilcoxon Test, Kendall correlation coefficient, Kullback– Leibler divergence, log-likelihood and burstiness - to gauge domainhood. Our findings indicate that burstiness is the most suitable measure to single out domain-specific words from a specialized corpus and to allow for the quantification of domainhood.
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini
In this study, we focus on the creation and evaluation of domain-specific web corpora. To this purpose, we propose a two-step approach, namely the (1) the automatic extraction and evaluation of term seeds from personas and use cases/scenarios; (2) the creation and evaluation of domain-specific web corpora bootstrapped with term seeds automatically extracted in step 1. Results are encouraging and show that: (1) it is possible to create a fairly accurate term extractor for relatively short narratives; (2) it is straightforward to evaluate a quality such as domain-specificity of web corpora using well-established metrics.
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-Marina Santini
In this study, we put forward two claims: 1) it is possible to design a dynamic and extensible corpus without running the risk of getting into scalability problems; 2) it is possible to devise noise-resistant Language Technology applications without affecting performance. To support our claims, we describe the design, construction and limitations of a very specialized medical web corpus, called eCare_Sv_01, and we present two experiments on lay-specialized text classification. eCare_Sv_01 is a small corpus of web documents written in Swedish. The corpus contains documents about chronic diseases. The sublanguage used in each document has been labelled as "lay" or "specialized" by a lay annotator. The corpus is designed as a flexible text resource, where additional medical documents will be appended over time. Experiments show that the layspecialized labels assigned by the lay annotator are reliably learned by standard classifiers. More specifically, Experiment 1 shows that scalability is not an issue when increasing the size of the datasets to be learned from 156 up to 801 documents. Experiment 2 shows that lay-specialized labels can be learned regardless of the large amount of disturbing factors, such as machine translated documents or low-quality texts, which are numerous in the corpus.
An Exploratory Study on Genre Classification using Readability FeaturesMarina Santini
We present a preliminary study that explores whether text features used for readability assessment are reliable genre-revealing features. We empirically explore the difference between genre and domain. We carry out two sets of experiments with both supervised and unsupervised methods. Findings on the Swedish national corpus (the SUC) show that readability cues are good indicators of genre variation.
folksonomy, social tagging, tag clouds, automatic folksonomy construction, word clouds, wordle,context-preserving word cloud visualisation, CPEWCV, seam carving, inflate and push, star forest, cycle cover, quantitative metrics, realized adjacencies, distortion, area utilization, compactness, aspect ratio, running time, semantics in language technology
Information Extraction, Named Entity Recognition, NER, text analytics, text mining, e-discovery, unstructured data, structured data, calendaring, standard evaluation per entity, standard evaluation per token, sequence classifier, sequence labeling, word shapes, semantic analysis in language technology
word sense disambiguation, wsd, thesaurus-based methods, dictionary-based methods, supervised methods, lesk algorithm, michael lesk, simplified lesk, corpus lesk, graph-based methods, word similarity, word relatedness, path-based similarity, information content, surprisal, resnik method, lin method, elesk, extended lesk, semcor, collocational features, bag-of-words features, the window, lexical semantics, computational semantics, semantic analysis in language technology.
inferential statistics, statistical inference, language technology, interval estimation, confidence interval, standard error, confidence level, z critical value, confidence interval for proportion, confidence interval for the mean, multiplier,
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
attribute selection, constructing decision trees, decision trees, divide and conquer, entropy, gain ratio, information gain, machine leaning, pruning, rules, suprisal
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
1. Machine Learning for Language Technology
Uppsala University
Department of Linguistics and Philology
Structured Prediction
October 2013
Slides borrowed from previous courses.
Thanks to Ryan McDonald (Google Research)
and Prof. Joakim Nivre
Machine Learning for Language Technology 1(36)
2. Structured Prediction
Outline
Last time:
Preliminaries: input/output, features, etc.
Linear classifiers
Perceptron
Large-margin classifiers (SVMs, MIRA)
Logistic regression
Today:
Structured prediction with linear classifiers
Structured perceptron
Structured large-margin classifiers (SVMs, MIRA)
Conditional random fields
Case study: Dependency parsing
Machine Learning for Language Technology 2(36)
3. Structured Prediction
Structured Prediction (i)
Sometimes our output space Y does not consist of simple
atomic classes
Examples:
Parsing: for a sentence x, Y is the set of possible parse trees
Sequence tagging: for a sentence x, Y is the set of possible
tag sequences, e.g., part-of-speech tags, named-entity tags
Machine translation: for a source sentence x, Y is the set of
possible target language sentences
Machine Learning for Language Technology 3(36)
4. Structured Prediction
Hidden Markov Models
Generative model – maximizes likelihood of P(x, y)
We are looking at discriminative versions of this
Not just sequences, though that will be the running example
Machine Learning for Language Technology 4(36)
5. Structured Prediction
Structured Prediction (ii)
Can’t we just use our multiclass learning algorithms?
In all the cases, the size of the set Y is exponential in the
length of the input x
It is non-trivial to apply our learning algorithms in such cases
Machine Learning for Language Technology 5(36)
6. Structured Prediction
Perceptron
Training data: T = {(xt, yt)}
|T |
t=1
1. w(0) = 0; i = 0
2. for n : 1..N
3. for t : 1..T
4. Let y = arg maxy w(i) · f(xt, y) (**)
5. if y = yt
6. w(i+1) = w(i) + f(xt, yt) − f(xt, y )
7. i = i + 1
8. return wi
(**) Solving the argmax requires a search over an exponential
space of outputs!
Machine Learning for Language Technology 6(36)
7. Structured Prediction
Large-Margin Classifiers
Batch (SVMs):
min
1
2
||w||2
such that:
w·f(xt, yt)−w·f(xt, y ) ≥ 1
∀(xt, yt) ∈ T and y ∈ ¯Yt (**)
Online (MIRA):
Training data: T = {(xt, yt)}
|T |
t=1
1. w(0)
= 0; i = 0
2. for n : 1..N
3. for t : 1..T
4. w(i+1)
= arg minw* w* − w(i)
such that:
w · f(xt, yt) − w · f(xt, y ) ≥ 1
∀y ∈ ¯Yt (**)
5. i = i + 1
6. return wi
(**) There are exponential constraints in the size of each input!!
Machine Learning for Language Technology 7(36)
8. Structured Prediction
Factor the Feature Representations
We can make an assumption that our feature representations
factor relative to the output
Example:
Context-free parsing:
f(x, y) =
A→BC∈y
f(x, A → BC)
Sequence analysis – Markov assumptions:
f(x, y) =
|y|
i=1
f(x, yi−1, yi )
These kinds of factorizations allow us to run algorithms like
CKY and Viterbi to compute the argmax function
Machine Learning for Language Technology 8(36)
9. Structured Prediction
Example – Sequence Labeling
Many NLP problems can be cast in this light
Part-of-speech tagging
Named-entity extraction
Semantic role labeling
...
Input: x = x0x1 . . . xn
Output: y = y0y1 . . . yn
Each yi ∈ Yatom – which is small
Each y ∈ Y = Yn
atom – which is large
Example: part-of-speech tagging – Yatom is set of tags
x = John saw Mary with the telescope
y = noun verb noun preposition article noun
Machine Learning for Language Technology 9(36)
10. Structured Prediction
Sequence Labeling – Output Interaction
x = John saw Mary with the telescope
y = noun verb noun preposition article noun
Why not just break up sequence into a set of multi-class
predictions?
Because there are interactions between neighbouring tags
What tag does“saw”have?
What if I told you the previous tag was article?
What if it was noun?
Machine Learning for Language Technology 10(36)
11. Structured Prediction
Sequence Labeling – Markov Factorization
x = John saw Mary with the telescope
y = noun verb noun preposition article noun
Markov factorization – factor by adjacent labels
First-order (like HMMs)
f(x, y) =
|y|
i=1
f(x, yi−1, yi )
kth-order
f(x, y) =
|y|
i=k
f(x, yi−k, . . . , yi−1, yi )
Machine Learning for Language Technology 11(36)
12. Structured Prediction
Sequence Labeling – Features
x = John saw Mary with the telescope
y = noun verb noun preposition article noun
First-order
f(x, y) =
|y|
i=1
f(x, yi−1, yi )
f(x, yi−1, yi ) is any feature of the input & two adjacent labels
fj (x, yi−1, yi ) =
8
<
:
1 if xi = “saw”
and yi−1 = noun and yi = verb
0 otherwise
fj (x, yi−1, yi ) =
8
<
:
1 if xi = “saw”
and yi−1 = article and yi = verb
0 otherwise
wj should get high weight and wj should get low weight
Machine Learning for Language Technology 12(36)
13. Structured Prediction
Sequence Labeling - Inference
How does factorization effect inference?
y = arg max
y
w · f(x, y)
= arg max
y
w ·
|y|
i=1
f(x, yi−1, yi )
= arg max
y
|y|
i=1
w · f(x, yi−1, yi )
= arg max
y
|y|
i=1
m
j=1
wj · fj (x, yi−1, yi )
Can use the Viterbi algorithm
Machine Learning for Language Technology 13(36)
14. Structured Prediction
Sequence Labeling – Viterbi Algorithm
Let αy,i be the score of the best labeling
Of the sequence x0x1 . . . xi
Where yi = y
Let’s say we know α, then
maxy αy,n is the score of the best labeling of the sequence
αy,i can be calculated with the following recursion
αy,0 = 0.0 ∀y ∈ Yatom
αy,i = max
y∗
αy∗,i−1 + w · f(x, y∗, y)
Machine Learning for Language Technology 14(36)
15. Structured Prediction
Sequence Labeling - Back-Pointers
But that only tells us what the best score is
Let βy,i be the ith label in the best labeling
Of the sequence x0x1 . . . xi
Where yi = y
βy,i can be calculated with the following recursion
βy,0 = nil ∀y ∈ Yatom
βy,i = arg max
y∗
αy∗,i−1 + w · f(x, y∗, y)
Thus:
The last label in the best sequence is yn = arg maxy αy,n
The second-to-last label is yn−1 = βyn,n
...
The first label is y0 = βy1,1
Machine Learning for Language Technology 15(36)
16. Structured Prediction
Structured Learning
We know we can solve the inference problem
At least for sequence labeling
But also for many other problems where one can factor
features appropriately (context-free parsing, dependency
parsing, semantic role labeling, . . . )
How does this change learning?
for the perceptron algorithm?
for SVMs?
for logistic regression?
Machine Learning for Language Technology 16(36)
17. Structured Prediction
Structured Perceptron
Exactly like original perceptron
Except now the argmax function uses factored features
Which we can solve with algorithms like the Viterbi algorithm
All of the original analysis carries over!!
1. w(0) = 0; i = 0
2. for n : 1..N
3. for t : 1..T
4. Let y = arg maxy w(i) · f(xt , y) (**)
5. if y = yt
6. w(i+1) = w(i) + f(xt , yt ) − f(xt , y )
7. i = i + 1
8. return wi
(**) Solve the argmax with Viterbi for sequence problems!
Machine Learning for Language Technology 17(36)
18. Structured Prediction
Online Structured SVMs (or Online MIRA)
1. w(0) = 0; i = 0
2. for n : 1..N
3. for t : 1..T
4. w(i+1) = arg minw*
‚
‚w* − w(i)
‚
‚
such that:
w · f(xt , yt ) − w · f(xt , y ) ≥ L(yt , y )
∀y ∈ ¯Yt and y ∈ k-best(xt , w(i))
5. i = i + 1
6. return wi
k-best(xt) is set of k outputs with highest scores using w(i)
Simple solution – only consider highest-scoring output y ∈ ¯Yt
Note: Old fixed margin of 1 is now a fixed loss L(yt, y )
between two structured outputs
Machine Learning for Language Technology 18(36)
19. Structured Prediction
Structured SVMs
min
1
2
||w||2
such that:
w · f(xt, yt) − w · f(xt, y ) ≥ L(yt, y )
∀(xt, yt) ∈ T and y ∈ ¯Yt
Still have an exponential number of constraints
Feature factorizations permit solutions (max-margin Markov
networks, structured SVMs)
Note: Old fixed margin of 1 is now a fixed loss L(yt, y )
between two structured outputs
Machine Learning for Language Technology 19(36)
20. Structured Prediction
Conditional Random Fields (i)
What about structured logistic regression?
Such a thing exists – Conditional Random Fields (CRFs)
Consider again the sequential case with 1st order factorization
Inference is identical to the structured perceptron – use Viterbi
arg max
y
P(y|x) = arg max
y
ew·f(x,y)
Zx
= arg max
y
ew·f(x,y)
= arg max
y
w · f(x, y)
= arg max
y
X
i=1
w · f(x, yi−1, yi )
Machine Learning for Language Technology 20(36)
21. Structured Prediction
Conditional Random Fields (ii)
However, learning does change
Reminder: pick w to maximize log-likelihood of training data:
w = arg max
w t
log P(yt|xt)
Take gradient and use gradient ascent
∂
∂wi
F(w) =
t
fi (xt, yt) −
t y ∈Y
P(y |xt)fi (xt, y )
And the gradient is:
F(w) = (
∂
∂w0
F(w),
∂
∂w1
F(w), . . . ,
∂
∂wm
F(w))
Machine Learning for Language Technology 21(36)
22. Structured Prediction
Conditional Random Fields (iii)
Problem: sum over output space Y
∂
∂wi
F(w) =
X
t
fi (xt , yt ) −
X
t
X
y ∈Y
P(y |xt )fi (xt , y )
=
X
t
X
j=1
fi (xt , yt,j−1, yt,j ) −
X
t
X
y ∈Y
X
j=1
P(y |xt )fi (xt , yj−1, yj )
Can easily calculate first term – just empirical counts
What about the second term?
Machine Learning for Language Technology 22(36)
23. Structured Prediction
Conditional Random Fields (iv)
Problem: sum over output space Y
t y ∈Y j=1
P(y |xt)fi (xt, yj−1, yj )
We need to show we can compute it for arbitrary xt
y ∈Y j=1
P(y |xt)fi (xt, yj−1, yj )
Solution: the forward-backward algorithm
Machine Learning for Language Technology 23(36)
24. Structured Prediction
Forward Algorithm (i)
Let αm
u be the forward scores
Let |xt| = n
αm
u is the sum over all labelings of x0 . . . xm such that ym = u
αm
u =
|y |=m, ym=u
ew·f(xt ,y )
=
|y |=m ym=u
e
P
j=1 w·f(xt ,yj−1,yj )
i.e., the sum of all labelings of length m, ending at position m
with label u
Note then that
Zxt =
y
ew·f(xt ,y )
=
u
αn
u
Machine Learning for Language Technology 24(36)
25. Structured Prediction
Forward Algorithm (ii)
We can fill in α as follows:
α0
u = 1.0 ∀u
αm
u =
v
αm−1
v × ew·f(xt ,v,u)
Machine Learning for Language Technology 25(36)
26. Structured Prediction
Backward Algorithm
Let βm
u be the symmetric backward scores
i.e., the sum over all labelings of xm . . . xn such that ym = u
We can fill in β as follows:
βn
u = 1.0 ∀u
βm
u =
v
βm+1
v × ew·f(xt ,u,v)
Note: β is overloaded – different from back-pointers
Machine Learning for Language Technology 26(36)
27. Structured Prediction
Conditional Random Fields - Final
Let’s show we can compute it for arbitrary xt
y ∈Y j=1
P(y |xt)fi (xt, yj−1, yj )
So we can re-write it as:
j=1
αj−1
yj−1
× ew·f(xt ,yj−1,yj )
× βj
yj
Zxt
fi (xt, yj−1, yj )
Forward-backward can calculate partial derivatives efficiently
Machine Learning for Language Technology 27(36)
28. Structured Prediction
Conditional Random Fields Summary
Inference: Viterbi
Learning: Use the forward-backward algorithm
What about not sequential problems
Context-free parsing – can use inside-outside algorithm
General problems – message passing & belief propagation
Machine Learning for Language Technology 28(36)
29. Structured Prediction
Case Study: Dependency Parsing
Given an input sentence x, predict syntactic dependencies y
Machine Learning for Language Technology 29(36)
30. Structured Prediction
Model 1: Arc-Factored Graph-Based Parsing
y = arg max
y
w · f(x, y)
= arg max
y
(i,j)∈y
w · f(i, j)
(i, j) ∈ y means xi → xj , i.e., a dependency from xi to xj
Solving the argmax
w · f(i, j) is weight of arc
A dependency tree is a spanning tree of a dense graph over x
Use max spanning tree algorithms for inference
Machine Learning for Language Technology 30(36)
31. Structured Prediction
Defining f(i, j)
Can contain any feature over arc or the input sentence
Some example features
Identities of xi and xj
Their part-of-speech tags
The part-of-speech of surrounding words
The distance between xi and xj
...
Machine Learning for Language Technology 31(36)
32. Structured Prediction
Empirical Results
Spanning tree dependency parsing results (McDonald 2006)
Trained using MIRA (online SVMs)
English Czech Chinese
Accuracy Complete Accuracy Complete Accuracy Complete
90.7 36.7 84.1 32.2 79.7 27.2
Simple structured linear classifier
Near state-of-the-art performance for many languages
Higher-order models give higher accuracy
Machine Learning for Language Technology 32(36)
33. Structured Prediction
Model 2: Transition-Based Parsing
y = arg max
y
w · f(x, y)
= arg max
y
t(s)∈T(y)
w · f(s, t)
t(s) ∈ T(y) means that the derivation of y includes the
application of transition t to state s
Solving the argmax
w · f(s, t) is score of transition t in state s
Use beam search to find best derivation from start state s0
Machine Learning for Language Technology 33(36)
34. Structured Prediction
Defining f(s, t)
Can contain any feature over parser states
Some example features
Identities of words in s (e.g., top of stack, head of queue)
Their part-of-speech tags
Their head and dependents (and their part-of-speech tags)
The number of dependents of words in s
...
Machine Learning for Language Technology 34(36)
35. Structured Prediction
Empirical Results
Transition-based dependency parsing with beam search(**)
(Zhang and Nivre 2011)
Trained using perceptron
English Chinese
Accuracy Complete Accuracy Complete
92.9 48.0 86.0 36.9
Simple structured linear classifier
State-of-the-art performance with rich non-local features
(**) Beam search is a heuristic search algorithm that explores a
graph by expanding the most promising node in a limited set.
Beam search is an optimization of best-first search that reduces its
memory requirements. It only finds an approxiamate solution
Machine Learning for Language Technology 35(36)
36. Structured Prediction
Structured Prediction Summary
Can’t use multiclass algorithms – search space too large
Solution: factor representations
Can allow for efficient inference and learning
Showed for sequence learning: Viterbi + forward-backward
But also true for other structures
CFG parsing: CKY + inside-outside
Dependency parsing: spanning tree algorithms or beam search
General graphs: message passing and belief propagation
Machine Learning for Language Technology 36(36)