Liling Tan - 2015 - An Awkward Disparity between BLEU / RIBES Scores and Human Judgements in Machine Translation

•

0 likes•54 views

Association for Computational Linguistics

This document discusses discrepancies between machine translation evaluation metrics like BLEU and RIBES scores and human judgements of translation quality. The authors find that higher scores on these metrics do not always correlate with better quality translations as judged by humans. Minor differences in BLEU scores can indicate major translation inadequacies missed by the metrics. The metrics may better correlate with human judgements for positively scored translations than negatively scored ones. The authors conclude that higher BLEU or RIBES scores do not necessarily mean the machine translation is better.

Education

$An Awkward Disparity between BLEU / RIBES Scores and Human Judgements in Machine Translation Liling Tan, Jon Dehdari1 and Josef van Genabith1 Universität des Saarlandes Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI)1 liling.tan@uni-saarland.de, {first.last_name}@dfki.de Introduction •MT metrics criticized for various reasons (Babych and Hartley, 2004; Smith et al. 2014; Graham et al. 2015) •Low BLEU != Bad MT (Callison-Burch et al. 2006) •Higher BLEU -> Better MT (c.f. WMT, WAT, IWSLT, OpenMT) BLEU: (Papineni et al. 2002) •Precision based •Weak recall penalty •Disregards order •Crude ngram weights •Over-sensitive to minor difference Acknowledgements: The research leading to these results has received funding from the People Pro- gramme(MarieCurieActions)oftheEuropeanUnion'sSeventhFrameworkProgrammeFP7/2007-2013/un- der REA grant agreement n ◦ 317471. RIBES (Isozaki et al. 2014) •Kendall Tau prior on unigram •Overcomes reordering •Adequacy not measured •Correlates with BLEU (naturally) System Setup + Results •Organizers: RIBES = 94.13 ; BLEU = 69.22 ; HUMAN = 0.0 •Ours : RIBES = 95.15 ; BLEU = 85.23 ; HUMAN = -17.75 !!! Note: This is our Unicode2String submission for KO->JA patent subtask in WAT 2015; the other results of other subtasks are presented in Tan and Bond (2014) and Tan et al. (2015) Meta-Evaluation Bubble Graph of Diff. Hypothesis - Baseline BLEU against RIBES for Positive HUMAN Scores Bubble Graph of Diff. Hypothesis - Baseline BLEU against RIBES for Negative HUMAN Scores Conclusion • HigherBLEU/RIBEScorrelateswith+ve HUMAN, not -ve HUMAN • Minor lexical diff. cause huge diff. in BLEU ; RIBES mostly measures fluency • Minor metric score diff. not reflecting major translation inadequacy • Higher BLEU != Better MT References • Bogdan Babych and Anthony Hartley. 2004. Ex- tending the BLEU MT evaluation method with fre- quency weightings. In ACL. • OndËĞrej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, PhilippKoehn,VarvaraLogacheva,ChristofMonz,MatteoNegri,MattPost,CarolinaScarton,LuciaSpe- cia, and Marco Turchi. 2015. Findings of the 2015 workshop on statistical machine translation. In WMT. • Chris Callison-Burch, Miles Osborne, and Philipp Koehn. 2006. Re-evaluation the role of Bleu in machine translation research. In EACL. • Mauro Cettolo, Jan Niehues, Sebastian StÃĳker, Luisa Bentivogli, and Marcello Federico. 2014. Report on the 11th IWSLT evaluation campaign, IWSLT 2014. In IWSLT. • Yvette Graham, Timothy Baldwin, and Nitika Mathur. 2015. Accurate evaluation of segment-level machine translation metrics. In ACL. • Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Katsuhito Sudoh, and Hajime Tsukada. 2010. Automatic evaluation of translation quality for distant language pairs. In EMNLP. • Toshiaki Nakazawa, Hideya Mino, Isao Goto, Graham Neubig, Sadao Kurohashi, and Eiichiro Sumita. 2015. Overview of the 2nd workshop on Asian translation. In WAT. • Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL. • Liling Tan and Francis Bond. 2014. Manipulating input data in machine translation. In WAT. • LilingTan,JosefvanGenabith,andFrancisBond. 2015. Passiveandpervasiveuseofbilingualdictio-nary in statistical machine translation. In HyTra.$

This document summarizes the steps taken and results of piloting an electronic mentoring evaluation survey across multiple schools at UCSF. It describes conducting two pilot tests of the survey, refining it based on feedback, obtaining necessary approvals, and analyzing initial results. Demographic data is presented on the 185 participants in the second pilot test across Medicine, Nursing, Pharmacy and Dentistry. Results showed the survey took an average of 4 minutes to complete and 92% of those with a mentor completed it. Next steps outlined completing development of the survey for wider rollout and evaluation.

Use of evaluation findings; types and influences

Glenn O'Neil

How to write a 4* impact case study

Mark Reed

Availability, accessibility and applicability of evidence: Transferability - ...

Office of Health Economics

This document summarizes a presentation on the transferability of data and evidence between countries for healthcare decision making. It discusses guidelines around transferring treatment effects, baseline risks, costs and utilities. It also describes approaches to improve transferability, such as identifying networks for collaboration, assessing efficacy transferability through value of information analysis, and addressing differences in patient-reported outcome reporting across cultures. The presentation evaluates challenges to transferring evidence internationally and innovative methods that could help overcome issues with accessibility, applicability and availability of data between jurisdictions.

Spotlight Webinar: ROBINS-I

The National Collaborating Centre for Methods and Tools

1) The document describes a webinar presented by the National Collaborating Centre Methods and Tools (NCCMT) on the ROBINS-I tool for assessing risk of bias in non-randomized studies. 2) The webinar provided an overview of ROBINS-I, including its development process, contributors, key features such as the seven bias domains and signaling questions, and how it can be used to make risk of bias assessments. 3) Attendees of the webinar were given information on how to access the presentation and recording afterward on the NCCMT website.

Is Internet Cognitive Behavioral Therapy an effective alternative to Cognitiv...

Jessica Short

Internet-based cognitive behavioral therapy (ICBT) has been shown to be an effective treatment for social anxiety disorder (SAD) when all modules are completed, though dropout rates are higher than traditional CBT. While outcomes of ICBT are comparable to traditional CBT, it is not recommended as a substitute due to higher dropout rates. ICBT could be better utilized concurrently with traditional CBT or during waitlist periods to decrease costs and wait times for treatment. More research is still needed to understand why some patients complete ICBT while others dropout prematurely.

This document summarizes the key issues with the current state of children and young people's mental health services in the UK and internationally. It finds that community-based usual care is often ineffective and fragmented. There is a lack of evidence that provider characteristics predict outcomes and inconsistent use of evidence-based practices. The document proposes a new "template" for appropriate child and youth services that focuses on improving access, awareness, participation, evidence-based practice, and accountability through routine outcome measurement. It highlights the Children and Young People's Improving Access to Psychological Therapies (CYP-IAPT) program as an example of implementing this template through training, collaboratives, and other reforms.

Innovations conference 2014 dr tracy robinson using a multidisciplinary pro...

Cancer Institute NSW

Liling Tan - 2015 - An Awkward Disparity between BLEU / RIBES Scores and Huma...

Association for Computational Linguistics

The document discusses how BLEU and RIBES scores do not always correlate with human judgments of translation quality. While conventional wisdom suggests higher scores mean better translations, the authors found examples where translations with higher scores received lower human scores, and vice versa. By analyzing translations at the segment level, they found higher BLEU/RIBES generally corresponds to better translations for positive human scores, but not for negative scores, where increases in the metrics did not reliably indicate worse translations. They conclude the metrics may not fully capture differences important to humans, such as minor lexical errors that greatly impact n-gram matching.

Will Biomedical Research Fundamentally Change in the Era of Big Data?

Philip Bourne

This document discusses how biomedical research may fundamentally change in the era of big data. It notes that biomedical research has always been data-driven, but the scope, variety, complexity and volume of data is now much greater. It also discusses the need for more open data sharing and new tools and methods for large-scale analysis. The document suggests biomedical research may move towards a more collaborative "platform" model, as seen with companies like Airbnb, with the goal of improving data access, reuse and reproducibility of research. However, overcoming challenges like incentives, trust and work practices will be important for any new platform to succeed.

Oral defense b. henry

Dr. Bernard C. Henry

This document outlines a dissertation that investigates whether un-moderated or moderated group participation is better for knowledge creation and convergence within an online community of practice (CoP). The study uses a mixed methods sequential explanatory design including a quasi-experimental pre-post test and qualitative interviews and content analysis. Results from the pre-post tests and qualitative analysis indicate that both moderated and un-moderated groups showed learning, but the moderated group performed better. The study suggests online collaboration can help build legal skills and minimize degraded judgments by facilitating knowledge convergence within the organization.

Running Head REPLY TO OPINION 5.1 FOR KIMBRILEE SCHMITZ 1REPLY.docx

toltonkendal

Running Head: REPLY TO OPINION 5.1 FOR KIMBRILEE SCHMITZ 1 REPLY TO OPINION 5.1 FOR KIMBRILEE SCHMITZ 3 Tonya Klemmer 2 posts Re:Module 5 DQ 1 What are the most effective strategies for managing quality control on quantitative methods in program evaluation? Why are they the most effective? When an evaluator is conducting a quantitative evaluation on an organization it is extremely important that effective strategies are applied to manage quality control. Key stakeholders are relying on the evaluator to provide reliable and valid information based on the data that has been gathered. As reported by Min, Ko, Cho, Jeong, Lee,Chun,and Cho (2015), systematic errors occur frequently in clinical laboratory test results and should be detected and corrected. This helps to ensure the validity and reliability of the results. Min et. al (2015) goes on to report that, in conclusion, the newly developed quantitative quality control procedure (QQCP) can analyze systematic errors quantitatively and be applied to every run. As with qualitative evaluations triangulation is another quality control method that can be applied to compare the results of the data as a way of maintaining the validity and reliability. As reported by Tonkin-Cline, Anthierens, Hood, Yardley, Cals, Francis, Coenen, van der Velden, Godycki-Cwirko, Llor, Butler, Verheij, Goossens, Little (2016), a triangulation protocol can also enhance the validity of findings and assess whether data agree (convergence), complement one another (complementarity) or contradict each other (dissonance). Once these things have been determined it makes it easier to make the necessary changes. Min, W., Ko, D., Cho, E. J., Jeong, T., Lee, W., Chun, S., & Cho, H. (2015). A novel quantitative evaluation method for quality control results. Clinica Chimica Acta, 451(Part B), 175-179. doi:10.1016/j.cca.2015.09.026 Tonkin-Crine, S., Anthierens, S., Hood, K., Yardley, L., Cals, J. L., Francis, N. A., & ... Little, P. (2016). Discrepancies between qualitative and quantitative evaluation of randomised controlled trial results: achieving clarity through mixed methods triangulation. Implementation Science, 111-8. doi:10.1186/s13012-016-0436-0 Shelly Gooden 1 posts Re:Module 5 DQ 2 Explain why quantitative methods would prove advantageous over qualitative methods in program evaluation. Are there drawbacks to using quantitative methods? If so, which ones are of greatest concern? Why? Quantitative evaluation methods can be less expensive than qualitative methods and allow evaluators to provide results quickly. The cost savings can be made in reduced man-hours as interviews or observations are not required replaced by surveys and questionnaires (Rao & Woolcock, 2003). The instruments used in quantitative evaluation do not require coding (Choy, 2014). Quantitative instruments provide numeric or categorical data that can be analyzed after data cleaning but coding is not necessary. Again, this saves time. Surveys and q ...

Developing a Critical Interventions Framework

Australian Centre for Student Equity and Success

In early 2013, the authors were commissioned by DIICCSRTE to develop a critical interventions framework for student equity in higher education. To answer the seemingly simple question of whether we as a sector were on track in achieving our national social inclusion goals, we must review the current student equity makeup of the sector, and determine how effective our equity initiatives are. The first part of that question was relatively easy to answer. However, finding clear, rigorous evidence of program efficacy from the literature was much more difficult. In this presentation, I will discuss the critical interventions framework and the difficulties with uncovering evidence of effectiveness as opposed to the theoretical strength of an initiative, and briefly discuss how the framework might be used in the future.

Evaluating the cost-effectiveness of development investments

CCAFS | CGIAR Research Program on Climate Change, Agriculture and Food Security

HIT Use in Primary Care Practices: Understanding and Eliminating Disparity

UCLA CTSI

UCLA CTSI and University of Minnesota Cross-Institutional Award Projects Principal Investigators: Hector Rodriguez (UCLA) and William Riley (University of Minnesota) Large-scale federal investments in health information technology (HIT) are intended to spur health care providers and organizations to share patient information to better coordinate and improve quality of care. However, the uptake of HIT has lagged in ambulatory care settings that care for high proportions of low-income patients. Our pilot study seeks to generate knowledge about facilitators and barriers to the spread of electronic health information exchange (HIE) for improving quality of care among underserved populations. We have established partnerships with two health care organizations: Citrus Valley Health Partners (CVHP), a provider network serving many underserved patients in the East San Gabriel Valley, Calf., and the Federally Qualified Health Center Urban Health Network (FUHN), a group of ten clinic organizations serving the Minneapolis-St. Paul area. We are conducting key informant interviews of physicians, front office staff, IT personnel, and executives. Through these interviews, we are learning about the barriers to electronic exchange of health information within the clinics and practices, between these sites, and at city-wide or regional levels. This multi-level framework elucidates the opportunities and challenges for ambulatory care practices serving underserved populations in adopting and sustaining HIT.

Publishing in Public Health

Houkje

Rapid Qaulitative Inquiry, extended presentation

James Beebe

This document provides an overview of Rapid Qualitative Inquiry (RQI), a team-based qualitative research methodology for investigating complicated situations quickly with limited time and resources. RQI uses iterative data collection and analysis to develop a preliminary understanding within weeks. It emphasizes getting an insider's perspective through semi-structured interviews and stories rather than answers. The document discusses RQI's differences from traditional qualitative research, examples of its applications, and guidelines for effective use through teamwork and balancing data collection with analysis.

Evaluation of Gender Aware Health Interventions in South Asia: What do we kn...

MEASURE Evaluation

This document summarizes a presentation on evaluating gender-aware health interventions in South Asia. It outlines the objectives, methodology, preliminary findings, discussion questions, limitations, and next steps of the evaluation. The methodology included searching for publications, establishing relevance, abstracting data, and rating interventions on their level of gender integration. Preliminary findings showed variation in study designs, sampling methods, control groups, gender measurement, analysis plans, impact levels, and use of multiple endlines. Discussion questions focused on measurement challenges and how to improve future evaluations. Limitations included the small number of interventions evaluated and use of only English publications. Next steps were to finalize analysis, incorporate a global review, and disseminate findings.

PhD Confirmation talk

University of Melbourne, Australia

Reducing sitting time at work: What's the evidence?

Health Evidence™

Health Evidence hosted a 60 minute webinar examining the effectiveness of workplace interventions for reducing sitting at work. Click here for access to the audio recording for this webinar: https://youtu.be/psmac6jkbMM Dr. Nipun Shrestha, MBBS, MPH, Postgraduate Student at Victoria University led the session and presented findings from his recent Cochrane review: Shrestha N, Kukkonen-harjula KT, Verbeek JH, Ijaz S, Hermans V, & Bhaumik S. (2016). Workplace interventions for reducing sitting at work. Cochrane Database of Systematic Reviews, 2016(3), Art. No.: CD010912. http://healthevidence.org/view-article.aspx?a=workplace-interventions-reducing-sitting-work-28404 Office work has become sedentary in nature. Increased sitting has been linked to increase in cardiovascular disease, obesity and overall mortality. This review examines the impact of workplace interventions to reduce sitting at work. Two cross-over randomized control trials, 11 cluster randomized trials and 4 controlled before-and-after studies, including 2180 participants are included in this review. Findings suggest that sit-stand desks may decrease workplace sitting. This webinar examined the effectiveness and components of interventions that reduce sitting at work.

Fostering research for policy and practitioners lessons and opportunities

UNICEF Office of Research - Innocenti

This document discusses fostering research for policy and practitioners through cohort and longitudinal studies. It provides an overview of CIFF's mission and strategic priority areas. CIFF seeks transformational impact through a systematic approach across sectors like education, nutrition, health, and climate change. CIFF currently has a portfolio of 57 investments totaling $560 million spread across innovation, pilot programs, delivery at scale, and systems change. The document discusses challenges in achieving scale and uptake of research findings, and how CIFF is approaching these challenges through clear theories of change, cost evaluations, communication of evidence, and partnerships.

Healthy Research Partnerships Promote Healthy Communities

Mary Nacionales

Evidence-Informed Public Health Decisions Made Easier: Take it one Step at a ...

Health Evidence™

The document outlines the steps involved in evidence-informed public health decision making. It discusses defining a focused question, efficiently searching for relevant research evidence such as from systematic reviews, critically appraising the research methods, interpreting the results and adapting the information to the local context, deciding whether and how to implement changes, and evaluating the effectiveness of changes made. The goal is to integrate the best available research evidence with local factors to inform public health policies and practices.

Using Comparative Data to Enhance Learning Abroad Strategies (NAFSA 2018)

Keri Ramirez

This document discusses using comparative data and benchmarking to enhance strategies for learning abroad programs. It provides examples of benchmarking data on the number of students participating in study abroad programs from various universities. Benchmarking can legitimize activities, incentivize data reporting, and create useful conversations around operations and policies. When benchmarking, it is important to define key metrics and goals related to student participation, outcomes, and who participates in programs.

Cadth 2015 d7 burgess cadth 2015 20150414

CADTH Symposium

This document discusses different forms of public engagement that could be appropriate for drug policy decisions. It notes that deliberative public engagement events can incorporate diverse perspectives but may not be fully representative. It advocates for participatory governance with advisory bodies that regularly refresh their engagement to better reflect the public. Experiments with different engagement methods should be assessed to help balance input with decision efficiency.

Quality in hospital

Mmedsc Hahm

The document discusses quality improvement in hospitals. It notes that quality improvement (QI) requires sustained leadership, extensive training, robust measurement systems, and a culture receptive to change. It outlines six dimensions of healthcare quality: safety, effectiveness, appropriateness, access, patient satisfaction, and efficiency. Efficiency in healthcare involves deriving maximum benefit from available resources through technical and allocative efficiency. Common causes of medical errors include communication problems, inadequate information flow, human factors, and organizational issues. Many methods can be used to detect adverse events, both passive and active surveillance. Improvement starts with identifying an area for improvement through asking questions. Models for quality improvement include PDCA, Lean, Six Sigma, and change management. Measurement is key to

Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text

Association for Computational Linguistics

This paper contributes a noun phrase-annotated SMS corpus and proposes a weak semi-Markov CRF model for noun phrase chunking in informal text. The weak semi-CRF model improves training speed over linear-CRF and semi-CRF models while maintaining similar accuracy. Experiments on the SMS corpus show the weak semi-CRF achieves F1 scores comparable to other models but trains faster, especially with larger training data sizes.

Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...

Association for Computational Linguistics

This document presents a new method for automatically detecting false friends between Spanish and Portuguese using word embeddings. The method builds word vector spaces for each language using word2vec, finds a linear transformation between the spaces, and measures vector distances to classify word pairs as cognates or false friends. In experiments on a dataset of 710 word pairs, the method achieved state-of-the-art accuracy of 77.28% and high coverage of 97.91%, outperforming previous work. Future work will explore using different word embeddings and fine-grained classifications of partial false friends.

Similar to Liling Tan - 2015 - An Awkward Disparity between BLEU / RIBES Scores and Human Judgements in Machine Translation

Professor Peter Fonagy - CYP IAPT National Clinical Lead

CYP MH

Innovations conference 2014 dr tracy robinson using a multidisciplinary pro...

Cancer Institute NSW

Liling Tan - 2015 - An Awkward Disparity between BLEU / RIBES Scores and Huma...

Association for Computational Linguistics

Will Biomedical Research Fundamentally Change in the Era of Big Data?

Philip Bourne

Oral defense b. henry

Dr. Bernard C. Henry

Running Head REPLY TO OPINION 5.1 FOR KIMBRILEE SCHMITZ 1REPLY.docx

toltonkendal

Developing a Critical Interventions Framework

Australian Centre for Student Equity and Success

Evaluating the cost-effectiveness of development investments

CCAFS | CGIAR Research Program on Climate Change, Agriculture and Food Security

HIT Use in Primary Care Practices: Understanding and Eliminating Disparity

UCLA CTSI

Publishing in Public Health

Houkje

Rapid Qaulitative Inquiry, extended presentation

James Beebe

Evaluation of Gender Aware Health Interventions in South Asia: What do we kn...

MEASURE Evaluation

PhD Confirmation talk

University of Melbourne, Australia

Reducing sitting time at work: What's the evidence?

Health Evidence™

Fostering research for policy and practitioners lessons and opportunities

UNICEF Office of Research - Innocenti

Healthy Research Partnerships Promote Healthy Communities

Mary Nacionales

Evidence-Informed Public Health Decisions Made Easier: Take it one Step at a ...

Health Evidence™

Using Comparative Data to Enhance Learning Abroad Strategies (NAFSA 2018)

Keri Ramirez

Cadth 2015 d7 burgess cadth 2015 20150414

CADTH Symposium

Quality in hospital

Mmedsc Hahm

Similar to Liling Tan - 2015 - An Awkward Disparity between BLEU / RIBES Scores and Human Judgements in Machine Translation (20)

Professor Peter Fonagy - CYP IAPT National Clinical Lead

Innovations conference 2014 dr tracy robinson using a multidisciplinary pro...

Liling Tan - 2015 - An Awkward Disparity between BLEU / RIBES Scores and Huma...

Will Biomedical Research Fundamentally Change in the Era of Big Data?

Oral defense b. henry

Running Head REPLY TO OPINION 5.1 FOR KIMBRILEE SCHMITZ 1REPLY.docx

Developing a Critical Interventions Framework

Evaluating the cost-effectiveness of development investments

HIT Use in Primary Care Practices: Understanding and Eliminating Disparity

Publishing in Public Health

Rapid Qaulitative Inquiry, extended presentation

Evaluation of Gender Aware Health Interventions in South Asia: What do we kn...

PhD Confirmation talk

Reducing sitting time at work: What's the evidence?

Fostering research for policy and practitioners lessons and opportunities

Healthy Research Partnerships Promote Healthy Communities

Evidence-Informed Public Health Decisions Made Easier: Take it one Step at a ...

Using Comparative Data to Enhance Learning Abroad Strategies (NAFSA 2018)

Cadth 2015 d7 burgess cadth 2015 20150414

Quality in hospital

More from Association for Computational Linguistics

Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text

Association for Computational Linguistics

Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...

Association for Computational Linguistics

Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis

Association for Computational Linguistics

This document describes a Spanish language corpus for humor analysis that was created through crowd-sourcing annotations. Over 27,000 tweets were collected from humorous accounts and annotated through a web interface. The corpus contains over 100,000 annotations of the tweets' humor and funniness. Inter-annotator agreement was higher for this corpus than a previous Spanish humor corpus. The dataset will help analyze subjectivity in humor and was used in a shared task on humor classification and funniness prediction.

Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...

Association for Computational Linguistics

This document discusses position bias in instructor interventions in MOOC discussion forums. It finds that instructors are more likely to intervene in threads that appear higher on the discussion forum user interface due to their recent activity. To address this, it proposes a debiased classifier that weights examples based on their propensity for intervention. It finds this approach identifies intervention opportunities that were overlooked due to position bias. The debiased classifier outperforms a standard classifier on several metrics, demonstrating it can better predict unbiased intervention needs.

Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions

Association for Computational Linguistics

The document summarizes the history and current state of the ACL Anthology, a repository of publications from ACL-sponsored conferences. It discusses how the Anthology was established in 2001 and is now maintained by volunteers, containing over 45,000 papers. The presentation calls for community involvement to help future-proof the Anthology through efforts like migrating its infrastructure and improving documentation. It also proposes hosting the Anthology on the main ACL website and recruiting a new editor.

Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification

Association for Computational Linguistics

The document presents SAMSA, a new automatic evaluation measure for structural text simplification. SAMSA uses semantic parsing to measure the preservation of semantic structures and relations between an original text and its simplified version. It correlates significantly better with human judgments of meaning preservation and structural simplicity than prior reference-based metrics. SAMSA is the first evaluation method designed specifically for structural simplification operations like sentence splitting.

Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions

Association for Computational Linguistics

Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...

Association for Computational Linguistics

(1) Sequicity is a framework that simplifies task-oriented dialogue systems using single sequence-to-sequence architectures. (2) It formalizes dialogues as sequences of belief spans and responses and decodes them in two stages: generating a belief span followed by a response. (3) An experiment on two datasets found that a two-stage CopyNet instantiation of Sequicity outperformed several baselines in effectiveness, efficiency and handling out-of-vocabulary requests.

Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...

Association for Computational Linguistics

The document summarizes a study that explored how people's strategies for giving commands to a robot change over time during a collaborative navigation task. Ten participants each directed a robot for one hour via dialogue. Initially, participants predominantly used metric units like distances in their commands, but over time their commands increasingly referred to environmental landmarks. The study collected audio, text, and robot data to analyze parameters in commands. Future work aims to automate dialogue response generation based on this data.

Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...

Association for Computational Linguistics

The document describes a system for estimating emotion intensity in tweets. It takes a lexicon-based and word vector-based approach to create sentence embeddings for tweets. Various regression models are trained and an ensemble is used to predict emotion intensity scores between 0-1 for anger, sadness, joy and fear. The system achieved third place in predicting emotion intensity and second place for intensities over 0.5. Future work involves using contextual sentence embeddings to improve predictions.

Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop

Association for Computational Linguistics

This document describes Toshiba's machine translation system submitted to the WAT2015 workshop. It discusses using statistical post-editing (SPE) to improve rule-based machine translation (RBMT) output, as well as combining SPE and SMT systems using reranking with recurrent neural network language models. Experimental results show that the combined system achieved the best BLEU and RIBES scores compared to the individual SPE and SMT systems on several language pairs, including Japanese-English and Chinese-Japanese. However, human evaluation correlations were not entirely clear.

Chenchen Ding - 2015 - NICT at WAT 2015

Association for Computational Linguistics

John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...

Association for Computational Linguistics

The document describes improvements made to the KyotoEBMT machine translation system. It discusses using forest parsing of input sentences to handle parsing errors and syntactic divergences. It also describes using the Nile alignment tool along with constituent parsing to improve word alignments from the training corpus. New features were added and the reranking was improved by incorporating a neural machine translation-based bilingual language model.

John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...

Association for Computational Linguistics

El documento describe el sistema de traducción basado en ejemplos KyotoEBMT. El sistema utiliza análisis de dependencia tanto del idioma origen como del idioma destino y puede manejar ambigüedades en las hipótesis de traducción mediante el uso de reglas de rejilla. Los resultados oficiales del WAT2015 muestran mejoras en las métricas BLEU y RIBES con la reranqueación de traducciones, aunque la reranqueación empeora la evaluación humana para la dirección de traducción japonés-chino. El sistema Ky

Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...

Association for Computational Linguistics

This document evaluates several neural machine translation models for English to Japanese translation. It finds that simple neural models outperform statistical machine translation baselines. Soft attention models with LSTM units performed best. However, training these models on pre-reordered data hurt performance. The neural models tended to produce grammatically correct but incomplete translations by omitting information. Replacing unknown words helped some models but more sophisticated solutions are needed for models trained on natural order data.

Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...

Association for Computational Linguistics

This document evaluates various neural machine translation models for English to Japanese translation. It compares different network architectures, recurrent units, and training data configurations. Results show that soft-attention models outperformed multi-layer encoder-decoder models, and training on pre-reordered data hurt performance. Neural machine translation models tended to generate grammatically correct but incomplete translations.

Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015

Association for Computational Linguistics

This document describes NAVER's machine translation systems for the WAT 2015 evaluation. For English-to-Japanese translation, the best system combined tree-to-string syntax-based machine translation with neural machine translation re-ranking, achieving a BLEU score of 34.60. For Korean-to-Japanese translation, the top system used phrase-based machine translation and neural machine translation re-ranking, obtaining a BLEU score of 71.38. The document also analyzes the effectiveness of character-level tokenization and other techniques for neural machine translation.

Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop

Association for Computational Linguistics

Toshiba presented their machine translation system for the WAT2015 workshop. Their system uses statistical post-editing (SPE) to correct rule-based machine translation (RBMT) output. It also combines SPE and phrase-based statistical machine translation (SMT) results by reranking the merged n-best lists using a recurrent neural network language model. Evaluation showed the combined system achieved the best results on most language pairs compared to SPE and SMT individually. Analysis of system selections by the combination found it primarily chose translations from SPE.

Chenchen Ding - 2015 - NICT at WAT 2015

Association for Computational Linguistics

Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...

Association for Computational Linguistics

Neural reranking of machine translation output improves both automatic metrics and subjective human evaluations of translation quality. The document analyzes reranking results from a statistical machine translation system using an attentional neural machine translation model. Reranking corrected errors related to reordering, insertion, deletion, substitution and conjugation. Specifically, it improved phrasal reordering, auxiliary verb insertion/deletion, and coordinate structures. The gains were mainly in grammatical aspects rather than lexical selection. While reranking is shown to be effective, questions remain about comparing it to pure neural machine translation and neural language models.

More from Association for Computational Linguistics (20)