Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
Abstract of Aaron Han’s Presentation
The main topic of this presentation will be the “evaluation of machine translation”. With the rapid development of machine translation (MT), the MT evaluation becomes more and more important to tell whether they make some progresses. The traditional human judgments are very time-consuming and expensive. On the other hand, there are some weaknesses in the existing automatic MT evaluation metrics:
– perform well in certain language pairs but weak on others, which we call the language-bias problem;
– consider no linguistic information (leading the metrics result in low correlation with human judgments) or too many linguistic features (difficult in replicability), which we call the extremism problem;
– design incomprehensive factors (e.g. precision only).
To address the existing problems, he has developed several automatic evaluation metrics:
– Design tunable parameters to address the language-bias problem;
– Use concise linguistic features for the linguistic extremism problem;
– Design augmented factors.
The experiments on ACL-WMT corpora show the proposed metrics yield higher correlation with human judgments. The proposed metrics have been published on international top conferences, e.g. COLING and MT SUMMIT. Actually speaking, the evaluation works are very related to the similarity measuring. So these works can be further developed into other literature, such as information retrieval, question and answering, searching, etc.
A brief introduction about some of his other researches will also be mentioned, such as Chinese named entity recognition, word segmentation, and multilingual treebanks, which have been published on Springer LNCS and LNAI series. Precious suggestions and comments are much appreciated. The opportunities of further corporation will be more exciting.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
Abstract of Aaron Han’s Presentation
The main topic of this presentation will be the “evaluation of machine translation”. With the rapid development of machine translation (MT), the MT evaluation becomes more and more important to tell whether they make some progresses. The traditional human judgments are very time-consuming and expensive. On the other hand, there are some weaknesses in the existing automatic MT evaluation metrics:
– perform well in certain language pairs but weak on others, which we call the language-bias problem;
– consider no linguistic information (leading the metrics result in low correlation with human judgments) or too many linguistic features (difficult in replicability), which we call the extremism problem;
– design incomprehensive factors (e.g. precision only).
To address the existing problems, he has developed several automatic evaluation metrics:
– Design tunable parameters to address the language-bias problem;
– Use concise linguistic features for the linguistic extremism problem;
– Design augmented factors.
The experiments on ACL-WMT corpora show the proposed metrics yield higher correlation with human judgments. The proposed metrics have been published on international top conferences, e.g. COLING and MT SUMMIT. Actually speaking, the evaluation works are very related to the similarity measuring. So these works can be further developed into other literature, such as information retrieval, question and answering, searching, etc.
A brief introduction about some of his other researches will also be mentioned, such as Chinese named entity recognition, word segmentation, and multilingual treebanks, which have been published on Springer LNCS and LNAI series. Precious suggestions and comments are much appreciated. The opportunities of further corporation will be more exciting.
Multi-system machine translation using online APIs for English-LatvianMatīss
This paper describes a hybrid machine translation (HMT) system that employs several online MT system application program interfaces (APIs) forming a Multi-System Machine Translation (MSMT) approach. The goal is to im-prove the automated translation of English – Latvian texts over each of the individual MT APIs. The selection of the best hypothesis translation is done by calculating the perplexity for each hypothesis. Experiment results show a slight improvement of BLEU score and WER (word error rate).
Real-time DirectTranslation System for Sinhala and Tamil Languages.Sheeyam Shellvacumar
Presented my research on "Real-time DirectTranslation System for Sinhala and Tamil Languages" at the FedCSIS 2015 Research Conference hosted by University of Lodz, Poland from 13 - 17th of September 2015.
Currently, MT is taking on a new cross-paradigm development characterized by the hybridization of data-driven statistical approaches and linguistic knowledge models. This presentation introduces the structural design of CCID's hybrid MT system including extracting term and lexicon entries from parallel corpora, extracting multi-word expressions (MWEs) in native English from the three-tuple comparable corpora, and new approach in statistical post-editing of Rule-based MT for domain adaptation. Research and Application of MT and language corpora in areas such as Corporate financial narratives will also be discussed in this session.
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010ivan provalov
Two presentation from the Michigan Information Retrieval Enthusiasts Group Meetup on August 19 by Cengage Learning search platform development team.
Scaling Performance Tuning With Lucene by John Nader discusses primary performance hot spots related to scaling to a multi-million document collection. This includes the team's experiences with memory consumption, GC tuning, query expansion, and filter performance. Discusses both the tools used to identify issues and the techniques used to address them.
Relevance Tuning Using TREC Dataset by Rohit Laungani and Ivan Provalov describes the TREC dataset used by the team to improve the relevance of the Lucene-based search platform. Goes over IBM paper and describe the approaches tried: Lexical Affinities, Stemming, Pivot Length Normalization, Sweet Spot Similarity, Term Frequency Average Normalization. Talks about Pseudo Relevance Feedback.
Two Level Disambiguation Model for Query TranslationIJECEIAES
Selection of the most suitable translation among all translation candidates returned by bilingual dictionary has always been quiet challenging task for any cross language query translation. Researchers have frequently tried to use word co-occurrence statistics to determine the most probable translation for user query. Algorithms using such statistics have certain shortcomings, which are focused in this paper. We propose a novel method for ambiguity resolution, named „two level disambiguation model‟. At first level disambiguation, the model properly weighs the importance of translation alternatives of query terms obtained from the dictionary. The importance factor measures the probability of a translation candidate of being selected as the final translation of a query term. This removes the problem of taking binary decision for translation candidates. At second level disambiguation, the model targets the user query as a single concept and deduces the translation of all query terms simultaneously, taking into account the weights of translation alternatives also. This is contrary to previous researches which select translation for each word in source language query independently. The experimental result with English-Hindi cross language information retrieval shows that the proposed two level disambiguation model achieved 79.53% and 83.50% of monolingual translation and 21.11% and 17.36% improvement compared to greedy disambiguation strategies in terms of MAP for short and long queries respectively.
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...ijnlc
The quality of Neural Machine Translation (NMT) systems like Statistical Machine Translation (SMT) systems, heavily depends on the size of training data set, while for some pairs of languages, high-quality parallel data are poor resources. In order to respond to this low-resourced training data bottleneck reality, we employ the pivoting approach in both neural MT and statistical MT frameworks. During our experiments on the Persian-Spanish, taken as an under-resourced translation task, we discovered that, the aforementioned method, in both frameworks, significantly improves the translation quality in comparison to the standard direct translation approach.
Multi-system machine translation using online APIs for English-LatvianMatīss
This paper describes a hybrid machine translation (HMT) system that employs several online MT system application program interfaces (APIs) forming a Multi-System Machine Translation (MSMT) approach. The goal is to im-prove the automated translation of English – Latvian texts over each of the individual MT APIs. The selection of the best hypothesis translation is done by calculating the perplexity for each hypothesis. Experiment results show a slight improvement of BLEU score and WER (word error rate).
Real-time DirectTranslation System for Sinhala and Tamil Languages.Sheeyam Shellvacumar
Presented my research on "Real-time DirectTranslation System for Sinhala and Tamil Languages" at the FedCSIS 2015 Research Conference hosted by University of Lodz, Poland from 13 - 17th of September 2015.
Currently, MT is taking on a new cross-paradigm development characterized by the hybridization of data-driven statistical approaches and linguistic knowledge models. This presentation introduces the structural design of CCID's hybrid MT system including extracting term and lexicon entries from parallel corpora, extracting multi-word expressions (MWEs) in native English from the three-tuple comparable corpora, and new approach in statistical post-editing of Rule-based MT for domain adaptation. Research and Application of MT and language corpora in areas such as Corporate financial narratives will also be discussed in this session.
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010ivan provalov
Two presentation from the Michigan Information Retrieval Enthusiasts Group Meetup on August 19 by Cengage Learning search platform development team.
Scaling Performance Tuning With Lucene by John Nader discusses primary performance hot spots related to scaling to a multi-million document collection. This includes the team's experiences with memory consumption, GC tuning, query expansion, and filter performance. Discusses both the tools used to identify issues and the techniques used to address them.
Relevance Tuning Using TREC Dataset by Rohit Laungani and Ivan Provalov describes the TREC dataset used by the team to improve the relevance of the Lucene-based search platform. Goes over IBM paper and describe the approaches tried: Lexical Affinities, Stemming, Pivot Length Normalization, Sweet Spot Similarity, Term Frequency Average Normalization. Talks about Pseudo Relevance Feedback.
Two Level Disambiguation Model for Query TranslationIJECEIAES
Selection of the most suitable translation among all translation candidates returned by bilingual dictionary has always been quiet challenging task for any cross language query translation. Researchers have frequently tried to use word co-occurrence statistics to determine the most probable translation for user query. Algorithms using such statistics have certain shortcomings, which are focused in this paper. We propose a novel method for ambiguity resolution, named „two level disambiguation model‟. At first level disambiguation, the model properly weighs the importance of translation alternatives of query terms obtained from the dictionary. The importance factor measures the probability of a translation candidate of being selected as the final translation of a query term. This removes the problem of taking binary decision for translation candidates. At second level disambiguation, the model targets the user query as a single concept and deduces the translation of all query terms simultaneously, taking into account the weights of translation alternatives also. This is contrary to previous researches which select translation for each word in source language query independently. The experimental result with English-Hindi cross language information retrieval shows that the proposed two level disambiguation model achieved 79.53% and 83.50% of monolingual translation and 21.11% and 17.36% improvement compared to greedy disambiguation strategies in terms of MAP for short and long queries respectively.
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...ijnlc
The quality of Neural Machine Translation (NMT) systems like Statistical Machine Translation (SMT) systems, heavily depends on the size of training data set, while for some pairs of languages, high-quality parallel data are poor resources. In order to respond to this low-resourced training data bottleneck reality, we employ the pivoting approach in both neural MT and statistical MT frameworks. During our experiments on the Persian-Spanish, taken as an under-resourced translation task, we discovered that, the aforementioned method, in both frameworks, significantly improves the translation quality in comparison to the standard direct translation approach.
How Masterly Are People at Playing with Their Vocabulary?Matīss
In this paper, we describe adaptation of a simple word guessing game that occupied the hearts and minds of people around the world. There are versions for all three Baltic countries and even several versions of each. We specifically pay attention to the Latvian version and look into how people form their guesses given any already uncovered hints. The paper analyses guess patterns, easy and difficult word characteristics, and player behaviour and response.
Processing of multi-word expressions (MWEs) is a known problem for any natural language processing task. Even neural machine translation (NMT) struggles to overcome it. This paper presents results of experiments on investigating NMT attention allocation to the MWEs and improving automated translation of
sentences that contain MWEs in English -> Latvian and English -> Czech NMT systems. Two improvement strategies were
explored - (1) bilingual pairs of automatically extracted MWE candidates were added to the parallel corpus used to train the NMT system, and (2) full sentences containing the automatically extracted MWE candidates were added to the parallel corpus. Both approaches allowed to increase automated evaluation
results. The best result - 0.99 BLEU point increase - has been reached with the first approach, while with the second approach minimal improvements achieved. We also provide open-source software and tools used for MWE extraction and alignment inspection.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Combining machine translated sentence chunks from multiple MT systems
1. Combining machine translated sentence chunks
from multiple MT systems
Matīss Rikters and Inguna Skadiņa
17th International Conference on Intelligent Text Processing and Computational Linguistics
Konya, Turkey
April 5, 2016
2. Contents
Hybrid Machine Translation
Multi-System Hybrid MT
Simple combining of translations
Combining full whole translations
Combining translations of sentence chunks
Combining translations of linguistically motivated chunks
Other work
Future plans
3. Hybrid Machine Translation
Statistical rule generation
Rules for RBMT systems are generated from training corpora
Multi-pass
Process data through RBMT first, and then through SMT
Multi-System hybrid MT
Multiple MT systems run in parallel
4. Multi-System Hybrid MT
Related work:
SMT + RBMT (Ahsan and Kolachina, 2010)
Confusion Networks (Barrault, 2010)
+ Neural Network Model (Freitag et al., 2015)
SMT + EBMT + TM + NE (Santanu et al., 2014)
Recursive sentence decomposition (Mellebeek et al., 2006)
5. Combining full whole translations
Translate the full input sentence with multiple MT systems
Choose the best translation as the output
Combining Translations
6. Combining full whole translations
Translate the full input sentence with multiple MT systems
Choose the best translation as the output
Combining translations of sentence chunks
Split the sentence into smaller chunks
The chunks are the top level subtrees of the syntax tree of the sentence
Translate each chunk with multiple MT systems
Choose the best translated chunks and combine them
Combining Translations
7. Combining full whole translations
Teikumu dalīšana tekstvienībās
Tulkošana ar tiešsaistes MT API
Google Translate Bing Translator LetsMT
Labākā tulkojuma izvēle
Tulkojuma izvade
Sentence tokenization
Translation with the online MT APIs
Selection of
the best translation
Output
8. Combining full whole translations
Choosing the best translation:
KenLM (Heafield, 2011) calculates probabilities based on the observed entry with
longest matching history 𝑤𝑓
𝑛
:
𝑝 𝑤 𝑛 𝑤1
𝑛−1
= 𝑝 𝑤 𝑛 𝑤𝑓
𝑛−1
𝑖=1
𝑓−1
𝑏(𝑤𝑖
𝑛−1
)
where the probability 𝑝 𝑤 𝑛 𝑤𝑓
𝑛−1
and backoff penalties 𝑏(𝑤𝑖
𝑛−1
) are given by an
already-estimated language model. Perplexity is then calculated using this
probability: where given an unknown probability distribution p
and a proposed probability model q, it is evaluated by determining how well it
predicts a separate test sample x1, x2... xN drawn from p.
9. Combining full whole translations
Choosing the best translation:
A 5-gram language model was trained with
KenLM
JRC-Acquis corpus v. 3.0 (Steinberger, 2006) - 1.4 million Latvian legal
domain sentences
Sentences are scored with the query program that comes with KenLM
10. Combining full whole translations
Choosing the best translation:
A 5-gram language model was trained with
KenLM
JRC-Acquis corpus v. 3.0 (Steinberger, 2006) - 1.4 million Latvian legal
domain sentences
Sentences are scored with the query program that comes with KenLM
Test data
1581 random sentences from the JRC-Acquis corpus
Tested with the ACCURAT balanced evaluation corpus - 512
general domain sentences (Skadiņš et al., 2010), but
the results were not as good
12. Combining translated chunks of sentences
Teikumu dalīšana tekstvienībās
Tulkošana ar tiešsaistes MT API
Google
Translate
Bing
Translator
LetsMT
Labāko fragmentu izvēle
Tulkojumu izvade
Teikumu sadalīšana fragmentos
Sintaktiskā analīze
Teikumu apvienošana
Sentence tokenization
Translation with the online MT APIs
Selection of
the best chunks
Output
Syntactic analysis
Sentence chunking
Sentence recomposition
13. Syntactic analysis:
Berkeley Parser (Petrov et al., 2006)
Sentences are split into chunks from the top level subtrees
of the syntax tree
Combining translated chunks of sentences
14. Syntactic analysis:
Berkeley Parser (Petrov et al., 2006)
Sentences are split into chunks from the top level subtrees
of the syntax tree
Selection of the best chunk:
5-gram LM trained with KenLM and the JRC-Acquis corpus
Sentences are scored with the query program that comes with KenLM
Combining translated chunks of sentences
15. Syntactic analysis:
Berkeley Parser (Petrov et al., 2006)
Sentences are split into chunks from the top level subtrees
of the syntax tree
Selection of the best chunk:
5-gram LM trained with KenLM and the JRC-Acquis corpus
Sentences are scored with the query program that comes with KenLM
Test data
1581 random sentences from the JRC-Acquis corpus
Tested with the ACCURAT balanced evaluation corpus,
but the results were not as good
Combining translated chunks of sentences
16. System
BLEU Hybrid selection
MSMT SyMHyT Google Bing LetsMT
Google Translate 18.09 100% - -
Bing Translator 18.87 - 100% -
LetsMT 30.28 - - 100%
Hibrīds Google + Bing 18.73 21.27 74% 26% -
Hibrīds Google + LetsMT 24.50 26.24 25% - 75%
Hibrīds LetsMT + Bing 24.66 26.63 - 24% 76%
Hibrīds Google + Bing + LetsMT 22.69 24.72 17% 18% 65%
September 2015 (Rikters and Skadiņa 2016)
Combining translated chunks of sentences
17. Combining translations of
linguistically motivated chunks
An advanced approach to chunking
Traverse the syntax tree bottom up, from right to left
Add a word to the current chunk if
The current chunk is not too long (sentence word count / 4)
The word is non-alphabetic or only one symbol long
The word begins with a genitive phrase («of »)
Otherwise, initialize a new chunk with the word
In case when chunking results in too many chunks, repeat the process, allowing
more (than sentence word count / 4) words in a chunk
18. An advanced approach to chunking
Traverse the syntax tree bottom up, from right to left
Add a word to the current chunk if
The current chunk is not too long (sentence word count / 4)
The word is non-alphabetic or only one symbol long
The word begins with a genitive phrase («of »)
Otherwise, initialize a new chunk with the word
In case when chunking results in too many chunks, repeat the process, allowing
more (than sentence word count / 4) words in a chunk
Changes in the MT API systems
LetsMT API temporarily replaced with Hugo.lv API
Added Yandex API
Combining translations of
linguistically motivated chunks
20. Selection of the best translation:
6-gram and 12-gram LMs trained with
KenLM
JRC-Acquis corpus v. 3.0
DGT-Translation Memory corpus (Steinberger, 2011) – 3.1 million Latvian
legal domain sentences
Sentences scored with the query program from KenLM
Combining translations of
linguistically motivated chunks
21. Selection of the best translation:
6-gram and 12-gram LMs trained with
KenLM
JRC-Acquis corpus v. 3.0
DGT-Translation Memory corpus (Steinberger, 2011) – 3.1 million Latvian
legal domain sentences
Sentences scored with the query program from KenLM
Test data
1581 random sentences from the JRC-Acquis corpus
ACCURAT balanced evaluation corpus
Combining translations of
linguistically motivated chunks
22. Sentence chunks with SyMHyT Sentence chunks with ChunkMT
• Recently
• there
• has been an increased interest in the
automated discovery of equivalent
expressions in different languages
• .
• Recently there has been an increased
interest
• in the automated discovery of
equivalent expressions
• in different languages .
Combining translations of
linguistically motivated chunks
25. System BLEU Equal Bing Google Hugo Yandex
BLEU - - 17.43 17.73 17.14 16.04
MSMT - Google + Bing 17.70 7.25% 43.85% 48.90% - -
MSMT- Google + Bing + LetsMT 17.63 3.55% 33.71% 30.76% 31.98% -
SyMHyT - Google + Bing 17.95 4.11% 19.46% 76.43% - -
SyMHyT - Google + Bing +
LetsMT 17.30 3.88% 15.23% 19.48% 61.41% -
ChunkMT - Google + Bing 18.29 22.75% 39.10% 38.15% - -
ChunkMT – all four 19.21 7.36% 30.01% 19.47% 32.25% 10.91%
January 2016
Combining translations of
linguistically motivated chunks
26. • Matīss Rikters
"Multi-system machine translation using online APIs
for English-Latvian"
ACL-IJCNLP 2015
• Matīss Rikters and Inguna Skadiņa
"Syntax-based multi-system machine translation"
LREC 2016
Related publications
27. K-translate - interactive multi-system machine translation
About the same as ChunkMT but with a nice user interface
Draws a syntax tree with chunks highlighted
Designates which chunks where chosen from which system
Provides a confidence score for the choices
Allows using online APIs or user provided machine translations
Comes with resources for translating between English, French, German and Latvian
Can be used in a web browser
Work in progress
28. K-translate - interactive multi-system machine translation
Start page
Translate with
onlinesystems
Inputtranslations
to combine
Input
translated
chunks
Settings
Translation results
Inputsource
sentence
Inputsource
sentence
Work in progress
30. Future work
More enhancements for the chunking step
Add special processing of multi-word expressions (MWEs)
Try out other types of LMs
POS tag + lemma
Recurrent Neural Network Language Model
(Mikolov et al., 2010)
Continuous Space Language Model
(Schwenk et al., 2006)
Character-Aware Neural Language Model
(Kim et al., 2015)
Choose the best translation candidate with MT quality estimation
QuEst++ (Specia et al., 2015)
SHEF-NN (Shah et al., 2015)
Future ideas
31. References
Ahsan, A., and P. Kolachina. "Coupling Statistical Machine Translation with Rule-based Transfer and Generation, AMTA-The Ninth
Conference of the Association for Machine Translation in the Americas." Denver, Colorado (2010).
Barrault, Loïc. "MANY: Open source machine translation system combination." The Prague Bulletin of Mathematical Linguistics
93 (2010): 147-155.
Santanu, Pal, et al. "USAAR-DCU Hybrid Machine Translation System for ICON 2014" The Eleventh International Conference on
Natural Language Processing. , 2014.
Mellebeek, Bart, et al. "Multi-engine machine translation by recursive sentence decomposition." (2006).
Heafield, Kenneth. "KenLM: Faster and smaller language model queries." Proceedings of the Sixth Workshop on Statistical
Machine Translation. Association for Computational Linguistics, 2011.
Steinberger, Ralf, et al. "The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages." arXiv preprint cs/0609058
(2006).
Petrov, Slav, et al. "Learning accurate, compact, and interpretable tree annotation." Proceedings of the 21st International
Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics.
Association for Computational Linguistics, 2006.
Steinberger, Ralf, et al. "Dgt-tm: A freely available translation memory in 22 languages." arXiv preprint arXiv:1309.5226 (2013).
Raivis Skadiņš, Kārlis Goba, Valters Šics. 2010. Improving SMT for Baltic Languages with Factored Models. Proceedings of the
Fourth International Conference Baltic HLT 2010, Frontiers in Artificial Intelligence and Applications, Vol. 2192. , 125-132.
Mikolov, Tomas, et al. "Recurrent neural network based language model." INTERSPEECH. Vol. 2. 2010.
Schwenk, Holger, Daniel Dchelotte, and Jean-Luc Gauvain. "Continuous space language models for statistical machine
translation." Proceedings of the COLING/ACL on Main conference poster sessions. Association for Computational Linguistics,
2006.
Kim, Yoon, et al. "Character-aware neural language models." arXiv preprint arXiv:1508.06615 (2015).
Specia, Lucia, G. Paetzold, and Carolina Scarton. "Multi-level Translation Quality Prediction with QuEst++." 53rd Annual
Meeting of the Association for Computational Linguistics and Seventh International Joint Conference on Natural Language
Processing of the Asian Federation of Natural Language Processing: System Demonstrations. 2015.
Shah, Kashif, et al. "SHEF-NN: Translation Quality Estimation with Neural Networks." Proceedings of the Tenth Workshop on
Statistical Machine Translation. 2015.