Russian Learner Translator Corpus: design, research potential and applications
17th International Conference on Text, Speech and Dialogue Brno, Czech Republic, September 8–12 2014
Detecting and Describing Historical Periods in a Large CorporaTraian Rebedea
Many historic periods (or events) are remembered
by slogans, expressions or words that are strongly linked to them. Educated people are also able to determine whether a particular word or expression is related to a specific period in human history. The present paper aims to establish correlations between significant historic periods (or events) and the texts written in that period. In order to achieve this, we have developed a system that automatically links words (and topics discovered using Latent Dirichlet Allocation) to periods of time in the recent history. For this analysis to be relevant and conclusive, it must be undertaken on a representative set of texts written throughout history. To this end, instead of relying on manually selected texts, the Google Books Ngram corpus has been chosen as a basis for the analysis. Although it provides only word n-gram statistics for the texts written in a given year, the resulting time series can be used to provide insights about the most important periods and events in recent history, by automatically linking them with specific keywords or even LDA topics.
A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov
In the process of globalization, the number of English words in other languages has rapidly increased. In automatic speech recognition systems, spell-checking, tagging, and other software in the field of natural language processing,
loan words are not easily recognized and should be evaluated
separately. In this paper we present a corpora-based approach to the automatic detection of anglicisms in Russian social network
texts. Proposed method is based on the idea of simultaneous
scripting, phonetics, and semantics similarity of the original Latin word and its Cyrillic analogue. We used a set of transliteration, phonetic transcribing, and morphological analysis methods to find possible hypotheses and distributional semantic models to filter them. Resulting list of borrowings, gathered from approximately 20 million LiveJournal texts, shows good intersection with manually collected dictionary. Proposed method is fully automated and can be applied to any domain–specific area.
Full paper available at:
https://www.academia.edu/29834070/A_General_Method_Applicable_to_the_Search_for_Anglicisms_in_Russian_Social_Network_Texts
Detecting and Describing Historical Periods in a Large CorporaTraian Rebedea
Many historic periods (or events) are remembered
by slogans, expressions or words that are strongly linked to them. Educated people are also able to determine whether a particular word or expression is related to a specific period in human history. The present paper aims to establish correlations between significant historic periods (or events) and the texts written in that period. In order to achieve this, we have developed a system that automatically links words (and topics discovered using Latent Dirichlet Allocation) to periods of time in the recent history. For this analysis to be relevant and conclusive, it must be undertaken on a representative set of texts written throughout history. To this end, instead of relying on manually selected texts, the Google Books Ngram corpus has been chosen as a basis for the analysis. Although it provides only word n-gram statistics for the texts written in a given year, the resulting time series can be used to provide insights about the most important periods and events in recent history, by automatically linking them with specific keywords or even LDA topics.
A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov
In the process of globalization, the number of English words in other languages has rapidly increased. In automatic speech recognition systems, spell-checking, tagging, and other software in the field of natural language processing,
loan words are not easily recognized and should be evaluated
separately. In this paper we present a corpora-based approach to the automatic detection of anglicisms in Russian social network
texts. Proposed method is based on the idea of simultaneous
scripting, phonetics, and semantics similarity of the original Latin word and its Cyrillic analogue. We used a set of transliteration, phonetic transcribing, and morphological analysis methods to find possible hypotheses and distributional semantic models to filter them. Resulting list of borrowings, gathered from approximately 20 million LiveJournal texts, shows good intersection with manually collected dictionary. Proposed method is fully automated and can be applied to any domain–specific area.
Full paper available at:
https://www.academia.edu/29834070/A_General_Method_Applicable_to_the_Search_for_Anglicisms_in_Russian_Social_Network_Texts
Anti-plagiarism tools for our repositoriesJan Mach
Presentation focuses on a test and a comparative analysis of systems for detecting duplicates (so-called anti-plagiarism systems) used for the repositories of higher education theses and dissertations in the CR. A text corpus containing the most frequent sources of plagiarism was created for the needs of the test, and the modifications made by plagiarists were simulated.
The success of duplicity detection by the most important anti-plagiarism systems was verified experimentally, and a comparative analysis and verification of stipulated hypotheses were performed. The evaluation was also performed on the author’s own prototype application
using the Google search engine.
Enrichment of Cross-Lingual Information on Chinese Genealogical Linked DataHang Dong
With the emergence of non-English Linked Datasets, discrepancy in language has become a major obstacle for cross-lingual access of resources in the Semantic Web. To prevent non-English monolingual Linked Datasets to form “islands” in the Web of Data, it is suggested to enrich a further layer of multilingual information on the Linked Open Data cloud. In the domain of culture heritage, enriching cross-lingual information can enhance the multilingual retrieval of cultural heritage resources, and promote international communication in the field. In this article, methods to enrich cross-lingual information for Linked Data are summarized, with a review on the cultural heritage domain. The mobile App Demo, Learn Chinese Surnames, winning the Shanghai Library Open Data Application Development Contest on 2016, is then introduced as a case study, to present the practice of enriching English-described information on a Chinese genealogical Linked Dataset, through consuming multilingual sources in the Linked Open Data cloud. Further in the data validation and conclusion, the issues of data quality and experience of consuming Linked Data are summarized.
This lecture teaches about how to write research methodology, sampling technique, Research Onion, Durrant's seven pointed typology of Research, research data, theoretical framework and ethical considerations. Its video is present : https://youtu.be/6SOhlBMaa-A
Open Government Data Tutorial at CLEI 2013. Part 5 Semantic Issuesjpane
This tutorial about Open Government Data was a 4 hours tutorial at the Conferencia Latinoameticana en Informatica (CLEI 2013) http://clei2013.org.ve/ divided into 5 parts:
1 - Introduction
http://www.slideshare.net/jpane/open-government-data-tutorial-at-clei-2013-part-1-introduction
2 - Issues
https://www.slideshare.net/jpane/02-issues-v4slideshare
3 - Real Experience
http://www.slideshare.net/jpane/open-government-data-tutorial-03-real-experience
4 - Applications
http://www.slideshare.net/jpane/open-government-data-tutorial-at-clei-2013-part-4-applications
5 - Semantic Issues
http://www.slideshare.net/jpane/open-government-data-tutorial-at-clei-2013-part-5-semantic-issues
This is part 5 - Semantic Issues
In these slides, the overview of the RusProfiling shared task at PAN@FIRE 2017 in Bangalore, India.
This year task aimed at gender identification in Russian texts in a cross-genre perspective: training on Twitter, evaluating on Twitter, Facebook, reviews, essays and gender-imitated texts.
Enabling Language Resources to Expose Translations as Linked Data on the WebJorge Gracia
Language resources, such as multilingual lexica and multilingual electronic dictionaries, contain collections of lexical entries in several languages. Having access to the corresponding explicit or implicit translation relations between such entries might be of great interest for many NLP-based applications. By using Semantic Web-based techniques, translations can be available on the Web to be consumed by other (semantic enabled) resources in a direct manner, not relying on application-specific formats. To that end, in this paper we propose a model for representing translations as linked data, as an extension of the lemon model. Our translation module represents some core information associated to term translations and does not commit to specific views or translation theories. As a proof of concept, we have extracted the translations of the terms contained in Terminesp, a multilingual terminological database, and represented them as linked data. We have made them accessible on the Web both for humans (via a Web interface) and software agents (with a SPARQL endpoint).
Anti-plagiarism tools for our repositoriesJan Mach
Presentation focuses on a test and a comparative analysis of systems for detecting duplicates (so-called anti-plagiarism systems) used for the repositories of higher education theses and dissertations in the CR. A text corpus containing the most frequent sources of plagiarism was created for the needs of the test, and the modifications made by plagiarists were simulated.
The success of duplicity detection by the most important anti-plagiarism systems was verified experimentally, and a comparative analysis and verification of stipulated hypotheses were performed. The evaluation was also performed on the author’s own prototype application
using the Google search engine.
Enrichment of Cross-Lingual Information on Chinese Genealogical Linked DataHang Dong
With the emergence of non-English Linked Datasets, discrepancy in language has become a major obstacle for cross-lingual access of resources in the Semantic Web. To prevent non-English monolingual Linked Datasets to form “islands” in the Web of Data, it is suggested to enrich a further layer of multilingual information on the Linked Open Data cloud. In the domain of culture heritage, enriching cross-lingual information can enhance the multilingual retrieval of cultural heritage resources, and promote international communication in the field. In this article, methods to enrich cross-lingual information for Linked Data are summarized, with a review on the cultural heritage domain. The mobile App Demo, Learn Chinese Surnames, winning the Shanghai Library Open Data Application Development Contest on 2016, is then introduced as a case study, to present the practice of enriching English-described information on a Chinese genealogical Linked Dataset, through consuming multilingual sources in the Linked Open Data cloud. Further in the data validation and conclusion, the issues of data quality and experience of consuming Linked Data are summarized.
This lecture teaches about how to write research methodology, sampling technique, Research Onion, Durrant's seven pointed typology of Research, research data, theoretical framework and ethical considerations. Its video is present : https://youtu.be/6SOhlBMaa-A
Open Government Data Tutorial at CLEI 2013. Part 5 Semantic Issuesjpane
This tutorial about Open Government Data was a 4 hours tutorial at the Conferencia Latinoameticana en Informatica (CLEI 2013) http://clei2013.org.ve/ divided into 5 parts:
1 - Introduction
http://www.slideshare.net/jpane/open-government-data-tutorial-at-clei-2013-part-1-introduction
2 - Issues
https://www.slideshare.net/jpane/02-issues-v4slideshare
3 - Real Experience
http://www.slideshare.net/jpane/open-government-data-tutorial-03-real-experience
4 - Applications
http://www.slideshare.net/jpane/open-government-data-tutorial-at-clei-2013-part-4-applications
5 - Semantic Issues
http://www.slideshare.net/jpane/open-government-data-tutorial-at-clei-2013-part-5-semantic-issues
This is part 5 - Semantic Issues
In these slides, the overview of the RusProfiling shared task at PAN@FIRE 2017 in Bangalore, India.
This year task aimed at gender identification in Russian texts in a cross-genre perspective: training on Twitter, evaluating on Twitter, Facebook, reviews, essays and gender-imitated texts.
Enabling Language Resources to Expose Translations as Linked Data on the WebJorge Gracia
Language resources, such as multilingual lexica and multilingual electronic dictionaries, contain collections of lexical entries in several languages. Having access to the corresponding explicit or implicit translation relations between such entries might be of great interest for many NLP-based applications. By using Semantic Web-based techniques, translations can be available on the Web to be consumed by other (semantic enabled) resources in a direct manner, not relying on application-specific formats. To that end, in this paper we propose a model for representing translations as linked data, as an extension of the lemon model. Our translation module represents some core information associated to term translations and does not commit to specific views or translation theories. As a proof of concept, we have extracted the translations of the terms contained in Terminesp, a multilingual terminological database, and represented them as linked data. We have made them accessible on the Web both for humans (via a Web interface) and software agents (with a SPARQL endpoint).
Directions
Length: ~3-4 typed, double-spaced pages (approx. 750-1000 words)
Content: The reviews will follow a summary/response organization. The following questions should help guide your review:
Summary:
· General comments: The goal of this part of your review is to demonstrate your comprehension of the study. As such, assume your target audience is non-experts in SLA research. Avoid highly technical details and jargon, opting instead for more accessible language and descriptions, i.e., “your own words.” There should be no need for any quotes in this summary.
· Content: Your summary should address the following questions:
· What were the goals of the study? What were the researchers hoping to find out as a result of the study? What were the gaps/limitations in our understanding that they were hoping to address? (Note: You do not need to summarize their entire literature review, but should provide some basic background to contextualize the study.)
· How did they attempt to address the research questions? Summarize the methodology employed. Who were the participants? What data-collection methods/instruments were used? What was analyzed, compared…?
· What were the key findings? (Note: No need to discuss detailed statistical findings. Simply summarize the important findings). How did the researcher(s) interpret these findings in relation to their research questions and previous research discussed in their literature review?
Response:
· General Comments: The goal of this part of your review is to demonstrate your intellectual interaction with the research you have read.
· Content: Your response should address the following questions:
· What new terms or concepts have you learned from this article? (Don’t just list terms/concepts, but briefly explain them.)
· How do the findings relate to your own experience with and/or ideas about language acquisition? Any surprises? Confirmations? Anything about which you remain skeptical? (If relevant, how do findings relate to other course readings or discussions?)
· What questions has this study—the methodology, the findings, etc.—raised for you? What do you suspect might be the answer to your questions?
Applied Linguistics 2014: 35/2: 184–207 � Oxford University Press 2013
doi:10.1093/applin/amt013 Advance Access published on 13 July 2013
Dynamics of Complexity and Accuracy: A
Longitudinal Case Study of Advanced
Untutored Development
*BRITTANY POLAT and YOUJIN KIM
Georgia State University
*E-mail: [email protected] or [email protected]
This longitudinal case study follows a dynamic systems approach to investigate
an under-studied research area in second language acquisition, the development
of complexity and accuracy for an advanced untutored learner of English. Using
the analytical tools of dynamic systems theory (Verspoor et al. 2011) within the
framework of complexity, accuracy, and fluency (Skehan 1998; Norris and
Ortega 2009), the study tracks accuracy, syntactic complexity, a ...
Innovative methods for data integration: Linked Data and NLPariadnenetwork
Linked Data (LD) + Natural Language Processing (NLP)
Two technologies that open up new possibilities for semantic integration of archaeological datasets and fieldwork reports.
Overview
•Illustrative early examples
- a flavour of progress and challenges to date
•NLP of grey literature (English – Dutch)
•Mapping between multilingual vocabularies
PhD thesis defense.
This manuscript describes a methodology designed and implemented to realise the recommendation of vocabularies based on the content of a given website. The goal of the proposed approach is to generate vocabularies by reusing existing schemas. The automatic recommendation helps to leverage websites to self-described web entities in the Web of Data; understandable by both humans and machines. In this direction, the implemented approach is wrapped within a broader methodology of turning a website in a machine understandable node by using technologies that have been developed in the scope of the Semantic Web vision. Transforming a website to a machine understandable entity is the first step required by the websites side in order to narrow the gap with web agents and enable the structured content consumption without the need of implementing an Application Programming Interface (API) that would provide read-write functionality. The motivation of the thesis stems from the fact that the data provided via an API is already presented on the corresponding website in most of the cases.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
This is a presentation by Dada Robert in a Your Skill Boost masterclass organised by the Excellence Foundation for South Sudan (EFSS) on Saturday, the 25th and Sunday, the 26th of May 2024.
He discussed the concept of quality improvement, emphasizing its applicability to various aspects of life, including personal, project, and program improvements. He defined quality as doing the right thing at the right time in the right way to achieve the best possible results and discussed the concept of the "gap" between what we know and what we do, and how this gap represents the areas we need to improve. He explained the scientific approach to quality improvement, which involves systematic performance analysis, testing and learning, and implementing change ideas. He also highlighted the importance of client focus and a team approach to quality improvement.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
Overview on Edible Vaccine: Pros & Cons with Mechanism
RusLTC at TSD-2014 (Brno)
1. RUSSIAN LEARNER TRANSLATOR CORPUS:
design, research potential and
applications
Andrey Kutuzov
National Research University Higher School of Economics
Maria Kunilovskaya
Tyumen State University
17th International Conference on Text, Speech and Dialogue
Brno, Czech Republic, September 8–12 2014
2. General description
• inspired by MeLLANGE
• online and downloadable http://rus-ltc.org
• 1.3 mln tokens
• translations from 10 universities
• 11 source text genres (inc. essays, educational,
informational)
• multiple: 263 sources, 1952 translations
• bi-directional:
approx. 200 English ST(≈300K tokens) with their 1300
Russian translations (≈700 thousand tokens), and
over 40 Russian ST and approx. 600 English translations
• 10 types of linguistic and extralinguistic meta data
• Lexical and POS query interface (Freeling-based linguistic
mark-up) RusLTC at TSD-2014 2
3. Corpus design
1) Txt-archive structured by file-naming conventions
RU_1_23.txt and EN_1_23_9.txt
RU_1_23.head.txt and EN_1_23_9.head.txt
2) TMX file
• pair-wise alignment with LF aligner batch mode
• manual correction (Olifant /Heartsome tmx-editors)
• merging TUVs with identical source segments + adding XML tags to
link segments to head files (a homegrown script)
3) Error-tagged subcorpus
• a collection of 265 annotated translations (for 33 sources);
• stand-off machine readable annotation
• pre-defined error classification
• 6,471 error tags
• online tag-editor based of brat http://brat.nlplab.org/index.html
RusLTC at TSD-2014 3
6. Application and Research
RusLTC is a general purpose data source for translation studies
and translation education research, inc. study of
1. variation and choice in translation;
2. ’translationese’ and the translator interlanguage;
3. interdependence between the translation characteristics
and various meta data (direction and conditions of
translation, source text genre);
4. translation-related “problem areas” or rich points in source
texts;
5. translation quality and translation quality assessment (TQA)
Direct use
• in the curriculum and materials design
• as a teaching and learning aid.
RusLTC at TSD-2014 6
7. RusLTC research: gender asymmetry
in translated texts
1) The same gender asymmetry in male and
female translations as in Russian original
(based on lexical variety)
2) Sentence length figures for female
translations contradict similar statistics for
originals
RusLTC at TSD-2014 7
8. Research based on RusLTC: splitting in
EN-RU translation
1) types of syntactic structures that undergo
splitting in English-Russian translation:
– coordination with “, and”
– non-restrictive relative clauses
2) most frequent mistakes associated with splitting:
– loss or misinterpretation of semantic relations
between propositions,
– issues with anaphora resolution and
– greater communicative value acquired by upgraded
sentences.
RusLTC at TSD-2014 8
9. Error-tagged part: inter-rater reliability
AIM: to gauge reliability of mark-up results based on
error classification proposed and establish the areas of
disagreement
RusLTC at TSD-2014 9
23
38
112
130
30
114
30
30
112
130
38
93
α=0.734 versus α=0.569
10. Error statistics analysis to inform translation
didactics
Hypothesis 1: The better one knows L1 the better she
understands the source/the better the transfer skills.
Hypothesis 2: Final year students make less mistakes than 4th
year students
Hypothesis 3: Test translations show better results than routine
translations because students are more motivated to
perform better
Hypothesis 4: The quantitative results of the error annotation
depend on the order of translations in the set (“order
effect”)
RusLTC at TSD-2014 10
11. Use in the classroom
1) Students have online access to:
• their own error-tagged and commented translations;
• peer translations;
• mistakes statistics which reflects their individual
progress and difficulties.
RusLTC at TSD-2014 11
12. 2) Students’ rating based on the
quality of final translation
RusLTC at TSD-2014 12
Quality parameters used for consecutive ranking to arrive at relative evaluation:
1. number of critical errors,
2. number of content errors and
3. total number of mistakes.
13. 3) Follow students’ individual
progress over the year
(based on the total number of mistakes normalized by the text
size)
RusLTC at TSD-2014 13
14. 4) Think of remedial activities
RusLTC at TSD-2014 14
The top ten mistakes in the sample
15. 1) Theory-based exercises utilizing multiple
concordances
• discussing translation strategies, identifying translation problems
and comparing/evaluating solutions
• developing skills to overcome known transfer issues in English-
Russian translation which are due to interlingual typological
differences
2) Corpus-driven exercises to prevent most
common mistakes
• developing L1 competence through building up corpus-querying
and documentary research skills;
• extending the scope of world knowledge through information
search and developing text analysis and text comprehension
aptitude.
5) Design materials and teaching aids
RusLTC at TSD-2014 15
16. Summary
1) Russian Learner Translator Corpus is an available and
extensive source of data for translation studies and
translator education research (http://www.rus-ltc.org/);
2) The error-tagged subcorpus (http://dev.rus-
ltc.org/brat/#/rusltc/) is a method to provide students
extensive feedback on their translations
3) and a means of accumulating research data on TQA;
4) RusLTC content is used in designing teaching materials.
Thank you!
RusLTC at TSD-2014 16