The document describes a strategy for extracting domain-specific comparable corpora from Wikipedia. The strategy involves identifying comparable articles, building a characteristic vocabulary for the domain of interest, exploring the Wikipedia category graph to select relevant categories, and extracting parallel sentence fragments through similarity modeling and thresholding. The extracted corpora are then used to augment existing training data for an English-Spanish phrase-based machine translation system, demonstrating improvements on in-domain and closely related out-of-domain test sets over the baseline system trained only on Europarl.
Apache Hivemall is a scalable machine learning library for Apache Hive, Apache Spark, and Apache Pig.
Hivemall provides a number of machine learning functionalities across classification, regression, ensemble learning, and feature engineering through UDFs/UDAFs/UDTFs of Hive.
We have released the first Apache release (v0.5.0-incubating) on Mar 5, 2018 and the project plans to release v0.5.2 in Q2, 2018.
We will first give a quick walk-through of features, usages, what's new in v0.5.0, and future roadmaps of Apache Hivemall. Next, we will introduce Hivemall on Apache Spark in depth such as DataFrame integration and Spark 2.3 supports in Hivemall.
Power and Clock Gating Modelling in Coarse Grained Reconfigurable SystemsMDC_UNICA
Power reduction is one of the biggest challenges in modern systems and tends to become a severe issue dealing with complex scenarios. To provide high-performance and exibility, designers often opt for coarse-grained recongurable (CGR) systems. Nevertheless, these systems require specic attention to the power problem, since large set of resources may be underutilized while computing a certain task.
This paper focuses on this issue. Targeting CGR devices, we propose a way to model in advance power and clock gating costs on the basis of the functional, technological and architectural parameters of the baseline CGR system. The
proposed flow guides designers towards optimal implementations, saving designer eort and time.
EzPAARSE is open source software that analyses your locally gathered proxy logfiles and provides you with COUNTER-deduplicated, KBART-formatted and geolocalised reports of your users’ accesses to subscribed e-resources. Come and watch us demo it live to understand how it works and learn how to install it in your institution for producing your own enriched measures and indicators.
Apache Hivemall is a scalable machine learning library for Apache Hive, Apache Spark, and Apache Pig.
Hivemall provides a number of machine learning functionalities across classification, regression, ensemble learning, and feature engineering through UDFs/UDAFs/UDTFs of Hive.
We have released the first Apache release (v0.5.0-incubating) on Mar 5, 2018 and the project plans to release v0.5.2 in Q2, 2018.
We will first give a quick walk-through of features, usages, what's new in v0.5.0, and future roadmaps of Apache Hivemall. Next, we will introduce Hivemall on Apache Spark in depth such as DataFrame integration and Spark 2.3 supports in Hivemall.
Power and Clock Gating Modelling in Coarse Grained Reconfigurable SystemsMDC_UNICA
Power reduction is one of the biggest challenges in modern systems and tends to become a severe issue dealing with complex scenarios. To provide high-performance and exibility, designers often opt for coarse-grained recongurable (CGR) systems. Nevertheless, these systems require specic attention to the power problem, since large set of resources may be underutilized while computing a certain task.
This paper focuses on this issue. Targeting CGR devices, we propose a way to model in advance power and clock gating costs on the basis of the functional, technological and architectural parameters of the baseline CGR system. The
proposed flow guides designers towards optimal implementations, saving designer eort and time.
EzPAARSE is open source software that analyses your locally gathered proxy logfiles and provides you with COUNTER-deduplicated, KBART-formatted and geolocalised reports of your users’ accesses to subscribed e-resources. Come and watch us demo it live to understand how it works and learn how to install it in your institution for producing your own enriched measures and indicators.
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Databricks
In this talk, we present a comprehensive framework we developed at Databricks for assessing the correctness, stability, and performance of our Spark SQL engine. Apache Spark is one of the most actively developed open source projects, with more than 1200 contributors from all over the world. At this scale and pace of development, mistakes bound to happen. We will discuss various approaches we take, including random query generation, random data generation, random fault injection, and longevity stress tests. We will demonstrate the effectiveness of the framework by highlighting several correctness issues we have found through random query generation and critical performance regressions we were able to diagnose within hours due to our automated benchmarking tools.
Fast and Reliable Apache Spark SQL EngineDatabricks
Building the next generation Spark SQL engine at speed poses new challenges to both automation and testing. At Databricks, we are implementing a new testing framework for assessing the quality and performance of new developments as they produced. Having more than 1,200 worldwide contributors, Apache Spark follows a rapid pace of development. At this scale, new testing tooling such as random query and data generation, fault injection, longevity stress, and scalability tests are essential to guarantee a reliable and performance Spark later in production. By applying such techniques, we will demonstrate the effectiveness of our testing infrastructure by drilling-down into cases where correctness and performance regressions have been found early. In addition, showing how they have been root-caused and fixed to prevent regressions in production and boosting the continuous delivery of new features.
Installation
resilience.io Package Overview
Using the model –step by step
resilience.io Testing Capabilities (and Limitations)
resilience.io Use Examples
Q&A / Interactive Session
FEATool Multiphysics Matlab FEM and CFD Toolbox - v1.6 Quickstart GuideFEATool Multiphysics
FEATool Multiphysics v1.6 Quickstart Guide
FEATool Multiphysics is a fully integrated and easy to use Matlab Multiphysics PDE and FEM Finite Element Analysis simulation toolbox, featuring built-in support for heat transfer, computational fluid dynamics CFD, chemical and reaction engineering, and structural mechanics modeling and simulation.
Visit https://www.featool.com for more information.
For more course tutorials visit
www.newtonhelp.com
ECET 365 Lab 1 Using the Serial Communication Interface in a Microcontroller
ECET 365 Lab 2 Temperature Measuring System using a Microcontroller
A presentation of the on-going work on interoperability within the toolchain. A new domain OSLC KM is introduced, some experiments for reusing models are also presented and, some videos are also used to present some user stories.
For more course tutorials visit www.newtonhelp.com
ECET 365 Lab 1 Using the Serial Communication Interface in a Microcontroller
ECET 365 Lab 2 Temperature Measuring System using a
Geospatial Synergy: Amplifying Efficiency with FME & Esri ft. Peak Guest Spea...Safe Software
Dive deep into the world of geospatial data management and transformation in our upcoming webinar focusing on the powerful integration of FME and Esri technologies. This insightful session comprises two compelling segments aimed at enhancing your geospatial workflows, while minimizing operational hurdles.
In the first segment, guest speaker Jan Roggisch from Locus unveils how Auckland Council triumphed over the challenges of handling large, frequent data updates on ArcGIS Online using FME. Discover the journey from manual data handling to an automated, streamlined process that reduced server downtime from minutes to seconds: setting a new standard for local government organizations.
The second segment, led by James Botterill from 1Spatial, unveils the magic of incorporating ArcPy into your FME workflows. Delve into real-world scenarios where ArcGIS geoprocessing is harmoniously orchestrated within FME using the PythonCaller. Gain insights into raster-vector data conversion, spatial analysis, and a host of practical tips and tricks that empower you to leverage the combined capabilities of FME and Esri for efficient data manipulation and conversion.
Join us to explore the remarkable possibilities that open up when FME and Esri technologies converge – enhancing your ability to manage and transform geospatial data with unprecedented efficiency.
Geospatial Synergy: Amplifying Efficiency with FME & EsriSafe Software
Dive deep into the world of geospatial data management and transformation in our upcoming webinar focusing on the powerful integration of FME and Esri technologies. This insightful session comprises two compelling segments aimed at enhancing your geospatial workflows, while minimizing operational hurdles.
In the first segment, guest speaker Jan Roggisch from Locus unveils how Auckland Council triumphed over the challenges of handling large, frequent data updates on ArcGIS Online using FME. Discover the journey from manual data handling to an automated, streamlined process that reduced server downtime from minutes to seconds: setting a new standard for local government organizations.
The second segment, led by James Botterill from 1Spatial, unveils the magic of incorporating ArcPy into your FME workflows. Delve into real-world scenarios where ArcGIS geoprocessing is harmoniously orchestrated within FME using the PythonCaller. Gain insights into raster-vector data conversion, spatial analysis, and a host of practical tips and tricks that empower you to leverage the combined capabilities of FME and Esri for efficient data manipulation and conversion.
Join us to explore the remarkable possibilities that open up when FME and Esri technologies converge – enhancing your ability to manage and transform geospatial data with unprecedented efficiency.
Benchmarking Commercial RDF Stores with Publications Office DatasetGhislain Atemezing
The slides present a benchmark of RDF stores with real-world datasets and queries from the EU Publications Office (PO). The study compares the performance of four commercial triple stores: Stardog 4.3 EE, GraphDB 8.0.3 EE, Oracle 12.2c and Virtuoso 7.2.4.2 with respect to the following requirements: bulk loading, scalability, stability and query execution.
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Databricks
In this talk, we present a comprehensive framework we developed at Databricks for assessing the correctness, stability, and performance of our Spark SQL engine. Apache Spark is one of the most actively developed open source projects, with more than 1200 contributors from all over the world. At this scale and pace of development, mistakes bound to happen. We will discuss various approaches we take, including random query generation, random data generation, random fault injection, and longevity stress tests. We will demonstrate the effectiveness of the framework by highlighting several correctness issues we have found through random query generation and critical performance regressions we were able to diagnose within hours due to our automated benchmarking tools.
Fast and Reliable Apache Spark SQL EngineDatabricks
Building the next generation Spark SQL engine at speed poses new challenges to both automation and testing. At Databricks, we are implementing a new testing framework for assessing the quality and performance of new developments as they produced. Having more than 1,200 worldwide contributors, Apache Spark follows a rapid pace of development. At this scale, new testing tooling such as random query and data generation, fault injection, longevity stress, and scalability tests are essential to guarantee a reliable and performance Spark later in production. By applying such techniques, we will demonstrate the effectiveness of our testing infrastructure by drilling-down into cases where correctness and performance regressions have been found early. In addition, showing how they have been root-caused and fixed to prevent regressions in production and boosting the continuous delivery of new features.
Installation
resilience.io Package Overview
Using the model –step by step
resilience.io Testing Capabilities (and Limitations)
resilience.io Use Examples
Q&A / Interactive Session
FEATool Multiphysics Matlab FEM and CFD Toolbox - v1.6 Quickstart GuideFEATool Multiphysics
FEATool Multiphysics v1.6 Quickstart Guide
FEATool Multiphysics is a fully integrated and easy to use Matlab Multiphysics PDE and FEM Finite Element Analysis simulation toolbox, featuring built-in support for heat transfer, computational fluid dynamics CFD, chemical and reaction engineering, and structural mechanics modeling and simulation.
Visit https://www.featool.com for more information.
For more course tutorials visit
www.newtonhelp.com
ECET 365 Lab 1 Using the Serial Communication Interface in a Microcontroller
ECET 365 Lab 2 Temperature Measuring System using a Microcontroller
A presentation of the on-going work on interoperability within the toolchain. A new domain OSLC KM is introduced, some experiments for reusing models are also presented and, some videos are also used to present some user stories.
For more course tutorials visit www.newtonhelp.com
ECET 365 Lab 1 Using the Serial Communication Interface in a Microcontroller
ECET 365 Lab 2 Temperature Measuring System using a
Geospatial Synergy: Amplifying Efficiency with FME & Esri ft. Peak Guest Spea...Safe Software
Dive deep into the world of geospatial data management and transformation in our upcoming webinar focusing on the powerful integration of FME and Esri technologies. This insightful session comprises two compelling segments aimed at enhancing your geospatial workflows, while minimizing operational hurdles.
In the first segment, guest speaker Jan Roggisch from Locus unveils how Auckland Council triumphed over the challenges of handling large, frequent data updates on ArcGIS Online using FME. Discover the journey from manual data handling to an automated, streamlined process that reduced server downtime from minutes to seconds: setting a new standard for local government organizations.
The second segment, led by James Botterill from 1Spatial, unveils the magic of incorporating ArcPy into your FME workflows. Delve into real-world scenarios where ArcGIS geoprocessing is harmoniously orchestrated within FME using the PythonCaller. Gain insights into raster-vector data conversion, spatial analysis, and a host of practical tips and tricks that empower you to leverage the combined capabilities of FME and Esri for efficient data manipulation and conversion.
Join us to explore the remarkable possibilities that open up when FME and Esri technologies converge – enhancing your ability to manage and transform geospatial data with unprecedented efficiency.
Geospatial Synergy: Amplifying Efficiency with FME & EsriSafe Software
Dive deep into the world of geospatial data management and transformation in our upcoming webinar focusing on the powerful integration of FME and Esri technologies. This insightful session comprises two compelling segments aimed at enhancing your geospatial workflows, while minimizing operational hurdles.
In the first segment, guest speaker Jan Roggisch from Locus unveils how Auckland Council triumphed over the challenges of handling large, frequent data updates on ArcGIS Online using FME. Discover the journey from manual data handling to an automated, streamlined process that reduced server downtime from minutes to seconds: setting a new standard for local government organizations.
The second segment, led by James Botterill from 1Spatial, unveils the magic of incorporating ArcPy into your FME workflows. Delve into real-world scenarios where ArcGIS geoprocessing is harmoniously orchestrated within FME using the PythonCaller. Gain insights into raster-vector data conversion, spatial analysis, and a host of practical tips and tricks that empower you to leverage the combined capabilities of FME and Esri for efficient data manipulation and conversion.
Join us to explore the remarkable possibilities that open up when FME and Esri technologies converge – enhancing your ability to manage and transform geospatial data with unprecedented efficiency.
Benchmarking Commercial RDF Stores with Publications Office DatasetGhislain Atemezing
The slides present a benchmark of RDF stores with real-world datasets and queries from the EU Publications Office (PO). The study compares the performance of four commercial triple stores: Stardog 4.3 EE, GraphDB 8.0.3 EE, Oracle 12.2c and Virtuoso 7.2.4.2 with respect to the following requirements: bulk loading, scalability, stability and query execution.
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
Thinking of getting a dog? Be aware that breeds like Pit Bulls, Rottweilers, and German Shepherds can be loyal and dangerous. Proper training and socialization are crucial to preventing aggressive behaviors. Ensure safety by understanding their needs and always supervising interactions. Stay safe, and enjoy your furry friends!
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Delivering Micro-Credentials in Technical and Vocational Education and TrainingAG2 Design
Explore how micro-credentials are transforming Technical and Vocational Education and Training (TVET) with this comprehensive slide deck. Discover what micro-credentials are, their importance in TVET, the advantages they offer, and the insights from industry experts. Additionally, learn about the top software applications available for creating and managing micro-credentials. This presentation also includes valuable resources and a discussion on the future of these specialised certifications.
For more detailed information on delivering micro-credentials in TVET, visit this https://tvettrainer.com/delivering-micro-credentials-in-tvet/
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Alberto Barrón-Cedeño - 2015 - A Factory of Comparable Corpora from Wikipedia
1. A Factory of Comparable
Corpora from Wikipedia
Alberto Barrón-Cedeño1, Cristina España-Bonet2,
Josu Boldoba2, and Lluís Màrquez1
1Qatar Computing Research Institute, HBKU, Qatar
2TALP Research Center, UPC, Spain
{albarron, lmarquez}@qf.org.qa
cristinae@cs.upc.edu jboldoba08@gmail.com
BUCC @ ACL – July 2015
2. Background
There are tons of articles exploiting Wikipedia as a comparable corpus
Factory of Comparable Corpora BUCC 2015 0/26
3. Background
There are tons of articles exploiting Wikipedia as a comparable corpus
Factory of Comparable Corpora BUCC 2015 0/26
4. Background
There are tons of articles exploiting Wikipedia as a comparable corpus
Factory of Comparable Corpora BUCC 2015 0/26
5. Background
There are tons of articles exploiting Wikipedia as a comparable corpus
Factory of Comparable Corpora BUCC 2015 0/26
6. Background
There are tons of articles exploiting Wikipedia as a comparable corpus
Factory of Comparable Corpora BUCC 2015 0/26
7. Background
There are tons of articles exploiting Wikipedia as a comparable corpus
Factory of Comparable Corpora BUCC 2015 0/26
8. Background
Nevertheless…
• Little attention is paid to identifying a domain-specific high-quality
comparable corpus
• Domain-specific corpora is a key factor in different tasks, including MT
Factory of Comparable Corpora BUCC 2015 1/26
9. Background
Nevertheless…
• Little attention is paid to identifying a domain-specific high-quality
comparable corpus
• Domain-specific corpora is a key factor in different tasks, including MT
• Wikipedia includes (somehow) all the information necessary to extract
such a resource
Factory of Comparable Corpora BUCC 2015 1/26
10. Background
Nevertheless…
• Little attention is paid to identifying a domain-specific high-quality
comparable corpus
• Domain-specific corpora is a key factor in different tasks, including MT
• Wikipedia includes (somehow) all the information necessary to extract
such a resource
Our aim is to identify those domain-specific comparable corpora from
Wikipedia!
Factory of Comparable Corpora BUCC 2015 1/26
12. Background: Strategy Overview
• Identify comparable articles (easy)
• Build a characteristic vocabulary for the domain of interest (not so easy)
Factory of Comparable Corpora BUCC 2015 2/26
13. Background: Strategy Overview
• Identify comparable articles (easy)
• Build a characteristic vocabulary for the domain of interest (not so easy)
• Explore the Wikipedia categories’ graph to select the subset of
categories in the domain (difficult)
• Brute-force sentence-wise comparison for parallel pairs identification
Factory of Comparable Corpora BUCC 2015 2/26
15. Comparable Corpora
Problem No large collections of comparable texts for all domains and
language pairs exist
Objective To extract high-quality comparable corpora on specific
domains
Pilot language pair English–Spanish
Pilot domains Science, Computer Science, Sports
Currently experimenting on more than 700 domains and 10 languages
Factory of Comparable Corpora BUCC 2015 4/26
16. Comparable Corpora: Characteristic Vocabulary
1 Retrieve every article associated to the top category of the domain
(e.g., Sports)
Factory of Comparable Corpora BUCC 2015 5/26
17. Comparable Corpora: Characteristic Vocabulary
1 Retrieve every article associated to the top category of the domain
(e.g., Sports)
2 Merge the articles’ contents and apply standard and ad-hoc
pre-processing
Factory of Comparable Corpora BUCC 2015 5/26
18. Comparable Corpora: Characteristic Vocabulary
1 Retrieve every article associated to the top category of the domain
(e.g., Sports)
2 Merge the articles’ contents and apply standard and ad-hoc
pre-processing
3 Select the top-k tf-sorted tokens as the characteristic vocabulary
(we consider 10% of the tokens)
Factory of Comparable Corpora BUCC 2015 5/26
19. Comparable Corpora: Characteristic Vocabulary
1 Retrieve every article associated to the top category of the domain
(e.g., Sports)
2 Merge the articles’ contents and apply standard and ad-hoc
pre-processing
3 Select the top-k tf-sorted tokens as the characteristic vocabulary
(we consider 10% of the tokens)
Articles Vocabulary
en es en es
CS 4 130 106 447
Sc 29 3 464 140
Sp 3 10 122 100
Factory of Comparable Corpora BUCC 2015 5/26
20. Comparable Corpora: Graph exploration
Slice of the Spanish Wikipedia category graph departing from categories
Sport and Science (as in Spring 2015)
Sport
Sports
Mountain
sports
MountaineeringMountains
Mountains
of Andorra
Pyrenees
Science
Scientific
disciplines
Natural
sciences
Earth
sciencies
Geology
Geology
by country
Geology
of Spain
Mountains
by country
Mountain ran-
ges of Spain
Mountains of
the Pyrenees
Factory of Comparable Corpora BUCC 2015 6/26
21. Comparable Corpora: Graph exploration
1 Perform a breadth-first search departing from the root category
2 Visit nodes only once to avoid loops and repeating traversed paths
3 Stop at the level when most categories do not belong to the domain
Factory of Comparable Corpora BUCC 2015 7/26
22. Comparable Corpora: Graph exploration
1 Perform a breadth-first search departing from the root category
2 Visit nodes only once to avoid loops and repeating traversed paths
3 Stop at the level when most categories do not belong to the domain
Stopping criterion
Heuristic A category belongs to the domain if its title contains at
least one term from the characteristic vocabulary
Explore until a minimum percentage of the categories in
a tree level belong to the domain
Factory of Comparable Corpora BUCC 2015 7/26
23. Comparable Corpora: Graph exploration
1 Perform a breadth-first search departing from the root category
2 Visit nodes only once to avoid loops and repeating traversed paths
3 Stop at the level when most categories do not belong to the domain
Stopping criterion
Heuristic A category belongs to the domain if its title contains at
least one term from the characteristic vocabulary
Explore until a minimum percentage of the categories in
a tree level belong to the domain
Category pato in Spanish —literally "duck"— refers to a sport rather than an animal!!!
Factory of Comparable Corpora BUCC 2015 7/26
24. Comparable Corpora: Graph exploration
Article pairs selected according to two criteria: 50% and 60%
Articles Distance from the root
50% 60% 50% 60%
en-es en-es en es en es
CS 18,168 8,251 6 5 5 5
Sc 161,130 21,459 6 4 4 4
Sp 72,315 1,980 8 8 3 4
Factory of Comparable Corpora BUCC 2015 8/26
26. Parallelisation: Similarity Models
• Character 3-grams (cosine) [McNamee and Mayfield, 2004]
• Pseudo-cognates (cosine) [Simard et al., 1992]
• Translated word 1-grams in both directions (cosine)
• Length factor [Pouliquen et al., 2003]
0
0.2
0.4
0.6
0.8
1
0 10000 20000 30000 40000 50000 60000 70000 80000
Probable lengths of translations of d
|d| = 30000
es
en
Factory of Comparable Corpora BUCC 2015 10/26
27. Parallelisation: Corpus for Preliminary Evaluation
• 30 article pairs (10 per domain)
• Annotated at sentence level
• Three classes: parallel, comparable, and other
• Each pair was annotated by 2 volunteers mean Cohen’s κ ∼ 0.7
Factory of Comparable Corpora BUCC 2015 11/26
28. Parallelisation: Threshold Definition
c3g cog monoen monoes len
Thres. 0.25 0.30 0.20 0.15 0.90
P 0.28 0.16 0.30 0.26 0.08
R 0.53 0.49 0.46 0.34 0.57
F1 0.36 0.24 0.36 0.30 0.14
Factory of Comparable Corpora BUCC 2015 12/26
29. Parallelisation: Threshold Definition
c3g cog monoen monoes len
Thres. 0.25 0.30 0.20 0.15 0.90
P 0.28 0.16 0.30 0.26 0.08
R 0.53 0.49 0.46 0.34 0.57
F1 0.36 0.24 0.36 0.30 0.14
¯S ¯S·len S · F1 S · F1·len
Thres. 0.25 0.15 0.05 0.05
P 0.27 0.33 0.18 0.32
R 0.50 0.62 0.77 0.65
F1 0.35 0.43 0.29 0.43
S =similarities; · = average
Factory of Comparable Corpora BUCC 2015 12/26
32. Impact: Corpora
in domain out of domain
Training Wikipedia Europarl
Development Wikipedia News commentary
Test Wikipedia/Gnome News commentary
Factory of Comparable Corpora BUCC 2015 15/26
33. Impact: Corpora
in domain out of domain
Training Wikipedia Europarl
Development Wikipedia News commentary
Test Wikipedia/Gnome News commentary
Generation of the Wikipedia dev and test sets
1 Select only sentences starting with a letter and longer than three tokens
2 Compute the perplexity of each sentence pair (with respect to a
Europarl LM)
3 Sort the pairs according to similarity and perplexity
4 Manually select the first k parallel sentences
Factory of Comparable Corpora BUCC 2015 15/26
34. Impact: Corpora Statistics
CS Sc Sp All
c3g 95,715 723,760 334,828 883,366
cog 182,283 1,213,965 451,324 1,430,962
monoen 210,664 1,367,169 461,237 1,638,777
¯S·len 120,835 956,346 389,975 1,160,977
union 577,428 3,847,381 1,181,664 4,948,241
Wikipedia dev 300 300 300 900
Wikipedia test 500 500 500 1500
Gnome 1000 – – –
Factory of Comparable Corpora BUCC 2015 16/26
35. Impact: Phrase-based SMT System
Language model 5-gram interpolated Kneser-Ney discounting, SRILM
Toolkit
Alignments GIZA++ Toolkit
Translation model Moses package
Weights optimization MERT against BLEU
Decoder Moses
Factory of Comparable Corpora BUCC 2015 17/26
36. Impact: Experiments definition
1 In domain
2 In domain
3 Out of domain
Training Wikipedia or Europarl
Test Wikipedia (+Gnome for CS)
Training Wikipedia and Europarl
Test Wikipedia (+Gnome for CS)
Training Wikipedia and Europarl
Test News
Factory of Comparable Corpora BUCC 2015 18/26
38. Impact: Results on Gnome (in domain)
CS Un
c3g 11.08 9.56
cog 18.48 17.66
monoen 19.48 20.58
¯S·len 20.71 20.56
union 22.41 20.63
EP 18.15
EP+c3g 19.78 19.49
EP+cog 21.09 20.14
EP+monoen 21.27 20.66
EP+ ¯S·len 21.58 20.65
EP+union 22.37 21.43
Factory of Comparable Corpora BUCC 2015 20/26
39. Impact: Translation Instances
Better reordering
Source All internet packets have a source IP address and a destination
IP address.
EP Todos los paquetes de internet tienen un origen dirección IP y
destino dirección IP.
EP+union-CS Todos los paquetes de internet tienen una dirección IP de
origen y una dirección IP de destino.
Reference Todos los paquetes de internet tienen una dirección IP de
origen y una dirección IP de destino.
Factory of Comparable Corpora BUCC 2015 21/26
40. Impact: Translation Instances
Awareness of terms (possible overfitting?)
Source Attack of the Killer Tomatoes is a 2D platform video game
developed by Imagineering and released in 1991 for the NES.
EP el ataque de los tomates es un asesino 2D plataforma
vídeo-juego desarrollados por Imagineering y liberados en
1991 por la NES.
union-CS Attack of the Killer Tomatoes es un videojuego de plataformas
desarrollado por Imagineering y lanzado en 1991 para la
Nintendo Entertainment System.
Reference Attack of the Killer Tomatoes es un videojuego de plataformas
en 2D desarrollado por Imagineering y lanzado en 1991 para
el NES.
Factory of Comparable Corpora BUCC 2015 21/26
41. Impact: Translation Instances
Better vocabulary
Source Fractal compression is a lossy compression method for digital
images, based on fractals.
EP Fractal compresión es un método para lossy compresión
digital imágenes , basada en fractals.
EP+union-CS La compresión fractal es un método de compresión con
pérdida para imágenes digitales, basado en fractales.
Reference La compresión fractal es un método de compresión con
pérdida para imágenes digitales, basado en fractales.
Factory of Comparable Corpora BUCC 2015 21/26
44. Final Remarks
• A simple model to extract domain-specific comparable corpora from
Wikipedia
• The domain-specific corpora showed to be useful to feed SMT systems,
but other tasks are possible
Factory of Comparable Corpora BUCC 2015 24/26
45. Final Remarks
• A simple model to extract domain-specific comparable corpora from
Wikipedia
• The domain-specific corpora showed to be useful to feed SMT systems,
but other tasks are possible
• We are currently comparing our model against an IR-based system
[Plamada and Volk, 2012]
• The platform currently operates in more language pairs, including
French, Catalan, German, and Arabic; but it can operate in any
language and domain
Factory of Comparable Corpora BUCC 2015 24/26
46. Final Remarks
• A simple model to extract domain-specific comparable corpora from
Wikipedia
• The domain-specific corpora showed to be useful to feed SMT systems,
but other tasks are possible
• We are currently comparing our model against an IR-based system
[Plamada and Volk, 2012]
• The platform currently operates in more language pairs, including
French, Catalan, German, and Arabic; but it can operate in any
language and domain
• The prototype is coded in Java (and depends on JWPL). We plan to
release it in short!
Factory of Comparable Corpora BUCC 2015 24/26
48. References I
McNamee, P. and Mayfield, J. (2004).
Character N-Gram Tokenization for European Language Text Retrieval.
Information Retrieval, 7(1-2):73–97.
Plamada, M. and Volk, M. (2012).
Towards a wikipedia-extracted alpine corpus.
In The Fifth Workshop on Building and Using Comparable Corpora.
Pouliquen, B., Steinberger, R., and Ignat, C. (2003).
Automatic Identification of Document Translations in Large Multilingual Document Collections.
In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-2003), pages
401–408, Borovets, Bulgaria.
Simard, M., Foster, G. F., and Isabelle, P. (1992).
Using Cognates to Align Sentences in Bilingual Corpora.
In Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation.
Factory of Comparable Corpora BUCC 2015 26/26