Challenges and Possibilities for
 Extracting Parallel Corpora from the Web

         The Translator’s Dream Scenario


   ...
Presentation outline
‣ Importance of bilingual parallel corpora

‣ Compilation of corpora

‣ The Web as a repository of te...
Importance of parallel corpora
bilingual parallel corpus = collection of pairs of texts
(original + translation), bitexts
...
Compilation of corpora:
             THE PROBLEM
a) Ready-made corpora:

limited in size and language pairs - not represen...
Compilation of corpora:
            THE PROBLEM
b) Self-built corpora:

1. by keeping an archive of one’s past work, pairi...
The Web as a repository of texts
         and its characteristics
The Web ranks 2nd in the most useful and used resources ...
Automatic extraction of parallel texts
        from the Web: THE SOLUTION

An intelligent tool (agent) that is able to ext...
Visualising an intelligent agent
    integrated to a TM system

                          PROCESS

       STEP 1: Location...
How will it work?
The tool will crawl the Web using a ‘spider’ (e.g. Google’s Web APIs
service) looking for parallel conte...
How will it work?
STEP 2: Verification of parallel content

In order to verify that the pairs of webpages are mutual transl...
How will it work?

STEP 3: Download of the pairs of parallel webpages to a temporary local
folder

This will allow the offl...
How will it work?
Offline actions

‣ Non textual items filtering out: After having a validated set of parallel
  webpages, t...
How will it work?


General functionality aspects

The agent will start operating once the TM application is opened and it...
How will it work?
Such settings will include:

‣ Schedule of temporary sessions

‣ Determination of the size of the databa...
Quality issues regarding such a
               corpus
Arbitrary selection of any parallel webpages on the Web will most
pr...
Existing experimental tools

3 tools have been developed so far (for research purposes only) that have
been able to extrac...
Existing experimental tools


Performance results -2:

‣ STRAND: 98% recall & 97.4% precision

‣ BITS: 97.1% recall & 99.1...
THANK YOU!




             Questions?
Upcoming SlideShare
Loading in...5
×

Challenges and Possibilities for Extracting Parallel Corpora from the Web - The translator's dream scenario

2,423

Published on

Presentation at X Symposium on Social Communication, Santiago de Cuba, January 2007.

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,423
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Challenges and Possibilities for Extracting Parallel Corpora from the Web - The translator's dream scenario

  1. 1. Challenges and Possibilities for Extracting Parallel Corpora from the Web The Translator’s Dream Scenario Elina Lagoudaki Imperial College London, UK e.lagoudaki@imperial.ac.uk X Symposium on Social Communication, 22-26 January 2007, Santiago de Cuba
  2. 2. Presentation outline ‣ Importance of bilingual parallel corpora ‣ Compilation of corpora ‣ The Web as a repository of texts and its characteristics ‣ Visualising an intelligent agent integrated to a TM system - How will it work? ‣ Quality issues regarding an automatically built corpus from the Web ‣ Existing tools for the automatic extraction of parallel texts from the Web 2
  3. 3. Importance of parallel corpora bilingual parallel corpus = collection of pairs of texts (original + translation), bitexts important for: NLP researchers, lexicographers, translator trainers, translators ✓ comparison of the source text with the corpus in order to find any existing translations for parts of the source text ✓ extraction of bilingual terminology, creation of specialised glossaries ✓ validation of the translator’s choices, by cross-checking them with word frequency data or by being able to see the chosen terms in a variety of contexts simultaneously ✓ consistency in domain-specific terminology ✓ text analysis: either for investigating translation strategies or for acquiring/cultivating linguistic or subject-specific knowledge 3
  4. 4. Compilation of corpora: THE PROBLEM a) Ready-made corpora: limited in size and language pairs - not representative of subject domains - usually very costly and restricted by copyrights ‣ the Hansard corpus (English-Canadian French/ 2.87 million parallel sentence pairs) ‣ the OPUS corpus (60 languages/ 30 million words) ‣ the Europarl (11 European languages/ over 20 million words) ‣ the UN corpus (English-French-Spanish/ 165 million words) 4
  5. 5. Compilation of corpora: THE PROBLEM b) Self-built corpora: 1. by keeping an archive of one’s past work, pairing up original texts and their translations and by aligning them ‣ it takes years to build a good-size corpus even for the most experienced translator 2. by acquiring bilingual texts from various external sources (electronic or not) or by locating and downloading paired webpages from the Web manually ‣ tedious and time-consuming process 5
  6. 6. The Web as a repository of texts and its characteristics The Web ranks 2nd in the most useful and used resources by translators (ICL TM Survey 2006) 1. in dictionaries/glossaries in CD-ROMs 2. on the Internet: online dictionaries/glossaries 3. in hardcopy dictionaries/glossaries 4. on the Internet: in a search engine (e.g. Google) 5. in old translations/glossaries that I have Unique characteristics: ✓ vast-size collection of electronic texts (11.5 billion indexed public webpages) ✓ the nature of the content is dynamic = always ‘fresh’ content ✓ content available in any language (even minority ones) ✓ content available in any sobject-domain ✓ content available in any writing style (colloquial, formal, scientific, etc.) 6
  7. 7. Automatic extraction of parallel texts from the Web: THE SOLUTION An intelligent tool (agent) that is able to extract parallel texts from the Web and download those in the translator’s local repository in a quick and effortless way and with little or no human intervention. Very Somewhat Not very Not at all Don’t important important important important know TM system with the ability to locate bilingual parallel texts on my subject on the Web, download them and store 44% 31% 16% 7% 2% them in my TM database automatically for future use or reference ICL TM Survey 2006 - Total number of respondents to this question: 700 7
  8. 8. Visualising an intelligent agent integrated to a TM system PROCESS STEP 1: Location of pairs of bilingual webpages Challenge no.1 STEP 2: Identification of parallel content (mutual translations) Challenge no.2 STEP 3: Download and indexing of bilingual parallel texts in the TM database/index 8
  9. 9. How will it work? The tool will crawl the Web using a ‘spider’ (e.g. Google’s Web APIs service) looking for parallel content according to a combination of structure and content-based criteria STEP 1: Location of bilingual pages (that might be mutual translations) in the same website It will be able to identify language-specific webpages in every publicly available website a) by checking for language indicators in the URL paths or the file names e.g. http://europa.eu/abc/panorama/index_en.htm and http://europa.eu/abc/panorama/index_es.htm b) by parsing the content of the webpages for any language anchors (e.g. English version, français, etc.) 9
  10. 10. How will it work? STEP 2: Verification of parallel content In order to verify that the pairs of webpages are mutual translations, the agent needs to identify and evaluate the similarities between them, at both structure and content levels a) by checking the structure characteristics (HTML mark-up, length and number of paragraphs) of the webpages to identify any parallel structure b) by parsing the content of both webpages for instances of lexical equivalence (this presupposes the integration of a bilingual lexicon/ glossary into the system with the help of which the tool will be able to perform the linguistic parsing) 10
  11. 11. How will it work? STEP 3: Download of the pairs of parallel webpages to a temporary local folder This will allow the offline processing of the files, even if the user decides to interrupt the session at any point by going offline. 11
  12. 12. How will it work? Offline actions ‣ Non textual items filtering out: After having a validated set of parallel webpages, the agent will strip out any non textual elements (e.g. images, video, graphics, etc.) from every webpage ‣ Conversion into plain Unicode text (with no indication of font or text formatting) ‣ Tagging: Each text file will be tagged with the following information: date of creation of content, language, source (URL and webpage name, if it exists) and subject (if domain-specific keywords were used). ‣ Full-text indexing in the TM database: The texts will be indexed and aligned according the segmentation rules and alignment methods followed by each TM system. 12
  13. 13. How will it work? General functionality aspects The agent will start operating once the TM application is opened and it will work in the background without interrupting the work of the translator. It will run in regular temporary sessions (e.g. once a week or month) either automatically (i.e. the translator will not have to initiate every time the process) or at the user’s command. The module will be accessible through the main interface of the TM application and a separate dialog window will allow users to set the parameters under which the agent will operate. 13
  14. 14. How will it work? Such settings will include: ‣ Schedule of temporary sessions ‣ Determination of the size of the database (the agent will stop the session when a predefined size threshold is reached) ‣ Search restrictions: a) by websites (the user will be able to restrict the search by listing specific websites that he/she considers as more authoritative sources of good quality content; by default, the agent will search the entire public Web) b) by domains (the user will be able to include subject-specific keywords in the search or import his/her bilingual glossary, if he/she wants to build a domain-specific parallel corpus). 14
  15. 15. Quality issues regarding such a corpus Arbitrary selection of any parallel webpages on the Web will most probably affect the quality of the corpus ‣ Restrict selection of websites in which to look for parallel webpages ‣ Verify the quality of the corpus manually This automatically generated corpus can serve as a reference material or as a ‘ready-only’ translation memory, if the translator is too concerned about the quality and does not want to use it as his/her primary TM database. 15
  16. 16. Existing experimental tools 3 tools have been developed so far (for research purposes only) that have been able to extract parallel corpora from the Web and save them in a local database: ‣ STRAND (Resnik, 1998; Resnik & Smith, 2003) ‣ BITS (Ma & Liberman, 1999) ‣ PTMiner (Chen & Nie, 2000; Kraaij et al., 2003) Performance results -1: ‣ STRAND: an English-French corpus of 325 webpage pairs ‣ BITS: 63MB parallel corpus of English-German ‣ PTMiner: 174/198 MB of English-French corpus extracted 16
  17. 17. Existing experimental tools Performance results -2: ‣ STRAND: 98% recall & 97.4% precision ‣ BITS: 97.1% recall & 99.1% precision ‣ PTMiner: 99% precision Recall = the proportion of bilingual pairs of webpages retrieved Precision = the proportion of pairs of pages correctly identified as parallel among all bilingual pairs of webpages 17
  18. 18. THANK YOU! Questions?

×