Challenges and Possibilities for Extracting Parallel Corpora from the Web - The translator's dream scenario

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Challenges and Possibilities for Extracting Parallel Corpora from the Web - The translator's dream scenario - Presentation Transcript

    1. Challenges and Possibilities for Extracting Parallel Corpora from the Web The Translator’s Dream Scenario Elina Lagoudaki Imperial College London, UK e.lagoudaki@imperial.ac.uk X Symposium on Social Communication, 22-26 January 2007, Santiago de Cuba
    2. Presentation outline ‣ Importance of bilingual parallel corpora ‣ Compilation of corpora ‣ The Web as a repository of texts and its characteristics ‣ Visualising an intelligent agent integrated to a TM system - How will it work? ‣ Quality issues regarding an automatically built corpus from the Web ‣ Existing tools for the automatic extraction of parallel texts from the Web 2
    3. Importance of parallel corpora bilingual parallel corpus = collection of pairs of texts (original + translation), bitexts important for: NLP researchers, lexicographers, translator trainers, translators ✓ comparison of the source text with the corpus in order to find any existing translations for parts of the source text ✓ extraction of bilingual terminology, creation of specialised glossaries ✓ validation of the translator’s choices, by cross-checking them with word frequency data or by being able to see the chosen terms in a variety of contexts simultaneously ✓ consistency in domain-specific terminology ✓ text analysis: either for investigating translation strategies or for acquiring/cultivating linguistic or subject-specific knowledge 3
    4. Compilation of corpora: THE PROBLEM a) Ready-made corpora: limited in size and language pairs - not representative of subject domains - usually very costly and restricted by copyrights ‣ the Hansard corpus (English-Canadian French/ 2.87 million parallel sentence pairs) ‣ the OPUS corpus (60 languages/ 30 million words) ‣ the Europarl (11 European languages/ over 20 million words) ‣ the UN corpus (English-French-Spanish/ 165 million words) 4
    5. Compilation of corpora: THE PROBLEM b) Self-built corpora: 1. by keeping an archive of one’s past work, pairing up original texts and their translations and by aligning them ‣ it takes years to build a good-size corpus even for the most experienced translator 2. by acquiring bilingual texts from various external sources (electronic or not) or by locating and downloading paired webpages from the Web manually ‣ tedious and time-consuming process 5
    6. The Web as a repository of texts and its characteristics The Web ranks 2nd in the most useful and used resources by translators (ICL TM Survey 2006) 1. in dictionaries/glossaries in CD-ROMs 2. on the Internet: online dictionaries/glossaries 3. in hardcopy dictionaries/glossaries 4. on the Internet: in a search engine (e.g. Google) 5. in old translations/glossaries that I have Unique characteristics: ✓ vast-size collection of electronic texts (11.5 billion indexed public webpages) ✓ the nature of the content is dynamic = always ‘fresh’ content ✓ content available in any language (even minority ones) ✓ content available in any sobject-domain ✓ content available in any writing style (colloquial, formal, scientific, etc.) 6
    7. Automatic extraction of parallel texts from the Web: THE SOLUTION An intelligent tool (agent) that is able to extract parallel texts from the Web and download those in the translator’s local repository in a quick and effortless way and with little or no human intervention. Very Somewhat Not very Not at all Don’t important important important important know TM system with the ability to locate bilingual parallel texts on my subject on the Web, download them and store 44% 31% 16% 7% 2% them in my TM database automatically for future use or reference ICL TM Survey 2006 - Total number of respondents to this question: 700 7
    8. Visualising an intelligent agent integrated to a TM system PROCESS STEP 1: Location of pairs of bilingual webpages Challenge no.1 STEP 2: Identification of parallel content (mutual translations) Challenge no.2 STEP 3: Download and indexing of bilingual parallel texts in the TM database/index 8
    9. How will it work? The tool will crawl the Web using a ‘spider’ (e.g. Google’s Web APIs service) looking for parallel content according to a combination of structure and content-based criteria STEP 1: Location of bilingual pages (that might be mutual translations) in the same website It will be able to identify language-specific webpages in every publicly available website a) by checking for language indicators in the URL paths or the file names e.g. http://europa.eu/abc/panorama/index_en.htm and http://europa.eu/abc/panorama/index_es.htm b) by parsing the content of the webpages for any language anchors (e.g. English version, français, etc.) 9
    10. How will it work? STEP 2: Verification of parallel content In order to verify that the pairs of webpages are mutual translations, the agent needs to identify and evaluate the similarities between them, at both structure and content levels a) by checking the structure characteristics (HTML mark-up, length and number of paragraphs) of the webpages to identify any parallel structure b) by parsing the content of both webpages for instances of lexical equivalence (this presupposes the integration of a bilingual lexicon/ glossary into the system with the help of which the tool will be able to perform the linguistic parsing) 10
    11. How will it work? STEP 3: Download of the pairs of parallel webpages to a temporary local folder This will allow the offline processing of the files, even if the user decides to interrupt the session at any point by going offline. 11
    12. How will it work? Offline actions ‣ Non textual items filtering out: After having a validated set of parallel webpages, the agent will strip out any non textual elements (e.g. images, video, graphics, etc.) from every webpage ‣ Conversion into plain Unicode text (with no indication of font or text formatting) ‣ Tagging: Each text file will be tagged with the following information: date of creation of content, language, source (URL and webpage name, if it exists) and subject (if domain-specific keywords were used). ‣ Full-text indexing in the TM database: The texts will be indexed and aligned according the segmentation rules and alignment methods followed by each TM system. 12
    13. How will it work? General functionality aspects The agent will start operating once the TM application is opened and it will work in the background without interrupting the work of the translator. It will run in regular temporary sessions (e.g. once a week or month) either automatically (i.e. the translator will not have to initiate every time the process) or at the user’s command. The module will be accessible through the main interface of the TM application and a separate dialog window will allow users to set the parameters under which the agent will operate. 13
    14. How will it work? Such settings will include: ‣ Schedule of temporary sessions ‣ Determination of the size of the database (the agent will stop the session when a predefined size threshold is reached) ‣ Search restrictions: a) by websites (the user will be able to restrict the search by listing specific websites that he/she considers as more authoritative sources of good quality content; by default, the agent will search the entire public Web) b) by domains (the user will be able to include subject-specific keywords in the search or import his/her bilingual glossary, if he/she wants to build a domain-specific parallel corpus). 14
    15. Quality issues regarding such a corpus Arbitrary selection of any parallel webpages on the Web will most probably affect the quality of the corpus ‣ Restrict selection of websites in which to look for parallel webpages ‣ Verify the quality of the corpus manually This automatically generated corpus can serve as a reference material or as a ‘ready-only’ translation memory, if the translator is too concerned about the quality and does not want to use it as his/her primary TM database. 15
    16. Existing experimental tools 3 tools have been developed so far (for research purposes only) that have been able to extract parallel corpora from the Web and save them in a local database: ‣ STRAND (Resnik, 1998; Resnik & Smith, 2003) ‣ BITS (Ma & Liberman, 1999) ‣ PTMiner (Chen & Nie, 2000; Kraaij et al., 2003) Performance results -1: ‣ STRAND: an English-French corpus of 325 webpage pairs ‣ BITS: 63MB parallel corpus of English-German ‣ PTMiner: 174/198 MB of English-French corpus extracted 16
    17. Existing experimental tools Performance results -2: ‣ STRAND: 98% recall & 97.4% precision ‣ BITS: 97.1% recall & 99.1% precision ‣ PTMiner: 99% precision Recall = the proportion of bilingual pairs of webpages retrieved Precision = the proportion of pairs of pages correctly identified as parallel among all bilingual pairs of webpages 17
    18. THANK YOU! Questions?

    + elinalagelinalag, 3 years ago

    custom

    1544 views, 0 favs, 0 embeds more stats

    Presentation at X Symposium on Social Communication more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 1544
      • 1544 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 0
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories