Pre and post editingenvironment for Apertium Lluís Villarejo Learning Technologies March 2012
c What is GSoC?• Its a global program that offers student developers stipends to write code for various open source software projects.• Since 2005• Inspire young developers to participate in OSS projects.• Give students more exposure to real-world soft dev scenarios.• Get more open source code created and released.• Help open source prjs identify and bring in new developers.
c Some participants• Apache Soft. Found. • Sakai Foundation• Debian • Mozilla• Facebook • Inclusive Design Inst.• Drupal • The Linux Foundation• Creative Commons • The GNU project• DocBook project • Wikimedia Foundation• GCC • WordPress• Gnome • Inclusive Design Inst.• ... • ...
c How does it work?• Orgs present themselves as mentoring agents.• Orgs present a list of potential projects and mentors.• Accepted orgs should try to attract students interest.• Students build project proposals.• Google finances slots for each org (5.000 + 500 USD).• The project community decides the student-slot assignation.• Between end of May and end of August.
c GsoC11 statistics• $7.2M budget• 1115 students accepted from 68 countries• 2096 mentors and co-mentors from 55 countries• 175 Open Source organizations• 18.1% of students have participated in previous years• 97 countries with student applicants• 88% overall success rate
cWhy participating with Apertium?• Strategically: – Apertium is a strategic agent inside UOC. – Developing Apertium means further developing internationalization aids for UOC. – Attract and onboard new developers for Apertium. – Collaboration with Googles Open Source initiatives.• Functionally: – Opporutnity to further develop specific UOC needs with external funding. – Capitalize specific user feedback on translation quality.
c The Apertium case• 20 proposed tasks• 17 tasks got interest from students [1-9] – Pre and post-editing environment gets 11 students interested.• Apertium community ranks the 17 tasks – Pre and post-editing environment ranks 4th• Google assigns 9 slots to Apertium (49.500 USD) – Our task goes through and Camille Mougey is selected from the Grenoble Insitute of Technology.
c Pre and post-editing, why?• An important part of the errors you get when translating a document are due to deficiencies in the original.• The integration of existing resources can help to ease this burden: – Digital knowledge sources (digital dictionaries... ) – Automatic tools (spell-checker, grammar checker, translation memory generation, search & replace...)• These processes should be integrated naturally in the translation workflow → the need for an integrated web interface to Apertium.• To improve the system we need to have access to the human post-editing process.
c Pre and post-editing, features• Pre and Post-editing web interface integrated with Apertium translation toolbox.• Spell checking on source and target languages. Integration with Aspell• Grammar checking on source and target languages. Integration with LanguageTool• Integration with several external dictionaries.• Search & replace functionalities on source and target languages.• Ability to deal with formatted text.• Logging system. All events are logged as they happen, ie at the very moment the user inserts or deletes text. This allows for a further data mining process to be run on the logs to detect commonly modified structures or vocabulary.• Translation memory generation. Integration of Maligna.• PDF translation through pdftohtml• Image translation. Through tesseract. Final report 2010 Final report 2011
c Results & learned lessons• Fully functional environment, goals accomplished.• Automatic availability of feedback on post-editing human behaviour.• Jointly defined task (flexible framework provided).• Interest in developing great empathy with the student.• Motivated and pro-active student.• Student engagement.• Very frequent feedback.• Mentoring team with access to ABSOLUTELY ALL the information regarding the project.
c Further work• Proof of concept accomplished.• Base platform developed so further work can be easily added.• Integration of other resources (more external dictionaries).• Extension of currently used resources (addition of grammar rules, dictionaries improvement, format range extension).• Logging information mining to get deeper knowledge on the human post-editing process.• Use of this mining process to improve Apertium translation engine.
c GsoC 2012• Logging information mining to get deeper knowledge on the human post-editing process.• Use of this mining process to improve Apertium translation engine.• Post-edition over formatted text.