SlideShare a Scribd company logo
1 of 26
Improving the quality of a customized SMT system using shared training data Chris.Wendt@microsoft.com Will.Lewis@microsoft.com August 28, 2009 1
Overview  Engine and Customization Basics Experiment Objective Experiment Setup Experiment Results Validation Conclusions 2
Microsoft’s Statistical MT Engine 3 Linguistically informed SMT
Microsoft Translator Runtime 4
Training 5
Microsoft’s Statistical MT Engine 6
Adding Domain Specificity  7
Experiment Objective Objective Determine the effect of pooling parallel data among multiple data providers within a domain, measured by the translation quality of an SMT system trained with that data. 8
Experiment Setup Data pool: TAUS Data Association’s repository of parallel translation data. Domain: computer-related technical documents. 	No difference is made between software,  hardware, documentation and marketing material. Criteria for test case selection: ,[object Object]
Less than 2M segments of parallel training data (at that point it would be valid to train a System using only the provider’s own data)Chosen case: Sybase Experiment Series: Observe BLEU scores using a reserved subset of the submitted data against systems trained with 1	General data, as used for www.microsofttranslator.com 2a	Only Microsoft’s internal parallel data, from localization of its own products 2b	Microsoft data + Sybase data 3a	General + Microsoft + TAUS 3b	General + Microsoft data + TAUS, with Sybase custom lambdas Measure BLEU on 3 sets of test documents, with 1 reference, reserved from the submission, not used in training: ,[object Object]
Microsoft
General9
System Details 10
Training data composition German Chinese (Simplified) Sybase does not have enough data to build a system exclusively with Sybase data 11
Experiment Results, measured in BLEU Chinese German 12
Experiment Results, measured in BLEU Chinese German 13
Experiment Results, measured in BLEU Chinese German 14 More than 8 point gain compared to system built without the shared data
Experiment Results, measured in BLEU Chinese Best results are achieved using the maximum available data within the domain, using custom lambda training German 15
Experiment Results, measured in BLEU Chinese Weight training (lambda training) without diversity in the training data has very little effect German 16 The diversity aspect was somewhat a surprise for us. Microsoft’s large data pool by itself did not give Sybase the hoped-for boost.
Experiment Results, measured in BLEU Chinese Lambda training with in-domain diversity has a significant positive effect for the lambda target, and a significant negative effect for everyone else German 17
Experiment Results, measured in BLEU Chinese A system can be customized with small amounts of target language material, as long as there is a diverse set of in-domain parallel data available German 18
Experiment Results, measured in BLEU Chinese Small data providers benefit more from sharing than large data providers, but all benefit German 19
Experiment Results, measured in BLEU Chinese This is the best German Sybase system we could have built without TAUS German 20
Validation: Adobe Polish Training Data (sentences): General	1.5M Microsoft	1.7M Adobe	129K TAUS other	70K 21 Even for a language without a lot of training data we can see nice gains by pooling.
Validation: Dell Japanese Training data (sentences) ,[object Object]
Microsoft 	3.2M

More Related Content

Similar to Improving the quality of a customized SMT system using shared training data

Office automation system report
Office automation system reportOffice automation system report
Office automation system reportAmit Kulkarni
 
Office automation system report
Office automation system reportOffice automation system report
Office automation system reportAmit Kulkarni
 
Provable multicopy dynamic data possession
Provable multicopy dynamic data possessionProvable multicopy dynamic data possession
Provable multicopy dynamic data possessionnexgentech15
 
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMS
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMSPROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMS
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMSNexgen Technology
 
Provable multicopy dynamic data possession
Provable multicopy dynamic data possessionProvable multicopy dynamic data possession
Provable multicopy dynamic data possessionnexgentechnology
 
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMS
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMSPROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMS
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMSNexgen Technology
 
Provable multi copy dynamic data possession in cloud computing systems
Provable multi copy dynamic data possession in cloud computing systemsProvable multi copy dynamic data possession in cloud computing systems
Provable multi copy dynamic data possession in cloud computing systemsNagamalleswararao Tadikonda
 
Data Virtualization Deployments: How to Manage Very Large Deployments
Data Virtualization Deployments: How to Manage Very Large DeploymentsData Virtualization Deployments: How to Manage Very Large Deployments
Data Virtualization Deployments: How to Manage Very Large DeploymentsDenodo
 
Database Systems
Database SystemsDatabase Systems
Database SystemsUsman Tariq
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Cognos Data Module Architectures & Use Cases
Cognos Data Module Architectures & Use CasesCognos Data Module Architectures & Use Cases
Cognos Data Module Architectures & Use CasesSenturus
 
A database management system
A database management systemA database management system
A database management systemghulam120
 
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and Beyond
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and BeyondMongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and Beyond
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and BeyondMongoDB
 
Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660syaidatulamirah
 
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems Jiaheng Lu
 

Similar to Improving the quality of a customized SMT system using shared training data (20)

Office automation system report
Office automation system reportOffice automation system report
Office automation system report
 
Office automation system report
Office automation system reportOffice automation system report
Office automation system report
 
Ranjitbanshpal1
Ranjitbanshpal1Ranjitbanshpal1
Ranjitbanshpal1
 
Provable multicopy dynamic data possession
Provable multicopy dynamic data possessionProvable multicopy dynamic data possession
Provable multicopy dynamic data possession
 
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMS
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMSPROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMS
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMS
 
Provable multicopy dynamic data possession
Provable multicopy dynamic data possessionProvable multicopy dynamic data possession
Provable multicopy dynamic data possession
 
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMS
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMSPROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMS
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMS
 
Provable multi copy dynamic data possession in cloud computing systems
Provable multi copy dynamic data possession in cloud computing systemsProvable multi copy dynamic data possession in cloud computing systems
Provable multi copy dynamic data possession in cloud computing systems
 
Data Virtualization Deployments: How to Manage Very Large Deployments
Data Virtualization Deployments: How to Manage Very Large DeploymentsData Virtualization Deployments: How to Manage Very Large Deployments
Data Virtualization Deployments: How to Manage Very Large Deployments
 
Database Systems
Database SystemsDatabase Systems
Database Systems
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Cognos Data Module Architectures & Use Cases
Cognos Data Module Architectures & Use CasesCognos Data Module Architectures & Use Cases
Cognos Data Module Architectures & Use Cases
 
A database management system
A database management systemA database management system
A database management system
 
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and Beyond
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and BeyondMongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and Beyond
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and Beyond
 
EViews 81 Supplement
EViews 81 SupplementEViews 81 Supplement
EViews 81 Supplement
 
Best peer++
Best peer++Best peer++
Best peer++
 
Best peer++
Best peer++Best peer++
Best peer++
 
BigData Analysis
BigData AnalysisBigData Analysis
BigData Analysis
 
Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660
 
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
 

More from TAUS - The Language Data Network

TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...TAUS - The Language Data Network
 
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...TAUS - The Language Data Network
 
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...TAUS - The Language Data Network
 
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...TAUS - The Language Data Network
 
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...TAUS - The Language Data Network
 
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...TAUS - The Language Data Network
 
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)TAUS - The Language Data Network
 
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann... Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...TAUS - The Language Data Network
 
A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...TAUS - The Language Data Network
 
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...TAUS - The Language Data Network
 
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...TAUS - The Language Data Network
 
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...TAUS - The Language Data Network
 
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 The Theory and Practice of Computer Aided Translation Training System, Liu Q... The Theory and Practice of Computer Aided Translation Training System, Liu Q...
The Theory and Practice of Computer Aided Translation Training System, Liu Q...TAUS - The Language Data Network
 
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)TAUS - The Language Data Network
 
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 A use-case for getting MT into your company, Kerstin Berns (berns language c... A use-case for getting MT into your company, Kerstin Berns (berns language c...
A use-case for getting MT into your company, Kerstin Berns (berns language c...TAUS - The Language Data Network
 

More from TAUS - The Language Data Network (20)

TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
 
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
 
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
 
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
 
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
 
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
 
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
 
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann... Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 
A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...
 
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
 
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
 
Farmer Lv (TrueTran)
Farmer Lv (TrueTran)Farmer Lv (TrueTran)
Farmer Lv (TrueTran)
 
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
 
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 The Theory and Practice of Computer Aided Translation Training System, Liu Q... The Theory and Practice of Computer Aided Translation Training System, Liu Q...
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 
Translation Technology Showcase in Shenzhen
Translation Technology Showcase in ShenzhenTranslation Technology Showcase in Shenzhen
Translation Technology Showcase in Shenzhen
 
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
 
SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)
 
How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)
 
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 A use-case for getting MT into your company, Kerstin Berns (berns language c... A use-case for getting MT into your company, Kerstin Berns (berns language c...
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 
QE integrated in XTM, by Bob Willans (XTM)
QE integrated in XTM, by Bob Willans (XTM)QE integrated in XTM, by Bob Willans (XTM)
QE integrated in XTM, by Bob Willans (XTM)
 

Recently uploaded

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Recently uploaded (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

Improving the quality of a customized SMT system using shared training data

  • 1. Improving the quality of a customized SMT system using shared training data Chris.Wendt@microsoft.com Will.Lewis@microsoft.com August 28, 2009 1
  • 2. Overview Engine and Customization Basics Experiment Objective Experiment Setup Experiment Results Validation Conclusions 2
  • 3. Microsoft’s Statistical MT Engine 3 Linguistically informed SMT
  • 8. Experiment Objective Objective Determine the effect of pooling parallel data among multiple data providers within a domain, measured by the translation quality of an SMT system trained with that data. 8
  • 9.
  • 10.
  • 14. Training data composition German Chinese (Simplified) Sybase does not have enough data to build a system exclusively with Sybase data 11
  • 15. Experiment Results, measured in BLEU Chinese German 12
  • 16. Experiment Results, measured in BLEU Chinese German 13
  • 17. Experiment Results, measured in BLEU Chinese German 14 More than 8 point gain compared to system built without the shared data
  • 18. Experiment Results, measured in BLEU Chinese Best results are achieved using the maximum available data within the domain, using custom lambda training German 15
  • 19. Experiment Results, measured in BLEU Chinese Weight training (lambda training) without diversity in the training data has very little effect German 16 The diversity aspect was somewhat a surprise for us. Microsoft’s large data pool by itself did not give Sybase the hoped-for boost.
  • 20. Experiment Results, measured in BLEU Chinese Lambda training with in-domain diversity has a significant positive effect for the lambda target, and a significant negative effect for everyone else German 17
  • 21. Experiment Results, measured in BLEU Chinese A system can be customized with small amounts of target language material, as long as there is a diverse set of in-domain parallel data available German 18
  • 22. Experiment Results, measured in BLEU Chinese Small data providers benefit more from sharing than large data providers, but all benefit German 19
  • 23. Experiment Results, measured in BLEU Chinese This is the best German Sybase system we could have built without TAUS German 20
  • 24. Validation: Adobe Polish Training Data (sentences): General 1.5M Microsoft 1.7M Adobe 129K TAUS other 70K 21 Even for a language without a lot of training data we can see nice gains by pooling.
  • 25.
  • 28. Dell 172K22 Confirms the Sybase results
  • 29. Example SRC The Monitor collects metrics and performance data from the databases and MobiLink servers running on other computers, while a separate computer accesses the Monitor via a web browser. 1 Der Monitor sammelt Metriken und Leistungsdaten von Datenbanken und MobiLink-Servern, die auf anderen Computern ausführen, während auf ein separater Computer greift auf den Monitor über einen Web-Browser. 2a Der Monitor sammelt Metriken und Performance-Daten von der Datenbanken und MobiLink-Server auf anderen Computern ausgeführt werden, während ein separater Computer den Monitor über einen Webbrowser zugreift. 2b Der Monitor sammelt Metriken und Performance-Daten von der Datenbanken und MobiLink-Server auf anderen Computern ausgeführt werden, während ein separater Computer den Monitor über einen Webbrowser zugreift. 3a Der Monitor sammelt Metriken und Performance-Daten von der Datenbanken und MobiLink-Server auf anderen Computern ausgeführt werden, während ein separater Computer den Monitor über einen Webbrowser zugreift. 3b Der Monitor sammelt Kriterien und Performance-Daten aus der Datenbanken und MobiLink-Server auf anderen Computern ausgeführt werden, während ein separater Computer des Monitors über einen Webbrowser zugreift. REF Der Monitor sammelt Kriterien und Performance-Daten aus den Datenbanken und MobiLink-Servern die auf anderen Computern ausgeführt werden, während ein separater Computer auf den Monitor über einen Webbrowser zugreift. Google Der Monitor sammelt Metriken und Performance-Daten aus den Datenbanken und MobiLink-Server auf anderen Computern ausgeführt, während eine separate Computer auf dem Monitor über einen Web-Browser. 23
  • 30.
  • 31. Weight training (Lambda training) without diversity in the training data has almost no effect
  • 32. Lambda training with in-domain diversity has a significant positive effect for the lambda target, and a significant negative effect for everyone else
  • 33. A system can be customized with small amounts of target language material, as long as there is a diverse set of in-domain parallel data available
  • 34. Best results are achieved using the maximum available data within the domain, using custom lambda training
  • 35. Small data providers benefit more from sharing than large data providers, but all benefit24
  • 36.
  • 37. An MT system trained with the combined data can deliver significantly improved translation quality, compared to a system trained only with the provider’s own data plus baseline training.
  • 38. Customization via a separate target language model and lambda training works25
  • 39. References Chris Quirk, Arul Menezes, and Colin Cherry, Dependency Treelet Translation: Syntactically Informed Phrasal SMT, in Proceedings of ACL, Association for Computational Linguistics, June 2005 Microsoft Translator: www.microsofttranslator.com TAUS Data Association: www.tausdata.org 26

Editor's Notes

  1. 2 things:Show that pooling works, especially for data owners with less bitext than MicrosoftCustomization gives the data a boost
  2. Not a surprise, we strongly prefer the lambda target’s style and terminology in this case
  3. Not a surprise, we strongly prefer the lambda target’s style and terminology in this case
  4. Not a surprise, we strongly prefer the lambda target’s style and terminology in this case
  5. Not a surprise, we strongly prefer the lambda target’s style and terminology in this case
  6. Not a surprise, we strongly prefer the lambda target’s style and terminology in this case
  7. System 1:- General test set BLEU = 0.1590- Tech* test set BLEU = 0.2890- TAUS Adobe test set BLEU = 0.1940 System 3b:- General test set BLEU = 0.1353- Tech* test set BLEU = 0.3388- TAUS Adobe test set BLEU = 0.3374 Training Data (lines/words):- Gendom: 1520177 lines / 22632777 enu words 19988095 plk words- Tech: 1786035 lines / 22717903 enu words 21205994 plk words- TAUS: 199210 lines / 2439361 enu words 2306301 plk words- TAUS Adobe: 129084 lines / 1664918 enu words 1512067 plk words
  8. System 1:- General test set BLEU = 0.1799- Tech* test set BLEU = 0.3788- TAUS Dell test set BLEU = 0.2672 System 2a:- General test set BLEU = 0.1476- Tech* test set BLEU = 0.3087- TAUS Dell test set BLEU = 0.3949 System 2b:- General test set BLEU = 0.1728- Tech* test set BLEU = 0.4132- TAUS Dell test set BLEU = 0.3264 System 3a:- General test set BLEU = 0.1733- Tech* test set BLEU = 0.4230- TAUS Dell test set BLEU = 0.3989 System 3b:- General test set BLEU = 0.1485- Tech* test set BLEU = 0.3221- TAUS Dell test set BLEU = 0.4243 Training Data:- Gendom: 4348176 lines- Tech: 3299908 lines- TAUS: 1612637 lines- TAUS Dell: 172017 lines * = new set of Tech test sentences deduped against entire training set