SlideShare a Scribd company logo
University of Southern California
Information Sciences Institute
Many-to-English Machine Translation
Tools, Data, and Pretrained Models
Thamme Gowda
Information Sciences Institute
and Dept. of Computer Science
University of Southern California
tg@isi.edu
Zhao Zhang
Texas Advanced Computing Center
University of Texas
and NASA Jet Propulsion Lab
Chris Mattmann
Dept. of Computer Science
University of Southern California
and NASA Jet Propulsion Lab
Jonathan May
Information Sciences Institute
and Dept. of Computer Science
University of Southern California
ACL 2021 System Demonstrations (Virtual)
http://rtg.isi.edu/many-eng/
University of Southern California
Information Sciences Institute
2
Overview
• Three tools for machine translation: MTData, NLCodec, RTG
• Task: 500-to-English translation in an opensource way
– Collect a massive (bitext) dataset
– Train a massively multilingual NMT model
• Applications:
– Ready to use translation service; available via docker
– Parent model for transfer learning
University of Southern California
Information Sciences Institute
3
Tools
Focus: reproducibility and scalability
1. MTData: parallel dataset catalog and downloader
– As of June 2021, 120K+ datasets, hundreds of languages; ISO 639-3
– Publicly listed datasets: OPUS, Statmt.org, Paracrawl, ...
2. NLCodec: Vocabulary manager; and database layer
– PySpark backend for large datasets
– NLDb: Efficient storage and retrieval layer; parallelizable
3. Reader Translator Generator (RTG): NMT toolkit based on Pytorch
– Reproducible experiments; a conf.yml per experiment
– All the necessary ingredients for NMT research à production
• pip install mtdata nlcodec rtg
University of Southern California
Information Sciences Institute
4
500→English Translation
• Dataset: 500+ languages
– Dedupe, cleaning, etc ...
– Excluding the known test sets e.g. NewsTest, OPUS-100, ...
è ~474 million sentence pairs; 9 billion tokens on each side
• Model: Transformer: 768d, 9 encoder, 6 decoder,
– Separate BPE vocabularies: 512k source and 64k target embeddings
– Large batches: ~720k toks per step, 200K steps
– Gradient accumulation (5x), Float-16 ops,
and distributed training on 8x A100 GPUs
University of Southern California
Information Sciences Institute
5
Translation Service: BLEU on OPUS-100 Test Set
University of Southern California
Information Sciences Institute
6
Transfer Learning: Fine Tuning
Model BRE-ENG SME-ENG
Baseline 12.7 10.7
500-eng parent 11.8 8.6
Finetuned 22.8 19.1
• E.g., two low-resource langs
BRE: 1.2M ENG toks
SME: 100K ENG toks
• Huge improvements in BLEU!
University of Southern California
Information Sciences Institute
7
Take Away
• Home page: http://rtg.isi.edu/many-eng/
– Demo service, data, models, tutorials ...
– Dataset: http://rtg.isi.edu/many-eng/data-v1.html
https://opus.nlpl.eu/MT560.php [Thanks, Jörg Tiedemann]
– Models: http://rtg.isi.edu/many-eng/models/
• Docker: IMAGE=tgowda/rtg-model:500toEng-v1
docker run --gpus '"device=0"' --rm -i -p 6060:6060 $IMAGE
• Integrated to Apache Tika
– Parses html, pdf, epub, docx, ppt, ... runs OCR on images
University of Southern California
Information Sciences Institute
8
Last Slide: Thanks 🙏
• Help bring more languages and datasets into MTdata
– Issues and Pull requests are welcome
• Future work:
MTData currently uses ISO 639-3, which has limitations
– Language ID: (lang, script, region) e.g., BCP-47
• ISO 639 for language names
• ISO 15924 for script names
• ISO 3166-1 for region names
– Script and region can be optional, i.e., assume default values

More Related Content

Similar to 500 languages to English Machine Translation Model

An Integrated West Coast Science DMZ for Data-Intensive Research
An Integrated West Coast Science DMZ for Data-Intensive ResearchAn Integrated West Coast Science DMZ for Data-Intensive Research
An Integrated West Coast Science DMZ for Data-Intensive Research
Larry Smarr
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
Larry Smarr
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...
Ola Spjuth
 
CHASE-CI: A Distributed Big Data Machine Learning Platform
CHASE-CI: A Distributed Big Data Machine Learning PlatformCHASE-CI: A Distributed Big Data Machine Learning Platform
CHASE-CI: A Distributed Big Data Machine Learning Platform
Larry Smarr
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
inside-BigData.com
 
The Exascale Computing Project and the future of HPC
The Exascale Computing Project and the future of HPCThe Exascale Computing Project and the future of HPC
The Exascale Computing Project and the future of HPC
inside-BigData.com
 
A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...
A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...
A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...
Larry Smarr
 
Larry Smarr - NRP Application Drivers
Larry Smarr - NRP Application DriversLarry Smarr - NRP Application Drivers
Larry Smarr - NRP Application Drivers
Larry Smarr
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
Larry Smarr
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
inside-BigData.com
 
SomeSlides
SomeSlidesSomeSlides
SomeSlides
guestd60742
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
Ian Foster
 
AppTek-CLimateGPT-EvryWS20240308-v3.pptx
AppTek-CLimateGPT-EvryWS20240308-v3.pptxAppTek-CLimateGPT-EvryWS20240308-v3.pptx
AppTek-CLimateGPT-EvryWS20240308-v3.pptx
Gérard Chollet
 
Nikravesh australia long_versionkeynote2012
Nikravesh australia long_versionkeynote2012Nikravesh australia long_versionkeynote2012
Nikravesh australia long_versionkeynote2012
Masoud Nikravesh
 
Pioneering and Democratizing Scalable HPC+AI at PSC
Pioneering and Democratizing Scalable HPC+AI at PSCPioneering and Democratizing Scalable HPC+AI at PSC
Pioneering and Democratizing Scalable HPC+AI at PSC
inside-BigData.com
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
Larry Smarr
 
Getting Access to ALCF Resources and Services
Getting Access to ALCF Resources and ServicesGetting Access to ALCF Resources and Services
Getting Access to ALCF Resources and Services
davidemartin
 
The Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway SystemThe Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway System
Larry Smarr
 
Toward a National Research Platform
Toward a National Research PlatformToward a National Research Platform
Toward a National Research Platform
Larry Smarr
 
Grid Computing In Israel
Grid Computing  In IsraelGrid Computing  In Israel
Grid Computing In Israel
Guy Tel-Zur
 

Similar to 500 languages to English Machine Translation Model (20)

An Integrated West Coast Science DMZ for Data-Intensive Research
An Integrated West Coast Science DMZ for Data-Intensive ResearchAn Integrated West Coast Science DMZ for Data-Intensive Research
An Integrated West Coast Science DMZ for Data-Intensive Research
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...
 
CHASE-CI: A Distributed Big Data Machine Learning Platform
CHASE-CI: A Distributed Big Data Machine Learning PlatformCHASE-CI: A Distributed Big Data Machine Learning Platform
CHASE-CI: A Distributed Big Data Machine Learning Platform
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
The Exascale Computing Project and the future of HPC
The Exascale Computing Project and the future of HPCThe Exascale Computing Project and the future of HPC
The Exascale Computing Project and the future of HPC
 
A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...
A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...
A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...
 
Larry Smarr - NRP Application Drivers
Larry Smarr - NRP Application DriversLarry Smarr - NRP Application Drivers
Larry Smarr - NRP Application Drivers
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
SomeSlides
SomeSlidesSomeSlides
SomeSlides
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
 
AppTek-CLimateGPT-EvryWS20240308-v3.pptx
AppTek-CLimateGPT-EvryWS20240308-v3.pptxAppTek-CLimateGPT-EvryWS20240308-v3.pptx
AppTek-CLimateGPT-EvryWS20240308-v3.pptx
 
Nikravesh australia long_versionkeynote2012
Nikravesh australia long_versionkeynote2012Nikravesh australia long_versionkeynote2012
Nikravesh australia long_versionkeynote2012
 
Pioneering and Democratizing Scalable HPC+AI at PSC
Pioneering and Democratizing Scalable HPC+AI at PSCPioneering and Democratizing Scalable HPC+AI at PSC
Pioneering and Democratizing Scalable HPC+AI at PSC
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
 
Getting Access to ALCF Resources and Services
Getting Access to ALCF Resources and ServicesGetting Access to ALCF Resources and Services
Getting Access to ALCF Resources and Services
 
The Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway SystemThe Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway System
 
Toward a National Research Platform
Toward a National Research PlatformToward a National Research Platform
Toward a National Research Platform
 
Grid Computing In Israel
Grid Computing  In IsraelGrid Computing  In Israel
Grid Computing In Israel
 

More from Thamme Gowda

Thamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slides
Thamme Gowda
 
Macro average: rare types are important too
Macro average: rare types are important tooMacro average: rare types are important too
Macro average: rare types are important too
Thamme Gowda
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Thamme Gowda
 
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGData Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Thamme Gowda
 
Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Sparkler at spark summit east 2017
Sparkler at spark summit east 2017
Thamme Gowda
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler
Thamme Gowda
 
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda
 
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityIEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
Thamme Gowda
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
Thamme Gowda
 

More from Thamme Gowda (9)

Thamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slides
 
Macro average: rare types are important too
Macro average: rare types are important tooMacro average: rare types are important too
Macro average: rare types are important too
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
 
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGData Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
 
Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Sparkler at spark summit east 2017
Sparkler at spark summit east 2017
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler
 
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL Internship
 
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityIEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
 

Recently uploaded

Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
AbdullaAlAsif1
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
terusbelajar5
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
Texas Alliance of Groundwater Districts
 

Recently uploaded (20)

Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
 

500 languages to English Machine Translation Model

  • 1. University of Southern California Information Sciences Institute Many-to-English Machine Translation Tools, Data, and Pretrained Models Thamme Gowda Information Sciences Institute and Dept. of Computer Science University of Southern California tg@isi.edu Zhao Zhang Texas Advanced Computing Center University of Texas and NASA Jet Propulsion Lab Chris Mattmann Dept. of Computer Science University of Southern California and NASA Jet Propulsion Lab Jonathan May Information Sciences Institute and Dept. of Computer Science University of Southern California ACL 2021 System Demonstrations (Virtual) http://rtg.isi.edu/many-eng/
  • 2. University of Southern California Information Sciences Institute 2 Overview • Three tools for machine translation: MTData, NLCodec, RTG • Task: 500-to-English translation in an opensource way – Collect a massive (bitext) dataset – Train a massively multilingual NMT model • Applications: – Ready to use translation service; available via docker – Parent model for transfer learning
  • 3. University of Southern California Information Sciences Institute 3 Tools Focus: reproducibility and scalability 1. MTData: parallel dataset catalog and downloader – As of June 2021, 120K+ datasets, hundreds of languages; ISO 639-3 – Publicly listed datasets: OPUS, Statmt.org, Paracrawl, ... 2. NLCodec: Vocabulary manager; and database layer – PySpark backend for large datasets – NLDb: Efficient storage and retrieval layer; parallelizable 3. Reader Translator Generator (RTG): NMT toolkit based on Pytorch – Reproducible experiments; a conf.yml per experiment – All the necessary ingredients for NMT research à production • pip install mtdata nlcodec rtg
  • 4. University of Southern California Information Sciences Institute 4 500→English Translation • Dataset: 500+ languages – Dedupe, cleaning, etc ... – Excluding the known test sets e.g. NewsTest, OPUS-100, ... è ~474 million sentence pairs; 9 billion tokens on each side • Model: Transformer: 768d, 9 encoder, 6 decoder, – Separate BPE vocabularies: 512k source and 64k target embeddings – Large batches: ~720k toks per step, 200K steps – Gradient accumulation (5x), Float-16 ops, and distributed training on 8x A100 GPUs
  • 5. University of Southern California Information Sciences Institute 5 Translation Service: BLEU on OPUS-100 Test Set
  • 6. University of Southern California Information Sciences Institute 6 Transfer Learning: Fine Tuning Model BRE-ENG SME-ENG Baseline 12.7 10.7 500-eng parent 11.8 8.6 Finetuned 22.8 19.1 • E.g., two low-resource langs BRE: 1.2M ENG toks SME: 100K ENG toks • Huge improvements in BLEU!
  • 7. University of Southern California Information Sciences Institute 7 Take Away • Home page: http://rtg.isi.edu/many-eng/ – Demo service, data, models, tutorials ... – Dataset: http://rtg.isi.edu/many-eng/data-v1.html https://opus.nlpl.eu/MT560.php [Thanks, Jörg Tiedemann] – Models: http://rtg.isi.edu/many-eng/models/ • Docker: IMAGE=tgowda/rtg-model:500toEng-v1 docker run --gpus '"device=0"' --rm -i -p 6060:6060 $IMAGE • Integrated to Apache Tika – Parses html, pdf, epub, docx, ppt, ... runs OCR on images
  • 8. University of Southern California Information Sciences Institute 8 Last Slide: Thanks 🙏 • Help bring more languages and datasets into MTdata – Issues and Pull requests are welcome • Future work: MTData currently uses ISO 639-3, which has limitations – Language ID: (lang, script, region) e.g., BCP-47 • ISO 639 for language names • ISO 15924 for script names • ISO 3166-1 for region names – Script and region can be optional, i.e., assume default values