SlideShare a Scribd company logo
1 of 22
Download to read offline
What is mt-chamber?
Koichi Akabe
AHC Lab
2015-04-25
2015-04-25 Koichi Akabe (AHC-Lab) 1 / 20
Building mathine translation system is complicated
File Reader
I don't like it.
What's your...
Yes, we can!
...
...
File Reader
僕は嫌いです。
..は何ですか?
我々にはできる!
...
...
Tokenizer EN
I/do/n't/like/it/.
What/'s/your/...
Yes/,/we/can/!
...
...
Tokenizer JA
僕/は/嫌い/で/す/。
../は/何/で/す/か/?
我々/に/は/でき/る/!
...
...
Cleaner
I/do/n't/like/it/.
僕/は/嫌い/で/す/。
Yes/,/we/can/!
我々/に/は/でき/る/!
...
...
Parser EN
I do n't like it .
...
...
Parser JA
僕 は 嫌い で す 。
...
...
...
...Remove pairs
if len(words) > 80
▶ Tokenization (splits into words)
▶ Cleaning (removes long lines, personal info, and garbage)
▶ Syntax parsing
▶ Alignment (makes pairs of same meaning words)
▶ Training (extracts translation rules)
▶ Tuning (adjusts parameters)
2015-04-25 Koichi Akabe (AHC-Lab) 2 / 20
We want:
▶ to write this process easier.
▶ to run this process faster.
2015-04-25 Koichi Akabe (AHC-Lab) 3 / 20
“UNIX Pipeline” can make some processes faster
File Reader
I don't like it.
What's your...
Yes, we can!
...
...
File Reader
僕は嫌いです。
..は何ですか?
我々にはできる!
...
...
Tokenizer EN
I/do/n't/like/it/.
What/'s/your/...
Yes/,/we/can/!
...
...
Tokenizer JA
僕/は/嫌い/で/す/。
../は/何/で/す/か/?
我々/に/は/でき/る/!
...
...
Cleaner
I/do/n't/like/it/.
僕/は/嫌い/で/す/。
Yes/,/we/can/!
我々/に/は/でき/る/!
...
...
Parser EN
I do n't like it .
...
...
Parser JA
僕 は 嫌い で す 。
...
...
...
...
2015-04-25 Koichi Akabe (AHC-Lab) 4 / 20
“UNIX Pipeline” can make some processes faster
File Reader
I don't like it.
What's your...
Yes, we can!
...
...
File Reader
僕は嫌いです。
..は何ですか?
我々にはできる!
...
...
Tokenizer EN
I/do/n't/like/it/.
What/'s/your/...
Yes/,/we/can/!
...
...
Tokenizer JA
僕/は/嫌い/で/す/。
../は/何/で/す/か/?
我々/に/は/でき/る/!
...
...
Cleaner
I/do/n't/like/it/.
僕/は/嫌い/で/す/。
Yes/,/we/can/!
我々/に/は/でき/る/!
...
...
Parser EN
I do n't like it .
...
...
Parser JA
僕 は 嫌い で す 。
...
...
...
...
▶ ○ Tokenization (splits into words)
▶ ○ Cleaning (removes long lines, personal info, and garbage)
▶ ○ Syntax parsing
▶ △ Alignment (makes pairs of same meaning words)
▶ Training (extracts translation rules)
▶ Tuning (adjusts parameters)
2015-04-25 Koichi Akabe (AHC-Lab) 4 / 20
How to run it under multi-thread environment
File Reader
I don't like it.
What's your...
Yes, we can!
...
...
File Reader
僕は嫌いです。
..は何ですか?
我々にはできる!
...
...
Tokenizer EN
I/do/n't/like/it/.
What/'s/your/...
Yes/,/we/can/!
...
...
Tokenizer JA
僕/は/嫌い/で/す/。
../は/何/で/す/か/?
我々/に/は/でき/る/!
...
...
Cleaner
I/do/n't/like/it/.
僕/は/嫌い/で/す/。
Yes/,/we/can/!
我々/に/は/でき/る/!
...
...
Parser EN
I do n't like it .
...
...
Parser JA
僕 は 嫌い で す 。
...
...
...
...
Single-thread Multi-thread
2015-04-25 Koichi Akabe (AHC-Lab) 5 / 20
More complicated situation
File Reader
File Reader
Tokenizer
Tokenizer
Cleaner
Parser
Parser
File WriterLower Caser
File WriterLower Caser
File WriterLower Caser
File WriterLower Caser
EN
JA
EN
JA
EN
JA
EN
EN
EN EN
JAJA
JA
EN
JA
JA
File WriterLower Caser
JAJA
2015-04-25 Koichi Akabe (AHC-Lab) 6 / 20
Named pipe in UNIX supports multi-thread pipeline!
“Named pipe” is an implementation of multi-producer/-consumer
queue.
$ mkfifo ./ tok_queue # create named pipe
$ mkfifo ./ cln_queue
$ cleaner < ./ cln_queue > ... &
$ cleaner < ./ cln_queue > ... &
....
$ tokenizer < ./ tok_queue > ./ cln_queue &
$ tokenizer < ./ tok_queue > ./ cln_queue &
....
$ cat ./ corpus_data > ./ tok_queue
tok_queue
tokenizer
cln_queue
cleaner
cat
2015-04-25 Koichi Akabe (AHC-Lab) 7 / 20
Solution?
2015-04-25 Koichi Akabe (AHC-Lab) 8 / 20
Simple multi-producer/-consumer queue will break order
Single-thread (ST) programs will require sorted data, but it is not
guaranteed in multi-thread (MT) programs.
1
2
3
4
5
...
1
2
3
4
2
4
1
3
4
3
1
2
MT
Process
ST
Process
MT
Process
ST
Process
Queue
Queue
Items of faster threads are pushed into the beginning of a queue.
2015-04-25 Koichi Akabe (AHC-Lab) 9 / 20
Also multi-thread programs have to care order if they use
multiple input
Pairs will be broken if each thread puts items into queues
independently.
1
2
3
4
4,1
3,3
1,2
2,4
MT
Process
ST
Process
MT
Process
1
2
3
4
MT
Process
ST
Process
1
2
3
4
5
...
Queue
1
2
3
4
5
...
Queue
2015-04-25 Koichi Akabe (AHC-Lab) 10 / 20
mt-chamber
A framework for multi-thread pipeline process
2015-04-25 Koichi Akabe (AHC-Lab) 11 / 20
mt-chamber guarantees line order and correct pairs
1
2
3
4
5
...
1
2
3
4
1
2
3
4
4
3
1
2
MT
Process
ST
Process
MT
Process
ST
Process
Queue
Queue
Blocks higher numbered items
2015-04-25 Koichi Akabe (AHC-Lab) 12 / 20
mt-chamber guarantees line order and correct pairs
1
2
3
4
4,4
3,3
1,1
2,2
MT
Process
ST
Process
MT
Process
1
2
3
4
MT
Process
ST
Process
1
2
3
4
5
...
Queue
1
2
3
4
5
...
Queue
Makes correct pairs
before putting items
2015-04-25 Koichi Akabe (AHC-Lab) 12 / 20
Short example
# ./ example_code
Alias KyTeaTokenizer System:command =" kytea -notags"
Read:file ="./ input" > inputdata
KyTeaTokenizer < inputdata > tokenized
Write:file ="./ output" < tokenized
$ ./mt -chamber.py --threads 20 < ./ example_code
Read System
./input kytea
Write
./output
2015-04-25 Koichi Akabe (AHC-Lab) 13 / 20
Multiple input
Read:file ="./ en" > EN_raw
Read:file ="./ ja" > JA_raw
WesternTokenizer < EN_raw > EN_tok
KyTeaTokenizer < JA_raw > JA_tok
LengthCleaner < EN tok JA tok > EN clean JA clean
Write:file ="./ en.clean" < EN_clean
Write:file ="./ ja.clean" < JA_clean
Read KyTea Tok
./ja
Read Western Tok
./en
Length
Cleaner
Write
./ja.clean
Write
./en.clean
2015-04-25 Koichi Akabe (AHC-Lab) 14 / 20
To define your own commands, write plugins
# plugins/LowerCaser.py
class Command:
InputSize = 1
OutputSize = 1
MultiThreadable = True
ShareResources = True
def routine(self , instream):
return (instream [0]. lower () ,)
All commands are implemented like this.
2015-04-25 Koichi Akabe (AHC-Lab) 15 / 20
Experiments
2015-04-25 Koichi Akabe (AHC-Lab) 16 / 20
Experiment setup
Measure performance of tokenizing and syntax parsing in each
environment.
Read line −→ Tokenize −→ Clean −→ Syntax parse −→ Write line
Thread size comparison:
▶ Unix pipe (each process is single thread)
▶ 1 thread (./mt-chamber.py --threads 1)
▶ 5 threads
▶ 10 threads
unsrt-limit comparison:
(“unsrt-limit” is the place of higher numbered items to wait lower
numbered ones for sorting)
▶ 100 items
▶ 500 items
▶ 1000 items
2015-04-25 Koichi Akabe (AHC-Lab) 17 / 20
Experiment setup
Corpus:
▶ English-Japanese translation corpus
▶ 10000 sentences before cleaning (after: 9920 sentences)
▶ Japanese tokenizer: KyTea
▶ Syntax parser: Ckylark (both languages)
2015-04-25 Koichi Akabe (AHC-Lab) 18 / 20
Experiment results
0
2000
4000
6000
Time[sec]
Unix Pipe 1 thread 5 threads 10 threads
unsrt-limit = 100
500
1000
▶ 1 thread mt-chamber is not slower than “Unix Pipe”
▶ unsrt-limit is important to run pipelines faster in multi-thread
2015-04-25 Koichi Akabe (AHC-Lab) 19 / 20
See also
https://github.com/vbkaisetsu/mt-chamber
2015-04-25 Koichi Akabe (AHC-Lab) 20 / 20

More Related Content

Similar to What is mt-chamber?

Emacs presentation
Emacs presentationEmacs presentation
Emacs presentationLingfei Kong
 
Into The Box 2018 Automate Your Test
Into The Box 2018 Automate Your Test Into The Box 2018 Automate Your Test
Into The Box 2018 Automate Your Test Ortus Solutions, Corp
 
Torquebox OSCON Java 2011
Torquebox OSCON Java 2011Torquebox OSCON Java 2011
Torquebox OSCON Java 2011tobiascrawley
 
CPANTS: Kwalitative website and its tools
CPANTS: Kwalitative website and its toolsCPANTS: Kwalitative website and its tools
CPANTS: Kwalitative website and its toolscharsbar
 
Verilog Lecture5 hust 2014
Verilog Lecture5 hust 2014Verilog Lecture5 hust 2014
Verilog Lecture5 hust 2014Béo Tú
 
Teflon - Anti Stick for the browser attack surface
Teflon - Anti Stick for the browser attack surfaceTeflon - Anti Stick for the browser attack surface
Teflon - Anti Stick for the browser attack surfaceSaumil Shah
 
Operating Systems - A Primer
Operating Systems - A PrimerOperating Systems - A Primer
Operating Systems - A PrimerSaumil Shah
 
Javascript TDD with Jasmine, Karma, and Gulp
Javascript TDD with Jasmine, Karma, and GulpJavascript TDD with Jasmine, Karma, and Gulp
Javascript TDD with Jasmine, Karma, and GulpAll Things Open
 
Greyhound - Powerful Pure Functional Kafka Library
Greyhound - Powerful Pure Functional Kafka LibraryGreyhound - Powerful Pure Functional Kafka Library
Greyhound - Powerful Pure Functional Kafka LibraryNatan Silnitsky
 
KenRimple_ETE2015_ES6LikeNow
KenRimple_ETE2015_ES6LikeNowKenRimple_ETE2015_ES6LikeNow
KenRimple_ETE2015_ES6LikeNowkrimple
 
Less02 2 e_testermodule_1
Less02 2 e_testermodule_1Less02 2 e_testermodule_1
Less02 2 e_testermodule_1Suresh Mishra
 
4Developers 2015: Under the dome (of failure driven pipeline) - Maciej Lasyk
4Developers 2015: Under the dome (of failure driven pipeline) - Maciej Lasyk4Developers 2015: Under the dome (of failure driven pipeline) - Maciej Lasyk
4Developers 2015: Under the dome (of failure driven pipeline) - Maciej LasykPROIDEA
 
Under the Dome (of failure driven pipeline)
Under the Dome (of failure driven pipeline)Under the Dome (of failure driven pipeline)
Under the Dome (of failure driven pipeline)Maciej Lasyk
 
Dive into Fluentd plugin v0.12
Dive into Fluentd plugin v0.12Dive into Fluentd plugin v0.12
Dive into Fluentd plugin v0.12N Masahiro
 
Fargate 를 이용한 ECS with VPC 1부
Fargate 를 이용한 ECS with VPC 1부Fargate 를 이용한 ECS with VPC 1부
Fargate 를 이용한 ECS with VPC 1부Hyun-Mook Choi
 

Similar to What is mt-chamber? (20)

Emacs presentation
Emacs presentationEmacs presentation
Emacs presentation
 
Into The Box 2018 Automate Your Test
Into The Box 2018 Automate Your Test Into The Box 2018 Automate Your Test
Into The Box 2018 Automate Your Test
 
Torquebox OSCON Java 2011
Torquebox OSCON Java 2011Torquebox OSCON Java 2011
Torquebox OSCON Java 2011
 
CPANTS: Kwalitative website and its tools
CPANTS: Kwalitative website and its toolsCPANTS: Kwalitative website and its tools
CPANTS: Kwalitative website and its tools
 
Verilog Lecture5 hust 2014
Verilog Lecture5 hust 2014Verilog Lecture5 hust 2014
Verilog Lecture5 hust 2014
 
Demystifying Maven
Demystifying MavenDemystifying Maven
Demystifying Maven
 
Topo gigio
Topo gigioTopo gigio
Topo gigio
 
Origins of Serverless
Origins of ServerlessOrigins of Serverless
Origins of Serverless
 
Teflon - Anti Stick for the browser attack surface
Teflon - Anti Stick for the browser attack surfaceTeflon - Anti Stick for the browser attack surface
Teflon - Anti Stick for the browser attack surface
 
Operating Systems - A Primer
Operating Systems - A PrimerOperating Systems - A Primer
Operating Systems - A Primer
 
Javascript TDD with Jasmine, Karma, and Gulp
Javascript TDD with Jasmine, Karma, and GulpJavascript TDD with Jasmine, Karma, and Gulp
Javascript TDD with Jasmine, Karma, and Gulp
 
Greyhound - Powerful Pure Functional Kafka Library
Greyhound - Powerful Pure Functional Kafka LibraryGreyhound - Powerful Pure Functional Kafka Library
Greyhound - Powerful Pure Functional Kafka Library
 
KenRimple_ETE2015_ES6LikeNow
KenRimple_ETE2015_ES6LikeNowKenRimple_ETE2015_ES6LikeNow
KenRimple_ETE2015_ES6LikeNow
 
Automate Yo' Self
Automate Yo' SelfAutomate Yo' Self
Automate Yo' Self
 
Less02 2 e_testermodule_1
Less02 2 e_testermodule_1Less02 2 e_testermodule_1
Less02 2 e_testermodule_1
 
Introduction To JSFL
Introduction To JSFLIntroduction To JSFL
Introduction To JSFL
 
4Developers 2015: Under the dome (of failure driven pipeline) - Maciej Lasyk
4Developers 2015: Under the dome (of failure driven pipeline) - Maciej Lasyk4Developers 2015: Under the dome (of failure driven pipeline) - Maciej Lasyk
4Developers 2015: Under the dome (of failure driven pipeline) - Maciej Lasyk
 
Under the Dome (of failure driven pipeline)
Under the Dome (of failure driven pipeline)Under the Dome (of failure driven pipeline)
Under the Dome (of failure driven pipeline)
 
Dive into Fluentd plugin v0.12
Dive into Fluentd plugin v0.12Dive into Fluentd plugin v0.12
Dive into Fluentd plugin v0.12
 
Fargate 를 이용한 ECS with VPC 1부
Fargate 를 이용한 ECS with VPC 1부Fargate 를 이용한 ECS with VPC 1부
Fargate 를 이용한 ECS with VPC 1부
 

Recently uploaded

Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 

Recently uploaded (20)

Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 

What is mt-chamber?

  • 1. What is mt-chamber? Koichi Akabe AHC Lab 2015-04-25 2015-04-25 Koichi Akabe (AHC-Lab) 1 / 20
  • 2. Building mathine translation system is complicated File Reader I don't like it. What's your... Yes, we can! ... ... File Reader 僕は嫌いです。 ..は何ですか? 我々にはできる! ... ... Tokenizer EN I/do/n't/like/it/. What/'s/your/... Yes/,/we/can/! ... ... Tokenizer JA 僕/は/嫌い/で/す/。 ../は/何/で/す/か/? 我々/に/は/でき/る/! ... ... Cleaner I/do/n't/like/it/. 僕/は/嫌い/で/す/。 Yes/,/we/can/! 我々/に/は/でき/る/! ... ... Parser EN I do n't like it . ... ... Parser JA 僕 は 嫌い で す 。 ... ... ... ...Remove pairs if len(words) > 80 ▶ Tokenization (splits into words) ▶ Cleaning (removes long lines, personal info, and garbage) ▶ Syntax parsing ▶ Alignment (makes pairs of same meaning words) ▶ Training (extracts translation rules) ▶ Tuning (adjusts parameters) 2015-04-25 Koichi Akabe (AHC-Lab) 2 / 20
  • 3. We want: ▶ to write this process easier. ▶ to run this process faster. 2015-04-25 Koichi Akabe (AHC-Lab) 3 / 20
  • 4. “UNIX Pipeline” can make some processes faster File Reader I don't like it. What's your... Yes, we can! ... ... File Reader 僕は嫌いです。 ..は何ですか? 我々にはできる! ... ... Tokenizer EN I/do/n't/like/it/. What/'s/your/... Yes/,/we/can/! ... ... Tokenizer JA 僕/は/嫌い/で/す/。 ../は/何/で/す/か/? 我々/に/は/でき/る/! ... ... Cleaner I/do/n't/like/it/. 僕/は/嫌い/で/す/。 Yes/,/we/can/! 我々/に/は/でき/る/! ... ... Parser EN I do n't like it . ... ... Parser JA 僕 は 嫌い で す 。 ... ... ... ... 2015-04-25 Koichi Akabe (AHC-Lab) 4 / 20
  • 5. “UNIX Pipeline” can make some processes faster File Reader I don't like it. What's your... Yes, we can! ... ... File Reader 僕は嫌いです。 ..は何ですか? 我々にはできる! ... ... Tokenizer EN I/do/n't/like/it/. What/'s/your/... Yes/,/we/can/! ... ... Tokenizer JA 僕/は/嫌い/で/す/。 ../は/何/で/す/か/? 我々/に/は/でき/る/! ... ... Cleaner I/do/n't/like/it/. 僕/は/嫌い/で/す/。 Yes/,/we/can/! 我々/に/は/でき/る/! ... ... Parser EN I do n't like it . ... ... Parser JA 僕 は 嫌い で す 。 ... ... ... ... ▶ ○ Tokenization (splits into words) ▶ ○ Cleaning (removes long lines, personal info, and garbage) ▶ ○ Syntax parsing ▶ △ Alignment (makes pairs of same meaning words) ▶ Training (extracts translation rules) ▶ Tuning (adjusts parameters) 2015-04-25 Koichi Akabe (AHC-Lab) 4 / 20
  • 6. How to run it under multi-thread environment File Reader I don't like it. What's your... Yes, we can! ... ... File Reader 僕は嫌いです。 ..は何ですか? 我々にはできる! ... ... Tokenizer EN I/do/n't/like/it/. What/'s/your/... Yes/,/we/can/! ... ... Tokenizer JA 僕/は/嫌い/で/す/。 ../は/何/で/す/か/? 我々/に/は/でき/る/! ... ... Cleaner I/do/n't/like/it/. 僕/は/嫌い/で/す/。 Yes/,/we/can/! 我々/に/は/でき/る/! ... ... Parser EN I do n't like it . ... ... Parser JA 僕 は 嫌い で す 。 ... ... ... ... Single-thread Multi-thread 2015-04-25 Koichi Akabe (AHC-Lab) 5 / 20
  • 7. More complicated situation File Reader File Reader Tokenizer Tokenizer Cleaner Parser Parser File WriterLower Caser File WriterLower Caser File WriterLower Caser File WriterLower Caser EN JA EN JA EN JA EN EN EN EN JAJA JA EN JA JA File WriterLower Caser JAJA 2015-04-25 Koichi Akabe (AHC-Lab) 6 / 20
  • 8. Named pipe in UNIX supports multi-thread pipeline! “Named pipe” is an implementation of multi-producer/-consumer queue. $ mkfifo ./ tok_queue # create named pipe $ mkfifo ./ cln_queue $ cleaner < ./ cln_queue > ... & $ cleaner < ./ cln_queue > ... & .... $ tokenizer < ./ tok_queue > ./ cln_queue & $ tokenizer < ./ tok_queue > ./ cln_queue & .... $ cat ./ corpus_data > ./ tok_queue tok_queue tokenizer cln_queue cleaner cat 2015-04-25 Koichi Akabe (AHC-Lab) 7 / 20
  • 10. Simple multi-producer/-consumer queue will break order Single-thread (ST) programs will require sorted data, but it is not guaranteed in multi-thread (MT) programs. 1 2 3 4 5 ... 1 2 3 4 2 4 1 3 4 3 1 2 MT Process ST Process MT Process ST Process Queue Queue Items of faster threads are pushed into the beginning of a queue. 2015-04-25 Koichi Akabe (AHC-Lab) 9 / 20
  • 11. Also multi-thread programs have to care order if they use multiple input Pairs will be broken if each thread puts items into queues independently. 1 2 3 4 4,1 3,3 1,2 2,4 MT Process ST Process MT Process 1 2 3 4 MT Process ST Process 1 2 3 4 5 ... Queue 1 2 3 4 5 ... Queue 2015-04-25 Koichi Akabe (AHC-Lab) 10 / 20
  • 12. mt-chamber A framework for multi-thread pipeline process 2015-04-25 Koichi Akabe (AHC-Lab) 11 / 20
  • 13. mt-chamber guarantees line order and correct pairs 1 2 3 4 5 ... 1 2 3 4 1 2 3 4 4 3 1 2 MT Process ST Process MT Process ST Process Queue Queue Blocks higher numbered items 2015-04-25 Koichi Akabe (AHC-Lab) 12 / 20
  • 14. mt-chamber guarantees line order and correct pairs 1 2 3 4 4,4 3,3 1,1 2,2 MT Process ST Process MT Process 1 2 3 4 MT Process ST Process 1 2 3 4 5 ... Queue 1 2 3 4 5 ... Queue Makes correct pairs before putting items 2015-04-25 Koichi Akabe (AHC-Lab) 12 / 20
  • 15. Short example # ./ example_code Alias KyTeaTokenizer System:command =" kytea -notags" Read:file ="./ input" > inputdata KyTeaTokenizer < inputdata > tokenized Write:file ="./ output" < tokenized $ ./mt -chamber.py --threads 20 < ./ example_code Read System ./input kytea Write ./output 2015-04-25 Koichi Akabe (AHC-Lab) 13 / 20
  • 16. Multiple input Read:file ="./ en" > EN_raw Read:file ="./ ja" > JA_raw WesternTokenizer < EN_raw > EN_tok KyTeaTokenizer < JA_raw > JA_tok LengthCleaner < EN tok JA tok > EN clean JA clean Write:file ="./ en.clean" < EN_clean Write:file ="./ ja.clean" < JA_clean Read KyTea Tok ./ja Read Western Tok ./en Length Cleaner Write ./ja.clean Write ./en.clean 2015-04-25 Koichi Akabe (AHC-Lab) 14 / 20
  • 17. To define your own commands, write plugins # plugins/LowerCaser.py class Command: InputSize = 1 OutputSize = 1 MultiThreadable = True ShareResources = True def routine(self , instream): return (instream [0]. lower () ,) All commands are implemented like this. 2015-04-25 Koichi Akabe (AHC-Lab) 15 / 20
  • 19. Experiment setup Measure performance of tokenizing and syntax parsing in each environment. Read line −→ Tokenize −→ Clean −→ Syntax parse −→ Write line Thread size comparison: ▶ Unix pipe (each process is single thread) ▶ 1 thread (./mt-chamber.py --threads 1) ▶ 5 threads ▶ 10 threads unsrt-limit comparison: (“unsrt-limit” is the place of higher numbered items to wait lower numbered ones for sorting) ▶ 100 items ▶ 500 items ▶ 1000 items 2015-04-25 Koichi Akabe (AHC-Lab) 17 / 20
  • 20. Experiment setup Corpus: ▶ English-Japanese translation corpus ▶ 10000 sentences before cleaning (after: 9920 sentences) ▶ Japanese tokenizer: KyTea ▶ Syntax parser: Ckylark (both languages) 2015-04-25 Koichi Akabe (AHC-Lab) 18 / 20
  • 21. Experiment results 0 2000 4000 6000 Time[sec] Unix Pipe 1 thread 5 threads 10 threads unsrt-limit = 100 500 1000 ▶ 1 thread mt-chamber is not slower than “Unix Pipe” ▶ unsrt-limit is important to run pipelines faster in multi-thread 2015-04-25 Koichi Akabe (AHC-Lab) 19 / 20