2. Building mathine translation system is complicated
File Reader
I don't like it.
What's your...
Yes, we can!
...
...
File Reader
僕は嫌いです。
..は何ですか?
我々にはできる!
...
...
Tokenizer EN
I/do/n't/like/it/.
What/'s/your/...
Yes/,/we/can/!
...
...
Tokenizer JA
僕/は/嫌い/で/す/。
../は/何/で/す/か/?
我々/に/は/でき/る/!
...
...
Cleaner
I/do/n't/like/it/.
僕/は/嫌い/で/す/。
Yes/,/we/can/!
我々/に/は/でき/る/!
...
...
Parser EN
I do n't like it .
...
...
Parser JA
僕 は 嫌い で す 。
...
...
...
...Remove pairs
if len(words) > 80
▶ Tokenization (splits into words)
▶ Cleaning (removes long lines, personal info, and garbage)
▶ Syntax parsing
▶ Alignment (makes pairs of same meaning words)
▶ Training (extracts translation rules)
▶ Tuning (adjusts parameters)
2015-04-25 Koichi Akabe (AHC-Lab) 2 / 20
3. We want:
▶ to write this process easier.
▶ to run this process faster.
2015-04-25 Koichi Akabe (AHC-Lab) 3 / 20
4. “UNIX Pipeline” can make some processes faster
File Reader
I don't like it.
What's your...
Yes, we can!
...
...
File Reader
僕は嫌いです。
..は何ですか?
我々にはできる!
...
...
Tokenizer EN
I/do/n't/like/it/.
What/'s/your/...
Yes/,/we/can/!
...
...
Tokenizer JA
僕/は/嫌い/で/す/。
../は/何/で/す/か/?
我々/に/は/でき/る/!
...
...
Cleaner
I/do/n't/like/it/.
僕/は/嫌い/で/す/。
Yes/,/we/can/!
我々/に/は/でき/る/!
...
...
Parser EN
I do n't like it .
...
...
Parser JA
僕 は 嫌い で す 。
...
...
...
...
2015-04-25 Koichi Akabe (AHC-Lab) 4 / 20
5. “UNIX Pipeline” can make some processes faster
File Reader
I don't like it.
What's your...
Yes, we can!
...
...
File Reader
僕は嫌いです。
..は何ですか?
我々にはできる!
...
...
Tokenizer EN
I/do/n't/like/it/.
What/'s/your/...
Yes/,/we/can/!
...
...
Tokenizer JA
僕/は/嫌い/で/す/。
../は/何/で/す/か/?
我々/に/は/でき/る/!
...
...
Cleaner
I/do/n't/like/it/.
僕/は/嫌い/で/す/。
Yes/,/we/can/!
我々/に/は/でき/る/!
...
...
Parser EN
I do n't like it .
...
...
Parser JA
僕 は 嫌い で す 。
...
...
...
...
▶ ○ Tokenization (splits into words)
▶ ○ Cleaning (removes long lines, personal info, and garbage)
▶ ○ Syntax parsing
▶ △ Alignment (makes pairs of same meaning words)
▶ Training (extracts translation rules)
▶ Tuning (adjusts parameters)
2015-04-25 Koichi Akabe (AHC-Lab) 4 / 20
6. How to run it under multi-thread environment
File Reader
I don't like it.
What's your...
Yes, we can!
...
...
File Reader
僕は嫌いです。
..は何ですか?
我々にはできる!
...
...
Tokenizer EN
I/do/n't/like/it/.
What/'s/your/...
Yes/,/we/can/!
...
...
Tokenizer JA
僕/は/嫌い/で/す/。
../は/何/で/す/か/?
我々/に/は/でき/る/!
...
...
Cleaner
I/do/n't/like/it/.
僕/は/嫌い/で/す/。
Yes/,/we/can/!
我々/に/は/でき/る/!
...
...
Parser EN
I do n't like it .
...
...
Parser JA
僕 は 嫌い で す 。
...
...
...
...
Single-thread Multi-thread
2015-04-25 Koichi Akabe (AHC-Lab) 5 / 20
7. More complicated situation
File Reader
File Reader
Tokenizer
Tokenizer
Cleaner
Parser
Parser
File WriterLower Caser
File WriterLower Caser
File WriterLower Caser
File WriterLower Caser
EN
JA
EN
JA
EN
JA
EN
EN
EN EN
JAJA
JA
EN
JA
JA
File WriterLower Caser
JAJA
2015-04-25 Koichi Akabe (AHC-Lab) 6 / 20
10. Simple multi-producer/-consumer queue will break order
Single-thread (ST) programs will require sorted data, but it is not
guaranteed in multi-thread (MT) programs.
1
2
3
4
5
...
1
2
3
4
2
4
1
3
4
3
1
2
MT
Process
ST
Process
MT
Process
ST
Process
Queue
Queue
Items of faster threads are pushed into the beginning of a queue.
2015-04-25 Koichi Akabe (AHC-Lab) 9 / 20
11. Also multi-thread programs have to care order if they use
multiple input
Pairs will be broken if each thread puts items into queues
independently.
1
2
3
4
4,1
3,3
1,2
2,4
MT
Process
ST
Process
MT
Process
1
2
3
4
MT
Process
ST
Process
1
2
3
4
5
...
Queue
1
2
3
4
5
...
Queue
2015-04-25 Koichi Akabe (AHC-Lab) 10 / 20
13. mt-chamber guarantees line order and correct pairs
1
2
3
4
5
...
1
2
3
4
1
2
3
4
4
3
1
2
MT
Process
ST
Process
MT
Process
ST
Process
Queue
Queue
Blocks higher numbered items
2015-04-25 Koichi Akabe (AHC-Lab) 12 / 20
14. mt-chamber guarantees line order and correct pairs
1
2
3
4
4,4
3,3
1,1
2,2
MT
Process
ST
Process
MT
Process
1
2
3
4
MT
Process
ST
Process
1
2
3
4
5
...
Queue
1
2
3
4
5
...
Queue
Makes correct pairs
before putting items
2015-04-25 Koichi Akabe (AHC-Lab) 12 / 20