BSDCONV
Buganini Q
Since 2009
Charset & Encoding
Character Set
Collection of characters
Encoding
Binary representation
Charset & Encoding
..
Unicode (32bits addr. space)
.
Unicode up to U+10FFFF
.
Unicode BMP (up to U+FFFF)
.
(Basic Multilin...
Charset & Encoding
..
Unicode (32bits addr. space)
.
Unicode up to U+10FFFF
.
Unicode BMP (up to U+FFFF)
.GB18030.
CNS1164...
Encoding :: UTF-32 / UCS4
Fixed Length
4 bytes
Filesize *= 4 for ASCII text file
Incompatible with C-style string conventio...
Encoding :: UCS2
Fixed Length
2 bytes
Filesize *= 2 for ASCII text file
Incompatible with C-style string convention
Endiann...
Encoding :: UTF-16
Variable Length
2 bytes / 4 bytes (Surrogate pairs)
Surrogates
Using U+D800..U+DFFF
Incompatible with C...
Encoding :: UTF-8
Variable Length
1~6 bytes
Compatible with C-style string convention
Self-synchronizing
Endian-neutral
So...
Encoding :: CNS11643 (全字庫) #issue
http://www.cns11643.gov.tw/
Only used by Taiwan government
NOT a subset of Unicode
Not j...
Encoding :: CCCII
Variants
Variant glyph at different plane
Mostly used for library indexing
強 21 3D 48
彊 2D 3D 48
强 33 3D ...
Encoding :: Big5
Many incompatible variations (abusing PUA), none of
standard tools can rule them all
http://moztw.org/doc...
Bsdconv :: Decoding and Encoding
Alternative to iconv
... ISO-8859-1. :. UTF-8..
from
.
to
Figure: Basic two phases conver...
Bsdconv :: Codecs & Fallback
Optionally produce question mark (U+003F) as replacement
... UTF-8. ,. 3F. :. ASCII. ,. 3F..
...
Encoding :: Big5
Many incompatible variations (abusing PUA), none of
standard tools can rule them all
http://moztw.org/doc...
Big5 5C issue (許功蓋)
BIG5:BIG5-5C,BIG5
# Input Output
Big5 Literal ” 成功” ” 成功 ”
ASCII/Hex ”xA6xA8xA5” ”xA6xA8xA5”
BIG5-5C,B...
Traditional/Simplified Chinese
NOT one-to-one mapping
Traditional 乾幹干
vs.
Simplified 干干干
Context dependent
之後、夜之后、入夜之後
Varia...
Project Chvar (1/2)
https://github.com/buganini/chvar
..
..签簽. 籖籤.
Canonical group
.
Canonical group
.
Compatibility group...
Project Chvar (2/2)
https://github.com/buganini/chvar
Normalization
Canonical Equivalence
Transliteration
Converted
or Can...
Bsdconv :: Phases
Traditional Chinese ⇔ Simplified Chinese
... UTF-8. :. ZHTW. :. UTF-8..
from
.
inter
.
to
Figure: Convers...
Bsdconv :: Phases
Furthermore, phrases mapping
... UTF-8. :. ZHTW. :. ZHTW-WORDS. :. UTF-8..
from
.
inter
.
inter
.
to
Fig...
Unicode :: Casing
IS complicated
Lowercase Uppercase
a A
i I
Table: English
Lowercase Uppercase
ı I
i İ
Table: Turkic
Lowe...
Unicode :: Normalization Forms (1/2)
UAX#15
Indexing
Identification security
Username, Domain name
Combining sequence Ç C +...
Unicode :: Normalization Forms (2/2)
UAX#15
Font variants ℌ H
Breaking differences NBSP SP
Cursive forms ‫ﻧ‬ ‫ﻨ‬
Circled ① ...
Normalization for fuzzy matching
UTF-8:UPPER:UTF-8
Input: aăⅷDžбⓐᾥ
Output: AĂⅧDŽБⒶᾭ
UTF-8:ZH-FUZZY-TW:KANA-PHONETIC:NFKD-
CA...
Bsdconv :: Cascade
Re-encode
... UTF-8. :. ISO-8859-1. |. BIG5. :. UTF-8..
from
.
to
.
from
.
to
Figure: Cascaded conversi...
Bsdconv :: Codec argument
Other than question mark
... UTF-8. ,. ANY#0121. :. ASCII. ,. ANY#21..
from
.
to
Figure: Codec a...
Bsdconv :: Alias
from/3F
ANY#013F&ERROR
to/3F
ANY#3F&ERROR
from/UTF-8
ASCII,_UTF-8
inter/NFKD
_NFKD:_NF-HANGUL-DECOMPOSITI...
Charset & Encoding
..
Unicode (32bits addr. space)
.
Unicode up to U+10FFFF
.
Unicode BMP (up to U+FFFF)
.
(Basic Multilin...
Bsdconv :: Types
(01) Unicode
(02) CNS11643
(03) Byte
(04) Chinese components
(1B) ANSI control sequences
(00) Bsdconv spe...
Encoding :: CNS11643 (全字庫) #issue
http://www.cns11643.gov.tw/
Only used by Taiwan government
NOT a subset of Unicode
Not j...
Chinese components composition
https://github.com/buganini/chicomp
UTF-8:ZH-DECOMP:ZH-COMP:UTF-8
Input: 功夫不好不要艹我
Output: 巭...
Bsdconv :: Flags
FREE - memory management
MARK - identifier
Bsdconv :: Cascade
Re-encode
... UTF-8. :. ISO-8859-1. |. BIG5. :. UTF-8..
from
.
to
.
from
.
to
Figure: Cascaded conversi...
Look-through (1/4)
..%u03B1%CE%B2.
Input (UTF-8 literal)
. ESCAPE : ....
Decoder
.
..
01
.
03
.
B1
.
03
.
CE
.
03
.
B2
.
I...
Look-through (2/4)
..
..01.
03
.
B1
. 03.
CE
. 03.
B2.
Internal data
. ... : PASS#MARK&FOR=1,BYTE.
Encoder
.
..
01
.
03
.
...
Look-through (3/4)
..
..01.
03
.
B1
.
MARK
. CE. B2
.
Internal data
. PASS#UNMARK,UTF-8 : ....
Decoder
.
..
01
.
03
.
B1
....
Look-through (4/4)
..
01
.
03
.
B1
.
01
.
03
.
B2
Internal data
... : UTF-8
Encoder
..
CE
.
B1
.”α”.
CE
.
B2
. ”β”
Interna...
Unicode :: East Asian Width (1/2)
UAX#11
..
..Narrow.
Halfwidth
..
.. Wide.
Fullwidth
.
Ambiguous
.
Neutral
Figure: Venn D...
Unicode :: East Asian Width (2/2)
UAX#11
Narrow Ambiguous Wide
Я
N ऊ
Na A A F
H カ カ W
咦 W
Table: Examples for Each Charact...
String width measurement
echo "42(ˊ_>ˋ) 紅茶" | bsdconv UTF-8:WIDTH:NULL
FULL: 2
HALF: 7
AMBI: 2
Chinese charset encoding detection
https://github.com/buganini/chiconv
ENCODING:SCORE#WITH=CJK:COUNT:ZH-
BONUS:ZHTW:ZH-BON...
Khmer legacy font converter
https://github.com/buganini/khmerconv
Issues
Encoding without registerd name, bound on fonts
S...
Encoding :: Big5
Many incompatible variations (abusing PUA), none of
standard tools can rule them all
http://moztw.org/doc...
Unicode :: East Asian Width (1/2)
UAX#11
..
..Narrow.
Halfwidth
..
.. Wide.
Fullwidth
.
Ambiguous
.
Neutral
Figure: Venn D...
Unicode :: East Asian Width (2/2)
UAX#11
Narrow Ambiguous Wide
Я
N ऊ
Na A A F
H カ カ W
咦 W
Table: Examples for Each Charact...
Terminal transcoding
https://github.com/buganini/bug5
Issues
UAO: Non-standard big5 extension
Double color hack
ANSI contr...
Bug5 explained (1/6)
..⋆xC5x1B[1mxE5.
Input (Big5 literal)
. ANSI-CONTROL,BYTE : ....
Decoder
.
..
03
.
A1
.
03
.
B9
.
03
...
Bug5 explained (2/6)
..
..03.
A1
. 03.
B9
. 03.
C5
. 1B.
5B
.
31
.
6D
. 03.
E5.
Internal data
. ... : BIG5-DEFRAG : ....
I...
Bug5 explained (3/6)
..
..03.
A1
. 03.
B9
. 03.
C5
. 03.
E5
. 1B.
5B
.
31
.
6D
.
Internal data
. ... : BYTE,PASS#MARK&FOR=...
Bug5 explained (4/6)
..
..A1. B9. C5. E5. 1B.
5B
.
31
.
6D
.
MARK
.
Internal data
. PASS#UNMARK,BIG5 : ....
Decoder
.
..
0...
Bug5 explained (5/6)
..
..01.
26
.
05
. 01.
9A
.
5A
. 1B.
5B
.
31
.
6D
.
Internal data
. ... : AMBIGUOUS-PAD : ....
Inter-...
Bug5 explained (6/6)
..
..01.
26
.
05
. 01.
A0
. 01.
9A
.
5A
. 1B.
5B
.
31
.
6D
.
Internal data
. ... : UTF-8,PASS#FOR=1B....
Bsdconv :: Bindings
Python/Ruby/Go/Perl/PHP
https://pypi.python.org/pypi/bsdconv
https://rubygems.org/gems/ruby-bsdconv
ht...
Bsdconv :: GUI
https://github.com/buganini/gbsdconv
Alternative to ConvertZ
Text
File name
File content
Meta tag
Thanks
ESCAPE,UTF-8:PA
SS#FOR=UNICODE&M
ARK,BYTE|PASS#UNMA
RK,UTF-8:NFC:ASCII,ES
CAPE|
https://github.com/buganini/bsdconv
Upcoming SlideShare
Loading in …5
×

Journey of Bsdconv

2,975 views

Published on

Unicode, Charset, Encoding, Conversion, Detection, Variants

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,975
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Journey of Bsdconv

  1. 1. BSDCONV Buganini Q Since 2009
  2. 2. Charset & Encoding Character Set Collection of characters Encoding Binary representation
  3. 3. Charset & Encoding .. Unicode (32bits addr. space) . Unicode up to U+10FFFF . Unicode BMP (up to U+FFFF) . (Basic Multilingual Plane) . GB18030. CNS11643 . CP950 . Latin1 Figure: Character Sets
  4. 4. Charset & Encoding .. Unicode (32bits addr. space) . Unicode up to U+10FFFF . Unicode BMP (up to U+FFFF) .GB18030. CNS11643 . CP950 . Latin1 . UTF-32 / UCS4 . UTF-81 / UTF-16 . UCS2 . GB18030. CNS11643 . CP950 (DBCS) . ISO-8859-1 / EBCDIC-0372 1 Could cover more but restricted by RFC 3629 2 Aka. IBM-37, some control characters are different from ISO-8859-1
  5. 5. Encoding :: UTF-32 / UCS4 Fixed Length 4 bytes Filesize *= 4 for ASCII text file Incompatible with C-style string convention Endianness concern
  6. 6. Encoding :: UCS2 Fixed Length 2 bytes Filesize *= 2 for ASCII text file Incompatible with C-style string convention Endianness concern BMP-only
  7. 7. Encoding :: UTF-16 Variable Length 2 bytes / 4 bytes (Surrogate pairs) Surrogates Using U+D800..U+DFFF Incompatible with C-style string convention Endianness concern ******** ******** 110110** ******** 110111** ******** Table: UTF-16 Structure
  8. 8. Encoding :: UTF-8 Variable Length 1~6 bytes Compatible with C-style string convention Self-synchronizing Endian-neutral Sorting order = Code point order 0******* (ASCII) 110***** 10****** 1110**** 10****** 10****** 11110*** 10****** 10****** 10****** 111110** 10****** 10****** 10****** 10****** 1111110* 10****** 10****** 10****** 10****** 10****** Table: UTF-8 Structure
  9. 9. Encoding :: CNS11643 (全字庫) #issue http://www.cns11643.gov.tw/ Only used by Taiwan government NOT a subset of Unicode Not just an charset/encoding Font Pronunciation ㄇㄥ ˊ / méng Radical 艸 Component 艹日月 Stroke Tra/Sim mapping 萌蕄 Table: Examples for some information provided by 全字庫 for「萌」
  10. 10. Encoding :: CCCII Variants Variant glyph at different plane Mostly used for library indexing 強 21 3D 48 彊 2D 3D 48 强 33 3D 48
  11. 11. Encoding :: Big5 Many incompatible variations (abusing PUA), none of standard tools can rule them all http://moztw.org/docs/big5/ Scenario Dominating encoding Microsoft CP950 Taiwan BBS UAO (Unicode-at-Once) gov.tw Big5-2003 gov.hk HKSCS (1999,2001,2004) Special characters conflict The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which may have special meaning in certain context
  12. 12. Bsdconv :: Decoding and Encoding Alternative to iconv ... ISO-8859-1. :. UTF-8.. from . to Figure: Basic two phases conversion
  13. 13. Bsdconv :: Codecs & Fallback Optionally produce question mark (U+003F) as replacement ... UTF-8. ,. 3F. :. ASCII. ,. 3F.. from . to Figure: Fallback codec Transliteration ... UTF-8. :. CP936. ,. CP936-TRANS. ,. 3F.. from . to Figure: Multiple fallback codecs
  14. 14. Encoding :: Big5 Many incompatible variations (abusing PUA), none of standard tools can rule them all http://moztw.org/docs/big5/ Scenario Dominating encoding Microsoft CP950 Taiwan BBS UAO (Unicode-at-Once) gov.tw Big5-2003 gov.hk HKSCS (1999,2001,2004) Special characters conflict The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which may have special meaning in certain context
  15. 15. Big5 5C issue (許功蓋) BIG5:BIG5-5C,BIG5 # Input Output Big5 Literal ” 成功” ” 成功 ” ASCII/Hex ”xA6xA8xA5” ”xA6xA8xA5” BIG5-5C,BIG5:BIG5 # Input Output Big5 Literal ” 成功 ” ” 成功” ASCII/Hex ”xA6xA8xA5” ”xA6xA8xA5”
  16. 16. Traditional/Simplified Chinese NOT one-to-one mapping Traditional 乾幹干 vs. Simplified 干干干 Context dependent 之後、夜之后、入夜之後 Variants 峰、峯
  17. 17. Project Chvar (1/2) https://github.com/buganini/chvar .. ..签簽. 籖籤. Canonical group . Canonical group . Compatibility group Figure: Two level grouping in Chvar 签 簽 籖 籤 TW 簽 - 籤 - CN - 签 - 籖 CP950 簽 - 籤 - GB2312 - 签 × × Table: Canonical Group 签 簽 籖 籤 TW 簽 - 簽 簽 CN - 签 签 签 CP950 簽 - 簽 簽 GB2312 - 签 签 签 Table: Compatibility Group
  18. 18. Project Chvar (2/2) https://github.com/buganini/chvar Normalization Canonical Equivalence Transliteration Converted or Canonical Equivalence or Compatibility Equivalence Fuzzy character matching Compatibility Equivalence 签 簽 籖 籤 TW 簽 - 籤 - CN - 签 - 籖 CP950 簽 - 籤 - GB2312 - 签 × × Table: Canonical Group 签 簽 籖 籤 TW 簽 - 簽 簽 CN - 签 签 签 CP950 簽 - 簽 簽 GB2312 - 签 签 签 Table: Compatibility Group
  19. 19. Bsdconv :: Phases Traditional Chinese ⇔ Simplified Chinese ... UTF-8. :. ZHTW. :. UTF-8.. from . inter . to Figure: Conversion with inter-mapping phase
  20. 20. Bsdconv :: Phases Furthermore, phrases mapping ... UTF-8. :. ZHTW. :. ZHTW-WORDS. :. UTF-8.. from . inter . inter . to Figure: Conversion with multiple inter-mapping phases
  21. 21. Unicode :: Casing IS complicated Lowercase Uppercase a A i I Table: English Lowercase Uppercase ı I i İ Table: Turkic Lowercase Uppercase a A à A Table: French Lowercase Uppercase σ Σ ς Σ Table: Greek Default Case Folding
  22. 22. Unicode :: Normalization Forms (1/2) UAX#15 Indexing Identification security Username, Domain name Combining sequence Ç C + ◌̧ Ordering of combining marks q+◌̇+◌̣ q+◌̣+◌̇ Hangul 가 ᄀ + ᅡ Singleton Ω Ω Table: Canonical Equivalence
  23. 23. Unicode :: Normalization Forms (2/2) UAX#15 Font variants ℌ H Breaking differences NBSP SP Cursive forms ‫ﻧ‬ ‫ﻨ‬ Circled ① 1 Width, size, rotated カ カ ︷ { Superscripts/subscripts ⁹ 9 Squared characters ㍿ 株 + 式 + 会 + 社 Fractions ¾ 3 + / + 4 Others dž d + z + ◌̌ Table: Compatibility Equivalence
  24. 24. Normalization for fuzzy matching UTF-8:UPPER:UTF-8 Input: aăⅷDžбⓐᾥ Output: AĂⅧDŽБⒶᾭ UTF-8:ZH-FUZZY-TW:KANA-PHONETIC:NFKD- CASEFOLD:UTF-8 Input: ¼ℌℍăDžⓐ⁹ 灣湾ド鬒鬒㊣ æß Output: 1⁄4hhădža9 灣灣 do 鬒鬒正 æss Composition Decomposition Canonical NFC NFD Compatibility NFKC NFKD Table: The four Unicode normalization forms and the transformations
  25. 25. Bsdconv :: Cascade Re-encode ... UTF-8. :. ISO-8859-1. |. BIG5. :. UTF-8.. from . to . from . to Figure: Cascaded conversions Input Output ¥x¥_ 台北
  26. 26. Bsdconv :: Codec argument Other than question mark ... UTF-8. ,. ANY#0121. :. ASCII. ,. ANY#21.. from . to Figure: Codec argument Or more than one character ... UTF-8. ,. ANY#013F.0121. :. ASCII. ,. ANY#21.. from . to Figure: Data list, separated by dot
  27. 27. Bsdconv :: Alias from/3F ANY#013F&ERROR to/3F ANY#3F&ERROR from/UTF-8 ASCII,_UTF-8 inter/NFKD _NFKD:_NF-HANGUL-DECOMPOSITION:_NF-ORDER inter/NFKC NFKD:_NFC:_NF-HANGUL-COMPOSITION inter/NFKD-CASEFOLD NFD:CASEFOLD:NFKD:CASEFOLD:NFKD filter/01 UNICODE
  28. 28. Charset & Encoding .. Unicode (32bits addr. space) . Unicode up to U+10FFFF . Unicode BMP (up to U+FFFF) . (Basic Multilingual Plane) . GB18030. CNS11643 . CP950 . Latin1 Figure: Character Sets
  29. 29. Bsdconv :: Types (01) Unicode (02) CNS11643 (03) Byte (04) Chinese components (1B) ANSI control sequences (00) Bsdconv special characters
  30. 30. Encoding :: CNS11643 (全字庫) #issue http://www.cns11643.gov.tw/ Only used by Taiwan government NOT a subset of Unicode Not just an charset/encoding Font Pronunciation ㄇㄥ ˊ / méng Radical 艸 Component 艹日月 Stroke Tra/Sim mapping 萌蕄 Table: Examples for some information provided by 全字庫 for「萌」
  31. 31. Chinese components composition https://github.com/buganini/chicomp UTF-8:ZH-DECOMP:ZH-COMP:UTF-8 Input: 功夫不好不要艹我 Output: 巭孬嫑莪 UTF-8:ZH-DECOMP:ZH-COMP:CHEWING:UTF-8 Input: 功夫不好不要艹我 Output: ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ UTF-8:ZH-DECOMP:ZH-COMP:CHEWING:HAN- PINYIN:UTF-8 Input: 功夫不好不要艹我 Output: pu nao yao [uh]2
  32. 32. Bsdconv :: Flags FREE - memory management MARK - identifier
  33. 33. Bsdconv :: Cascade Re-encode ... UTF-8. :. ISO-8859-1. |. BIG5. :. UTF-8.. from . to . from . to Figure: Cascaded conversions Input Output ¥x¥_ 台北
  34. 34. Look-through (1/4) ..%u03B1%CE%B2. Input (UTF-8 literal) . ESCAPE : .... Decoder . .. 01 . 03 . B1 . 03 . CE . 03 . B2 . Internal data . Entity Unicode UTF-8 Hex α U+03B1 CEB1 β U+03B2 CEB2 Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
  35. 35. Look-through (2/4) .. ..01. 03 . B1 . 03. CE . 03. B2. Internal data . ... : PASS#MARK&FOR=1,BYTE. Encoder . .. 01 . 03 . B1 . MARK . CE . B2 . Internal data . Entity Unicode UTF-8 Hex α U+03B1 CEB1 β U+03B2 CEB2 Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
  36. 36. Look-through (3/4) .. ..01. 03 . B1 . MARK . CE. B2 . Internal data . PASS#UNMARK,UTF-8 : .... Decoder . .. 01 . 03 . B1 . 01 . 03 . B2 . Internal data . Entity Unicode UTF-8 Hex α U+03B1 CEB1 β U+03B2 CEB2 Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
  37. 37. Look-through (4/4) .. 01 . 03 . B1 . 01 . 03 . B2 Internal data ... : UTF-8 Encoder .. CE . B1 .”α”. CE . B2 . ”β” Internal data αβ Output (UTF-8 literal) Entity Unicode UTF-8 Hex α U+03B1 CEB1 β U+03B2 CEB2 Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
  38. 38. Unicode :: East Asian Width (1/2) UAX#11 .. ..Narrow. Halfwidth .. .. Wide. Fullwidth . Ambiguous . Neutral Figure: Venn Diagram Showing the Set Relations for Six Categories
  39. 39. Unicode :: East Asian Width (2/2) UAX#11 Narrow Ambiguous Wide Я N ऊ Na A A F H カ カ W 咦 W Table: Examples for Each Character Class and Their Resolved Widths Na Narrow N Neural, usually treated as Narrow W Wide F Fullwidth H Halfwidth Table: Width attributes
  40. 40. String width measurement echo "42(ˊ_>ˋ) 紅茶" | bsdconv UTF-8:WIDTH:NULL FULL: 2 HALF: 7 AMBI: 2
  41. 41. Chinese charset encoding detection https://github.com/buganini/chiconv ENCODING:SCORE#WITH=CJK:COUNT:ZH- BONUS:ZHTW:ZH-BONUS-PHRASE:NULL Score(s) = $SCORE−$IERR∗$COUNT∗0.01 $COUNT 帥呆了 ⇒ UTF-8:SCORE#WITH=CJK:…… ENCODING SCORE COUNT IERR Score(s) UTF-8 19 4 0 4.75 BIG5 8 3 2 -4.0 GBK 4 1 4 -36.0 CCCII 36 9 0 4.0 UTF-16LE 20 5 2 0.0
  42. 42. Khmer legacy font converter https://github.com/buganini/khmerconv Issues Encoding without registerd name, bound on fonts Stored in CP1252 or UTF-8 Solution Two pass detection Detect encoding Detect font family (currently not working) (High converage in SBCS) Algorithm ported from Khmer Converter3 Khmer Converter Mapping Reordering Visual order vs. Unicode model Unicode Model: baseCharacter [+ [Robat/Shifter] + [Coeng*] + [Shifter] + [Vowel] + [Sign]] 3 http://www.khmeros.info/en/khmer-converter
  43. 43. Encoding :: Big5 Many incompatible variations (abusing PUA), none of standard tools can rule them all http://moztw.org/docs/big5/ Scenario Dominating encoding Microsoft CP950 Taiwan BBS UAO (Unicode-at-Once) gov.tw Big5-2003 gov.hk HKSCS (1999,2001,2004) Special characters conflict The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which may have special meaning in certain context
  44. 44. Unicode :: East Asian Width (1/2) UAX#11 .. ..Narrow. Halfwidth .. .. Wide. Fullwidth . Ambiguous . Neutral Figure: Venn Diagram Showing the Set Relations for Six Categories
  45. 45. Unicode :: East Asian Width (2/2) UAX#11 Narrow Ambiguous Wide Я N ऊ Na A A F H カ カ W 咦 W Table: Examples for Each Character Class and Their Resolved Widths Na Narrow N Neural, usually treated as Narrow W Wide F Fullwidth H Halfwidth Table: Width attributes
  46. 46. Terminal transcoding https://github.com/buganini/bug5 Issues UAO: Non-standard big5 extension Double color hack ANSI control sequence in the middle of DBCS Ambiguous width characters luit/screen cannot help Solution (tl;dr) Big5 to Unicode ANSI-CONTROL,BYTE:BIG5-DEFRAG: BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD: UTF-8,PASS#FOR=1B Unicode to Big5 UTF-8,00,BYTE:ZHTW:AMBIGUOUS-UNPAD: BIG5,CP950-TRANS,UAO,00,ANY#3F
  47. 47. Bug5 explained (1/6) ..⋆xC5x1B[1mxE5. Input (Big5 literal) . ANSI-CONTROL,BYTE : .... Decoder . .. 03 . A1 . 03 . B9 . 03 . C5 . 1B . 5B . 31 . 6D . 03 . E5 . Internal data . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  48. 48. Bug5 explained (2/6) .. ..03. A1 . 03. B9 . 03. C5 . 1B. 5B . 31 . 6D . 03. E5. Internal data . ... : BIG5-DEFRAG : .... Inter-conversion . .. 03 . A1 . 03 . B9 . 03 . C5 . 03 . E5 . 1B . 5B . 31 . 6D . Internal data . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  49. 49. Bug5 explained (3/6) .. ..03. A1 . 03. B9 . 03. C5 . 03. E5 . 1B. 5B . 31 . 6D . Internal data . ... : BYTE,PASS#MARK&FOR=1B. Encoder . .. A1 . B9 . C5 . E5 . 1B . 5B . 31 . 6D . MARK . Internal data . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  50. 50. Bug5 explained (4/6) .. ..A1. B9. C5. E5. 1B. 5B . 31 . 6D . MARK . Internal data . PASS#UNMARK,BIG5 : .... Decoder . .. 01 . 26 . 05 . 01 . 9A . 5A . 1B . 5B . 31 . 6D . Internal data . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  51. 51. Bug5 explained (5/6) .. ..01. 26 . 05 . 01. 9A . 5A . 1B. 5B . 31 . 6D . Internal data . ... : AMBIGUOUS-PAD : .... Inter-conversion . .. 01 . 26 . 05 . 01 . A0 . 01 . 9A . 5A . 1B . 5B . 31 . 6D . Internal data . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  52. 52. Bug5 explained (6/6) .. ..01. 26 . 05 . 01. A0 . 01. 9A . 5A . 1B. 5B . 31 . 6D . Internal data . ... : UTF-8,PASS#FOR=1B. Encoder . ⋆ 驚 x1B[1m . Output (UTF-8 literal) . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  53. 53. Bsdconv :: Bindings Python/Ruby/Go/Perl/PHP https://pypi.python.org/pypi/bsdconv https://rubygems.org/gems/ruby-bsdconv https://github.com/buganini/go-bsdconv https://github.com/buganini/perl-bsdconv https://github.com/buganini/php-bsdconv PostgreSQL/MySQL https://github.com/buganini/postgres-bsdconv https://github.com/buganini/mysql-udf-bsdconv Irssi https://github.com/buganini/irssi-scripts/blob/master/irssi-bsdconv.pl
  54. 54. Bsdconv :: GUI https://github.com/buganini/gbsdconv Alternative to ConvertZ Text File name File content Meta tag
  55. 55. Thanks ESCAPE,UTF-8:PA SS#FOR=UNICODE&M ARK,BYTE|PASS#UNMA RK,UTF-8:NFC:ASCII,ES CAPE| https://github.com/buganini/bsdconv

×