SlideShare a Scribd company logo
BSDCONV
Buganini Q
Since 2009
Charset & Encoding
Character Set
Collection of characters
Encoding
Binary representation
Charset & Encoding
..
Unicode (32bits addr. space)
.
Unicode up to U+10FFFF
.
Unicode BMP (up to U+FFFF)
.
(Basic Multilingual Plane)
. GB18030.
CNS11643
.
CP950
.
Latin1
Figure: Character Sets
Charset & Encoding
..
Unicode (32bits addr. space)
.
Unicode up to U+10FFFF
.
Unicode BMP (up to U+FFFF)
.GB18030.
CNS11643
.
CP950
.
Latin1
.
UTF-32 / UCS4
.
UTF-81 / UTF-16
.
UCS2
. GB18030.
CNS11643
.
CP950 (DBCS)
.
ISO-8859-1 / EBCDIC-0372
1
Could cover more but restricted by RFC 3629
2
Aka. IBM-37, some control characters are different from ISO-8859-1
Encoding :: UTF-32 / UCS4
Fixed Length
4 bytes
Filesize *= 4 for ASCII text file
Incompatible with C-style string convention
Endianness concern
Encoding :: UCS2
Fixed Length
2 bytes
Filesize *= 2 for ASCII text file
Incompatible with C-style string convention
Endianness concern
BMP-only
Encoding :: UTF-16
Variable Length
2 bytes / 4 bytes (Surrogate pairs)
Surrogates
Using U+D800..U+DFFF
Incompatible with C-style string convention
Endianness concern
******** ********
110110** ******** 110111** ********
Table: UTF-16 Structure
Encoding :: UTF-8
Variable Length
1~6 bytes
Compatible with C-style string convention
Self-synchronizing
Endian-neutral
Sorting order = Code point order
0******* (ASCII)
110***** 10******
1110**** 10****** 10******
11110*** 10****** 10****** 10******
111110** 10****** 10****** 10****** 10******
1111110* 10****** 10****** 10****** 10****** 10******
Table: UTF-8 Structure
Encoding :: CNS11643 (全字庫) #issue
http://www.cns11643.gov.tw/
Only used by Taiwan government
NOT a subset of Unicode
Not just an charset/encoding
Font
Pronunciation ㄇㄥ ˊ / méng
Radical 艸
Component 艹日月
Stroke
Tra/Sim mapping 萌蕄
Table: Examples for some information provided by 全字庫 for「萌」
Encoding :: CCCII
Variants
Variant glyph at different plane
Mostly used for library indexing
強 21 3D 48
彊 2D 3D 48
强 33 3D 48
Encoding :: Big5
Many incompatible variations (abusing PUA), none of
standard tools can rule them all
http://moztw.org/docs/big5/
Scenario Dominating encoding
Microsoft CP950
Taiwan BBS UAO (Unicode-at-Once)
gov.tw Big5-2003
gov.hk HKSCS (1999,2001,2004)
Special characters conflict
The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which
may have special meaning in certain context
Bsdconv :: Decoding and Encoding
Alternative to iconv
... ISO-8859-1. :. UTF-8..
from
.
to
Figure: Basic two phases conversion
Bsdconv :: Codecs & Fallback
Optionally produce question mark (U+003F) as replacement
... UTF-8. ,. 3F. :. ASCII. ,. 3F..
from
.
to
Figure: Fallback codec
Transliteration
... UTF-8. :. CP936. ,. CP936-TRANS. ,. 3F..
from
.
to
Figure: Multiple fallback codecs
Encoding :: Big5
Many incompatible variations (abusing PUA), none of
standard tools can rule them all
http://moztw.org/docs/big5/
Scenario Dominating encoding
Microsoft CP950
Taiwan BBS UAO (Unicode-at-Once)
gov.tw Big5-2003
gov.hk HKSCS (1999,2001,2004)
Special characters conflict
The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which
may have special meaning in certain context
Big5 5C issue (許功蓋)
BIG5:BIG5-5C,BIG5
# Input Output
Big5 Literal ” 成功” ” 成功 ”
ASCII/Hex ”xA6xA8xA5” ”xA6xA8xA5”
BIG5-5C,BIG5:BIG5
# Input Output
Big5 Literal ” 成功 ” ” 成功”
ASCII/Hex ”xA6xA8xA5” ”xA6xA8xA5”
Traditional/Simplified Chinese
NOT one-to-one mapping
Traditional 乾幹干
vs.
Simplified 干干干
Context dependent
之後、夜之后、入夜之後
Variants
峰、峯
Project Chvar (1/2)
https://github.com/buganini/chvar
..
..签簽. 籖籤.
Canonical group
.
Canonical group
.
Compatibility group
Figure: Two level grouping in Chvar
签 簽 籖 籤
TW 簽 - 籤 -
CN - 签 - 籖
CP950 簽 - 籤 -
GB2312 - 签 × ×
Table: Canonical Group
签 簽 籖 籤
TW 簽 - 簽 簽
CN - 签 签 签
CP950 簽 - 簽 簽
GB2312 - 签 签 签
Table: Compatibility Group
Project Chvar (2/2)
https://github.com/buganini/chvar
Normalization
Canonical Equivalence
Transliteration
Converted
or Canonical Equivalence
or Compatibility Equivalence
Fuzzy character matching
Compatibility Equivalence
签 簽 籖 籤
TW 簽 - 籤 -
CN - 签 - 籖
CP950 簽 - 籤 -
GB2312 - 签 × ×
Table: Canonical Group
签 簽 籖 籤
TW 簽 - 簽 簽
CN - 签 签 签
CP950 簽 - 簽 簽
GB2312 - 签 签 签
Table: Compatibility Group
Bsdconv :: Phases
Traditional Chinese ⇔ Simplified Chinese
... UTF-8. :. ZHTW. :. UTF-8..
from
.
inter
.
to
Figure: Conversion with inter-mapping phase
Bsdconv :: Phases
Furthermore, phrases mapping
... UTF-8. :. ZHTW. :. ZHTW-WORDS. :. UTF-8..
from
.
inter
.
inter
.
to
Figure: Conversion with multiple inter-mapping phases
Unicode :: Casing
IS complicated
Lowercase Uppercase
a A
i I
Table: English
Lowercase Uppercase
ı I
i İ
Table: Turkic
Lowercase Uppercase
a A
à A
Table: French
Lowercase Uppercase
σ Σ
ς Σ
Table: Greek
Default Case Folding
Unicode :: Normalization Forms (1/2)
UAX#15
Indexing
Identification security
Username, Domain name
Combining sequence Ç C + ◌̧
Ordering of combining marks q+◌̇+◌̣ q+◌̣+◌̇
Hangul 가 ᄀ + ᅡ
Singleton Ω Ω
Table: Canonical Equivalence
Unicode :: Normalization Forms (2/2)
UAX#15
Font variants ℌ H
Breaking differences NBSP SP
Cursive forms ‫ﻧ‬ ‫ﻨ‬
Circled ① 1
Width, size, rotated
カ カ
︷ {
Superscripts/subscripts ⁹ 9
Squared characters ㍿ 株 + 式 + 会 + 社
Fractions ¾ 3 + / + 4
Others dž d + z + ◌̌
Table: Compatibility Equivalence
Normalization for fuzzy matching
UTF-8:UPPER:UTF-8
Input: aăⅷDžбⓐᾥ
Output: AĂⅧDŽБⒶᾭ
UTF-8:ZH-FUZZY-TW:KANA-PHONETIC:NFKD-
CASEFOLD:UTF-8
Input: ¼ℌℍăDžⓐ⁹ 灣湾ド鬒鬒㊣ æß
Output: 1⁄4hhădža9 灣灣 do 鬒鬒正 æss
Composition Decomposition
Canonical NFC NFD
Compatibility NFKC NFKD
Table: The four Unicode normalization forms and the transformations
Bsdconv :: Cascade
Re-encode
... UTF-8. :. ISO-8859-1. |. BIG5. :. UTF-8..
from
.
to
.
from
.
to
Figure: Cascaded conversions
Input Output
¥x¥_ 台北
Bsdconv :: Codec argument
Other than question mark
... UTF-8. ,. ANY#0121. :. ASCII. ,. ANY#21..
from
.
to
Figure: Codec argument
Or more than one character
... UTF-8. ,. ANY#013F.0121. :. ASCII. ,. ANY#21..
from
.
to
Figure: Data list, separated by dot
Bsdconv :: Alias
from/3F
ANY#013F&ERROR
to/3F
ANY#3F&ERROR
from/UTF-8
ASCII,_UTF-8
inter/NFKD
_NFKD:_NF-HANGUL-DECOMPOSITION:_NF-ORDER
inter/NFKC
NFKD:_NFC:_NF-HANGUL-COMPOSITION
inter/NFKD-CASEFOLD
NFD:CASEFOLD:NFKD:CASEFOLD:NFKD
filter/01
UNICODE
Charset & Encoding
..
Unicode (32bits addr. space)
.
Unicode up to U+10FFFF
.
Unicode BMP (up to U+FFFF)
.
(Basic Multilingual Plane)
. GB18030.
CNS11643
.
CP950
.
Latin1
Figure: Character Sets
Bsdconv :: Types
(01) Unicode
(02) CNS11643
(03) Byte
(04) Chinese components
(1B) ANSI control sequences
(00) Bsdconv special characters
Encoding :: CNS11643 (全字庫) #issue
http://www.cns11643.gov.tw/
Only used by Taiwan government
NOT a subset of Unicode
Not just an charset/encoding
Font
Pronunciation ㄇㄥ ˊ / méng
Radical 艸
Component 艹日月
Stroke
Tra/Sim mapping 萌蕄
Table: Examples for some information provided by 全字庫 for「萌」
Chinese components composition
https://github.com/buganini/chicomp
UTF-8:ZH-DECOMP:ZH-COMP:UTF-8
Input: 功夫不好不要艹我
Output: 巭孬嫑莪
UTF-8:ZH-DECOMP:ZH-COMP:CHEWING:UTF-8
Input: 功夫不好不要艹我
Output: ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8:ZH-DECOMP:ZH-COMP:CHEWING:HAN-
PINYIN:UTF-8
Input: 功夫不好不要艹我
Output: pu nao yao [uh]2
Bsdconv :: Flags
FREE - memory management
MARK - identifier
Bsdconv :: Cascade
Re-encode
... UTF-8. :. ISO-8859-1. |. BIG5. :. UTF-8..
from
.
to
.
from
.
to
Figure: Cascaded conversions
Input Output
¥x¥_ 台北
Look-through (1/4)
..%u03B1%CE%B2.
Input (UTF-8 literal)
. ESCAPE : ....
Decoder
.
..
01
.
03
.
B1
.
03
.
CE
.
03
.
B2
.
Internal data
.
Entity Unicode UTF-8 Hex
α U+03B1 CEB1
β U+03B2 CEB2
Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
Look-through (2/4)
..
..01.
03
.
B1
. 03.
CE
. 03.
B2.
Internal data
. ... : PASS#MARK&FOR=1,BYTE.
Encoder
.
..
01
.
03
.
B1
.
MARK
.
CE
.
B2
.
Internal data
.
Entity Unicode UTF-8 Hex
α U+03B1 CEB1
β U+03B2 CEB2
Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
Look-through (3/4)
..
..01.
03
.
B1
.
MARK
. CE. B2
.
Internal data
. PASS#UNMARK,UTF-8 : ....
Decoder
.
..
01
.
03
.
B1
.
01
.
03
.
B2
.
Internal data
.
Entity Unicode UTF-8 Hex
α U+03B1 CEB1
β U+03B2 CEB2
Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
Look-through (4/4)
..
01
.
03
.
B1
.
01
.
03
.
B2
Internal data
... : UTF-8
Encoder
..
CE
.
B1
.”α”.
CE
.
B2
. ”β”
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex
α U+03B1 CEB1
β U+03B2 CEB2
Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
Unicode :: East Asian Width (1/2)
UAX#11
..
..Narrow.
Halfwidth
..
.. Wide.
Fullwidth
.
Ambiguous
.
Neutral
Figure: Venn Diagram Showing the Set Relations for Six Categories
Unicode :: East Asian Width (2/2)
UAX#11
Narrow Ambiguous Wide
Я
N ऊ
Na A A F
H カ カ W
咦 W
Table: Examples for Each Character Class and Their Resolved Widths
Na Narrow
N Neural, usually treated as Narrow
W Wide
F Fullwidth
H Halfwidth
Table: Width attributes
String width measurement
echo "42(ˊ_>ˋ) 紅茶" | bsdconv UTF-8:WIDTH:NULL
FULL: 2
HALF: 7
AMBI: 2
Chinese charset encoding detection
https://github.com/buganini/chiconv
ENCODING:SCORE#WITH=CJK:COUNT:ZH-
BONUS:ZHTW:ZH-BONUS-PHRASE:NULL
Score(s) = $SCORE−$IERR∗$COUNT∗0.01
$COUNT
帥呆了 ⇒ UTF-8:SCORE#WITH=CJK:……
ENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 4.75
BIG5 8 3 2 -4.0
GBK 4 1 4 -36.0
CCCII 36 9 0 4.0
UTF-16LE 20 5 2 0.0
Khmer legacy font converter
https://github.com/buganini/khmerconv
Issues
Encoding without registerd name, bound on fonts
Stored in CP1252 or UTF-8
Solution
Two pass detection
Detect encoding
Detect font family (currently not working)
(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer Converter
Mapping
Reordering
Visual order vs. Unicode model
Unicode Model: baseCharacter [+ [Robat/Shifter] + [Coeng*]
+ [Shifter] + [Vowel] + [Sign]]
3
http://www.khmeros.info/en/khmer-converter
Encoding :: Big5
Many incompatible variations (abusing PUA), none of
standard tools can rule them all
http://moztw.org/docs/big5/
Scenario Dominating encoding
Microsoft CP950
Taiwan BBS UAO (Unicode-at-Once)
gov.tw Big5-2003
gov.hk HKSCS (1999,2001,2004)
Special characters conflict
The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which
may have special meaning in certain context
Unicode :: East Asian Width (1/2)
UAX#11
..
..Narrow.
Halfwidth
..
.. Wide.
Fullwidth
.
Ambiguous
.
Neutral
Figure: Venn Diagram Showing the Set Relations for Six Categories
Unicode :: East Asian Width (2/2)
UAX#11
Narrow Ambiguous Wide
Я
N ऊ
Na A A F
H カ カ W
咦 W
Table: Examples for Each Character Class and Their Resolved Widths
Na Narrow
N Neural, usually treated as Narrow
W Wide
F Fullwidth
H Halfwidth
Table: Width attributes
Terminal transcoding
https://github.com/buganini/bug5
Issues
UAO: Non-standard big5 extension
Double color hack
ANSI control sequence in the middle of DBCS
Ambiguous width characters
luit/screen cannot help
Solution (tl;dr)
Big5 to Unicode
ANSI-CONTROL,BYTE:BIG5-DEFRAG:
BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:
UTF-8,PASS#FOR=1B
Unicode to Big5
UTF-8,00,BYTE:ZHTW:AMBIGUOUS-UNPAD:
BIG5,CP950-TRANS,UAO,00,ANY#3F
Bug5 explained (1/6)
..⋆xC5x1B[1mxE5.
Input (Big5 literal)
. ANSI-CONTROL,BYTE : ....
Decoder
.
..
03
.
A1
.
03
.
B9
.
03
.
C5
.
1B
.
5B
.
31
.
6D
.
03
.
E5
.
Internal data
.
Entity Unicode UTF-8 Hex Big5 Hex
⋆ U+2605 E29885 A1B9
驚 U+9A5A E9A99A C5E5
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
Bug5 explained (2/6)
..
..03.
A1
. 03.
B9
. 03.
C5
. 1B.
5B
.
31
.
6D
. 03.
E5.
Internal data
. ... : BIG5-DEFRAG : ....
Inter-conversion
.
..
03
.
A1
.
03
.
B9
.
03
.
C5
.
03
.
E5
.
1B
.
5B
.
31
.
6D
.
Internal data
.
Entity Unicode UTF-8 Hex Big5 Hex
⋆ U+2605 E29885 A1B9
驚 U+9A5A E9A99A C5E5
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
Bug5 explained (3/6)
..
..03.
A1
. 03.
B9
. 03.
C5
. 03.
E5
. 1B.
5B
.
31
.
6D
.
Internal data
. ... : BYTE,PASS#MARK&FOR=1B.
Encoder
.
..
A1
.
B9
.
C5
.
E5
.
1B
.
5B
.
31
.
6D
.
MARK
.
Internal data
.
Entity Unicode UTF-8 Hex Big5 Hex
⋆ U+2605 E29885 A1B9
驚 U+9A5A E9A99A C5E5
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
Bug5 explained (4/6)
..
..A1. B9. C5. E5. 1B.
5B
.
31
.
6D
.
MARK
.
Internal data
. PASS#UNMARK,BIG5 : ....
Decoder
.
..
01
.
26
.
05
.
01
.
9A
.
5A
.
1B
.
5B
.
31
.
6D
.
Internal data
.
Entity Unicode UTF-8 Hex Big5 Hex
⋆ U+2605 E29885 A1B9
驚 U+9A5A E9A99A C5E5
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
Bug5 explained (5/6)
..
..01.
26
.
05
. 01.
9A
.
5A
. 1B.
5B
.
31
.
6D
.
Internal data
. ... : AMBIGUOUS-PAD : ....
Inter-conversion
.
..
01
.
26
.
05
.
01
.
A0
.
01
.
9A
.
5A
.
1B
.
5B
.
31
.
6D
.
Internal data
.
Entity Unicode UTF-8 Hex Big5 Hex
⋆ U+2605 E29885 A1B9
驚 U+9A5A E9A99A C5E5
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
Bug5 explained (6/6)
..
..01.
26
.
05
. 01.
A0
. 01.
9A
.
5A
. 1B.
5B
.
31
.
6D
.
Internal data
. ... : UTF-8,PASS#FOR=1B.
Encoder
.
⋆ 驚 x1B[1m
.
Output (UTF-8 literal)
.
Entity Unicode UTF-8 Hex Big5 Hex
⋆ U+2605 E29885 A1B9
驚 U+9A5A E9A99A C5E5
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
Bsdconv :: Bindings
Python/Ruby/Go/Perl/PHP
https://pypi.python.org/pypi/bsdconv
https://rubygems.org/gems/ruby-bsdconv
https://github.com/buganini/go-bsdconv
https://github.com/buganini/perl-bsdconv
https://github.com/buganini/php-bsdconv
PostgreSQL/MySQL
https://github.com/buganini/postgres-bsdconv
https://github.com/buganini/mysql-udf-bsdconv
Irssi
https://github.com/buganini/irssi-scripts/blob/master/irssi-bsdconv.pl
Bsdconv :: GUI
https://github.com/buganini/gbsdconv
Alternative to ConvertZ
Text
File name
File content
Meta tag
Thanks
ESCAPE,UTF-8:PA
SS#FOR=UNICODE&M
ARK,BYTE|PASS#UNMA
RK,UTF-8:NFC:ASCII,ES
CAPE|
https://github.com/buganini/bsdconv

More Related Content

What's hot

assembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YUassembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YU
Education
 
Introduction to Assembly Language
Introduction to Assembly LanguageIntroduction to Assembly Language
Introduction to Assembly Language
Motaz Saad
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5
PRADEEP
 
N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)
Selomon birhane
 
Embedded c
Embedded cEmbedded c
Embedded c
Nandan Desai
 
Programmable Logic Devices
Programmable Logic DevicesProgrammable Logic Devices
Programmable Logic Devices
Madhusudan Donga
 
Assembly language (coal)
Assembly language (coal)Assembly language (coal)
Assembly language (coal)
Hareem Aslam
 
C programming part2
C programming part2C programming part2
C programming part2
Keroles karam khalil
 
Assembly Language Lecture 2
Assembly Language Lecture 2Assembly Language Lecture 2
Assembly Language Lecture 2
Motaz Saad
 
Instruction set-of-8086
Instruction set-of-8086Instruction set-of-8086
Instruction set-of-8086
mudulin
 
Introduction to 8088 microprocessor
Introduction to 8088 microprocessorIntroduction to 8088 microprocessor
Introduction to 8088 microprocessor
Dwight Sabio
 
Assembly Language Programming By Ytha Yu, Charles Marut Chap 4 (Introduction ...
Assembly Language Programming By Ytha Yu, Charles Marut Chap 4 (Introduction ...Assembly Language Programming By Ytha Yu, Charles Marut Chap 4 (Introduction ...
Assembly Language Programming By Ytha Yu, Charles Marut Chap 4 (Introduction ...
Bilal Amjad
 
Assembly language part I
Assembly language part IAssembly language part I
Assembly language part I
Mohammed A. Imran
 
Chapter 6 Flow control Instructions
Chapter 6 Flow control InstructionsChapter 6 Flow control Instructions
Chapter 6 Flow control Instructions
warda aziz
 
[ASM] Lab1
[ASM] Lab1[ASM] Lab1
[ASM] Lab1
Nora Youssef
 
Instruction formats-in-8086
Instruction formats-in-8086Instruction formats-in-8086
Instruction formats-in-8086
MNM Jain Engineering College
 
Lecture6
Lecture6Lecture6
Ch9a
Ch9aCh9a
C cheat sheet for varsity (extreme edition)
C cheat sheet for varsity (extreme edition)C cheat sheet for varsity (extreme edition)
C cheat sheet for varsity (extreme edition)
Saifur Rahman
 
Lecture5(1)
Lecture5(1)Lecture5(1)
Lecture5(1)
misgina Mengesha
 

What's hot (20)

assembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YUassembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YU
 
Introduction to Assembly Language
Introduction to Assembly LanguageIntroduction to Assembly Language
Introduction to Assembly Language
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5
 
N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)
 
Embedded c
Embedded cEmbedded c
Embedded c
 
Programmable Logic Devices
Programmable Logic DevicesProgrammable Logic Devices
Programmable Logic Devices
 
Assembly language (coal)
Assembly language (coal)Assembly language (coal)
Assembly language (coal)
 
C programming part2
C programming part2C programming part2
C programming part2
 
Assembly Language Lecture 2
Assembly Language Lecture 2Assembly Language Lecture 2
Assembly Language Lecture 2
 
Instruction set-of-8086
Instruction set-of-8086Instruction set-of-8086
Instruction set-of-8086
 
Introduction to 8088 microprocessor
Introduction to 8088 microprocessorIntroduction to 8088 microprocessor
Introduction to 8088 microprocessor
 
Assembly Language Programming By Ytha Yu, Charles Marut Chap 4 (Introduction ...
Assembly Language Programming By Ytha Yu, Charles Marut Chap 4 (Introduction ...Assembly Language Programming By Ytha Yu, Charles Marut Chap 4 (Introduction ...
Assembly Language Programming By Ytha Yu, Charles Marut Chap 4 (Introduction ...
 
Assembly language part I
Assembly language part IAssembly language part I
Assembly language part I
 
Chapter 6 Flow control Instructions
Chapter 6 Flow control InstructionsChapter 6 Flow control Instructions
Chapter 6 Flow control Instructions
 
[ASM] Lab1
[ASM] Lab1[ASM] Lab1
[ASM] Lab1
 
Instruction formats-in-8086
Instruction formats-in-8086Instruction formats-in-8086
Instruction formats-in-8086
 
Lecture6
Lecture6Lecture6
Lecture6
 
Ch9a
Ch9aCh9a
Ch9a
 
C cheat sheet for varsity (extreme edition)
C cheat sheet for varsity (extreme edition)C cheat sheet for varsity (extreme edition)
C cheat sheet for varsity (extreme edition)
 
Lecture5(1)
Lecture5(1)Lecture5(1)
Lecture5(1)
 

Similar to Journey of Bsdconv

Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
guest6ddfb98
 
20141106 asfws unicode_hacks
20141106 asfws unicode_hacks20141106 asfws unicode_hacks
20141106 asfws unicode_hacks
Cyber Security Alliance
 
Chapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structuChapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structu
EstelaJeffery653
 
Reed Solomon Frame Structures Revealed
Reed Solomon Frame Structures RevealedReed Solomon Frame Structures Revealed
Reed Solomon Frame Structures Revealed
David Alan Tyner
 
ISA.pptx
ISA.pptxISA.pptx
ISA.pptx
FarrukhMuneer2
 
Except UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonExcept UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in Python
Aram Dulyan
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character Encoding
Bert Pattyn
 
Chap 01[1]
Chap 01[1]Chap 01[1]
ASCII-EBCDIC-HEX
ASCII-EBCDIC-HEXASCII-EBCDIC-HEX
ASCII-EBCDIC-HEX
Remo Morettini
 
HES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARF
HES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARFHES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARF
HES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARF
Hackito Ergo Sum
 
Embedded system (Chapter 2) part 2
Embedded system (Chapter 2) part 2Embedded system (Chapter 2) part 2
Embedded system (Chapter 2) part 2
Ikhwan_Fakrudin
 
Keyboard interrupt
Keyboard interruptKeyboard interrupt
Keyboard interrupt
Tech_MX
 
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
RootedCON
 
Y03301460154
Y03301460154Y03301460154
Y03301460154
ijceronline
 
Getting Started with Raspberry Pi - DCC 2013.1
Getting Started with Raspberry Pi - DCC 2013.1Getting Started with Raspberry Pi - DCC 2013.1
Getting Started with Raspberry Pi - DCC 2013.1
Tom Paulus
 
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
ETH Zurich
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character sets
renchenyu
 
ITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and Grammars
Tonny Madsen
 
Assembler1
Assembler1Assembler1
Assembler1
jayashri kolekar
 
C programming part2
C programming part2C programming part2
C programming part2
Keroles karam khalil
 

Similar to Journey of Bsdconv (20)

Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
 
20141106 asfws unicode_hacks
20141106 asfws unicode_hacks20141106 asfws unicode_hacks
20141106 asfws unicode_hacks
 
Chapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structuChapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structu
 
Reed Solomon Frame Structures Revealed
Reed Solomon Frame Structures RevealedReed Solomon Frame Structures Revealed
Reed Solomon Frame Structures Revealed
 
ISA.pptx
ISA.pptxISA.pptx
ISA.pptx
 
Except UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonExcept UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in Python
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character Encoding
 
Chap 01[1]
Chap 01[1]Chap 01[1]
Chap 01[1]
 
ASCII-EBCDIC-HEX
ASCII-EBCDIC-HEXASCII-EBCDIC-HEX
ASCII-EBCDIC-HEX
 
HES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARF
HES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARFHES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARF
HES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARF
 
Embedded system (Chapter 2) part 2
Embedded system (Chapter 2) part 2Embedded system (Chapter 2) part 2
Embedded system (Chapter 2) part 2
 
Keyboard interrupt
Keyboard interruptKeyboard interrupt
Keyboard interrupt
 
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
 
Y03301460154
Y03301460154Y03301460154
Y03301460154
 
Getting Started with Raspberry Pi - DCC 2013.1
Getting Started with Raspberry Pi - DCC 2013.1Getting Started with Raspberry Pi - DCC 2013.1
Getting Started with Raspberry Pi - DCC 2013.1
 
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character sets
 
ITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and Grammars
 
Assembler1
Assembler1Assembler1
Assembler1
 
C programming part2
C programming part2C programming part2
C programming part2
 

Recently uploaded

20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 

Recently uploaded (20)

20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 

Journey of Bsdconv

  • 2. Charset & Encoding Character Set Collection of characters Encoding Binary representation
  • 3. Charset & Encoding .. Unicode (32bits addr. space) . Unicode up to U+10FFFF . Unicode BMP (up to U+FFFF) . (Basic Multilingual Plane) . GB18030. CNS11643 . CP950 . Latin1 Figure: Character Sets
  • 4. Charset & Encoding .. Unicode (32bits addr. space) . Unicode up to U+10FFFF . Unicode BMP (up to U+FFFF) .GB18030. CNS11643 . CP950 . Latin1 . UTF-32 / UCS4 . UTF-81 / UTF-16 . UCS2 . GB18030. CNS11643 . CP950 (DBCS) . ISO-8859-1 / EBCDIC-0372 1 Could cover more but restricted by RFC 3629 2 Aka. IBM-37, some control characters are different from ISO-8859-1
  • 5. Encoding :: UTF-32 / UCS4 Fixed Length 4 bytes Filesize *= 4 for ASCII text file Incompatible with C-style string convention Endianness concern
  • 6. Encoding :: UCS2 Fixed Length 2 bytes Filesize *= 2 for ASCII text file Incompatible with C-style string convention Endianness concern BMP-only
  • 7. Encoding :: UTF-16 Variable Length 2 bytes / 4 bytes (Surrogate pairs) Surrogates Using U+D800..U+DFFF Incompatible with C-style string convention Endianness concern ******** ******** 110110** ******** 110111** ******** Table: UTF-16 Structure
  • 8. Encoding :: UTF-8 Variable Length 1~6 bytes Compatible with C-style string convention Self-synchronizing Endian-neutral Sorting order = Code point order 0******* (ASCII) 110***** 10****** 1110**** 10****** 10****** 11110*** 10****** 10****** 10****** 111110** 10****** 10****** 10****** 10****** 1111110* 10****** 10****** 10****** 10****** 10****** Table: UTF-8 Structure
  • 9. Encoding :: CNS11643 (全字庫) #issue http://www.cns11643.gov.tw/ Only used by Taiwan government NOT a subset of Unicode Not just an charset/encoding Font Pronunciation ㄇㄥ ˊ / méng Radical 艸 Component 艹日月 Stroke Tra/Sim mapping 萌蕄 Table: Examples for some information provided by 全字庫 for「萌」
  • 10. Encoding :: CCCII Variants Variant glyph at different plane Mostly used for library indexing 強 21 3D 48 彊 2D 3D 48 强 33 3D 48
  • 11. Encoding :: Big5 Many incompatible variations (abusing PUA), none of standard tools can rule them all http://moztw.org/docs/big5/ Scenario Dominating encoding Microsoft CP950 Taiwan BBS UAO (Unicode-at-Once) gov.tw Big5-2003 gov.hk HKSCS (1999,2001,2004) Special characters conflict The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which may have special meaning in certain context
  • 12. Bsdconv :: Decoding and Encoding Alternative to iconv ... ISO-8859-1. :. UTF-8.. from . to Figure: Basic two phases conversion
  • 13. Bsdconv :: Codecs & Fallback Optionally produce question mark (U+003F) as replacement ... UTF-8. ,. 3F. :. ASCII. ,. 3F.. from . to Figure: Fallback codec Transliteration ... UTF-8. :. CP936. ,. CP936-TRANS. ,. 3F.. from . to Figure: Multiple fallback codecs
  • 14. Encoding :: Big5 Many incompatible variations (abusing PUA), none of standard tools can rule them all http://moztw.org/docs/big5/ Scenario Dominating encoding Microsoft CP950 Taiwan BBS UAO (Unicode-at-Once) gov.tw Big5-2003 gov.hk HKSCS (1999,2001,2004) Special characters conflict The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which may have special meaning in certain context
  • 15. Big5 5C issue (許功蓋) BIG5:BIG5-5C,BIG5 # Input Output Big5 Literal ” 成功” ” 成功 ” ASCII/Hex ”xA6xA8xA5” ”xA6xA8xA5” BIG5-5C,BIG5:BIG5 # Input Output Big5 Literal ” 成功 ” ” 成功” ASCII/Hex ”xA6xA8xA5” ”xA6xA8xA5”
  • 16. Traditional/Simplified Chinese NOT one-to-one mapping Traditional 乾幹干 vs. Simplified 干干干 Context dependent 之後、夜之后、入夜之後 Variants 峰、峯
  • 17. Project Chvar (1/2) https://github.com/buganini/chvar .. ..签簽. 籖籤. Canonical group . Canonical group . Compatibility group Figure: Two level grouping in Chvar 签 簽 籖 籤 TW 簽 - 籤 - CN - 签 - 籖 CP950 簽 - 籤 - GB2312 - 签 × × Table: Canonical Group 签 簽 籖 籤 TW 簽 - 簽 簽 CN - 签 签 签 CP950 簽 - 簽 簽 GB2312 - 签 签 签 Table: Compatibility Group
  • 18. Project Chvar (2/2) https://github.com/buganini/chvar Normalization Canonical Equivalence Transliteration Converted or Canonical Equivalence or Compatibility Equivalence Fuzzy character matching Compatibility Equivalence 签 簽 籖 籤 TW 簽 - 籤 - CN - 签 - 籖 CP950 簽 - 籤 - GB2312 - 签 × × Table: Canonical Group 签 簽 籖 籤 TW 簽 - 簽 簽 CN - 签 签 签 CP950 簽 - 簽 簽 GB2312 - 签 签 签 Table: Compatibility Group
  • 19. Bsdconv :: Phases Traditional Chinese ⇔ Simplified Chinese ... UTF-8. :. ZHTW. :. UTF-8.. from . inter . to Figure: Conversion with inter-mapping phase
  • 20. Bsdconv :: Phases Furthermore, phrases mapping ... UTF-8. :. ZHTW. :. ZHTW-WORDS. :. UTF-8.. from . inter . inter . to Figure: Conversion with multiple inter-mapping phases
  • 21. Unicode :: Casing IS complicated Lowercase Uppercase a A i I Table: English Lowercase Uppercase ı I i İ Table: Turkic Lowercase Uppercase a A à A Table: French Lowercase Uppercase σ Σ ς Σ Table: Greek Default Case Folding
  • 22. Unicode :: Normalization Forms (1/2) UAX#15 Indexing Identification security Username, Domain name Combining sequence Ç C + ◌̧ Ordering of combining marks q+◌̇+◌̣ q+◌̣+◌̇ Hangul 가 ᄀ + ᅡ Singleton Ω Ω Table: Canonical Equivalence
  • 23. Unicode :: Normalization Forms (2/2) UAX#15 Font variants ℌ H Breaking differences NBSP SP Cursive forms ‫ﻧ‬ ‫ﻨ‬ Circled ① 1 Width, size, rotated カ カ ︷ { Superscripts/subscripts ⁹ 9 Squared characters ㍿ 株 + 式 + 会 + 社 Fractions ¾ 3 + / + 4 Others dž d + z + ◌̌ Table: Compatibility Equivalence
  • 24. Normalization for fuzzy matching UTF-8:UPPER:UTF-8 Input: aăⅷDžбⓐᾥ Output: AĂⅧDŽБⒶᾭ UTF-8:ZH-FUZZY-TW:KANA-PHONETIC:NFKD- CASEFOLD:UTF-8 Input: ¼ℌℍăDžⓐ⁹ 灣湾ド鬒鬒㊣ æß Output: 1⁄4hhădža9 灣灣 do 鬒鬒正 æss Composition Decomposition Canonical NFC NFD Compatibility NFKC NFKD Table: The four Unicode normalization forms and the transformations
  • 25. Bsdconv :: Cascade Re-encode ... UTF-8. :. ISO-8859-1. |. BIG5. :. UTF-8.. from . to . from . to Figure: Cascaded conversions Input Output ¥x¥_ 台北
  • 26. Bsdconv :: Codec argument Other than question mark ... UTF-8. ,. ANY#0121. :. ASCII. ,. ANY#21.. from . to Figure: Codec argument Or more than one character ... UTF-8. ,. ANY#013F.0121. :. ASCII. ,. ANY#21.. from . to Figure: Data list, separated by dot
  • 28. Charset & Encoding .. Unicode (32bits addr. space) . Unicode up to U+10FFFF . Unicode BMP (up to U+FFFF) . (Basic Multilingual Plane) . GB18030. CNS11643 . CP950 . Latin1 Figure: Character Sets
  • 29. Bsdconv :: Types (01) Unicode (02) CNS11643 (03) Byte (04) Chinese components (1B) ANSI control sequences (00) Bsdconv special characters
  • 30. Encoding :: CNS11643 (全字庫) #issue http://www.cns11643.gov.tw/ Only used by Taiwan government NOT a subset of Unicode Not just an charset/encoding Font Pronunciation ㄇㄥ ˊ / méng Radical 艸 Component 艹日月 Stroke Tra/Sim mapping 萌蕄 Table: Examples for some information provided by 全字庫 for「萌」
  • 31. Chinese components composition https://github.com/buganini/chicomp UTF-8:ZH-DECOMP:ZH-COMP:UTF-8 Input: 功夫不好不要艹我 Output: 巭孬嫑莪 UTF-8:ZH-DECOMP:ZH-COMP:CHEWING:UTF-8 Input: 功夫不好不要艹我 Output: ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ UTF-8:ZH-DECOMP:ZH-COMP:CHEWING:HAN- PINYIN:UTF-8 Input: 功夫不好不要艹我 Output: pu nao yao [uh]2
  • 32. Bsdconv :: Flags FREE - memory management MARK - identifier
  • 33. Bsdconv :: Cascade Re-encode ... UTF-8. :. ISO-8859-1. |. BIG5. :. UTF-8.. from . to . from . to Figure: Cascaded conversions Input Output ¥x¥_ 台北
  • 34. Look-through (1/4) ..%u03B1%CE%B2. Input (UTF-8 literal) . ESCAPE : .... Decoder . .. 01 . 03 . B1 . 03 . CE . 03 . B2 . Internal data . Entity Unicode UTF-8 Hex α U+03B1 CEB1 β U+03B2 CEB2 Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
  • 35. Look-through (2/4) .. ..01. 03 . B1 . 03. CE . 03. B2. Internal data . ... : PASS#MARK&FOR=1,BYTE. Encoder . .. 01 . 03 . B1 . MARK . CE . B2 . Internal data . Entity Unicode UTF-8 Hex α U+03B1 CEB1 β U+03B2 CEB2 Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
  • 36. Look-through (3/4) .. ..01. 03 . B1 . MARK . CE. B2 . Internal data . PASS#UNMARK,UTF-8 : .... Decoder . .. 01 . 03 . B1 . 01 . 03 . B2 . Internal data . Entity Unicode UTF-8 Hex α U+03B1 CEB1 β U+03B2 CEB2 Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
  • 37. Look-through (4/4) .. 01 . 03 . B1 . 01 . 03 . B2 Internal data ... : UTF-8 Encoder .. CE . B1 .”α”. CE . B2 . ”β” Internal data αβ Output (UTF-8 literal) Entity Unicode UTF-8 Hex α U+03B1 CEB1 β U+03B2 CEB2 Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
  • 38. Unicode :: East Asian Width (1/2) UAX#11 .. ..Narrow. Halfwidth .. .. Wide. Fullwidth . Ambiguous . Neutral Figure: Venn Diagram Showing the Set Relations for Six Categories
  • 39. Unicode :: East Asian Width (2/2) UAX#11 Narrow Ambiguous Wide Я N ऊ Na A A F H カ カ W 咦 W Table: Examples for Each Character Class and Their Resolved Widths Na Narrow N Neural, usually treated as Narrow W Wide F Fullwidth H Halfwidth Table: Width attributes
  • 40. String width measurement echo "42(ˊ_>ˋ) 紅茶" | bsdconv UTF-8:WIDTH:NULL FULL: 2 HALF: 7 AMBI: 2
  • 41. Chinese charset encoding detection https://github.com/buganini/chiconv ENCODING:SCORE#WITH=CJK:COUNT:ZH- BONUS:ZHTW:ZH-BONUS-PHRASE:NULL Score(s) = $SCORE−$IERR∗$COUNT∗0.01 $COUNT 帥呆了 ⇒ UTF-8:SCORE#WITH=CJK:…… ENCODING SCORE COUNT IERR Score(s) UTF-8 19 4 0 4.75 BIG5 8 3 2 -4.0 GBK 4 1 4 -36.0 CCCII 36 9 0 4.0 UTF-16LE 20 5 2 0.0
  • 42. Khmer legacy font converter https://github.com/buganini/khmerconv Issues Encoding without registerd name, bound on fonts Stored in CP1252 or UTF-8 Solution Two pass detection Detect encoding Detect font family (currently not working) (High converage in SBCS) Algorithm ported from Khmer Converter3 Khmer Converter Mapping Reordering Visual order vs. Unicode model Unicode Model: baseCharacter [+ [Robat/Shifter] + [Coeng*] + [Shifter] + [Vowel] + [Sign]] 3 http://www.khmeros.info/en/khmer-converter
  • 43. Encoding :: Big5 Many incompatible variations (abusing PUA), none of standard tools can rule them all http://moztw.org/docs/big5/ Scenario Dominating encoding Microsoft CP950 Taiwan BBS UAO (Unicode-at-Once) gov.tw Big5-2003 gov.hk HKSCS (1999,2001,2004) Special characters conflict The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which may have special meaning in certain context
  • 44. Unicode :: East Asian Width (1/2) UAX#11 .. ..Narrow. Halfwidth .. .. Wide. Fullwidth . Ambiguous . Neutral Figure: Venn Diagram Showing the Set Relations for Six Categories
  • 45. Unicode :: East Asian Width (2/2) UAX#11 Narrow Ambiguous Wide Я N ऊ Na A A F H カ カ W 咦 W Table: Examples for Each Character Class and Their Resolved Widths Na Narrow N Neural, usually treated as Narrow W Wide F Fullwidth H Halfwidth Table: Width attributes
  • 46. Terminal transcoding https://github.com/buganini/bug5 Issues UAO: Non-standard big5 extension Double color hack ANSI control sequence in the middle of DBCS Ambiguous width characters luit/screen cannot help Solution (tl;dr) Big5 to Unicode ANSI-CONTROL,BYTE:BIG5-DEFRAG: BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD: UTF-8,PASS#FOR=1B Unicode to Big5 UTF-8,00,BYTE:ZHTW:AMBIGUOUS-UNPAD: BIG5,CP950-TRANS,UAO,00,ANY#3F
  • 47. Bug5 explained (1/6) ..⋆xC5x1B[1mxE5. Input (Big5 literal) . ANSI-CONTROL,BYTE : .... Decoder . .. 03 . A1 . 03 . B9 . 03 . C5 . 1B . 5B . 31 . 6D . 03 . E5 . Internal data . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  • 48. Bug5 explained (2/6) .. ..03. A1 . 03. B9 . 03. C5 . 1B. 5B . 31 . 6D . 03. E5. Internal data . ... : BIG5-DEFRAG : .... Inter-conversion . .. 03 . A1 . 03 . B9 . 03 . C5 . 03 . E5 . 1B . 5B . 31 . 6D . Internal data . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  • 49. Bug5 explained (3/6) .. ..03. A1 . 03. B9 . 03. C5 . 03. E5 . 1B. 5B . 31 . 6D . Internal data . ... : BYTE,PASS#MARK&FOR=1B. Encoder . .. A1 . B9 . C5 . E5 . 1B . 5B . 31 . 6D . MARK . Internal data . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  • 50. Bug5 explained (4/6) .. ..A1. B9. C5. E5. 1B. 5B . 31 . 6D . MARK . Internal data . PASS#UNMARK,BIG5 : .... Decoder . .. 01 . 26 . 05 . 01 . 9A . 5A . 1B . 5B . 31 . 6D . Internal data . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  • 51. Bug5 explained (5/6) .. ..01. 26 . 05 . 01. 9A . 5A . 1B. 5B . 31 . 6D . Internal data . ... : AMBIGUOUS-PAD : .... Inter-conversion . .. 01 . 26 . 05 . 01 . A0 . 01 . 9A . 5A . 1B . 5B . 31 . 6D . Internal data . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  • 52. Bug5 explained (6/6) .. ..01. 26 . 05 . 01. A0 . 01. 9A . 5A . 1B. 5B . 31 . 6D . Internal data . ... : UTF-8,PASS#FOR=1B. Encoder . ⋆ 驚 x1B[1m . Output (UTF-8 literal) . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  • 54. Bsdconv :: GUI https://github.com/buganini/gbsdconv Alternative to ConvertZ Text File name File content Meta tag