SlideShare a Scribd company logo
zhang_xzhi
   2011-11-20
传说中:
纯文本 = ASCII = 字符是8位 ???
   American Standard Code for Information
    Interchange
   7位,128个字符。
   例子
      Value       意义
       7          BEL
       48         0
       63         ?
       65         A
       97         a
4
   0x00 to 0x1F    控制字符  32个
   0x20 to 0x7E   可打印字符 95个
   0X7F           delete
@Test
public void testASCII() throws Exception {
      String oldString = "abcd";
     //字符串转为字节流,又转回为字符串。
      True(testString2Byte2String(oldString, Encoding.US_ASCII));
      oldString = "abcd信之";
      False(testString2Byte2String(oldString, Encoding.US_ASCII));
}
String -> Bytes -> String
encoding = US-ASCII
bytes[] =
61626364 // 16进制表示
97,98,99,100, // 无符号整型表示
oldString = abcd
newString = abcd
equals = true


String -> Bytes -> String
encoding = US-ASCII
bytes[] =
616263643F3F
97,98,99,100,63,63,
oldString = abcd信之
newString = abcd??
equals = false


                            ASCII无法存储中文,直接解释为63(???),一般看到???,很多是编码问题。
1 byte = 8 bits

有1个bit处于空闲。

各种代码页。单字节代码页,双字节代码页。

混乱!
不是ASCII标准的扩展标准

不是单一的一套字符集

一般和ASCII兼容
   又名Latin-1
   8-bit single-byte coded graphic character sets
   主要服务目标 Western European 语言
   兼容ASCII

   ISO/IEC 8859 is a joint ISO and IEC series of
    standards for 8-bit character encodings.
   ISO 8859-1是其中应用最广泛的一套标准。
String -> Bytes -> String
encoding = ISO-8859-1
bytes[] =
61626364
97,98,99,100,
oldString = abcd
newString = abcd
equals = true

String -> Bytes -> String
encoding = ISO-8859-1
bytes[] =
616263643F3F
97,98,99,100,63,63,
oldString = abcd信之
newString = abcd??
equals = false
String -> Bytes -> String
encoding = ISO-8859-1
bytes[] =
B1
177,
oldString = ±
newString = ±
equals = true
   Uni 单一
   Unicode 终极单一编码方案
   每个字符16bits, 共2^16=65536个字符???
传说中
A  0100 0001



Unicode中
A     U+0041

代码点 --- 抽象的东西,不涉及存储。
 Unicode defines a code space of 1,114,112 code
  points in the range
   0x000000 to 0x10FFFF.
 划分为17个Unicode plane.
 Plane 0:Basic Multilingual Plane (BMP)
  0000-FFFF
 Plane 1-16: Supplementary Planes
 [1-10]0000 – [1-10]FFFF
 不是所有的bit排列都是有效代码点,有些Plane(p3-p13)
  目前未分配代码点。
 Supplementary Planes 一般程序测试的不充分,尽量
  避免使用。其实一般也难以用到。^_^ (ascii art)
16-bit Unicode Transformation Format

U+0000 - U+D7FF        U+E000 - U+FFFF
16bits编码。

U+10000 - U+10FFFF
编码为2个16bits, surrogate pair.
lead surrogates          0xD800 - 0xDBFF
trail surrogates         0xDC00 - 0xDFFF

U+D800 - U+DFFF
为了支持utf-16,保留该段代码点。

自同步编码,不用从文件头部开始解析。
   原始代码点           范围为 U+10000 - U+10FFFF
   减去0x10000 范围为 U+00000 – U+FFFFF
   前10bits范围为 0-0x3FF
    lead surrogates= 0xD800+前10bits
    范围为0xD800 - 0xDBFF
   后10bits范围为0-0x3FF
    trail surrogates= 0xDC00+后10bits
    范围为0xDC00 - 0xDFFF


                                             18
字符集 vs 编码

字符串“abcd”
00 61 00 62 00 63 00 64
61 00 62 00 63 00 64 00

Unicode Byte Order Mark
FE FF
FF FE
   String -> Bytes -> String
   encoding = UTF-16BE
   bytes[] =
   0061006200630064
   0,97,0,98,0,99,0,100,
   oldString = abcd
   newString = abcd
   equals = true

   String -> Bytes -> String
   encoding = UTF-16LE
   bytes[] =
   6100620063006400
   97,0,98,0,99,0,100,0,
   oldString = abcd
   newString = abcd
   equals = true
String -> Bytes -> String
encoding = UTF-16BE
bytes[] =
00610062006300644FE14E4B
0,97,0,98,0,99,0,100,79,225,78,75,
oldString = abcd信之
newString = abcd信之
equals = true



                                     21
空间浪费
遗留文档兼容
美国人
   自同步编码。
   变长编码 1-4个字节可以表示所有合法代码点。
   所有合法的ASCII编码都是合法的utf-8编码。
   每个编码的第一个字节可以指出编码的长度。
   编码首字节(非10开头)和其他字节(10开头)明显不
    同。




                                 24
   定长编码。每个代码点编码为32bits。
   支持随机读字符串中指定序数字符。




                           25
   UTF-8
   UTF-16
   UTF-32
   UTF-1
   UTF-5
   UTF-6
   UTF-7
   UTF-9
   UTF-18


             26
   Guojia Biaozhun Kuozhan
   编码为1byte或2bytes,兼容ASCII。




                               27
   count = 22046
   gbkLength = 44092
   utf_8_length = 65981
   utf_16_length = 44092
   utf_32_length = 88184

                            28
   很多早期互联网协议只支持7bits编码(ascii)。8bits编码
    不支持。
   无损传输2进制数据。
   8-bit clean




                                         29
   大部分编码支持
   可以打印
   Padding ‘=’
    看到=结尾,第一
    反应base64。
                  30
   类似于base64.
   每个字节分为2段,各4个bits。
   按值映射到0-F。




                        31
   保留字符




           32
   <head> <meta charset="gb2312"> …
   先使用ASCII编码解析到charset,然后使用charset
    重新解析。大部分编码集和ASCII兼容。
   没有指定charset,浏览器可以根据不同编码的词频
    推断charset。




                                       33
 所有涉及字符串和bytes互转的地方,指定charset,
  以增强可移植性。
 public String(byte bytes[])
 public byte[] getBytes()
 如果不指定,则使用默认编码集,这个对于可移植性
  是个坑。
 Charset.defaultCharset().name();




                                    34
   Character 字符 16位 char
   UTF-16




                            35
Encoding

More Related Content

Viewers also liked

The BCD to excess-3 converter
The BCD to excess-3 converterThe BCD to excess-3 converter
The BCD to excess-3 converter
Mahady Hasan
 
Analog to Digital Converter
Analog to Digital ConverterAnalog to Digital Converter
Analog to Digital ConverterRonak Machhi
 
Ascii
AsciiAscii
Chap ii.BCD code,Gray code
Chap ii.BCD code,Gray codeChap ii.BCD code,Gray code
Chap ii.BCD code,Gray codeBala Ganesh
 
DAC , Digital to analog Converter
DAC , Digital to analog ConverterDAC , Digital to analog Converter
DAC , Digital to analog Converter
Hossam Zein
 
adc converter basics
adc converter basicsadc converter basics
adc converter basicshacker1500
 
What are Flip Flops and Its types.
What are Flip Flops and Its types.What are Flip Flops and Its types.
What are Flip Flops and Its types.
Satya P. Joshi
 
Flipflops and Excitation tables of flipflops
Flipflops and Excitation tables of flipflopsFlipflops and Excitation tables of flipflops
Flipflops and Excitation tables of flipflops
student
 
Flip flop’s state tables & diagrams
Flip flop’s state tables & diagramsFlip flop’s state tables & diagrams
Flip flop’s state tables & diagramsSunny Khatana
 
ANALOG TO DIGITAL AND DIGITAL TO ANALOG CONVERTER
ANALOG TO DIGITAL AND DIGITAL TO ANALOG CONVERTERANALOG TO DIGITAL AND DIGITAL TO ANALOG CONVERTER
ANALOG TO DIGITAL AND DIGITAL TO ANALOG CONVERTER
Sripati Mahapatra
 
BCD,GRAY and EXCESS 3 codes
BCD,GRAY and EXCESS 3 codesBCD,GRAY and EXCESS 3 codes
BCD,GRAY and EXCESS 3 codes
student
 
Bcd to excess 3 code converter
Bcd to excess 3 code converterBcd to excess 3 code converter
Bcd to excess 3 code converter
Ushaswini Chowdary
 
DAC-digital to analog converter
DAC-digital to analog converterDAC-digital to analog converter
DAC-digital to analog converter
Shazid Reaj
 
Analog to digital converter
Analog to digital converterAnalog to digital converter
Analog to digital converter
Ashutosh Jaiswal
 
Analog to digital conversion
Analog to digital conversionAnalog to digital conversion
Analog to digital conversionEngr Ahmad Khan
 
ADC and DAC Best Ever Pers
ADC and DAC Best Ever PersADC and DAC Best Ever Pers
ADC and DAC Best Ever Pers
Eng Ahmed Salad Osman
 

Viewers also liked (18)

Ascii 03
Ascii 03Ascii 03
Ascii 03
 
Ascii
AsciiAscii
Ascii
 
The BCD to excess-3 converter
The BCD to excess-3 converterThe BCD to excess-3 converter
The BCD to excess-3 converter
 
Analog to Digital Converter
Analog to Digital ConverterAnalog to Digital Converter
Analog to Digital Converter
 
Ascii
AsciiAscii
Ascii
 
Chap ii.BCD code,Gray code
Chap ii.BCD code,Gray codeChap ii.BCD code,Gray code
Chap ii.BCD code,Gray code
 
DAC , Digital to analog Converter
DAC , Digital to analog ConverterDAC , Digital to analog Converter
DAC , Digital to analog Converter
 
adc converter basics
adc converter basicsadc converter basics
adc converter basics
 
What are Flip Flops and Its types.
What are Flip Flops and Its types.What are Flip Flops and Its types.
What are Flip Flops and Its types.
 
Flipflops and Excitation tables of flipflops
Flipflops and Excitation tables of flipflopsFlipflops and Excitation tables of flipflops
Flipflops and Excitation tables of flipflops
 
Flip flop’s state tables & diagrams
Flip flop’s state tables & diagramsFlip flop’s state tables & diagrams
Flip flop’s state tables & diagrams
 
ANALOG TO DIGITAL AND DIGITAL TO ANALOG CONVERTER
ANALOG TO DIGITAL AND DIGITAL TO ANALOG CONVERTERANALOG TO DIGITAL AND DIGITAL TO ANALOG CONVERTER
ANALOG TO DIGITAL AND DIGITAL TO ANALOG CONVERTER
 
BCD,GRAY and EXCESS 3 codes
BCD,GRAY and EXCESS 3 codesBCD,GRAY and EXCESS 3 codes
BCD,GRAY and EXCESS 3 codes
 
Bcd to excess 3 code converter
Bcd to excess 3 code converterBcd to excess 3 code converter
Bcd to excess 3 code converter
 
DAC-digital to analog converter
DAC-digital to analog converterDAC-digital to analog converter
DAC-digital to analog converter
 
Analog to digital converter
Analog to digital converterAnalog to digital converter
Analog to digital converter
 
Analog to digital conversion
Analog to digital conversionAnalog to digital conversion
Analog to digital conversion
 
ADC and DAC Best Ever Pers
ADC and DAC Best Ever PersADC and DAC Best Ever Pers
ADC and DAC Best Ever Pers
 

Similar to Encoding

Character Encoding - Concepts and Practices
Character Encoding - Concepts and PracticesCharacter Encoding - Concepts and Practices
Character Encoding - Concepts and Practices
rogeryi
 
编码大全 拔赤
编码大全 拔赤编码大全 拔赤
编码大全 拔赤jay li
 
Oracle 数据类型
Oracle 数据类型Oracle 数据类型
Oracle 数据类型
yzsind
 
第三课 信息编码
第三课 信息编码第三课 信息编码
第三课 信息编码
librajin
 
快快樂樂SIMD
快快樂樂SIMD快快樂樂SIMD
快快樂樂SIMD
Wei-Ta Wang
 
Unicode
UnicodeUnicode
Unicode
Yan-ren Tsai
 
Deep learning wiki on data encodingi
Deep learning  wiki on data encodingiDeep learning  wiki on data encodingi
Deep learning wiki on data encodingi
wang meng
 

Similar to Encoding (8)

Character Encoding - Concepts and Practices
Character Encoding - Concepts and PracticesCharacter Encoding - Concepts and Practices
Character Encoding - Concepts and Practices
 
编码大全 拔赤
编码大全 拔赤编码大全 拔赤
编码大全 拔赤
 
Oracle 数据类型
Oracle 数据类型Oracle 数据类型
Oracle 数据类型
 
Unicode ncr
Unicode ncrUnicode ncr
Unicode ncr
 
第三课 信息编码
第三课 信息编码第三课 信息编码
第三课 信息编码
 
快快樂樂SIMD
快快樂樂SIMD快快樂樂SIMD
快快樂樂SIMD
 
Unicode
UnicodeUnicode
Unicode
 
Deep learning wiki on data encodingi
Deep learning  wiki on data encodingiDeep learning  wiki on data encodingi
Deep learning wiki on data encodingi
 

Encoding

  • 1. zhang_xzhi 2011-11-20
  • 2. 传说中: 纯文本 = ASCII = 字符是8位 ???
  • 3. American Standard Code for Information Interchange  7位,128个字符。  例子 Value 意义 7 BEL 48 0 63 ? 65 A 97 a
  • 4. 4
  • 5. 0x00 to 0x1F 控制字符 32个  0x20 to 0x7E 可打印字符 95个  0X7F delete
  • 6.
  • 7. @Test public void testASCII() throws Exception { String oldString = "abcd"; //字符串转为字节流,又转回为字符串。 True(testString2Byte2String(oldString, Encoding.US_ASCII)); oldString = "abcd信之"; False(testString2Byte2String(oldString, Encoding.US_ASCII)); }
  • 8. String -> Bytes -> String encoding = US-ASCII bytes[] = 61626364 // 16进制表示 97,98,99,100, // 无符号整型表示 oldString = abcd newString = abcd equals = true String -> Bytes -> String encoding = US-ASCII bytes[] = 616263643F3F 97,98,99,100,63,63, oldString = abcd信之 newString = abcd?? equals = false ASCII无法存储中文,直接解释为63(???),一般看到???,很多是编码问题。
  • 9. 1 byte = 8 bits 有1个bit处于空闲。 各种代码页。单字节代码页,双字节代码页。 混乱!
  • 11. 又名Latin-1  8-bit single-byte coded graphic character sets  主要服务目标 Western European 语言  兼容ASCII  ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings.  ISO 8859-1是其中应用最广泛的一套标准。
  • 12. String -> Bytes -> String encoding = ISO-8859-1 bytes[] = 61626364 97,98,99,100, oldString = abcd newString = abcd equals = true String -> Bytes -> String encoding = ISO-8859-1 bytes[] = 616263643F3F 97,98,99,100,63,63, oldString = abcd信之 newString = abcd?? equals = false
  • 13. String -> Bytes -> String encoding = ISO-8859-1 bytes[] = B1 177, oldString = ± newString = ± equals = true
  • 14. Uni 单一  Unicode 终极单一编码方案  每个字符16bits, 共2^16=65536个字符???
  • 15. 传说中 A 0100 0001 Unicode中 A U+0041 代码点 --- 抽象的东西,不涉及存储。
  • 16.  Unicode defines a code space of 1,114,112 code points in the range 0x000000 to 0x10FFFF.  划分为17个Unicode plane.  Plane 0:Basic Multilingual Plane (BMP) 0000-FFFF  Plane 1-16: Supplementary Planes [1-10]0000 – [1-10]FFFF  不是所有的bit排列都是有效代码点,有些Plane(p3-p13) 目前未分配代码点。  Supplementary Planes 一般程序测试的不充分,尽量 避免使用。其实一般也难以用到。^_^ (ascii art)
  • 17. 16-bit Unicode Transformation Format U+0000 - U+D7FF U+E000 - U+FFFF 16bits编码。 U+10000 - U+10FFFF 编码为2个16bits, surrogate pair. lead surrogates 0xD800 - 0xDBFF trail surrogates 0xDC00 - 0xDFFF U+D800 - U+DFFF 为了支持utf-16,保留该段代码点。 自同步编码,不用从文件头部开始解析。
  • 18. 原始代码点 范围为 U+10000 - U+10FFFF  减去0x10000 范围为 U+00000 – U+FFFFF  前10bits范围为 0-0x3FF lead surrogates= 0xD800+前10bits 范围为0xD800 - 0xDBFF  后10bits范围为0-0x3FF trail surrogates= 0xDC00+后10bits 范围为0xDC00 - 0xDFFF 18
  • 19. 字符集 vs 编码 字符串“abcd” 00 61 00 62 00 63 00 64 61 00 62 00 63 00 64 00 Unicode Byte Order Mark FE FF FF FE
  • 20. String -> Bytes -> String  encoding = UTF-16BE  bytes[] =  0061006200630064  0,97,0,98,0,99,0,100,  oldString = abcd  newString = abcd  equals = true  String -> Bytes -> String  encoding = UTF-16LE  bytes[] =  6100620063006400  97,0,98,0,99,0,100,0,  oldString = abcd  newString = abcd  equals = true
  • 21. String -> Bytes -> String encoding = UTF-16BE bytes[] = 00610062006300644FE14E4B 0,97,0,98,0,99,0,100,79,225,78,75, oldString = abcd信之 newString = abcd信之 equals = true 21
  • 23.
  • 24. 自同步编码。  变长编码 1-4个字节可以表示所有合法代码点。  所有合法的ASCII编码都是合法的utf-8编码。  每个编码的第一个字节可以指出编码的长度。  编码首字节(非10开头)和其他字节(10开头)明显不 同。 24
  • 25. 定长编码。每个代码点编码为32bits。  支持随机读字符串中指定序数字符。 25
  • 26. UTF-8  UTF-16  UTF-32  UTF-1  UTF-5  UTF-6  UTF-7  UTF-9  UTF-18 26
  • 27. Guojia Biaozhun Kuozhan  编码为1byte或2bytes,兼容ASCII。 27
  • 28. count = 22046  gbkLength = 44092  utf_8_length = 65981  utf_16_length = 44092  utf_32_length = 88184 28
  • 29. 很多早期互联网协议只支持7bits编码(ascii)。8bits编码 不支持。  无损传输2进制数据。  8-bit clean 29
  • 30. 大部分编码支持  可以打印  Padding ‘=’ 看到=结尾,第一 反应base64。 30
  • 31. 类似于base64.  每个字节分为2段,各4个bits。  按值映射到0-F。 31
  • 32. 保留字符 32
  • 33. <head> <meta charset="gb2312"> …  先使用ASCII编码解析到charset,然后使用charset 重新解析。大部分编码集和ASCII兼容。  没有指定charset,浏览器可以根据不同编码的词频 推断charset。 33
  • 34.  所有涉及字符串和bytes互转的地方,指定charset, 以增强可移植性。 public String(byte bytes[]) public byte[] getBytes()  如果不指定,则使用默认编码集,这个对于可移植性 是个坑。 Charset.defaultCharset().name(); 34
  • 35. Character 字符 16位 char  UTF-16 35