Unicode Cjk Compatible Variations

2,233 views

Published on

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,233
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
34
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Unicode Cjk Compatible Variations

  1. 1. Unicode CJK Compatible Variations Taken from UNIHAN Database
  2. 2. Once I got a task to convert Chinese <ul><ul><li>Unicode extracted from PDF </li></ul></ul><ul><ul><li>Traditional Chinese to Simplified Chinese </li></ul></ul><ul><ul><li>Saved in GBK text file for full-text search </li></ul></ul><ul><ul><li>But some common words resulted error </li></ul></ul><ul><ul><li>Once converted into GBK, it results ? character </li></ul></ul><ul><li>$ echo 亮 | iconv -f utf8 -t gbk </li></ul><ul><li>iconv: illegal input sequence at position 0 </li></ul><ul><li>$ echo 亮 | perl -MEncode -ne 'print encode q(utf8), decode q( gb2312 ), encode q( gb2312 ), decode q(utf8), $_' </li></ul><ul><li>? </li></ul>
  3. 3. There is more than one way to enlighten the world http://www.isthisthingon.org/unicode/index.phtml?page=0F&subpage=9
  4. 4. So I declared they are Korean chars :) <ul><ul><li>That means it's impossible to do the convert </li></ul></ul><ul><ul><li>But is there any report about this trend? </li></ul></ul><ul><ul><li>I come to know what's CJK Compatible Ideographs </li></ul></ul><ul><ul><li>And I want a mapping to Simplified Unicode </li></ul></ul><ul><ul><li>And finally compatible with GBK </li></ul></ul><ul><ul><li>One site I discover at writing this slide </li></ul></ul>
  5. 5. Is there any power make this happen? <ul><ul><li>One PDF viewer named evince works magically </li></ul></ul><ul><ul><li>I guess that means perl can make it too </li></ul></ul><ul><ul><li>With help from CPAN modules </li></ul></ul><ul><ul><li>Unicode::Unihan can be a solution </li></ul></ul><ul><ul><li>It was introduced to me by fayland three years ago </li></ul></ul><ul><ul><li>Now it can do more than extract PinYin? </li></ul></ul>
  6. 6. The table I want is: <ul><li>perl -MEncode -e 'print encode q(UCS-2LE), decode q(utf8), qq( 亮 0 亮 )' | od -x </li></ul><ul><li>0000000 f977 0000 4eae </li></ul><ul><li>0xf977 => 0x4eae </li></ul><ul><li>perl -MEncode -e 'print encode q(UCS-2LE), decode q(utf8), qq( 諒 0 諒 )' | od -x </li></ul><ul><li>0000000 f97d 0000 8ad2 </li></ul><ul><li>0xf97d => 0x8ad2 </li></ul>
  7. 7. XYZ variants, I picked the last <ul><ul><li>Tag: kZVariant   </li></ul></ul><ul><ul><li>Status: Provisional  </li></ul></ul><ul><ul><li>Category: Variants   </li></ul></ul><ul><ul><li>Separator: space  </li></ul></ul><ul><ul><li>Syntax: U+2?[0-9A-F]{4}(:k[A-Za-z]+)?  </li></ul></ul><ul><ul><li>Description: The Unicode value(s) for known z-variants of this character. </li></ul></ul><ul><ul><li>x-axis to represent meaning </li></ul></ul><ul><ul><li>y-axis to represent abstract shape </li></ul></ul><ul><ul><li>z-axis is used for stylistic variations </li></ul></ul><ul><li>http://search.cpan.org/~dankogai/Unicode-Unihan-0.03/Unihan.pm </li></ul><ul><li>But latest version is not 0.03 now! Thanks Dan => </li></ul>
  8. 8. But how to get them in bulk? <ul><ul><li>Quick and dirty way </li></ul></ul><ul><ul><li>Thus I already forgot it once get it done </li></ul></ul><ul><ul><li>What I remember is to iterate from F900 to FA00 </li></ul></ul><ul><ul><li>with help from Unicode::String  </li></ul></ul><ul><ul><li>use Unicode::Unihan; </li></ul></ul><ul><ul><li>use Encode; </li></ul></ul><ul><ul><li>$uh = Unicode::Unihan->new(); </li></ul></ul><ul><ul><li>print $uh-> ZVariant (decode(q(utf8), q( 亮 ))) </li></ul></ul><ul><ul><li>U+4EAE </li></ul></ul><ul><ul><li>print $uh->ZVariant(decode(q(utf8), q( 亮 )))  </li></ul></ul><ul><ul><li>U+F977 </li></ul></ul>
  9. 9. The loop starts here <ul><ul><li>Unicode::String's special constructor </li></ul></ul><ul><ul><li>and it's special behavior </li></ul></ul><ul><li>perl -MUnicode::String= uchr -e 'print uchr($_)->utf8 for 0xf900 .. 0xf9ff' </li></ul><ul><ul><li>combine of two power </li></ul></ul><ul><li>perl -MUnicode::String=uchr, uhex -MUnicode::Unihan -MEncode -e '$uh=Unicode::Unihan->new(); print uhex $uh-> ZVariant(decode(q(utf8), uchr($_)-> utf8 )) for 0xf900 .. 0xf9ff' </li></ul>
  10. 10. Do I really need to install those module in production machine? <ul><li>perl -MUnicode::String=uchr,uhex -MUnicode::Unihan -MEncode -e '$uh=Unicode::Unihan->new(); print qq(&quot;), $x = uchr($_)->utf8, qq(&quot; => &quot;), uhex($uh-> ZVariant(decode(q(utf8), $x))), qq(&quot;n) for 0xf900 .. 0x faff ' | less </li></ul><ul><li>&quot; 豈 &quot; => &quot; 豈 &quot; </li></ul><ul><li>&quot; 更 &quot; => &quot; 更 &quot; </li></ul><ul><li>&quot; 車 &quot; => &quot; 車 &quot; </li></ul><ul><li>&quot; 賈 &quot; => &quot; 賈 &quot; </li></ul><ul><li>... </li></ul>
  11. 11. Thanks! <ul><li>省略不說 ... </li></ul>

×