Ruling the root : CJK Rules for The Root Zone

519 views
344 views

Published on

Ruling the root : CJK Rules for The Root Zone

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
519
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Ruling the root : CJK Rules for The Root Zone

  1. 1. CJK Rules For The Root Zone Kenny Huang, Ph.D. 黃勝雄博士 Member, CDNC / CGP Co-author, RFC3743 IETF Member, Executive Council, APNIC Member, Board of Directors, TWNIC huangksh@gmail.com 2014.Jun
  2. 2. Problem : CJK Is Complicated 2 PuttingCJKlabels in the root zone is evenmore complicated
  3. 3. Institutionalized Problem Solving : Structure 3
  4. 4. Constraints for CJK LGR 4 Independent Tasks Each CJK Panel creates an LGR Each LGR includes a repertoire and variants Define labels permission Define variants labels Assign dispositions •Allocatable •Block Coordination Tasks If an LGR includes Han characters: The variant *mappings* must agree for all the panels The variant *types* may be different The repertoires may be different *Presented by Lee Han Chuan & IP, Shanghai 2014 May 29
  5. 5. Overlap Case Illustration 5 壹 U58F9 弌 U5F0C 壱 U58F1 一 U4E00 allocate block block Variant Unicode Disposition 一 U4E00 Variant Unicode Disposition Chinese LGR Japanese LGR 1 2 3 Integrate ? Integrated Root Zone Label Generation Rules Rejected Generation Panel F T
  6. 6. High Level Conflict Strategies 6 ID Strategy Pros Cons Rank 1 Adopt X Abandon Rcjk Permit X No label rule 2 Adopt X Intersection ∩ (Rcjk) Permit X Permit ∩(variants/disp) Rules changed 3 Adopt X Union ∪(Rcjk) Permit X Permit ∪(variants/disp) Rules changed 4 Abandon X and Rcjk No conflict Label not available 5 Adopt rules based on frequency of use Fair & scientific approach Rules changed; fairness doesn’t mean appropriate CJK overlap C: rule Rc J : rule Rj K: rule Rk
  7. 7. Unified CJK LGR Illustration 7 壹 U58F9 弌 U5F0C 壱 U58F1 一 U4E00 allocate block block Variant Unicode Disposition Chinese LGR 1 2 3 一 U4E00 Variant Unicode Disposition Japanese LGR 壹 U58F9 弌 U5F0C 壱 U58F1 一 U4E00 allocate block block Variant Unicode Disposition Integrated LGR 1 2 3 一 U4E00 Variant Unicode Disposition Integrated LGR Union Intersection
  8. 8. CJK Integration Methodology Divide & Conquer (D&C) Unified CJK Rules Variant Dispositions Minimal Viable Solution CJK Rules Root Zone Admin Strategic Direction Plan and Define CJK Overlap Resources JK Overlap CJ Usage Pattern CJ Overlap CK Usage Pattern CK Overlap Services LGR Constrains Evaluation Method Diversified CJK DemandsRequires C Demands J Demands 8 Requires Split Merge
  9. 9. Splitting Non-overlapping Code Points From Repertories 9 C/J Overlap: 6181 C-Han : 19520 (CNNIC/TWNIC) J-Han : 6356 (JPRS) K-Han : 0 (KRNIC) Develop Conflict Strategy No conflict Rc Rk Rj 13339 175 1 unified code points 13339 175 13514 + CJK Han-overlap in IANA IDN Repository Problem Domain (Unsolved Overlap) : 6181 Rc Rj Rk Chinese LGR Japanese LGR Korean LGR
  10. 10. Engineering Design 10 2 TC : Apple News SC : Sina News JP : Mainichi News Computation for Word Usage and Frequency C/J overlap code points Matching usage frequency of use Split unused code points Split code points of low frequency of use Sample size is statistical significant
  11. 11. Splitting Unused Code Points from The Overlap 11 J only : 203 C only : 1927Rc Rj total unused : 2739 3 C / J Overlap Data Set : 6181 unified code points 2739 203 1927 4869 + C / J usage overlap : 1312 total used : 3442 Problem Domain (Unsolved Overlap) : 1312
  12. 12. Computing Frequency of Use of Code Points 12 4 Initial Data Set : 1312
  13. 13. Top 10 Most Popular Words 13 的, 2774 人, 1005 在, 975 一, 964 是, 960 不, 951 中, 896 有, 883 大, 776 台, 718 TC 日, 20942 月, 20315 人, 4430 国, 3754 中, 3521 被, 2791 称, 2340 地, 2226 南, 2152 生, 2027 SC 日, 822 年, 496 国, 393 会, 345 月, 325 人, 325 大, 319 市, 253 本, 251 中, 250 JP
  14. 14. 14 4.063 4.1884 1.7338 0.6 0.6094 0.886 0.55820.59440.5518 0.468 0.7042 0.4304 0.6026 0.4488 0.36 0.35620.4026 0.385 0.7508 0.325 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 C-Freq J-Freq Top 20 : Chinese Frequency of Use > Japanese Frequency of Use Generated Data Set : 939 FrequencyofUse%
  15. 15. 15 0.1988 0.2788 0.5512 0.1956 0.3834 0.1644 0.2366 0.1056 0.1212 0.1088 0.1688 0.1422 0.1344 0.1622 0.0856 0.2812 0.1588 0.0912 0.1134 0.1288 0 0.1 0.2 0.3 0.4 0.5 0.6 C-Freq J-Freq Top 20 : Chinese Frequency of Use < Japanese Frequency of Use Generated Data Set : 363 FrequencyofUse%
  16. 16. 16 0.0222 0.0144 0.0112 0.0056 0.0056 0.0022 0.0012 0.0012 0.0012 0.0012 0 0.005 0.01 0.015 0.02 0.025 8FCE 7D20 675F 541B 846C 79E9 82BD 96C0 5857 5353 C-Freq J-Freq FrequencyofUse% Chinese Frequency of Use = Japanese Frequency of Use Generated Data Set : 10
  17. 17. Frequency of Use Reassembly 17 unified code points 363 939 1302 + Problem Domain (Unsolved Overlap) : 10 C / J Usage Overlap Data Set : 1312 Freq C > J : 939 Freq J > C : 363 J = C 10 Rc Rj
  18. 18. Data Processing & Computation Recap 18 >20K Han Code Points 6181 CJK Overlap 1312 Usage Overlap Splitting Non-overlapping Frequency of Use Computation Filtering Process Filtering Process LOGICDesign Splitting Unused Methodology Review CJK Coordination Re-Sampling & Computation Statistical Justification 10 Code Points Problem domain was effectively reduced
  19. 19. Future Work 19 1578152323 3433 69 122 289 501 80 26 1455321 0 100 200 300 400 500 600 % numberofXwithinthesamerange Chinese Frequency of Use Minus Japanese Frequency of Use Overlap range redefine Expand (?) Std Dev. Require intensive CKJ coordination & deliberation RcRj Mean= 0.034465 S.D.=0.158477
  20. 20. Re-consider Language Tag 20 K tag J tag TLD registries IANA/Verisign provisioning root server operators publication Internet query Policy C tag Language tag support •RFC 2860 : The name space of language tags is administered by IANA •ISO Standard 639 : •when a language has both an IANA-registered tag and a tag derived from an ISO registered code, one MUST use the ISO tag. •Maintenance Agency : International Information Centre for Terminology (Austria) Sources of Language Tag distribution masters root servers DNS resolvers
  21. 21. 21 PerfectionSyndrome “Engineering isn't about perfect solutions; it's about doing the best you can with limited resources.” Randy Pausch

×