Lecture 09


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • As software systems are usually built in a layered approach, upper level software system can rely on facilities supported at lower layers of the system. The lowest layer of a software system is its operating system. If certain facilities or functions are provided at lower levels, applications built on top of it can directly use these functions without the need to build its own functions. Likewise, coding standards can be supported at different levels of the software platforms. Unicode as a new coding standard was not supported at the Microsoft operating system level for Windows older versions such as in all the Windows 9X series and Window ME which was developed for the notebook hardware platforms. Because of this, additional system software package was developed on top of the Window 9X operating systems particularly for Unicode applications and the package only has certain functions through the Unicode Application Programs Interface (API). Unicode applications can run only if it is compiled with this additional layered UAPI. The Windows 9x/ME platform supports different multi-byte encodings through a mechanism called code page where each multi-byte encoding is given a designated code page number and the system remembers at run time the current code page number so that fonts and related facilities associated with this code page can be properly loaded for the required locale. Window NT/2000/XP has changed in the operating system to support Unicode as the internal coding standard. All other encodings are converted into Unicode. Q: Can Java, the Unicode based application, run on Window 9X/ME platform and why?
  • The Linux operating system, with U NIX being its origin, uses the UTF-8 format to support Unicode applications. Any applications built on top of Linux for Unicode applications must be linked with the glibc 2.2.2 or its later versions for the C language applications and XFree86 4.0.3 for X-Windows applications. The documents, etc. are also written in UTF-8 and the locale functions conform to POSIX API. In the Apple machines Mac OS, wide character Unicode is supported from Mac OS 9.1. In fact, Mac OS was one of the first operating systems to support Unicode internally. Q : What is the main difference between UTF8 and Unicode? In terms of operating system, explain what is the advantage and disadvantage of Using UTF8 vs. Unicode.
  • The term code page is used in the old Microsoft multi-byte system to identify a particular coding standard used. Each coding standard is given an assigned code page identifier. For, example, Big5 is given the code page of 950. All documents written using a particular codepage will also carry the codepage information, which we call codeset announcement . The code page identifier is designated by Microsoft and a list of all of them can be found and announced to the public. Even in the Windows NT/2000/XP where internal code is already in Unicode, code page identifiers are still needed for code page conversions to and from Unicode. The conversion tables are also needed for multi-byte applications developed in the past. Windows NT/2000/XP also provide utility function to make use of these conversion tables.
  • The Java programming language uses Unicode internally. However, Java doesn’t require data from different places to be coded only in Unicode. Because Unicode is a superset of all character coding standards, we can always convert any multi-byte encoding to and from Unicode. Reference of Java supported multi-byte character sets and the APIs can be found in publicly available specifications and can be found on the internet. When writing a Java program, even the source program can be encoded in different coding standards. For example, we can write a Java program in a Windows 98 Big5 platform. In this case, if you hard code the Chinese msgs, they need to be converted to Unicode at compiling time. Therefore, Java compil ation gives you the option to specify the encoding as follows: javac –encoding <encoding> <source files> The String type is used in Java. But the method getBytes for String with a specified multi-byte coding name will convert a string to a multibyte data. For example, the statement byte [] utf8Bytes = str.getBytes(“UTF-8”); Will convert the data into UTF-8 code . On the other hand , the statement String str = new String(utf8Bytes, “UTF-8”); Will convert a UTF-8 data into Unicode data.
  • Generally speaking, codeset conversion can suffer from 1-to-0 problem or 1-to-N problem if 1-to-1 mapping cannot be provided. However, since Unicode is a superset of all existing national standard, it can guarantee round-trip conversion . The definition of round-trip conversion is defined as follows: Suppose any file file 1 in codeset A is converted to a file file 2 in codeset B and then file 2 is converted back to codeset A with a file name file 3 . If file 3 equals to file 1 , we say that codeset B guarantees round-trip conversion for A.
  • The Java code Byte[ ] my_data = { 0xA4, 0x40} Is to put 2 bytes into the byte string my_data . Then the statement String my_unicode_data = new String (my_data ,”big5”) Convert the 2 bytes of data (1 characters in Big5) as big5 code to Java’s string type which is in Unicode. The conversion will be done automatically. Note that Java uses the codeset name, not the codepage number (in Microsoft) The method getByte for the String type can convert a Unicode string to any other multi-byte character. In the statement Byte[] my_b5_data =my_unicode_data. getBytes (“Big5”) The variable my_b5_data will contain the value of 0xA440, which is the value of the character “ 一” in Big5, whereas, the statement Byte[] my_gb_data = my_unicode_ d ata. getBytes (“GBK”) The variable my_gb_data will contain the value D2BB which is the value of the character “ 一” in GB2312
  • For file operations, Java also allow data strings to be coded in other multi-byte encodings. The above is an example where the input data can be interpreted as Big5 and will be converted automatically into Unicode.
  • Here is an example for output into multi-byte encoding, in this case, the output is in Big5. Thus the output is two Big5 code 0xAA65 0xB362 for “ 河豚 ” , even though the actual code we wrote to the output was the two Unicode values u6CB3u8C5A
  • Multilingual application is different from an application for a single language. In a multilingual application, data (normally, manipulation data) are multilingual in nature. For example, in a software teaching Chinese for English people, the display data in terms of the software manual should be in English, however, the data for manipulation, should be Chinese with English explanations, which are bilingual in nature. In the past, we have learned how to write I18N software. It should be pointed out, however, that the primary purpose of I18N is for a single language. It is intended to facilitate the porting of a application from one language/locale to another. Even if many of the Asian locales supports other Latin symbols, these symbols are not treated as part of the Asian scripts. It is however, useful to consider designing a multi-lingual application using the I18N approach. We can in the analysis of a multilingual application, separately consider display data and manipulation data. If this separation can be done properly, the display data can then be designed using the I18N approach.
  • This notes introduces a set of symbols introduced in ISO 10646 as ideograph description characters (IDCs). The characters are used to describe the structures of ideograph characters. For instance, a character 峰 is obviously composed of two character components in a left-and-right structure. The IDCs are used to indicate such left-to-right structure. Based on the IDCs, a ideographic composition scheme will then be discussed. The compositions scheme provides a method to describe a character based on its component characters.
  • The ideograph Description Characters are structural symbols to indicate the positions of character components used in forming a character, which we sometimes say that they are the smaller ideograph functional units. There is a total of 12 IDCs in ISO 10646 coded in the range of 2FF0 to 2FFB. For example, the symbol ⿰ indicates left-right structure for characters, 峰 , etc. Note that the characters 2FF2 and 2FF3 are used to describe characters through three component characters whereas all the other 10 symbols require only 2 component characters.
  • It is understood that ideographs are usually formed by smaller components such as radicals, ideographs proper( 獨立漢字 ), and ideograph components( 漢字部件 ). These radicals, ideographs proper and ideograph components sometimes are all called character components( 部件 ). As an example, we can see that the same components may form different characters, such as the two components 大 , 小 can be used to form two different characters depending on the relative positions of 大 , 小 . In other words, components are not the only factor in determining the character, the relative positions of components are also part of the character formation. It should be pointed out that Chinese has a long tradition to describe characters through their components. For example, when someone tell his name, “zhang”(Putonghua Pinying), he is likely to further example that it is “ 弓” “長” “張” , not “ 立” “早” “章” .
  • In all the current encodings of Chinese characters, each character is considered an independent symbol, and is thus given a separate codepoint. Such codepoint assignment has no regards (or very limited regards) to the internal or substructures of the characters. In other words, the codepoint assignment is not directly linked to these information. Whenever, a new character needs to be supported, extension to existing standard much be produced. Even though characters assigned in the same block are arranged in Kang Xi( 康熙) radical order, characters are indeed being assigned to different blocks, thus the radical order cannot be globally maintained. It is in the nature of Chinese language where new characters will be created once in a while. This gives rise to the need to extend the standard indefinitely, which can be very time consuming. Also, it is not practical to assign a codepoint to every existing (or existed) characters as some of the characters are quite rarely used and thus the need for exchange of these characters are also rarely needed. If you consider that the codespace being a resource, it would not be an efficient use of the resource if we have to give every rarely used character a codepoint and maintain them through out the system.
  • Because of limitations of the encoding method, and the practical needs for the use of new and existing but rarely used characters, the ISO 10646’s working group started to work on the idea of using character components to describe characters. The intention was to use structural symbols and existing characters (used as components) to describe not yet coded characters . The original proposal had 15 structural symbols, but eventual only 12 symbols were accepted and be given the code range of 2FF0 to 2FFB. The 3 uncoded symbols are not shown here, but their functions are explained. It should be noted that the Left_up_encompass can be used for characters such as 斗 . Yet, because they are coded already, and are not necessarily be included.
  • With the 12 structure symbols, an Ideographic Composition Scheme was also introduced to describe a character using an ideograph description sequence (IDS) formed by components and the structure symbols where the IDCs are considered operators to the components following certain rules. The IDSs can be described by a mathematic formula using the so called context free grammar through a well known Backus Naur Form(BNF). Like any grammer, a IDS grammar G is described by four components as listed above. Let G = {  , N, P, S}, where  : the set of terminal symbols — coded radicals, coded ideographs, and the 12 IDCs. N: the set of 5 non-terminal symbols N={IDS, IDS1, Binary_Symbol, Ternary_Symbol, Ideograph_Component } S = {IDS}, which is the start symbol of the grammar P: a set of rewrite rules which will be shown in the next page
  • The rewrite rules are listed above. Note that because there are two ternary IDC symbols, they require three components and are thus separately listed. Even though the choice of binary operator and ternary operator in some cases are arbitrary, once it is chosen, there is no ambiguity in processing the IDS. A contact free grammar can easily be processed by programs as there is no ambiguity. In other words, the character structure described by this grammar is not ambiguous. Meaning that it cannot be interpreted differently. This is very important as it implied that a legal IDS describes only one character. Of course, this statement is true only if the IDCs themselves are not ambiguous, this as we will see later is not the case for all IDCs.
  • The above gives some examples of IDSs which are very commonly used and there does not seem to have much of an ambiguity to anyone what the they represent.
  • For example, the I EOGRAPHIC DESCRIPTION CHARACTER OVERLAID ( IDC-OLD , ⿻ ) de scribes the abstract form of the ideograph with D1 and D2 overlaying each other. But, it is not clear how these two components should be overlaid and whether they should touch each other or not. For example, to describe the character 巫 using components, every one understand that there are two components 从 and 工 . Yet, they cannot be described by any other IDCs except ⿻ . But, if just take the IDS “ ⿻ 从工” , you cannot tell whether the two 人 in 从 should be split and the vertical bar( 丨 ) should go through the two 人 without touch them. Nor can you tell that the top horizontal bar( 一 ) should be over 人 and the bottom horizontal bar( 一 ) should be below 人 . This indicates that ⿻ has built in ambiguity. As another example, IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER RIGHT ( IDC-SUR, ⿹ ) can be used to describes the abstract form of the ideograph with D 1 on the right top corner of D2, and D 2 is encompassed by D1. For instance, ⿹ is used to represent the character . Yet, we would question the legitimacy of ⿹ 从工 and ⿹ 工从 as it is not clear what this sequence represent although there is no explicit rules against such a use.
  • It should be noted that each character in principle can be described by different IDSs as shown in the above examples. Generally speaking we can only say that what are the “most commonly used” decomposition, yet, we cannot generally claim which one is the “correct” decomposition. The reason is that decomposition rules themselves can be ambiguous even to the most knowledgeable scholars. For example, in the character 章 , it is normally being decomposed into 立早 because the character takes 立 as its radical (in Kang Xi dictionary) even though 十 is also a radical as used in the character 卓 .
  • Each IDS uniquely defines a character, but a character may be described by different IDSs. Using the example of “ 章 ” as an example,   it can also be described by “ 音 ” “ 十 ” . The reason is that the IDCs indicate the relative positions, but it doesn’t not give precise indication on the size of the components. Consequently, IDS cannot be used for rendering purpose. In other words, to render a character correctly based on an IDS, additional information needs to be provided.
  • The term components have been used in many places throughout this subjects. In fact, the basic strokes can be considered as components and every character is built based on strokes. However, in practice, we look at characters and their components (or decomposition) using a more top-down approach. That is, to look at its substructures using a more functional view. For example, in the character 琦 , we would decompose it into two components 王 and 奇 first. As 王 is already a radical (for classification and indexing), thus it serves as a functional unit for which we would not further decompose it. Again, 奇 is an ideograph proper( 正字 ), we would not need to further decompose it even though it can be further decomposed into other components 大 and 可 . Therefore, in the definition of components, we use a practical and recursive approach to define components as follows: All radicals are components All strokes are components( in fact, all strokes are coded as of spring of 2005) All coded ideographs are components
  • The above give some examples of components that might of use for rarely used characters.
  • The use of IDCs are originally for describing un-encoded characters. Its introduction gives an alternative mechanism to describe Chinese characters which are not yet coded. However, IDCs are not limited to describe un-encoded characters. IDCs can reveal the substructures of ideographs. When used in combination with the IDS, IDCs and the IDS provide a linear way of describe a character using its components. Thus the IDS is a convenient tool to describe character composition and decomposition. This has additional educational benefit for Chinese character study. When studying Chinese variants, we can use two IDSs to decompose them. The difference in the substructure or components can pinpoint the specific place where the two characters are different. For example, can be described as ⿰ 氵 ⿱    又 , whereas, can be described by ⿰ 氵 ⿱ 几又
  • In this page, we are given a few examples of characters, and we can see how they can described (or decomposed) using IDS. Ex 1: 忂 it is obviously a left-to-right structure. Thus its IDS => ⿰ 彳瞿 . For this decomposition to work, both 彳 and 瞿 must be coded ideographs (or components). And their Unicode are indeed U+ 5F73 and U+ 77BF Ex 2 : 䑑 => ⿰ U+ 81E3 U+ 83D0 Ex 3: 䔴 , its IDS => ⿱ 艹 ⿰ 祟又 because “ 祟又” is not a single character in Unicode. The respective Unicode for these component characters are U+ 8279 , U+ 795F , and U+ 53C8 Ex 4: 蠿 is difficult to decompose. Even though it is an obvious top-to-bottom structure, the component being the upper component of 蠿 is not defined in Unicode. Its description is thus quite troublesome shown below: ⿱ ⿰ ⿱ ⿱ ⿰ 幺幺 一 ⿱ ⿰ 幺幺 一丨 ⿰ 虫虫 . I added the parentheses to make it easier to see: ⿱ ( ⿰ ( ⿱ ( ⿱ ( ⿰ 幺幺 ) 一 ) ( ⿱ ( ⿰ 幺幺 ) 一 )) 丨 ) ( ⿰ 虫虫 ) The component characters Unicode are 幺 ( 5E7A) 一 (4E00), 丨 ( 4E28), 虫 (866B)
  • Lecture 09

    1. 1. Unicode support status in various platforms (Microsoft Windows) <ul><li>Windows 9x / ME </li></ul><ul><ul><li>Do not support Unicode internally </li></ul></ul><ul><ul><li>Limited Unicode APIs are supported. </li></ul></ul><ul><ul><li>Unicode applications compiled with Microsoft Layer for Unicode can be run on Win9x </li></ul></ul><ul><ul><li>Use code page to support different encodings </li></ul></ul><ul><li>Windows NT / 2000 / XP </li></ul><ul><ul><li>Support Unicode </li></ul></ul><ul><ul><li>Use of wide char (fixed 2 bytes) </li></ul></ul><ul><ul><li>Use UCS-2 </li></ul></ul>
    2. 2. Unicode support status in various platforms (Linux & Mac OS) <ul><li>Linux </li></ul><ul><ul><li>Newer Kernel supports Unicode </li></ul></ul><ul><ul><li>Requires glibc 2.2.2 and XFree86 4.0.3 or newer </li></ul></ul><ul><ul><li>Use UTF-8 in most case, e.g. filesystem </li></ul></ul><ul><ul><li>Set locale to <lang>_<place>.<encoding>, e.g. zh_TW.utf8 </li></ul></ul><ul><ul><li>Enable UTF-8 support in console by executing unicode_start </li></ul></ul><ul><li>Mac OS </li></ul><ul><ul><li>Mac OS 9.1, Mac OS X support Unicode </li></ul></ul><ul><ul><li>16-bit for Unicode character </li></ul></ul>
    3. 3. What is a code page <ul><li>There are a lot of different encodings, e.g. EUC-TW, Big5, Latin-1 etc. </li></ul><ul><li>A code page (code page identifier) is a number to identify a codeset. </li></ul><ul><ul><li>e.g. 950 – Traditional Chinese (Big5) </li></ul></ul><ul><ul><li>e.g. 1252 – Windows Latin-1 </li></ul></ul><ul><ul><li>Other code page identifiers can be found in: </li></ul></ul><ul><ul><li>http://msdn.microsoft.com/library/en-us/intl/unicode_81rn.asp </li></ul></ul><ul><li>In Windows NT/2000/XP, code page conversion table provides information to convert between different encodings. </li></ul>
    4. 4. Java <ul><li>Java is in Unicode internally. The supported encoding sets are provided by Java library packages rt.jar and i18n.jar </li></ul><ul><li>The supported encoding sets for java.io.* , java.lang.* and java.nio.* API can be found in: </li></ul><ul><li>http://java.sun.com/j2se/1.4/docs/guide/intl/encoding.doc.html </li></ul><ul><li>User Input/Output will be automatically convert between Unicode and System code page </li></ul><ul><li>Specify the encoding of the source files when compiling. </li></ul><ul><ul><li>javac –encoding <encoding> <source files> </li></ul></ul><ul><li>Convert to other supported encoding: </li></ul><ul><li>e.g. byte [] utf8Bytes = str.getBytes(“UTF-8”); </li></ul><ul><li>Convert from other supported encoding: </li></ul><ul><li>e.g. String str = new String(utf8Bytes, “UTF-8”); </li></ul>
    5. 5. Code Conversion <ul><li>Generally codeset conversion cannot provide one-to-one mapping(unless the two character sets are exactly the same) </li></ul><ul><li>Unicode is a superset of every existing national standard => guaranteed round-trip conversion </li></ul><ul><li>Round-trip conversion : Suppose a file file 1 in codeset A is converted to a file file 2 in codeset B and then converted back to codeset A with a file name file 3 . </li></ul><ul><ul><li>If file 3 =file 1 , we say that codeset B guarantees round-trip conversion for A. </li></ul></ul>
    6. 6. Java Code conversion <ul><li>Conversion from multibyte to Unicode </li></ul><ul><ul><li>Byte[ ] my_data = { 0xA4, 0x40} </li></ul></ul><ul><ul><li>String my_unicode_data = new String (my_data ,”big5”) </li></ul></ul><ul><ul><li>Where “big5” is the name of the multibyte code name. Unicode needs this to do code conversion to: </li></ul></ul><ul><li>Conversion from Unicode to multibyte </li></ul><ul><ul><li>String my_unicode_data =“u4E00” ( 一 ) </li></ul></ul><ul><ul><li>Byte[] my_b5_data =my_unicode_data. getBytes (“Big5”) </li></ul></ul><ul><ul><ul><li>My_b5_data will have the value of 0xA440 </li></ul></ul></ul><ul><ul><li>Byte[] my_gb_data = my_unicode_ d ata. getBytes (“GBK”) </li></ul></ul><ul><ul><ul><li>My_gb_data will have the value of D2BB </li></ul></ul></ul>
    7. 7. <ul><li>Text stream import </li></ul><ul><ul><li>File I = new File (“input”); </li></ul></ul><ul><ul><li>FileInputStream tmpin = new FileInputStream (I); </li></ul></ul><ul><ul><li>BufferedReader in = new BufferedReader ( new InputStreamReader ( tmpin , “Big5”)); </li></ul></ul><ul><li>Once the BufferedReader in is established, data can be read using the readLine() method. </li></ul><ul><ul><li>inputStr = in. readLine (); </li></ul></ul>
    8. 8. <ul><li>Text Stream Export </li></ul><ul><ul><li>File o = new File (“output.big5”); </li></ul></ul><ul><ul><li>FileOutputStream tmpout = new FileOutputStream ( o ); </li></ul></ul><ul><ul><li>BufferedWriter out = new BufferedWriter(new OutputStreamWriter(tmpout, “Big5”)); </li></ul></ul><ul><li>… . </li></ul><ul><li>Out.println(“u6CB3u8C5A”); (“ 河豚” ) </li></ul><ul><li>Out.close(); </li></ul><ul><li>0xAA65 0xB362 </li></ul>
    9. 9. Multilingual applications <ul><li>Software teaching Chinese for English people </li></ul><ul><li>Software teaching English for Chinese </li></ul><ul><li>Conceptually separate two types of data in a multilingual application: </li></ul><ul><ul><li>Data related to display of menu/instructions, </li></ul></ul><ul><ul><li>Data related to the processing in the program </li></ul></ul><ul><ul><li>Multilingual application vs. I18n applications </li></ul></ul><ul><ul><li>I18N: data related to display and processing are the same and it is for the same language/convention </li></ul></ul><ul><ul><li>Multilingual applications: Data related to display is for one language(and can be internationalized). Data related to the processing can be multilingual and not necessarily related to the display language. </li></ul></ul><ul><ul><li>Unicode is the most convenient encoding for multilingual applications, but not absolutely necessary </li></ul></ul>
    10. 10. The Ideographic Composition Scheme Used in ISO 10646 <ul><li>Introduction to Ideograph Description Characters(IDCs) </li></ul><ul><li>The ideographic composition scheme </li></ul><ul><li>Application using IDCs </li></ul>
    11. 11. What are Ideograph Description Characters <ul><li>12 structure symbols used to describe the formation of characters using some smaller ideograph functional units such as character components ⿰ ⿱⿲⿳⿴⿵⿶⿷⿸⿹⿺⿻ </li></ul>
    12. 12. Characteristics of Ideographs <ul><li>Ideograph characters are often formed by smaller ideographic elements such as Radicals, ideographs proper, and other ideographic components which we generally call ideograph components </li></ul><ul><li>Natural in the formation of characters </li></ul><ul><li>Examples: 2 components </li></ul><ul><li>=> </li></ul><ul><li>Chinese uses components has long been using components to describe characters, especially characters with the same pronunciation </li></ul>
    13. 13. Problems with ideograph C haracter Encoding <ul><li>Each character is treated as a different symbol, and thus given a codepoint </li></ul><ul><li>Codepoint assignment in a block does try to follow radical order, but codepoint assignment does not consider the substructures(components). Thus such information is not revealed. </li></ul><ul><li>When new character is created, codepoint allocation is needed in new blocks , thus radical order cannot be globally maintained. </li></ul><ul><li>Also there is a potentially endless standardization process </li></ul><ul><ul><li>Encoding of rarely used ideograph characters is a waste of resource both in terms of code space and also standardization effort </li></ul></ul>
    14. 14. Introduction of IDCs <ul><li>Work started in 1995 by ISO/IEC SC2/WG2/IRG in 1995 </li></ul><ul><li>Objective of the Original proposal: use coded ideographs and “structure symbols” to describe not yet coded ideographs. </li></ul><ul><li>Original proposal has 15 “Ideograph Structure Symbols” base on study on Han characters, three of them didn’t make it to ISO 10646/Unicode: </li></ul><ul><ul><li>Ideograph_Proper( 日 ): Every coded character is considered ideograph proper, thus not needed </li></ul></ul><ul><ul><li>Left_Up_Encompass: no un-encoded example </li></ul></ul><ul><ul><li>Mirror_Symmetry ( 非 ): left being mirrored to the right, but can be describe by Left_to_Right </li></ul></ul><ul><li>Renames the 12 symbols as Ideograph Description Characters </li></ul>
    15. 15. Ideographic Composition Scheme <ul><li>IDS describes a character using its components and indicating the relative positions of the components. </li></ul><ul><li>IDCs are considered operators to the components. </li></ul><ul><li>IDSs can be expressed by a context free grammar through the Backus Naur Form (BNF) . The grammar G has four components: </li></ul><ul><li>Let G = {  , N, P, S}, where </li></ul><ul><ul><ul><li> : the set of terminal symbols — coded radicals, coded ideographs, and the 12 IDCs. </li></ul></ul></ul><ul><ul><ul><li>N:the set of 5 non-terminal symbols </li></ul></ul></ul><ul><ul><ul><ul><li>N={IDS, IDS1, Binary_Symbol, Ternary_Symbol, Ideograph_Component} </li></ul></ul></ul></ul><ul><ul><ul><li>S = {IDS}, which is the start symbol of the grammar </li></ul></ul></ul><ul><ul><ul><li>P: a set of rewrite rules </li></ul></ul></ul>
    16. 16. <ul><li>The following is the set of rewriting rules P: </li></ul><ul><li>IDS::=<Binary_Symbol><IDS1><IDS1>|<Ternary_Symbol> </li></ul><ul><li><IDS1><IDS1><IDS1> </li></ul><ul><li><IDS1> ::= <IDS> | <Ideograph_Component> </li></ul><ul><li><Ideograph_Component>::= coded_ideograph | coded_radical | coded_component </li></ul><ul><li><Binary-Symbol> ::= ⿰ | ⿱ | ⿴ | ⿵ | ⿶ | ⿷ | ⿸ | ⿹ | ⿺ | ⿻ </li></ul><ul><li><Ternary_Symbol> ::= ⿲ | ⿳ </li></ul><ul><li>Note that even though the IDCs are terminal symbols, they are not part of the ideograph components. </li></ul>
    17. 17. Examples
    18. 18. <ul><ul><ul><li>IDEOGRAPHIC DESCRIPTION CHARACTER OVERLAID ( IDC-OLD , ⿻ ) : </li></ul></ul></ul><ul><ul><ul><ul><li>The IDS introduced by IDC-OLD describes the abstract form of the ideograph with D1 and D2 overlaying each other. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>⿻ 从工 is an example of an IDS which represents the abstract from of 巫 </li></ul></ul></ul></ul><ul><li>IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER RIGHT ( IDC-SUR, ⿹ ) : </li></ul><ul><ul><ul><ul><li>The IDS introduced by IDC-SUR describes the abstract form of the ideograph with D 1 on the right top corner of D2, and D 2 is encompassed by D1. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>⿹ is an example of an IDS which represents the abstract from of </li></ul></ul></ul></ul>
    19. 19. <ul><li>IDS allows a character to be described by different sequences </li></ul><ul><li>One IDS should describe only one character, yet one character can be described by different IDSs. </li></ul>
    20. 20. <ul><li>IDS describes ideographic character composition at the abstract level. It indicates the relative positions of the components, but does not indicate the proportions. </li></ul><ul><li>Not intended for rendering. </li></ul><ul><li>Nesting is natural in ideographs and they are reflected in in the IDS scheme </li></ul>
    21. 21. Components <ul><li>Ideographic Components(IRG definition) : </li></ul><ul><li>units which can be used to represent ideographs. These components consist of ideographs proper coded in ISO 10646 (BMP) and some basic elements used to form ideographs. </li></ul><ul><li>Radicals(IRG definition): those ideographic components listed in index pages of KX1, DKW, DJW, HYD. </li></ul><ul><li>ISO extensions: </li></ul><ul><ul><li>Radicals </li></ul></ul><ul><ul><li>Components </li></ul></ul>
    22. 22. <ul><li>28 from GBK and more from IRG </li></ul><ul><li>ISO IRG component sample </li></ul>
    23. 23. Extending the Objectives of IDCs <ul><li>Using coded characters to describe not yet code ideographs both for representation and exchange </li></ul><ul><li>Limit standardization to only modern characters, and not some rarely used characters </li></ul><ul><li>Learning of character composition(education) </li></ul><ul><li>Revealing substructures of ideograph characters </li></ul><ul><li>Description of ideograph variants </li></ul>
    24. 24. Examples <ul><li>Given characters => IDS? </li></ul><ul><li>忂 䑑 䔄 䔴 蠿 </li></ul><ul><li>Given a IDS => what are the characters </li></ul><ul><li>莫言 </li></ul><ul><li>艹旲言 </li></ul><ul><li>Is the following a legal IDS? </li></ul><ul><li>莫言 艹旲 </li></ul>
    25. 25. Conclusion <ul><li>IDCs are introduced in Unicode 3.0 </li></ul><ul><li>The use is going beyond the original objective </li></ul><ul><li>Applications based on the IDCs were already developed such as in the the Hong Kong Glyph Specification. </li></ul><ul><li>IDCs should also useful in ideograph variant specifications </li></ul><ul><li>Additional search site: </li></ul><ul><ul><li>http://glyph.iso10646hk.net/ccs/ccs.jsp?lang=zh_TW </li></ul></ul>