BSDCONV
Kuan-Chung Chiu
(buganini at gmail dot com)
Contents
1 Syntax 1
1.1 Phases & Cascade . . . . . . . . . . . . . . ....
A basic conversion consists of from and to phases. Search of codec name is
case insensitive.
ISO-8859-1 : UTF-8
from to
Fi...
1.2 Codecs & Fallback
A phase consists of one or more codecs, separated by comma. The latter
codecs will be utilized if an...
2 Type & Flag
2.1 Type
A code point packet note its type at first byte.
ID Description Provider(from) Consumer(to)
00 Bsdco...
2.2 Flag
A code point packet carries its own flags. Currently there are two types of
flag, FREE and MARK. Flag FREE indicate...
2.3 Helper codecs
Codec from/bsdconv can be used to input internal data structure, and codec
to/BSDCONV-OUTPUT can be used...
3.2 Skeleton
#include <bsdconv.h>
bsdconv_instance *ins;
char *buf;
size_t len;
ins=bsdconv_create ("UTF -8: UPSIDEDOWN:UT...
3.3.1 BSDCONV HOLD
This is default output mode after bsdconv init(). Usually used with BSD-
CONV AUTOMALLOC or BSDCONV PRE...
There are two APIs to get/reset counter(s):
bsdconv_counter_t * bsdconv_counter (char *name );
Return the pointer to the c...
Upcoming SlideShare
Loading in …5
×

Bsdconv

1,379 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,379
On SlideShare
0
From Embeds
0
Number of Embeds
24
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Bsdconv

  1. 1. BSDCONV Kuan-Chung Chiu (buganini at gmail dot com) Contents 1 Syntax 1 1.1 Phases & Cascade . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Codecs & Fallback . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Codec argument . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Type & Flag 3 2.1 Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Helper codecs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 C Programming guide 6 3.1 Conversion instance lifecycle . . . . . . . . . . . . . . . . . . . . . 6 3.2 Skeleton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3 Output mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3.1 BSDCONV HOLD . . . . . . . . . . . . . . . . . . . . . . 8 3.3.2 BSDCONV AUTOMALLOC . . . . . . . . . . . . . . . . 8 3.3.3 BSDCONV PREMALLOCED . . . . . . . . . . . . . . . 8 3.3.4 BSDCONV FILE . . . . . . . . . . . . . . . . . . . . . . . 8 3.3.5 BSDCONV FD . . . . . . . . . . . . . . . . . . . . . . . . 8 3.3.6 BSDCONV NULL . . . . . . . . . . . . . . . . . . . . . . 8 3.3.7 BSDCONV PASS . . . . . . . . . . . . . . . . . . . . . . 8 3.4 Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.5 Memory pool issue . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1 Syntax 1.1 Phases & Cascade There are three types of conversion phases defined in bsdconv: from, inter, to. The from phase takes byte sequence and decodes it into a list of code points (except for from/PASS), on the other hand, the to phase encodes the list of code points back to byte sequence. The inter phase does code point to code point mapping. 1
  2. 2. A basic conversion consists of from and to phases. Search of codec name is case insensitive. ISO-8859-1 : UTF-8 from to Figure 1: Basic two phases conversion Between from and to phases, we can have an inter phase. UTF-8 : UPPER : UTF-8 from inter to Figure 2: Conversion with inter-mapping phase There can be more than one inter phases. UTF-8 : UPPER : FULL : UTF-8 from inter inter to Figure 3: Conversion with multiple inter-mapping phases An inter phase can be used standalonely, mostly in programmatic way. HALF inter Figure 4: Standalone inter-mapping phase Conversions can be cascaded with pipe symbol. In most cases it is equivalent to shell pipe unless the use of codecs manipulating flag (described in section 2.2). UTF-8 : BIG5 | BIG5 : UTF-8 from to from to Figure 5: Cascaded conversions ASCII-compatible codecs are designed to exclude ASCII part and named as FOO, with alias FOO ⇒ FOO,ASCII or ASCII, FOO. 2
  3. 3. 1.2 Codecs & Fallback A phase consists of one or more codecs, separated by comma. The latter codecs will be utilized if and only if the former codecs fail to consume the incoming data, once a codec finish its task, the first codec will be up again for upcoming data. UTF-8 : ASCII , 3F from to Figure 6: Fallback codec 1.3 Codec argument Some codecs take arguments, after the hash symbol. UTF-8 : ASCII , ANY#3F Figure 7: Passing argument to codec Some codecs take arguments in key-value form. Argument name and value consist of numbers, alphabets, hyphen and underscore, binary data are repre- sented in hexadecimal form. UTF-8 : ASCII , ESCAPE#PREFIX=2575 Figure 8: Passing argument to codec in key-value form Multiple arguments can be passed by being concatenated with ampersand. UTF-8 : ASCII , ESCAPE#PREFIX=262378&SUFFIX=3B Figure 9: Passing multiple arguments to codec List of data can be passed in dot-separated form. ANY#013F.0121 : ASCII Figure 10: Data list 3
  4. 4. 2 Type & Flag 2.1 Type A code point packet note its type at first byte. ID Description Provider(from) Consumer(to) 00 Bsdconv special characters BSDCONV-KEYWORD BSDCONV-KEYWORD 01 Unicode Most decoders Most encoders 02 CNS116431 CNS11643 CNS11643 03 Byte BYTE; ESCAPE BYTE; ESCAPE#FOR=BYTE 04 Chinese components inter/ZH-DECOMP inter/ZH-COMP 1B ANSI control sequence ANSI-CONTROL - Table 1: Types and its provider/consumer (just to name a few) Entity Unicode UTF-8 Hex % U+0025 25 A U+0041 41 ∀ U+2200 E28880 A∀ Input (UTF-8 literal) ASCII,BYTE : ... Decoder 01 41 03 E2 03 88 03 80 Internal data ... : ASCII,ESCAPE Encoder 41 ”A” 25 45 32 ”%E2” 25 38 38 ”%88” 25 38 30 ”%80” Internal data A%E2%88%80 Output (UTF-8 literal) Figure 11: Fallback & Type 1As for the intersection of CNS11643 and Unicode, from/CNS11643 does conversion to unicode type if possible. Vice versa, to/CNS11643 does conversion from unicode type if possible. 4
  5. 5. 2.2 Flag A code point packet carries its own flags. Currently there are two types of flag, FREE and MARK. Flag FREE indicates that the packet buffer needs to be recycled or released, this is used only when programming is involved. Flag MARK is (currently only) added by codec to/PASS#MARK and used by codec from/PASS#UNMARK to identify which packets have already been decoded and needs to be passed through in from phase. The code point packets structure is retained, including flags, within cascaded conversions, but not for shell pipe. Figure 11 demonstrate the flow of conversion ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8”. Entity Unicode UTF-8 Hex α U+03B1 CEB1 β U+03B2 CEB2 %u03B1%CE%B2 Input (UTF-8 literal) ESCAPE : ... Decoder 01 03 B1 03 CE 03 B2 Internal data ... : PASS#MARK&FOR=1,BYTE Encoder 01 03 B1 MARK CE B2 Internal data PASS#UNMARK,UTF-8 : ... Decoder 01 03 B1 01 03 B2 Internal data ... : UTF-8 Encoder CE B1 ”α” CE B2 ”β” Internal data αβ Output (UTF-8 literal) Figure 12: Flag, from/PASS & to/PASS 5
  6. 6. 2.3 Helper codecs Codec from/bsdconv can be used to input internal data structure, and codec to/BSDCONV-OUTPUT can be used to inspect type and flags. 3 C Programming guide 3.1 Conversion instance lifecycle bsdconv create() bsdconv init() set input/output parameters is last chunk set flush flag bsdconv() collect output has next chunk bsdconv destroy() yes no no yes next chunk no reuse instance Figure 13: Conversion instance lifecycle 6
  7. 7. 3.2 Skeleton #include <bsdconv.h> bsdconv_instance *ins; char *buf; size_t len; ins=bsdconv_create ("UTF -8: UPSIDEDOWN:UTF -8"); bsdconv_init(ins); do{ buf=bsdconv_malloc (BUFSIZ ); /* * fill data into buf * len=filled data length */ ins ->input.data=buf; ins ->input.len=len; ins ->input.flags |= F_FREE; ins ->input.next=NULL; if(ins ->input.len ==0) { // last chunk ins ->flush =1; } /* * set output parameter (see section 3.3) */ bsdconv(ins); /* * collect output (see section 3.3) */ }while(ins ->flush ==0); bsdconv_destroy (ins); For chunked conversion, input buffer should be allocated for each input to prevent content change during conversion. Output buffer with flag FREE is safe to be reused. 3.3 Output mode ins -> output mode Description BSDCONV HOLD Hold output in memory BSDCONV AUTOMALLOC Return output buffer which should be free() after use BSDCONV PREMALLOCED Fill output into given buffer BSDCONV FILE Write output into (FILE *) stream file BSDCONV FD Write output into (int) file descriptor BSDCONV NULL Discard output BSDCONV PASS Pass to another conversion instance 7
  8. 8. 3.3.1 BSDCONV HOLD This is default output mode after bsdconv init(). Usually used with BSD- CONV AUTOMALLOC or BSDCONV PREMALLOCED to get squeezed out- put. 3.3.2 BSDCONV AUTOMALLOC Output buffer will be allocated dynamically, the actual buffer size will be ins->output.len + output content length, it is useful when you need to have terminating null byte. 3.3.3 BSDCONV PREMALLOCED If ins->output.data is NULL, the total length of content to be output will be put to ins->output.len, but output will still be hold in memory. Otherwise, bsdconv() will fill as much unfragmented data as possible within the buffer size limit specified at ins->output.len. 3.3.4 BSDCONV FILE Output will be fwrite() to the given FILE * at ins->output.data. 3.3.5 BSDCONV FD Output will be write() to the given (int) file descriptor at ins->output.data. Casting to intptr t (defined in <stdint.h>) is needed to eliminate compiler warning. 3.3.6 BSDCONV NULL Output will be discard. This is usually used with evaluating conversion (see section 3.4). 3.3.7 BSDCONV PASS Output packets will be passed to the given (struct bsdconv instance *) con- version instance at ins->output.data. 3.4 Counters Counters are listed in ins->counter in linked-list with following structure. struct bsdconv_counter_entry { char *key; bsdconv_counter_t val; struct bsdconv_counter_entry *next; }; IERR and OERR are mandatory error counters. 8
  9. 9. There are two APIs to get/reset counter(s): bsdconv_counter_t * bsdconv_counter (char *name ); Return the pointer to the counter value. bsdconv counter t is currently defined as size t. void bsdconv_counter_reset (char *name ); Reset the specified counter, if name is NULL, all counters are reset. 3.5 Memory pool issue In case libbsdconv and your program uses different memory pools, bsdconv malloc() and bsdconv free() should be used to replace malloc() and free(). 9

×