BSDCONV
Kuan-Chung Chiu
(buganini at gmail dot com)

Contents
1 Syntax
1.1 Phases & Cascade . . . . . . . . . . . . . . . ...
A basic conversion consists of from and to phases. Search of codec name is
case insensitive.
. ISO-8859-1 : UTF-8
from

to...
1.2

Codecs & Fallback

A phase consists of one or more codecs, separated by comma. The latter
codecs will be utilized if ...
2

Type & Flag

2.1

Type

A code point packet note its type at first byte.
ID
00
01
02
03
04
1B

Description
Bsdconv speci...
2.2

Flag

A code point packet carries its own flags. Currently there are two types of
flag, FREE and MARK. Flag FREE indica...
2.3

Helper codecs

Codec from/bsdconv can be used to input internal data structure, and codec
to/bsdconv stdout can be us...
3.2

Skeleton

# include < bsdconv .h >
bsdconv_instance * ins ;
char * buf ;
size_t len ;
ins = bsdconv_create ( " UTF -8...
3.3.1

BSDCONV HOLD

This is default output mode after bsdconv init(). Usually used with BSDCONV AUTOMALLOC or BSDCONV PRE...
There are two APIs to get/reset counter(s):
bsd conv_ counter_t * bsdconv_counter ( char * name );
Return the pointer to t...
Upcoming SlideShare
Loading in...5
×

Bsdconv

818

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
818
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Bsdconv"

  1. 1. BSDCONV Kuan-Chung Chiu (buganini at gmail dot com) Contents 1 Syntax 1.1 Phases & Cascade . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Codecs & Fallback . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Codec argument . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 3 2 Type & Flag 2.1 Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Helper codecs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 5 6 3 C Programming guide 3.1 Conversion instance lifecycle . . . . . . 3.2 Skeleton . . . . . . . . . . . . . . . . . 3.3 Output mode . . . . . . . . . . . . . . 3.3.1 BSDCONV HOLD . . . . . . . 3.3.2 BSDCONV AUTOMALLOC . 3.3.3 BSDCONV PREMALLOCED 3.3.4 BSDCONV FILE . . . . . . . . 3.3.5 BSDCONV FD . . . . . . . . . 3.3.6 BSDCONV NULL . . . . . . . 3.4 Counters . . . . . . . . . . . . . . . . . 3.5 Memory pool issue . . . . . . . . . . . 6 6 7 7 8 8 8 8 8 8 8 9 1 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Syntax Phases & Cascade There are three types of conversion phases defined in bsdconv: from, inter, to. The from phase takes byte sequence and decodes it into a list of code points (except for from/PASS), on the other hand, the to phase encodes the list of code points back to byte sequence. The inter phase does code point to code point mapping. 1
  2. 2. A basic conversion consists of from and to phases. Search of codec name is case insensitive. . ISO-8859-1 : UTF-8 from to Figure 1: Basic two phases conversion Between from and to phases, we can have an inter phase. . UTF-8 : UPPER : UTF-8 from inter to Figure 2: Conversion with inter-mapping phase There can be more than one inter phases. . UTF-8 : UPPER : FULL : UTF-8 from inter inter to Figure 3: Conversion with multiple inter-mapping phases An inter phase can be used standalonely, mostly in programmatic way. . HALF inter Figure 4: Standalone inter-mapping phase Conversions can be cascaded with pipe symbol. In most cases it is equivalent to shell pipe unless the use of codecs manipulating flag (described in section 2.2). . UTF-8 : BIG5 | BIG5 : UTF-8 from to from to Figure 5: Cascaded conversions ASCII-compatible codecs are designed to exclude ASCII part and named as FOO, with alias FOO ⇒ FOO,ASCII or ASCII, FOO. 2
  3. 3. 1.2 Codecs & Fallback A phase consists of one or more codecs, separated by comma. The latter codecs will be utilized if and only if the former codecs fail to consume the incoming data, once a codec finish its task, the first codec will be up again for upcoming data. . UTF-8 : ASCII , 3F from to Figure 6: Fallback codec 1.3 Codec argument Some codecs take arguments, after the hash symbol. . UTF-8 : ASCII , ANY#3F Figure 7: Passing argument to codec Some codecs take arguments in key-value form. Argument name and value consist of numbers, alphabets, hyphen and underscore, binary data are represented in hexadecimal form. . UTF-8 : ASCII , ESCAPE#PREFIX=2575 Figure 8: Passing argument to codec in key-value form Multiple arguments can be passed by being concatenated with ampersand. . UTF-8 : ASCII , ESCAPE#PREFIX=262378&SUFFIX=3B Figure 9: Passing multiple arguments to codec List of data can be passed in dot-separated form. . ANY#013F.0121 : ASCII Figure 10: Data list 3
  4. 4. 2 Type & Flag 2.1 Type A code point packet note its type at first byte. ID 00 01 02 03 04 1B Description Bsdconv special characters Unicode CNS116431 Byte Chinese components ANSI control sequence Provider(from) BSDCONV KEYWORD Most decoder CNS11643 BYTE; ESCAPE inter/ZH DECOMP ANSI-CONTROL Consumer(to) BSDCONV KEYWORD Most encoder CNS11643 BYTE; ESCAPE#FOR=BYTE inter/ZH COMP - Table 1: Types and its provider/consumer (just to name a few) Entity % A ∀ Unicode U+0025 . U+0041 U+2200 UTF-8 Hex 25 41 E28880 . ASCII,BYTE : ... A∀ Input (UTF-8 literal) Decoder ... : ASCII,ESCAPE ”A” 41 . Encoder 01 41 03 E2 03 88 Internal data ”%E2” 25 45 32 ”%88” 25 38 38 ”%80” 25 38 30 Internal data Figure 11: Fallback & Type 1 As for the intersection of CNS11643 and Unicode, from/CNS11643 does conversion to unicode type if possible. Vice versa, to/CNS11643 does conversion from unicode type if possible. 4 03 80 A%E2%88%80 Output (UTF-8 literal)
  5. 5. 2.2 Flag A code point packet carries its own flags. Currently there are two types of flag, FREE and MARK. Flag FREE indicates that the packet buffer needs to be recycled or released, this is used only when programming is involved. Flag MARK is (currently only) added by codec to/PASS#MARK and used by codec from/PASS#UNMARK to identify which packets have already been decoded and needs to be passed through in from phase. The code point packets structure is retained, including flags, within cascaded conversions, but not for shell pipe. Figure 11 demonstrate the flow of conversion ”ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8”. Entity α β Unicode . U+03B1 U+03B2 UTF-8 Hex CEB1 CEB2 . %u03B1%CE%B2 ESCAPE : ... Input (UTF-8 literal) Decoder 01 03 B1 03 CE 03 B2 Internal data . ... : PASS#MARK&FOR=1,BYTE Encoder 01 03 B1 MARK CE B2 Internal data . PASS#UNMARK,UTF-8 : ... Decoder 01 03 B1 01 03 B2 Internal data ”α” . CE B1 ”β” αβ CE B2 Output (UTF-8 literal) Internal data Figure 12: Flag, from/PASS & to/PASS 5 ... : UTF-8 Encoder
  6. 6. 2.3 Helper codecs Codec from/bsdconv can be used to input internal data structure, and codec to/bsdconv stdout can be used to inspect type and flags. 3 3.1 C Programming guide Conversion instance lifecycle bsdconv .create() bsdconv init() set input/output parameters yes is last chunk set flush flag reuse instance no next chunk bsdconv() collect output yes has next chunk no no bsdconv destroy() Figure 13: Conversion instance lifecycle 6
  7. 7. 3.2 Skeleton # include < bsdconv .h > bsdconv_instance * ins ; char * buf ; size_t len ; ins = bsdconv_create ( " UTF -8: UPSIDEDOWN : UTF -8 " ); bsdconv_init ( ins ); do { buf = bsdconv_malloc ( BUFSIZ ); /* * fill data into buf * len = filled data length */ ins - > input . data = buf ; ins - > input . len = len ; ins - > input . flags |= F_FREE ; ins - > input . next = NULL ; if ( ins - > input . len ==0) { // last chunk ins - > flush =1; } /* * set output parameter ( see section 3.3) */ bsdconv ( ins ); /* * collect output ( see section 3.3) */ } while ( ins - > flush ==0); bsdconv_destroy ( ins ); For chunked conversion, input buffer should be allocated for each input to prevent content change during conversion. Output buffer with flag FREE is safe to be reused. 3.3 Output mode ins -> output mode BSDCONV HOLD BSDCONV AUTOMALLOC BSDCONV PREMALLOCED BSDCONV FILE BSDCONV FD BSDCONV NULL Description Hold output in memory Return output buffer which should be free() after use Fill output into given buffer Write output into (FILE *) stream file Write output into (int) file descriptor Discard output 7
  8. 8. 3.3.1 BSDCONV HOLD This is default output mode after bsdconv init(). Usually used with BSDCONV AUTOMALLOC or BSDCONV PREMALLOCED to get squeezed output. 3.3.2 BSDCONV AUTOMALLOC Output buffer will be allocated dynamically, the actual buffer size will be ins->output.len + output content length, it is useful when you need to have terminating null byte. 3.3.3 BSDCONV PREMALLOCED If ins->output.data is NULL, the total length of content to be output will be put to ins->output.len, but output will still be hold in memory. Otherwise, bsdconv() will fill as much unfragmented data as possible within the buffer size limit specified at ins->output.len. 3.3.4 BSDCONV FILE Output will be fwrite() to the given FILE * at ins->output.data. 3.3.5 BSDCONV FD Output will be write() to the given (int) file descriptor at ins->output.data. Casting to intptr t (defined in <stdint.h>) is needed to eliminate compiler warning. 3.3.6 BSDCONV NULL Output will be discard. This is usually used with evaluating conversion (see section 3.4). 3.4 Counters Counters are listed in ins->counter in linked-list with following structure. struct b s d con v_co unt er _ ent ry { char * key ; bs dconv_count er_t val ; struct b sdco nv_ c o u n te r _e n t r y * next ; }; IERR and OERR are mandatory error counters. 8
  9. 9. There are two APIs to get/reset counter(s): bsd conv_ counter_t * bsdconv_counter ( char * name ); Return the pointer to the counter value. bsdconv counter t is currently defined as size t. void b s d c o n v_co unter _re s et ( char * name ); Reset the specified counter, if name is NULL, all counters are reset. 3.5 Memory pool issue In case libbsdconv and your program uses different memory pools, bsdconv malloc() and bsdconv free() should be used to replace malloc() and free(). 9

×