geniconvtbl(4) File Formats geniconvtbl(4)NAMEgeniconvtbl - geniconvtbl input file format
DESCRIPTION
An input file to geniconvtbl is an ASCII text file that contains an
iconv code conversion definition from one codeset to another codeset.
The geniconvtbl utility accepts the code conversion definition file(s)
and writes code conversion binary table file(s) that can be used in
iconv(1) and iconv(3C) to support user-defined code conversions. See
iconv(1) and iconv(3C)for more detail on the iconv code conversion and
geniconvtbl(1) for more detail on the utility.
The Lexical Conventions
The following lexical conventions are used in the iconv code conversion
definition:
CONVERSION_NAME A string of characters representing the name of
the iconv code conversion. The iconv code con‐
version name should start with one or more
printable ASCII characters followed by a per‐
centage character '%' followed by another one
or more of printable ASCII characters. Exam‐
ples: ISO8859-1%ASCII, 646%eucJP, CP_939%ASCII.
NAME A string of characters starts with any one of
the ASCII alphabet characters or the underscore
character, '_', followed by one or more ASCII
alphanumeric characters and underscore charac‐
ter, '_'. Examples: _a1, ABC_codeset, K1.
HEXADECIMAL A hexadecimal number. The hexadecimal represen‐
tation consists of an escape character, '0'
followed by the constant 'x' or 'X' and one or
more hexadecimal digits. Examples: 0x0, 0x1,
0x1a, 0X1A, 0x1B3.
DECIMAL A decimal number, represented by one or more
decimal digits. Examples: 0, 123, 2165.
Each comment starts with '//' ends at the end of the line.
The following keywords are reserved:
automatic between binary
break condition default
dense direction discard
else error escapeseq
false if index
init input inputsize
map maptype no_change_copy
operation output output_byte_length
outputsize printchr printhd
printint reset return
true
Additionally, the following symbols are also reserved as tokens:
{ } [ ] ( ) ; , ...
The precedence and associativity
The following table shows the precedence and associativity of the oper‐
ators from lower precedence at the top to higher precedence at the bot‐
tom of the table allowed in the iconv code conversion definition:
Operator (Symbol) Associativity
--------------------------------------------------
Assignment (=) Right
--------------------------------------------------
Logical OR (||) Left
--------------------------------------------------
Logical AND (&&) Left
--------------------------------------------------
Bitwise OR (|) Left
--------------------------------------------------
Exclusive OR (^) Left
--------------------------------------------------
Bitwise AND (&) Left
--------------------------------------------------
Equal-to (= =), Left
Inequality (!=)
--------------------------------------------------
Less-than (<), Left
Less-than-or-equal-to (<=),
Greater-than (>),
Greater-than-or-equal-to (>=)
--------------------------------------------------
Left-shift (<<), Left
Right-shift (>>)
--------------------------------------------------
Addition (+), Left
Subtraction (-)
--------------------------------------------------
Multiplication (*), Left
Division (/),
Remainder (%)
---------------------------------------------------
Logical negation (!), Right
Bitwise complement (~),
Unary minus (-)
---------------------------------------------------
The Syntax
Each iconv code conversion definition starts with CONVERSION_NAME fol‐
lowed by one or more semi-colon separated code conversion definition
elements:
// a US-ASCII to ISO8859-1 iconv code conversion example:
US-ASCII%ISO8859-1 {
// one or more code conversion definition elements here.
:
:
}
Each code conversion definition element can be any one of the following
elements:
direction
condition
operation
map
To have a meaningful code conversion, there should be at least one
direction, operation, or map element in the iconv code conversion defi‐
nition.
The direction element contains one or more semi-colon separated condi‐
tion-action pairs that direct the code conversion:
direction For_US-ASCII_2_ISO8859-1 {
// one or more condition-action pairs here.
:
:
}
Each condition-action pair contains a conditional code conversion that
consists of a condition element and an action element.
condition action
If the pre-defined condition is met, the corresponding action is exe‐
cuted. If there is no pre-defined condition met, iconv(3C) will return
-1 with errno set to EILSEQ. The condition can be a condition element,
a name to a pre-defined condition element, or a condition literal
value, true. The 'true' condition literal value always yields success
and thus the corresponding action is always executed. The action also
can be an action element or a name to a pre-defined action element.
The condition element specifies one or more condition expression ele‐
ments. Since each condition element can have a name and also can exist
stand-alone, a pre-defined condition element can be referenced by the
name at any action pairs later. To be used in that way, the correspond‐
ing condition element should be defined beforehand:
condition For_US-ASCII_2_ISO8859-1 {
// one or more condition expression elements here.
:
:
}
The name of the condition element in the above example is For_US-
ASCII_2_ISO8859-1. Each condition element can have one or more condi‐
tion expression elements. If there are more than one condition expres‐
sion elements, the condition expression elements are checked from top
to bottom to see if any one of the condition expression elements will
yield a true. Any one of the following can be a condition expression
element:
between
escapeseq
expression
The between condition expression element defines one or more comma-sep‐
arated ranges:
between 0x0...0x1f, 0x7f...0x9f ;
between 0xa1a1...0xfefe ;
In the first expression in the example above, the covered ranges are
0x0 to 0x1f and 0x7f to 0x9f inclusively. In the second expression, the
covered range is the range whose first byte is 0xa1 to 0xfe and whose
second byte is between 0xa1 to 0xfe. This means that the range is
defined by each byte. In this case, the sequence 0xa280 does not meet
the range.
The escapeseq condition expression element defines an equal-to condi‐
tion for one or more comma-separated escape sequence designators:
// ESC $ ) C sequence:
escapeseq 0x1b242943;
// ESC $ ) C sequence or ShiftOut (SO) control character code, 0x0e:
escapeseq 0x1b242943, 0x0e;
The expression can be any one of the following and can be surrounded by
a pair of parentheses, '(' and ')':
// HEXADECIMAL:
0xa1a1
// DECIMAL
12
// A boolean value, true:
true
// A boolean value, false:
false
// Addition expression:
1 + 2
// Subtraction expression:
10 - 3
// Multiplication expression:
0x20 * 10
// Division expression:
20 / 10
// Remainder expression:
17 % 3
// Left-shift expression:
1 << 4
// Right-shift expression:
0xa1 >> 2
// Bitwise OR expression:
0x2121 | 0x8080
// Exclusive OR expression:
0xa1a1 ^ 0x8080
// Bitwise AND expression:
0xa1 & 0x80
// Equal-to expression:
0x10 == 16
// Inequality expression:
0x10 != 10
// Less-than expression:
0x20 < 25
// Less-than-or-equal-to expression:
10 <= 0x10
// Bigger-than expression:
0x10 > 12
// Bigger-than-or-equal-to expression:
0x10 >= 0xa
// Logical OR expression:
0x10 || false
// Logical AND expression:
0x10 && false
// Logical negation expression:
! false
// Bitwise complement expression:
~0
// Unary minus expression:
-123
There is a single type available in this expression: integer. The bool‐
ean values are two special cases of integer values. The 'true' boolean
value's integer value is 1 and the 'false' boolean value's integer
value is 0. Also, any integer value other than 0 is a true boolean
value. Consequently, the integer value 0 is the false boolean value.
Any boolean expression yields integer value 1 for true and integer
value 0 for false as the result.
Any literal value shown at the above expression examples as operands,
that is, DECIMAL, HEXADECIMAL, and boolean values, can be replaced with
another expression. There are a few other special operands that you can
use as well in the expressions: 'input', 'inputsize', 'outputsize', and
variables. input is a keyword pointing to the current input buffer.
inputsize is a keyword pointing to the current input buffer size in
bytes. outputsize is a keyword pointing to the current output buffer
size in bytes. The NAME lexical convention is used to name a variable.
The initial value of a variable is 0. The following expressions are
allowed with the special operands:
// Pointer to the third byte value of the current input buffer:
input[2]
// Equal-to expression with the 'input':
input == 0x8020
// Alternative way to write the above expression:
0x8020 == input
// The size of the current input buffer size:
inputsize
// The size of the current output buffer size:
outputsize
// A variable:
saved_second_byte
// Assignment expression with the variable:
saved_second_byte = input[1]
The input keyword without index value can be used only with the equal-
to operator, '=='. When used in that way, the current input buffer is
consecutively compared with another operand byte by byte. An expression
can be another operand. If the input keyword is used with an index
value n, it is a pointer to the (n+1)th byte from the beginning of the
current input buffer. An expression can be the index. Only a variable
can be placed on the left hand side of an assignment expression.
The action element specifies an action for a condition and can be any
one of the following elements:
direction
operation
map
The operation element specifies one or more operation expression ele‐
ments:
operation For_US-ASCII_2_ISO8859-1 {
// one or more operation expression element definitions here.
:
:
}
If the name of the operation element, in the case of the above example,
For_US -ASCII_2_ISO8859-1, is either init or reset, it defines the ini‐
tial operation and the reset operation of the iconv code conversion:
// The initial operation element:
operation init {
// one or more operation expression element definitions here.
:
:
}
// The reset operation element:
operation reset {
// one or more operation expression element definitions here.
:
:
}
The initial operation element defines the operations that need to be
performed in the beginning of the iconv code conversion. The reset
operation element defines the operations that need to be performed when
a user of the iconv(3) function requests a state reset of the iconv
code conversion. For more detail on the state reset, refer to
iconv(3C).
The operation expression can be any one of the following three differ‐
ent expressions and each operation expression should be separated by an
ending semicolon:
if-else operation expression
output operation expression
control operation expression
The if-else operation expression makes a selection depend on the bool‐
ean expression result. If the boolean expression result is true, the
true task that follows the 'if' is executed. If the boolean expression
yields false and if a false task is supplied, the false task that fol‐
lows the 'else' is executed. There are three different kinds of if-else
operation expressions:
// The if-else operation expression with only true task:
if (expression) {
// one or more operation expression element definitions here.
:
:
}
// The if-else operation expression with both true and false
// tasks:
if (expression) {
// one or more operation expression element definitions here.
:
:
} else {
// one or more operation expression element definitions here.
:
:
}
// The if-else operation expression with true task and
// another if-else operation expression as the false task:
if (expression) {
// one or more operation expression element definitions here.
:
:
} else if (expression) {
// one or more operation expression element definitions here.
:
:
} else {
// one or more operation expression element definitions here.
:
:
}
The last if-else operation expression can have another if-else opera‐
tion expression as the false task. The other if-else operation expres‐
sion can be any one of above three if-else operation expressions.
The output operation expression saves the right hand side expression
result to the output buffer:
// Save 0x8080 at the output buffer:
output = 0x8080;
If the size of the output buffer left is smaller than the necessary
output buffer size resulting from the right hand side expression, the
iconv code conversion will stop with E2BIG errno and (size_t)-1 return
value to indicate that the code conversion needs more output buffer to
complete. Any expression can be used for the right hand side expres‐
sion. The output buffer pointer will automatically move forward appro‐
priately once the operation is executed.
The control operation expression can be any one of the following
expressions:
// Return (size_t)-1 as the return value with an EINVAL errno:
error;
// Return (size_t)-1 as the return value with an EBADF errno:
error 9;
// Discard input buffer byte operation. This discards a byte from
// the current input buffer and move the input buffer pointer to
// the 2'nd byte of the input buffer:
discard;
// Discard input buffer byte operation. This discards
// 10 bytes from the current input buffer and move the input
// buffer pointer to the 11'th byte of the input buffer:
discard 10;
// Return operation. This stops the execution of the current
// operation:
return;
// Operation execution operation. This executes the init
// operation defined and sets all variables to zero:
operation init;
// Operation execution operation. This executes the reset
// operation defined and sets all variables to zero:
operation reset;
// Operation execution operation. This executes an operation
// defined and named 'ISO8859_1_to_ISO8859_2':
operation ISO8859_1_to_ISO8859_2;
// Direction operation. This executes a direction defined and
// named 'ISO8859_1_to_KOI8_R:
direction ISO8859_1_to_KOI8_R;
// Map execution operation. This executes a mapping defined
// and named 'Map_ISO8859_1_to_US_ASCII':
map Map_ISO8859_1_to_US_ASCII;
// Map execution operation. This executes a mapping defined
// and named 'Map_ISO8859_1_to_US_ASCII' after discarding
// 10 input buffer bytes:
map Map_ISO8859_1_to_US_ASCII 10;
In case of init and reset operations, if there is no pre-defined init
and/or reset operations in the iconv code conversions, only system-
defined internal init and reset operations will be executed. The execu‐
tion of the system-defined internal init and reset operations will
clear the system-maintained internal state.
There are three special operators that can be used in the operation:
printchr expression;
printhd expression;
printint expression;
The above three operators will print out the given expression as a
character, a hexadecimal number, and a decimal number, respectively, at
the standard error stream. These three operators are for debugging pur‐
poses only and should be removed from the final version of the iconv
code conversion definition file.
In addition to the above operations, any valid expression separated by
a semi-colon can be an operation, including an empty operation, denoted
by a semi-colon alone as an operation.
The map element specifies a direct code conversion mapping by using one
or more map pairs. When used, usually many map pairs are used to repre‐
sent an iconv code conversion definition:
map For_US-ASCII_2_ISO8859-1 {
// one or more map pairs here
:
:
}
Each map element also can have one or two comma-separated map attribute
elements like the following examples:
// Map with densely encoded mapping table map type:
map maptype = dense {
// one or more map pairs here
:
:
}
// Map with hash mapping table map type with hash factor 10.
// Only hash mapping table map type can have hash factor. If
// the hash factor is specified with other map types, it will be
// ignored.
map maptype = hash : 10 {
// one or more map pairs here.
:
:
}
// Map with binary search tree based mapping table map type:
map maptype = binary {
// one more more map pairs here.
:
:
}
// Map with index table based mapping table map type:
map maptype = index {
// one or more map pairs here.
:
:
}
// Map with automatic mapping table map type. If defined,
// system will assign the best possible map type.
map maptype = automatic {
// one or more map pairs here.
:
:
}
// Map with output_byte_length limit set to 2.
map output_byte_length = 2 {
// one or more map pairs here.
:
:
}
// Map with densely encoded mapping table map type and
// output_bute_length limit set to 2:
map maptype = dense, output_byte_length = 2 {
// one or more map pairs here.
:
:
}
If no maptype is defined, automatic is assumed. If no out‐
put_byte_length is defined, the system figures out the maximum possible
output byte length for the mapping by scanning all the possible output
values in the mappings. If the actual output byte length scanned is
bigger than the defined output_byte_length, the geniconvtbl utility
issues an error and stops generating the code conversion binary ta‐
ble(s).
The following are allowed map pairs:
// Single mapping. This maps an input character denoted by
// the code value 0x20 to an output character value 0x21:
0x20 0x21
// Multiple mapping. This maps 128 input characters to 128
// output characters. In this mapping, 0x0 maps to 0x10, 0x1 maps
// to 0x11, 0x2 maps to 0x12, ..., and, 0x7f maps to 0x8f:
0x0...0x7f 0x10
// Default mapping. If specified, every undefined input character
// in this mapping will be converted to a specified character
// (in the following case, a character with code value of 0x3f):
default 0x3f;
// Default mapping. If specified, every undefined input character
// in this mapping will not be converted but directly copied to
// the output buffer:
default no_change_copy;
// Error mapping. If specified, during the code conversion,
// if input buffer contains the byte value, in this case, 0x80,
// the iconv(3) will stop and return (size_t)-1 as the return
// value with EILSEQ set to the errno:
0x80 error;
If no default mapping is specified, every undefined input character in
the mapping will be treated as an error mapping. and thus the iconv(3C)
will stop the code conversion and return (size_t)-1 as the return value
with EILSEQ set to the errno.
The syntax of the iconv code conversion definition in extended BNF is
illustrated below:
iconv_conversion_definition
: CONVERSION_NAME '{' definition_element_list '}'
;
definition_element_list
: definition_element ';'
| definition_element_list definition_element ';'
;
definition_element
: direction
| condition
| operation
| map
;
direction
: 'direction' NAME '{' direction_unit_list '}'
| 'direction' '{' direction_unit_list '}'
;
direction_unit_list
: direction_unit
| direction_unit_list direction_unit
;
direction_unit
: condition action ';'
| condition NAME ';'
| NAME action ';'
| NAME NAME ';'
| 'true' action ';'
| 'true' NAME ';'
;
action
: direction
| map
| operation
;
condition
: 'condition' NAME '{' condition_list '}'
| 'condition' '{' condition_list '}'
;
condition_list
: condition_expr ';'
| condition_list condition_expr ';'
;
condition_expr
: 'between' range_list
| expr
| 'escapeseq' escseq_list ';'
;
range_list
: range_pair
| range_list ',' range_pair
;
range_pair
: HEXADECIMAL '...' HEXADECIMAL
;
escseq_list
: escseq
| escseq_list ',' escseq
;
escseq : HEXADECIMAL
;
map : 'map' NAME '{' map_list '}'
| 'map' '{' map_list '}'
| 'map' NAME map_attribute '{' map_list '}'
| 'map' map_attribute '{' map_list '}'
;
map_attribute
: map_type ',' 'output_byte_length' '=' DECIMAL
| map_type
| 'output_byte_length' '=' DECIMAL ',' map_type
| 'output_byte_length' '=' DECIMAL
;
map_type: 'maptype' '=' map_type_name : DECIMAL
| 'maptype' '=' map_type_name
;
map_type_name
: 'automatic'
| 'index'
| 'hash'
| 'binary'
| 'dense'
;
map_list
: map_pair
| map_list map_pair
;
map_pair
: HEXADECIMAL HEXADECIMAL
| HEXADECIMAL '...' HEXADECIMAL HEXADECIMAL
| 'default' HEXADECIMAL
| 'default' 'no_change_copy'
| HEXADECIMAL 'error'
;
operation
: 'operation' NAME '{' op_list '}'
| 'operation' '{' op_list '}'
| 'operation' 'init' '{' op_list '}'
| 'operation' 'reset' '{' op_list '}'
;
op_list : op_unit
| op_list op_unit
;
op_unit : ';'
| expr ';'
| 'error' ';'
| 'error' expr ';'
| 'discard' ';'
| 'discard' expr ';'
| 'output' '=' expr ';'
| 'direction' NAME ';'
| 'operation' NAME ';'
| 'operation' 'init' ';'
| 'operation' 'reset' ';'
| 'map' NAME ';'
| 'map' NAME expr ';'
| op_if_else
| 'return' ';'
| 'printchr' expr ';'
| 'printhd' expr ';'
| 'printint' expr ';'
;
op_if_else
: 'if' '(' expr ')' '{' op_list '}'
| 'if' '(' expr ')' '{' op_list '}' 'else' op_if_else
| 'if' '(' expr ')' '{' op_list '}' 'else' '{' op_list '}'
;
expr : '(' expr ')'
| NAME
| HEXADECIMAL
| DECIMAL
| 'input' '[' expr ']'
| 'outputsize'
| 'inputsize'
| 'true'
| 'false'
| 'input' '==' expr
| expr '==' 'input'
| '!' expr
| '~' expr
| '-' expr
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| expr '%' expr
| expr '<<' expr
| expr '>>' expr
| expr '|' expr
| expr '^' expr
| expr '&' expr
| expr '==' expr
| expr '!=' expr
| expr '>' expr
| expr '>=' expr
| expr '<' expr
| expr '<=' expr
| NAME '=' expr
| expr '||' expr
| expr '&&' expr
;
EXAMPLES
Example 1: Code conversion from ISO8859-1 to ISO646
ISO8859-1%ISO646 {
// Use dense-encoded internal data structure.
map maptype = dense {
default 0x3f
0x0...0x7f 0x0
};
}
Example 2: Code conversion from eucJP to ISO-2022-JP
// Iconv code conversion from eucJP to ISO-2022-JP
#include <sys/errno.h>
eucJP%ISO-2022-JP {
operation init {
codesetnum = 0;
};
operation reset {
if (codesetnum != 0) {
// Emit state reset sequence, ESC ( J, for
// ISO-2022-JP.
output = 0x1b284a;
}
operation init;
};
direction {
condition { // JIS X 0201 Latin (ASCII)
between 0x00...0x7f;
} operation {
if (codesetnum != 0) {
// We will emit four bytes.
if (outputsize <= 3) {
error E2BIG;
}
// Emit state reset sequence, ESC ( J.
output = 0x1b284a;
codesetnum = 0;
} else {
if (outputsize <= 0) {
error E2BIG;
}
}
output = input[0];
// Move input buffer pointer one byte.
discard;
};
condition { // JIS X 0208
between 0xa1a1...0xfefe;
} operation {
if (codesetnum != 1) {
if (outputsize <= 4) {
error E2BIG;
}
// Emit JIS X 0208 sequence, ESC $ B.
output = 0x1b2442;
codesetnum = 1;
} else {
if (outputsize <= 1) {
error E2BIG;
}
}
output = (input[0] & 0x7f);
output = (input[1] & 0x7f);
// Move input buffer pointer two bytes.
discard 2;
};
condition { // JIS X 0201 Kana
between 0x8ea1...0x8edf;
} operation {
if (codesetnum != 2) {
if (outputsize <= 3) {
error E2BIG;
}
// Emit JIS X 0201 Kana sequence,
// ESC ( I.
output = 0x1b2849;
codesetnum = 2;
} else {
if (outputsize <= 0) {
error E2BIG;
}
}
output = (input[1] & 127);
// Move input buffer pointer two bytes.
discard 2;
};
condition { // JIS X 0212
between 0x8fa1a1...0x8ffefe;
} operation {
if (codesetnum != 3) {
if (outputsize <= 5) {
error E2BIG;
}
// Emit JIS X 0212 sequence, ESC $ ( D.
output = 0x1b242844;
codesetnum = 3;
} else {
if (outputsize <= 1) {
error E2BIG;
}
}
output = (input[1] & 127);
output = (input[2] & 127);
discard 3;
};
true operation { // error
error EILSEQ;
};
};
}
FILES
/usr/bin/geniconvtbl
the utility geniconvtbl
/usr/lib/iconv/geniconvtbl/binarytables/*.bt
conversion binary tables
/usr/lib/iconv/geniconvtbl/srcs/*
conversion source files for user reference
SEE ALSOcpp(1), geniconvtbl(1), iconv(1), iconv(3C), iconv_close(3C),
iconv_open(3C), attributes(5), environ(5)
International Language Environments Guide
NOTES
The maximum length of HEXADECIMAL and DECIMAL digit length is 128. The
maximum length of a variable is 255. The maximum nest level is 16.
SunOS 5.10 18 Feb 2003 geniconvtbl(4)