Foreign Language Representation

XGC User's Guide: Using the Ada Compiler
Prev	Appendix A. The Compilation Model	Next

A.2.1. Latin-1

The basic character set is Latin-1. This character set is defined by ISO standard 8859, part 1. The lower half (character codes 16#00# ... 16#7F#) is identical to standard ASCII coding, but the upper half is used to represent additional characters. This includes extended letters used by European languages, such as the vowels with umlauts used in German, and the extra letter A-ring used in Swedish.

For a complete list of Latin-1 codes and their encodings, see the source of library unit Ada.Characters.Latin_1. You may use any of these extended characters freely in character or string literals. In addition, the extended characters that represent letters can be used in identifiers.

A.2.2. Other Eight-Bit Codes

XGC Ada also supports several other eight-bit coding schemes:

Latin-2: Latin-2 letters allowed in identifiers, with uppercase and lowercase equivalence.
Latin-3: Latin-3 letters allowed in identifiers, with uppercase and lowercase equivalence.
Latin-4: Latin-4 letters allowed in identifiers, with uppercase and lowercase equivalence.
IBM PC (code page 437): This code page is the normal default for PCs in the USA. It corresponds to the original IBM PC character set. This set has some, but not all, of the extended Latin-1 letters, but these letters do not have the same encoding as Latin-1. In this mode, these letters are allowed in identifiers with uppercase and lowercase equivalence.
IBM PC (code page 850): This code page is a modification of 437 extended to include all the Latin-1 letters, but still not with the usual Latin-1 encoding. In this mode, all these letters are allowed in identifiers with uppercase and lowercase equivalence.
Full Upper 8-bit: Any character in the range 80-FF allowed in identifiers, and all are considered distinct. In other words, there are no uppercase and lower case equivalences in this range. This is useful in conjunction with certain encoding schemes used for some foreign character sets (e.g. the typical method of representing Chinese characters on the PC).
No Upper-Half: No upper-half characters in the range 80-FF are allowed in identifiers. This gives Ada 83 compatibility for identifier names.

For precise data on the encodings permitted, and the uppercase and lower case equivalences that are recognized, see the file csets.adb in the XGC Ada compiler sources. You will need to obtain a full source release of XGC Ada to obtain this file.

A.2.3. Wide Character Encodings

XGC Ada allows wide character codes to appear in character and string literals, and also optionally in identifiers, using the following possible encoding schemes:

Brackets Coding

In this encoding, a wide character is represented by the following eight character sequence:

[ " a b c d " ]

Where a, b, c, d are the four hexadecimal characters (using uppercase letters) of the wide character code. For example, ["A345"] is used to represent the wide character with code 16#A345#. This scheme is compatible with use of the full Wide_Character set, and is also the method used for wide character encoding in the standard ACVC (Ada Compiler Validation Capability) test suite distributions.

Hex Coding

In this encoding, a wide character is represented by the following five character sequence:

ESC a b c d

Where a, b, c, d are the four hexadecimal characters (using uppercase letters) of the wide character code. For example, ESC A345 is used to represent the wide character with code 16#A345#. This scheme is compatible with use of the full Wide_Character set.

Upper-Half Coding

The wide character with encoding 16#abcd# where the upper bit is on (in other words, a is in the range 8 to F) is represented as two bytes, 16#ab# and 16#cd#. The second byte may never be a format control character, but is not required to be in the upper half. This method can be also used for shift-JIS or EUC, where the internal coding matches the external coding.

Shift JIS Coding

A wide character is represented by a two-character sequence, 16#ab# and 16#cd#, with the restrictions described for upper-half encoding as described above. The internal character code is the corresponding JIS character according to the standard algorithm for Shift-JIS conversion. Only characters defined in the JIS code set table can be used with this encoding method.

EUC Coding

A wide character is represented by a two-character sequence 16#ab# and 16#cd#, with both characters being in the upper half. The internal character code is the corresponding JIS character according to the EUC encoding algorithm. Only characters defined in the JIS code set table can be used with this encoding method.

Note: Some of these coding schemes do not permit the full use of the Ada 95 character set. For example, neither Shift JIS, nor EUC allow the use of the upper half of the Latin-1 set.

A.2. Foreign Language Representation

A.2.1. Latin-1

A.2.2. Other Eight-Bit Codes

A.2.3. Wide Character Encodings