The GEDCOM Standard Release 5.5
Chapter 3
Using Character Sets in GEDCOM
Introduction
GEDCOM needs to accommodate different character sets to facilitate the sharing of genealogical
data in different languages. To minimize the number of differing standards, we have chosen to have
each system convert its usage to ANSEL, and eventually to UNICODE.
In January 1991, a Unicode Consortium was founded to promote the use of the Unicode standard,
which accommodates most all characters in one character set. (See the section "Unicode".) The Unicode Consortium has agreed with the ISO 10646 standard to merge, and Unicode will
be a subset of the ISO 10646 international character encoding standard.
Currently, it is difficult to handle the two- and four-character code sequences (wide characters).
Therefore, until multi-byte handling becomes more common, ANSEL will be used to represent
Latin-based characters.
The GEDCOM Standard does not address the implementation methods for multilingual processing,
such as keyboard arrangements, sorting sequences, or character and graphic representations (font
styles, proportional spacing, and so forth) on the CRT or printers. However, the Unicode standard
has defined formatting characters that will indicate the direction of the text presentation and other
text formatting character code.
Systems using code pages to support diacritical characters must convert all characters above
character codes 128 to its ANSEL representation for that code page.
Most of the genealogy systems developed so far use ASCII, ANSEL, or both. ANSEL
accommodates the set of Latin-based languages, as explained below.
8-Bit ANSEL
The 8-Bit ANSEL (American National Standard for Extended Latin Alphabet Coded Character Set
for Bibliographic Use, Z39.47-1985 copyright) is the preferred character set for GEDCOM. It is
used for all transmissions of information unless another character set is specified.
Using this character set standard makes it possible to preserve the full integrity of the language by
providing a method of using the standard ASCII character set and supplementing it with both non-spacing character modifiers (diacritic) as well as spacing special characters.
Note:Non-spacing means that the diacritic is printed without advancing the device's print position.
The character being modified is then printed in the same position, resulting in a combined image of
both the character and the diacritic(s).
Storing ANSEL requires storing the non-spacing graphic character(s) preceding the ASCII character
that the diacritic is to modify. The ANSEL standard specifies an extended 8-bit configuration (above
128) to represent the spacing and non-spacing graphic characters that make up most of the Latinbased languages. ANSEL is a super-set of ASCII. The standard ASCII characters including the
control characters are preserved.
ANSEL is known by two other names:
- ANSI Z39.47-1985
- American Library Association character set, used in library systems worldwide, including
the MARC (Machine-Readable Catalog) format.
A description of the codes for the ANSEL character set has been reproduced with permission and is
included with the printed version of The GEDCOM Standard. The description of ANSEL codes is
not included in the electronic version. This description may be purchased from%
American National Standards Institute
1430 Broadway
New York, N.Y. 10018
The description of the ANSEL character set standard includes the following:
- An 8-Bit Code Table showing the ASCII and extended ANSEL codes
- An explanation or legend of these codes
- A chart that identifies the ANSEL Non-spacing Graphic Characters
- A chart that identifies the ASCII Control Characters
- A chart that identifies the ASCII Graphic Characters
Character set codes 0 through 127 are the same for 8-Bit ANSEL and 8-Bit ASCII (USA
version%ANSI 8-Bit). Character set codes 128 through 255 are unique to the ANSEL character set.
ASCII (USA Version)
When a language does not need diacritic characters or other special characters, and if you are not
transmitting binary data, you will find it convenient to use ASCII (8-bit USA version) if your
computer already supports it. This is a standard of the American National Standards Institute
(ANSI). Most of the basic printable characters of ANSEL and ASCII (USA version%ANSI 8-Bit)
are identical.
UNICODE (ISO 10646)
The Unicode standard is a new character code designed to encode text for storage in computer files.
It is a subset of the upcoming ISO 10646 standard. The design of the Unicode standard is based on
the simplicity and consistency of today's prevalent character code set, extended ASCII code set, but
goes far beyond ASCII's limited ability to encode only the Latin alphabet: the Unicode encoding
provides the capacity to encode most all of the characters used for written languages throughout the
world. In order to accommodate the many thousands of characters used in the international text, the
Unicode standard uses a 16-bit code set instead of extended ASCII's 8-bit code set. This expansion
provides codes for approximately 65,000 characters. The Unicode standard assigns each character a
unique 16-bit value, and does not use complex modes or escape codes to specify modified characters
or special cases. UNICODE may adopt a 32-bit code to represent characters which should allow for
all character representations. The text representation of the Unicode 16-bit numbers is U+0041
which is assigned to the letter A, 65 decimal. The Unicode standard includes the Latin alphabet used
for English, the Cyrillic alphabet used for Russian, the Greek, Hebrew, and Arabic alphabets. Otheralphabets used in countries across Europe, Africa, the Indian subcontinent, and Asia, such as
Japanese Kana, Korean Hangul, and Chinese Bopomofo are included. The largest part of the
Unicode standard is devoted to thousands of unified character codes for Chinese, Japanese, and
Korean ideographs. (See "The Unicode standard", vol. 1 and 2, published by Addison-Wesley
Publishing, for character code standards.)
The Unicode character set environment should eventually contain a set of character for all
languages. If the Unicode environment is used to produce a GEDCOM transmission, the header
record would also be in Unicode, requiring receiving systems to determine whether the transmission
is Unicode or ASCII before they could interpret the GEDCOM header. This would be done by
reading the first two bytes of the transmission. If the first two bytes are 0x30 and 0x20 then the
transmission will be in either ASCII or ANSEL as determined by the header record. If the first two
bytes are 0x30 and 0x00 then the transmission should be processed as a Unicode transmission.
(Different platforms may reverse the position of the null byte, in which case the test would be for
0x00 and 0x30.)
How to Change Character Sets
The character set for an entire transmission is specified in the character set line of the header
record.
The example below shows the specification in the header record:
Lvl
Tag
Value
0 HEAD
1 SOUR PAF
2 VERS 2.1
1 DEST ANSTFILE
1 CHAR ANSEL
The character set change remains in effect until the TRLR record is encountered at the end of the
transmission.
UNICODE character set should be used for multi-language support as soon as operating systems
begin providing adequate storage and display support.
For more information about character sets, see the following:
- Extended Latin Alphabet Coded Character Set for Bibliographic Use. American National
Standards (ANSI), Z39.47, 1985.
- "8-Bit ASCII%Structure and Rules." American National Standards (ANSI) X3.134.1%198x.
- "7-Bit and 8-Bit ASCII Supplemental Multilingual Graphic Character Set (ASCII
Multilingual Set)" (manuscript). American National Standards (ANSI), X3.134.2%198x.
- "The Unicode standard", vol. 1 and 2, published by Addison-Wesley Publishing.
weiter