Comparing Files: Character Encoding

Choosing a Character Encoding

Definitions

A codepage or a character set is a collection of characters. Historically, characters from different languages have been divided into different character sets, as computers were able to "address" only a limited number of characters at a time. Thus codepages were defined to support specific languages, or groups of languages, with similar writing systems. For instance, codepage 1251 contains characters used in both Bulgarian and Russian alphabets.

Legacy character encodings use the Single-Byte Character Sets (SBCS) or the Multi-Byte Character Sets (MBCS), also known as Double-Byte Character Sets (DBCS). ) SBCS contains 256 character codes, while DBCS are a mixture of single-byte and double-byte characters and can represent up to 65,536 characters.

Modern character encodings such as Unicode use 16-bit character codes to represent most of the characters used throughout the world. Each Unicode index refers unambiguously to a given character. Compared to SBCS, Unicode allows for addressing a considerably larger range of characters. Compared to MBCS, Unicode offers a simplified model for working with text.

Selecting a Character Encoding

Inherently Unicode-based, DeltaWalker has a built-in functionality for detecting the character encoding of a given text file. Unicode encodings are often easy to detect, thanks to a two-byte leading identifier, while SBCS don't lend themselves well for auto-detection. If a character encoding is detected incorrectly many, or all, characters would appear garbled and unreadable.

In case DeltaWalker is unable to correctly detect the character encoding of a file, you can easily select it from one of several places:

As illustrated on these screenshots, DeltaWalker allows you to select a charset encoding either by the language corresponding to that charset, or by the charset name itself. Note that one or more languages can use the same character encoding, or there could be an encoding without a corresponding language. Therefore when switching from Languages to Charsets DeltaWalker will always map the current language to its corresponding charset, but not the other way around.

The Editing preference page allows you to select the default encoding—language, or character set—for new text files created in DeltaWalker. Unless you overwrite the default encoding in say, the Set Encoding dialog, a new file will be saved on disk with the default encoding.

See Also