Recognize Coding - GNU Emacs Manual

Next: Specify Coding, Previous: Coding Systems, Up: International

27.8 Recognizing Coding Systems

Emacs tries to recognize which coding system to use for a given text as an integral part of reading that text. (This applies to files being read, output from subprocesses, text from X selections, etc.) Emacs can select the right coding system automatically most of the time—once you have specified your preferences.

Some coding systems can be recognized or distinguished by which byte sequences appear in the data. However, there are coding systems that cannot be distinguished, not even potentially. For example, there is no way to distinguish between Latin-1 and Latin-2; they use the same byte values with different meanings.

Emacs handles this situation by means of a priority list of coding systems. Whenever Emacs reads a file, if you do not specify the coding system to use, Emacs checks the data against each coding system, starting with the first in priority and working down the list, until it finds a coding system that fits the data. Then it converts the file contents assuming that they are represented in this coding system.

The priority list of coding systems depends on the selected language environment (see Language Environments). For example, if you use French, you probably want Emacs to prefer Latin-1 to Latin-2; if you use Czech, you probably want Latin-2 to be preferred. This is one of the reasons to specify a language environment.

However, you can alter the coding system priority list in detail with the command M-x prefer-coding-system. This command reads the name of a coding system from the minibuffer, and adds it to the front of the priority list, so that it is preferred to all others. If you use this command several times, each use adds one element to the front of the priority list.

If you use a coding system that specifies the end-of-line conversion type, such as iso-8859-1-dos, what this means is that Emacs should attempt to recognize iso-8859-1 with priority, and should use DOS end-of-line conversion when it does recognize iso-8859-1.

Sometimes a file name indicates which coding system to use for the file. The variable file-coding-system-alist specifies this correspondence. There is a special function modify-coding-system-alist for adding elements to this list. For example, to read and write all ‘.txt’ files using the coding system china-iso-8bit, you can execute this Lisp expression:

     (modify-coding-system-alist 'file "\\.txt\\'" 'chinese-iso-8bit)

The first argument should be file, the second argument should be a regular expression that determines which files this applies to, and the third argument says which coding system to use for these files.

Emacs recognizes which kind of end-of-line conversion to use based on the contents of the file: if it sees only carriage-returns, or only carriage-return linefeed sequences, then it chooses the end-of-line conversion accordingly. You can inhibit the automatic use of end-of-line conversion by setting the variable inhibit-eol-conversion to non-nil. If you do that, DOS-style files will be displayed with the ‘^M’ characters visible in the buffer; some people prefer this to the more subtle ‘(DOS)’ end-of-line type indication near the left edge of the mode line (see eol-mnemonic).

By default, the automatic detection of coding system is sensitive to escape sequences. If Emacs sees a sequence of characters that begin with an escape character, and the sequence is valid as an ISO-2022 code, that tells Emacs to use one of the ISO-2022 encodings to decode the file.

However, there may be cases that you want to read escape sequences in a file as is. In such a case, you can set the variable inhibit-iso-escape-detection to non-nil. Then the code detection ignores any escape sequences, and never uses an ISO-2022 encoding. The result is that all escape sequences become visible in the buffer.

The default value of inhibit-iso-escape-detection is nil. We recommend that you not change it permanently, only for one specific operation. That's because many Emacs Lisp source files in the Emacs distribution contain non-ASCII characters encoded in the coding system iso-2022-7bit, and they won't be decoded correctly when you visit those files if you suppress the escape sequence detection.

You can specify the coding system for a particular file using the ‘-*-...-*-’ construct at the beginning of a file, or a local variables list at the end (see File Variables). You do this by defining a value for the “variable” named coding. Emacs does not really have a variable coding; instead of setting a variable, this uses the specified coding system for the file. For example, ‘-*-mode: C; coding: latin-1;-*-’ specifies use of the Latin-1 coding system, as well as C mode. When you specify the coding explicitly in the file, that overrides file-coding-system-alist.

The variables auto-coding-alist, auto-coding-regexp-alist and auto-coding-functions are the strongest way to specify the coding system for certain patterns of file names, or for files containing certain patterns; these variables even override ‘-*-coding:-*-’ tags in the file itself. Emacs uses auto-coding-alist for tar and archive files, to prevent it from being confused by a ‘-*-coding:-*-’ tag in a member of the archive and thinking it applies to the archive file as a whole. Likewise, Emacs uses auto-coding-regexp-alist to ensure that RMAIL files, whose names in general don't match any particular pattern, are decoded correctly. One of the builtin auto-coding-functions detects the encoding for XML files.

If Emacs recognizes the encoding of a file incorrectly, you can reread the file using the correct coding system by typing C-x <RET> rcoding-system<RET>. To see what coding system Emacs actually used to decode the file, look at the coding system mnemonic letter near the left edge of the mode line (see Mode Line), or type C-h C <RET>.

The command unify-8859-on-decoding-mode enables a mode that “unifies” the Latin alphabets when decoding text. This works by converting all non-ASCII Latin-n characters to either Latin-1 or Unicode characters. This way it is easier to use various Latin-n alphabets together. In a future Emacs version we hope to move towards full Unicode support and complete unification of character sets.

Once Emacs has chosen a coding system for a buffer, it stores that coding system in buffer-file-coding-system and uses that coding system, by default, for operations that write from this buffer into a file. This includes the commands save-buffer and write-region. If you want to write files from this buffer using a different coding system, you can specify a different coding system for the buffer using set-buffer-file-coding-system (see Specify Coding).

You can insert any possible character into any Emacs buffer, but most coding systems can only handle some of the possible characters. This means that it is possible for you to insert characters that cannot be encoded with the coding system that will be used to save the buffer. For example, you could start with an ASCII file and insert a few Latin-1 characters into it, or you could edit a text file in Polish encoded in iso-8859-2 and add some Russian words to it. When you save the buffer, Emacs cannot use the current value of buffer-file-coding-system, because the characters you added cannot be encoded by that coding system.

When that happens, Emacs tries the most-preferred coding system (set by M-x prefer-coding-system or M-x set-language-environment), and if that coding system can safely encode all of the characters in the buffer, Emacs uses it, and stores its value in buffer-file-coding-system. Otherwise, Emacs displays a list of coding systems suitable for encoding the buffer's contents, and asks you to choose one of those coding systems.

If you insert the unsuitable characters in a mail message, Emacs behaves a bit differently. It additionally checks whether the most-preferred coding system is recommended for use in MIME messages; if not, Emacs tells you that the most-preferred coding system is not recommended and prompts you for another coding system. This is so you won't inadvertently send a message encoded in a way that your recipient's mail software will have difficulty decoding. (If you do want to use the most-preferred coding system, you can still type its name in response to the question.)

When you send a message with Mail mode (see Sending Mail), Emacs has four different ways to determine the coding system to use for encoding the message text. It tries the buffer's own value of buffer-file-coding-system, if that is non-nil. Otherwise, it uses the value of sendmail-coding-system, if that is non-nil. The third way is to use the default coding system for new files, which is controlled by your choice of language environment, if that is non-nil. If all of these three values are nil, Emacs encodes outgoing mail using the Latin-1 coding system.

When you get new mail in Rmail, each message is translated automatically from the coding system it is written in, as if it were a separate file. This uses the priority list of coding systems that you have specified. If a MIME message specifies a character set, Rmail obeys that specification, unless rmail-decode-mime-charset is nil.

For reading and saving Rmail files themselves, Emacs uses the coding system specified by the variable rmail-file-coding-system. The default value is nil, which means that Rmail files are not translated (they are read and written in the Emacs internal character code).