9.9 LaTeX’s encoding models
For most users it is probably sufficient to know that there exist certain input and output encodings and to have some basic knowledge about how to use them, as described in the previous sections. However, sometimes it is helpful to know the whole story in some detail, either to set up a new encoding or to better understand packages or classes that implement special features. So here is everything you always wanted to know about encodings in LaTEX.
We start by describing the general character data flow within the LaTEX system, deriving from that the base requirements for various encodings and the mapping between them. We then have a closer look at the internal representation model for character data within LaTEX, followed by a discussion of the mechanisms used to map incoming data via input encodings into that internal representation.
Finally, we explain how the internal representation is translated, via the output encodings, into the form required for the actual task of typesetting.
9.9.1 Character data within the LaTeX system
Document processing with the LaTEX system starts by interpreting data present in one or more source files. These data, which represent the document content, are stored in these files in the form of octets representing characters. To correctly interpret these octets, LaTEX (or any other program used to process the file, such as an editor) must know the encoding that was used when the file was written. In other words, it must know the mapping between abstract characters and the octets representing them.
With an incorrect mapping, all further processing is flawed to some extent unless the file contains only characters of a subset common in both encodings.
LaTEX makes one fundamental assumption at this stage: that (nearly) all characters of visible ASCII (decimal 32–126) are represented by the number that they have in the ASCII code table; see Table 9.21 on the next page.
There is both a practical and a TEXnical reason for this assumption. The practical reason is that most 8-bit encodings as well as the UTF-8 encoding usually used today share a common 7-bit plane. The TEXnical reason is that for using TEX efficiently, the majority of the visible portion of ASCII needs to be processed as characters of category “letter” (because only characters with this category can be used in multiplecharacter command names in TEX) or of category “other” (because TEX will not, for example, recognize the decimal digits as being part of a number if they do not have this category code).
9.9.2 LaTeX’s internal character representation (LICR)
In this section we cover the LICR concepts in some more depth. Technically speaking, text characters are represented internally by LaTEX in one of three ways, each of which is discussed in the following sections.
9.9.3 Input encodings
Since 2015 the default input encoding for LaTEX is UTF-8 unless explicitly changed in the preamble using the inputenc package. This means that in pdfTEX all UTF-8 characters that can be typeset using the loaded fonts can nowadays be entered in the source document in their natural UTF-8 form, e.g., as “ü” or “ß”, and there is no need to use the LICR representations “u or for them. This is technically achieved in pdfTEX by mapping the UTF-8 characters to their corresponding LICR objects using declarations.
9.9.4 Output encodings
As we learned earlier, output encodings define the mapping from the LICR to the glyphs (or constructs built from glyphs) available in the fonts used for typesetting. These mappings are referenced inside LaTEX by two- or three-letter names (e.g., OT1 or T2A). We say that a certain font is in a certain encoding if the mapping corresponds to the positions of the glyphs in the font in question. So what are the exact components of such a mapping?
Characters internally represented by ASCII characters are simply passed on to the font. In other words, TEX uses the ASCII code to select a glyph from the current font. For example, the character “A” with ASCII code 65 results in typesetting the glyph in position 65 in the current font. This is why LaTEX requires that fonts for text contain all such ASCII letters in their ASCII code positions, because there is no way to interact with this basic TEX mechanism (other than to disable it and do everything “manually”). Thus, for visible ASCII, a one-to-one mapping is implicitly present in all output encodings.