Localizations and character encodings

Browsers process text as Unicode internally. However, a way of representing characters in terms of bytes (character encoding) is used for transferring text over the network to the browser. The HTML specification recommends the use of the UTF-8 encoding (which can represent all of Unicode) and regardless of the encoding used requires Web content to declare what encoding was used.

The <meta> element's charset attribute is used to specify the page's character encoding. It must be used within a <head> block.

To specify that a page is using, for example, the UTF-8 character encoding (as per the recommendation), simply place the following line in the <head> block:

<meta charset="utf-8">

Details and browser internals

When the encoding is declared by Web content like the HTML specification requires, Firefox will use that encoding for turning the bytes into the internal representation. Unfortunately, using UTF-8 and declaring that UTF-8 was used was not always the prevalent way of offering Web content. In the 1990s, it was common to leave the encoding undeclared and to use a region-specific encoding that wasn't able to represent all of Unicode.

Firefox needs a fallback encoding that it uses for non-conforming legacy content that doesn't declare its encoding. For most locales, the fallback encoding is windows-1252 (often called ISO-8859-1), which was the encoding emitted by most Windows applications in the 1990s and a superset of the encoding emitted by most UNIX applications in the 1990s as a deployed in the America has and in Western Europe. However, there are locales where Web publishing was common already in the 1990s but the windows-1252 encoding was not suitable for the local language. In these locales, legacy content that doesn't declare its encoding is typically encoded using a legacy encoding other than windows-1252. In order to work with legacy content, some Firefox localizations need a non-windows-1252 fallback encoding.

Unfortunately, this means that the Web-exposed functionality of Firefox differs by locale and it is hard to read legacy content across locales with different fallback encodings. To avoid introducing this problem in locales where Web publishing took off after the adoption of UTF-8, locales that don't have a non-windows-1252 legacy encoding arising from the practices of the 1990s, should leave the fallback encoding at windows-1252 to facilitate reading content cross-locale from the old locales whose fallback encoding is windows-1252. New-authored locale-native UTF-8 content is expected to declare its encoding, in which case the fallback encoding does not participate in the processing of content.

Additionally, there is a small number of locales where in the 1990s there wasn't an obvious single region-specific encoding and heuristic detection among multiple legacy encodings was introduced to Web browsers. This has then had the effect that Web authors have depended on heuristic detection being present, so Firefox still has heuristic detection in these locales.

Finding canonical encoding names

The text below refers to canonical names of encodings. The canonical names are the values to the right of the equals sign in unixcharset.properties.

Specifying the fallback encoding

As of Firefox 28, this section is obsolete, since the preference intl.charset.default no longer exists. The mapping from locales onto fallback encodings is now built into Gecko itself.

The fallback encoding is specified by the preference intl.charset.default in intl.properties. It should be set to the canonical name of the legacy encoding that users of the localizations are most likely to encounter when browsing non-conforming legacy Web content that doesn't declare its encoding. Note that the fallback encoding as defined by the previous sentence does not necessarily need to be able to represent the characters needed for the language of the localization!

The fallback encoding should be left to windows-1252 for Western European locales, North, Central and South American locales, African locales, Central Asian locales and Oceanian locales. It typically needs to be set to something other than windows-1252 for Central and Eastern European locales, Middle Eastern locales and East Asian locales.

In order to avoid the problem of Web authors creating new UTF-8 content without declaring that the content uses UTF-8 and in order to maximize the ability of users to read content cross-locale, do not set the fallback encoding to UTF-8 for any newly-introduced localization. Note that Firefox no longer sends the Accept-Charset HTTP header, so there is no need to consider what gets advertised in Accept-Charset when setting the fallback encoding.

For locales where the fallback encoding is currently ISO-8859-1, it should be changed to windows-1252. ISO-8859-1 is decoded in the exact same way as windows-1252, but Firefox is moving to treating windows-1252 as the preferred label for this encoding in accordance with the Encoding Standard.

For locales where Internet Explorer has more market share than Firefox, the fallback encoding should typically be set to the same value as in Internet Explorer. You can see the fallback encoding a particular browser has by loading a test page. (Be sure to use a browser installation that has its settings left to the defaults when investigating!)

For locales where Firefox has more market share than Internet Explorer, it's probably best not to change the fallback encoding even if it doesn't follow the guidance given above. (For example, the fallback encoding for the Polish, Hungarian and Czech locales should probably continue to be ISO-8859-2 even though IE has a different fallback encoding.)

When in doubt, use windows-1252 as the fallback encoding.

Specifying the heuristic detection mode

The heuristic detection mode is specified by the preference intl.charset.detector in intl.properties. The setting must be left blank for all locales other than Russian, Ukrainian and Japanese. Do not under any circumstances specify the "universal" detector. It is not actually universal despite its name!

Exception for minority languages

If the localization is for minority language and the users are typically literate in the majority language of the region and read Web content written in the majority language very often, it is appropriate to specify the fallback encoding and the heuristic detection mode to be the same as for the localization for the majority language of the region. For example, for a localization for minority language in Russia, it is appropriate to copy the settings from the Russian localization.

Setting some encodings to be more easily selectable from the character encoding menu

The preference intl.charsetmenu.browser.static in intl.properties makes some character encodings more easily available in the Character Encoding menu in the browser. The value should be a comma-separated list of canonical encoding names. The list should include at least the fallback encoding, windows-1252 and UTF-8. For locales where there are multiple common legacy encodings, all those encodings should be included. For example, the fallback encoding for Japanese is Shift_JIS, but there are other legacy encodings: ISO-2022-JP and EUC-JP. Therefore, it makes sense for the list to be Shift_JIS, EUC-JP, ISO-2022-JP, windows-1252, UTF-8.