Localization formats

Warning: This document pertains to the development of Mozilla web sites and not to the development of Gecko-based extensions or applications.

There are 4 main approaches to web l10n with regards to the choice of technology used for localization logic:

  1. HTML/PHP
  2. .lang
  3. gettext (.po)
  4. wiki (TBD)

The choice of the filetype depends on a couple of factors:

  • How much content is there to be localized?
  • How often, if at all, the site will be updated after the launch?
    • Is this a long-term project with continuous updates to content?
    • Is this a long-term project with stable content?
    • Is this a short-term project?
  • Will the content differ per locale? If so, to what extent?

As every new web-dev project takes shape, a project manager should ask themselves these questions and have the answers ready before starting the web l10n process.

HTML/PHP

Maybe you are designing a project that has relatively small translation needs like three to four lines (or more) of content asking users to update to the next version of software available. You may choose to present just the HTML for localization:

We give an HTML file which lists several pieces of content like,

 <H1>Getting Started</H1>

and the localizer translates to

 <H1>Débuter avec Firefox</H1>

The localizer then submits the translated HTML or PHP back to us by either checking in changes to SVN or sending us a patch that Pascal checks in.

Advantages to HTML

  1. Good for small projects
  2. Very simple for web developers
  3. Gives localizers the exact context of translations
  4. A localizer who knows basic HTML can style translations to make sure translations display correctly...we can allow slight modifications (e.g. RTL or wider display)
  5. Gives the possibility to customize content per locale
  6. Simple workflow, just put the file on svn and it can appear on the staging server

Disadvantages to HTML

  1. Very hard for QA
    • If localizer changes something incorrectly (i.e. accidentally removes some HTML like </h1>, that localizer can break everything.)
  2. Very hard to update automatically, if not impossible.
  3. Can be hard to tell what has changed

.lang files

Historically, Mozilla has used a gettext-like file to present content for localization. .lang files provide some features that differentiate it from Gettext:

  • .lang is not dependent on PHP/.po library, so if our webdev team sets up a site without gettext support, we still have .lang. (It should be noted that this is near impossibility since most sites will be set up with gettext support.)
  • .lang files can be easily edited in a text editor

With this setup, a localizer is given a "[something].lang" file containing all the strings needing localization. That file will have the following structure:

 ;Getting Started
 Débuter avec Firefox

The English content is designated by the semi-colon and the localizer provides the translation underneath. That content is placed into an array that is used by the PHP code later.

 $array["Getting Started"] = "Débuter avec Firefox"

The PHP code searches the array and returns the translation that is associated with the English term used by the web developer.

See the example below.

 <H1><?PHP echo ___("Getting Started")?></H1>
 function ___($str) {
    return $array[$str];
 }

Advantage to .lang

  1. Simple work-flow allowing the web developer to place the file in SVN and it can appear on the staging server
  2. .lang syntax is like simplified .po, which many localizers who are familiar with linux and other projects understand
  3. Mozilla has a basic tool called main.lang checker, which can show any untranslated files to the localizer
  4. no need to compile to .mo file so a localizer can see his/her changes more quickly
  5. creating simple diffs
  6. .lang files will be cached which will reduce any slowness effect

Disadvantage to .lang

  1. no plural forms
  2. no context for localizers unless you provide good comments
  3. no styling by localizers if it is needed
  4. may be slower because file is not compiled into binaries
  5. not used as a standard by any other localization project
  6. no tools to validate syntax, so a localizer may cause accidental errors that can cause breakage (level of breakage depends on level of error)
  7. cannot use po editor, which most localizers know and love

gettext (.po)

Gettext is a widely-used localization format that uses .po files. With this arrangement, content for localization is presented in the following manner:

 msgid "coming soon"
 msgstr "Bientôt disponible"

where the value in the "" of the msgid is the English content, and the value in the "" of the msgstr is the translation. msgstr can be longer translations than just the exact above. For example, below is the entire introduction used for a certificate that was presented to those who downloaded Firefox during the Download Day campaign. The French content runs three lines:

 msgid "certificate_intro"
 msgstr ""
 "Merci de nous avoir aidé à établir ce record du monde ! "
 "Allez-y et montrez-le en téléchargeant et en imprimant votre "
 "certificat personnalisé Firefox 3 Download Day."

Advantages of gettext

  1. gettext has very powerful tools to update this site (if you use the actual English strings in msgids, not unique identifier strings like certificate_intro)
  2. Very established with a large set of powerful tools
  3. Harder to screw things up because existing tools will not allow localizers to edit the l10n file where they shouldn't
  4. Separates localizable strings available for localizers for the rest of the code, protecting it from unintended changes

Disadvantage of gettext

  1. .po file needs to be compiled into a .mo file for localizer to see changes
  2. Using regular diff to see changes to a file is sometimes impossible because the editing tool can save the .po file using a completely different structure or order of entities.

Read more about gettext on Wikipedia and on MDC.

Wiki

Blogging, documentation, and other types of Mozilla content often surface as wikis.

Case study: Download Day

In the above Gettext example, notice how the web developer used "certificate_intro" as the value of the msgid. This is not the actual content that was translated. So, if a localizer wanted to use one of the many powerful Gettext tools, like po-editor, the msgid provides NO CONTEXT for translation or for other localizers to verify translations when QA-ing. This should have been avoided.

In the case of Download Day, someone created entity-like identifiers in the msgid, which we have shown above with the "certificate_intro" key. Then, an en-US repository was created holding the translations to all the entity-like values of msgid. This is very non-standard because it avoids one of the obvious powers of Gettext. When English content is used as the value of the msgid, there is no need to place that content in a special repository.

But, in the Download Day example, when changes were made to en-US, the web-developer had to push those changes to all the repositories of all the locales. Localizers had to revisit an en-US repository to find the exact msgid, review the change, and return to their repository to make changes.

Without having the exact content as the msgid, this process may cause several errors since the localizers cannot has to continually switch back-and-forth. In this case, the choice to use customized values for msgid was error-prone, onerous, and unfamiliar to localizers who are used to more customary Gettext operations.