Mork

Mork is a database file format invented by David McCusker for the Mozilla code since the original Netscape database information was proprietary and could not be released open source. Starting with Mozilla 1.9, it was phased out in favor of SQLite, a more widely-supported file format.

The information on this page was constructed by reading the source code of the mork database in Mozilla and attempting to codify what it parses as faithfully as possible. In a few cases, constructs that the parser supported but were never used in Mozilla's code are left unspecified.

Mork Structure

Mork is a schema-less database format. At its core, it can be viewed as a set of rows, collections of name-value pairs, which can be organized into various tables. Rows need not be in a table, nor need they be in only one table. Values are merely an opaque sequence of bytes, so their actual content is dependent on the mork consumer.

Except when parsing values, whitespace ('\b', '\t', '\r', '\n', '\f', ' ', and '\x7f'), line continuations ('\\' followed by a newline), and comments (C++ or C style) can be ignored. In practice, only C++-style comments, and the standard whitespace characters ('\t', ' ', '\r', '\n') appear to be used.

The grammar for mork is as follows:

file = header ( dictionary | table | group | row )*
header = '// <!-- <mdb:mork:z v="1.4"/> -->'

dictionary = '<' ( metaDictionary | alias )* '>'
table = '{' '-'? mid ( metaTable | row | '-' | rowReference)* '}'
row = '[' '-'? mid ( cell | metaRow | '-' )* ']'
group = '@$${' hex '{@' (dictionary|table|row)* ('@$$}' hex '}@' | '@$$}~~}@')

metaDictionary = '<' cell* '>'
metaTable = '{' ( cell | row | rowReference )* '}'
metaRow = '[' cell* ']'
rowReference = hex ( '!' hex )?

cell = '(' ( '^' mid | name ) ( '=' value | '^' mid ) ')'
alias = '(' hex '=' value ')'
mid = hex ( ':' ( name | '^' hex ) )?

hex = [0-9A-Fa-f]+
name = [A-Za-z_?:] [A-Za-z_?:+-]*
value is a string terminated by ')' (not consumed) where '\' quotes the next metacharacter and
  '$' quotes the next two hexadecimal digits as a value (e.g., $20 is a space)

The first line in the file is the header. It is structured as a comment so it can be safely ignored by the parser. The header is:

// <!-- <mdb:mork:z v="1.4"/> -->

This file is therefore version 1.4 of the Zany Mork format for the Message Database. Versions 1.0-1.3 were never used in any publicly-available source code. Other serializations of mork have also never been used, so the Zany Mork version 1.4 format has become the only official Mork format.

After that, the lexical structure of mork is somewhat simple. The file is a collection of top-level structures, of which there exists four: dictionaries, tables, rows, and groups (changesets). Dictionaries, tables, and rows all have corresponding metastructures.

Meta-dictionaries are in practice only used to change the scope of alias definitions (the '(a=c)' that you can see in files). Meta-tables are used to establish some facts about the table as well as the default row scope. Meta-rows do not appear to be used at all, although the parser seems to consider setting the charset, row scope, and atom scope.


Some notes on further terminology:

  • When referring to cells or aliases, the first component is the key and the second component is the value.
  • Expanding refers to taking a '^' hex buffer and replacing it with the alias value for the hex key (explained below).
  • A mid has two components: the ID itself, to the left of the colon, and the scope, to the right of the colon.
  • There are two types of meta-rows, those of rows and those of tables. However, since the meta-row of a row never appears to be used, I will generally use 'meta-row' to refer to the meta-row of a table, unless otherwise qualified.

Dictionaries

A dictionary establishes a series of aliases. If inside the meta-dictionary, the only cell we care about is if the key is 'a' (for atom scope). This value is used to establish which dictionary the aliases are added to. Most files will have just two dictionaries: the column scope dictionary ('c') and the value scope dictionary ('v', the default).

The keys are hexadecimal numbers starting at 0x80, because the values less than 0x80 are theoretically their representative ASCII values.

Tables

A table is set of rows, potentially with a meta-row. The table id is specified using the mid at the beginning of the table. Furthermore, the scope of the mid also serves as the default scope of rows defined within the table.

The meta-row of the table (different from the meta-row of a row) contains a few things you might care about. The first is the kind column (k), where the mid is
typically column-scoped (:c). Next is the status column (s), which is defined as a single digit that is the priority (which appears to be unused), a 'u' if it is unique (i.e., the only table of its kind), and a 'v' if it is verbose (which also appears to be unused).

The presence of a row or rowReference in the meta-table indicates that said row is the meta-row of the table: it contains arbitrary properties for the entire
table just like a normal row. For example, the message summary format files use a table for each thread, where the meta-rows represent information about the
thread in general.

The last important point about tables is the meaning of the minuses. A minus before the table ID indicates that all the rows currently stored in the table should be removed before adding more rows. A minus sign before a row or rowReference indicates that said row should be removed from the table.

Rows

For the most part, a row should not be seen outside of a table; the only time I have seen one outside a table is for table meta-rows. The first mid, as you should expect by now, is the row ID, with the scope being an important part of the ID, in that 1:scope1 is not the same row as 1:scope2. If the scope is not specified, then it is the default scope of the table (if it has one) or 'r' otherwise. These row IDs are global, even if defined in a table. Indeed, defining a row in a table is a shortcut for both defining a row and adding the row to the table.

The presence of an initial minus means to delete all cells before adding new cells. A minus after the mid indicates that the next cell should be deleted. I haven't seen that case though, so it may not be an issue in practice.

The rowReference is a mid that refers to a specific row. If it occurs in a table, you can treat it as adding that row to a table; in a meta-table, it is the same as making that row the meta-row of the table.

Groups, a.k.a. Changesets

A group is theoretically an atomic transaction à la SQL's transactions. The equivalent of BEGIN TRANSACTION is the '@$${hex{@' identify (the hex represents the id). COMMIT TRANSACTION is '@$$}hex}@' and ROLLBACK TRANSACTION '@$$}~~}'. A brief survey of Mozilla code implies that it an aborted transaction is only used if mork loses internal consistency.

Cells and Aliases and Mids

Cells are the core of mork: they are key-value pairs of columns and the particular row's values. You can either manually refer to the key or use an alias to the key, and you can either manually refer to the value or use an alias to the value.

The scope of the mid for keys defaults to the column scope, 'c', and the scope for values defaults to the value scope, 'v'. These defaults are used if the scope is not specified (by having the ':' then the scope).

The value of a mork cell is treated as nothing more than a char*, char here in the sense of an octet rather than a string character. Most of the time, however, regular 7-bit ASCII data is put in the cell. Where integer data is used, the convention is to use and store the hexadecimal value for the output.

Aliases are how dictionary values are stored. They use the same semantics of cells for the value, except that it cannot refer to a mid.

The values of cells and aliases, when not specified by a mid reference, are one of the cases where whitespace is significant. Using the '\' character quotes the next character; at the end of the line, it represents a continuation such as that found in C\C++ code. The '$' character quotes the next two characters (which must be valid hexadecimal digits) and these three in total represent an octet with the value of the digits.

Mids are unique identifiers represented by the tuple (scope, id); if the ids are the same but the scopes are different, the two mids are different. In many places, the mid's scope has an assumed default value; if the scope component is not present, it assumes this default value.

Morkreader usage

Morkreader has both internal and external linkage static libraries that can be used.

The API that is presented is simple. You pass in a file to ParseFile, select the tables, and then pass in callback functions for the meta-row and row enumeration functions. GetTableIDs is provided for convenience's sake, and is mostly expected to be used only by the morkreader tests. All data stored internally will be destroyed when the MorkReader goes out of scope. More detail about individual functions can be found by reading the documentation included in MorkReader.h.

There is in the test directory a program called TestMorkReader. If you wish to just display a mork file, most of the code in this file can be copied to display it to you.