Reading textual data - Archive of obsolete content

This article describes how to read textual data from streams, files and sockets.

In order to read textual data, you need to know which character encoding the data is in. Files and network sockets contain bytes, not characters - to give these bytes a meaning, you need to know the character encoding.

XXX: Document nsIUnicharStreamListener (gecko 1.8) XXX: Also document nsIStreamListener here?

Determining the character encoding of data

If you have a network channel (nsIChannel), you can try the contentCharset property of it. Note that not all channels know the character encoding of the data. You can fallback to the default character encoding stored in preferences (intl.charset.default, a localized pref value)

When reading from a file, the question is harder to answer. Using the system character encoding may work (XXX insert text how to get it), or again the default character encoding from preferences.

Converting read data

If you read data from nsIScriptableInputStream as described on the file I/O code snippets page, you can convert it to UTF-8

// sstream is nsIScriptableInputStream
var str = sstream.read(4096);
var utf8Converter = Components.classes["@mozilla.org/intl/utf8converterservice;1"].
    getService(Components.interfaces.nsIUTF8ConverterService);
var data = utf8Converter.convertURISpecToUTF8 (str, "UTF-8");

Gecko 1.8 and newer

Reading strings

Starting with Gecko 1.8 (SeaMonkey 1.0, Firefox 1.5), you can use nsIConverterInputStream to read strings from a stream (nsIInputStream). This work was done in bug 295047.

Usage:

var charset = /* Need to find out what the character encoding is. Using UTF-8 for this example: */ "UTF-8";
const replacementChar = Components.interfaces.nsIConverterInputStream.DEFAULT_REPLACEMENT_CHARACTER;
var is = Components.classes["@mozilla.org/intl/converter-input-stream;1"]
                   .createInstance(Components.interfaces.nsIConverterInputStream);
is.init(fis, charset, 1024, replacementChar);

Now you can read string from is:

var str = {};
var numChars = is.readString(4096, str);
if (numChars != 0 /* EOF */)
  var read_string = str.value;

To read the entire stream and do something with the data:

var str = {};
while (is.readString(4096, str) != 0) {
  processData(str.value);
}

Don't forget to close the stream when you're done with it (is.close()). Not doing so can cause problems if you try to rename or delete the file at a later time on some platforms.

Note that you may get less characters than you asked for, especially (but not only) at the end of the file (stream).

Unsupported byte sequences

You can specify what should happen with byte sequences that do not correspond to a valid character. The last (4th) argument to init specifies which character they get replaced with; nsIConverterInputStream.DEFAULT_REPLACEMENT_CHARACTER is U+FFFD REPLACEMENT CHARACTER, which is often a good choice.

If you do not want any replacement, you can specify 0x0000 as replacement character; that way, readString will throw an exception when reaching unsupported bytes.

Reading lines

The nsIUnicharLineInputStream interface provides an easy way to read entire lines from a unichar stream. It can be used like nsILineInputStream, except that it supports non-ASCII characters, and has no problems with charsets with embedded nulls (like UTF-16 and UTF-32).

It can be used like this:

var charset = /* Need to find out what the character encoding is. Using UTF-8 for this example: */ "UTF-8";
var is = Components.classes["@mozilla.org/intl/converter-input-stream;1"]
                   .createInstance(Components.interfaces.nsIConverterInputStream);
// This assumes that fis is the nsIInputStream you want to read from
is.init(fis, charset, 1024, 0xFFFD);
is.QueryInterface(Components.interfaces.nsIUnicharLineInputStream);

if (is instanceof Components.interfaces.nsIUnicharLineInputStream) {
  var line = {};
  var cont;
  do {
    cont = is.readLine(line);

    // Now you can do something with line.value
  } while (cont);
}

The above example reads an entire stream until EOF. See nsIConverterInputStream for nsIConverterInputStream.init() arguments.

Earlier versions

Reading strings

Earlier versions of gecko do not provide easy ways to read unicode data from a stream. You will have to manually read a block of data and convert it using nsIScriptableUnicodeConverter.

For example:

// First, get and initialize the converter
var converter = Components.classes["@mozilla.org/intl/scriptableunicodeconverter"]
                          .createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
converter.charset = /* The character encoding you want, using UTF-8 here */ "UTF-8";

// Now, read from the stream
// This assumes istream is the stream you want to read from
var scriptableStream = Components.classes["@mozilla.org/scriptableinputstream;1"]
                                 .createInstance(Components.interfaces.nsIScriptableInputStream);
scriptableStream.init(istream);
var chunk = scriptableStream.read(4096);
var text = converter.ConvertToUnicode(chunk);

However, you must be aware that this method will not work for character encodings that have embedded null bytes, such as UTF-16 or UTF-32.

Reading Lines

There is no easy, general way to read a unicode line from a stream.

For the limited use case of reading lines from a local file, the following code using nsIScriptableUnicodeConverter works. This code will not work for character encodings that contain embedded nulls such as UTF-16 and UTF-32

// First, get and initialize the converter
var converter = Components.classes["@mozilla.org/intl/scriptableunicodeconverter"]
                          .createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
converter.charset = /* The character encoding you want, using UTF-8 here */ "UTF-8";

// This assumes that 'file' is a variable that contains the file you want to read, as an nsIFile
var fis = Components.classes["@mozilla.org/network/file-input-stream;1"]
                    .createInstance(Components.interfaces.nsIFileInputStream);
fis.init(file, -1, -1, 0);

var lis = fis.QueryInterface(Components.interfaces.nsILineInputStream);
var lineData = {};
var cont;
do {
  cont = lis.readLine(lineData);
  var line = converter.ConvertToUnicode(lineData.value);

  // Now you can do something with line

} while (cont);
fis.close();

Determining the character encoding of data

Converting read data

Gecko 1.8 and newer

Reading strings

Unsupported byte sequences

Reading lines

Earlier versions

Reading strings

Reading Lines

See also