Auto-Detecting Text Encoding

As explained in Text Encoding and .NET determining the encoding of a text file can be tricky. Basically, the file may or may not contain a Byte Order Mark. If it does, its encoding is immediately clear. If it doesn't, there's not much we can do: the file could be encoded using UTF-8 or ANSI and we need to make a choice between them.

Byte Order Marks a.k.a. Preambles

Fortunately, the .NET Framework helps us look for a Byte Order Mark through the System.Text.Encoding.GetEncodings() method. The result is an array of EncodingInfo objects. Each of those has a GetEncoding() method, returning the corresponding System.Text.Encoding object. Calling GetPreamble() on that gives us a byte array that contains the Byte Order Mark, called a preamble in .NET-speak.

So it's relatively easy to iterate over all supported encodings, selecting those that have a preamble:

For Each Info As EncodingInfo In Encoding.GetEncodings()
  Dim Preamble() As Byte = Info.GetEncoding().GetPreamble()
  If Preamble.Length > 0 Then
    ' We have an Encoding with a Byte Order Mark
  End If
Next Info

Next, of course, we need to test the contents of a file to see if it starts with one of the known preambles. To do so, we need to construct a list of encodings with a preamble, at the same time measuring the longest preamble in the list. To determine the encoding of a file, we need to read the first few bytes (the maximum preamble length) of the file and compare it to the known preambles. The trick is to test the longest preambles first.

That's not rocket science: it's EncodingDetector.

MOBZystems.Text.EncodingDetector

This class delivers three static methods:

public static string ReadAllText(
  string filename,
  Encoding defaultEncoding,
  out Encoding usedEncoding
)

and

public static Encoding DetectEncoding(byte[] bytes)

and

public static Encoding DetectEncoding(byte[] bytes)

ReadAllText() returns the contents of a text file, much like System.IO.File.ReadAllText(), except that this version expects the default encoding to use when there's no Byte Order Mark in the file, and returns the encoding that was used to decode the file in the out-parameter usedEncoding. Normally, you'd supply Encoding.Default (for ANSI) or Encoding.UTF8 (for UTF-8) as the default encoding.

For large files, this method is inefficient: it reads the whole file into a byte array, determines the encoding and uses its Encoding.GetString() method to create the string. This uses approximately two to three times as much memory as there are bytes in the file.

Internally, ReadAllText() uses the first overload of DetectEncoding(), which takes a byte array as its only argument. The start of the byte array is compared to the known preambles and if a match is found, the corresponding encoding is returned. If no BOM is found, the method returns null.

If you just need to know the encoding of a file, without reading its entire contents, use the second overload of DetectEncoding(). This takes a filename as its argument, and reads in just enough data to distinguish the BOM, if any. This method also returns null if no matching Byte Order Mark was found. You could also use it to determine the encoding of a file before reading it with System.File.IO.ReadAllText(), this time supplying the encoding detected.

EncodingDetector keeps track of the supported encodings automatically, creating the list once when needed.

Source code

The full source code of EncodingDetector is available here.