Auto-Detecting Text Encoding
As explained in Text Encoding and .NET determining the encoding of a text file can be tricky. Basically, the file may or may not contain a Byte Order Mark. If it does, its encoding is immediately clear. If it doesn't, there's not much we can do: the file could be encoded using UTF-8 or ANSI and we need to make a choice between them.
Byte Order Marks a.k.a. Preambles
Fortunately, the .NET Framework helps us look for a Byte Order Mark through the System.Text.Encoding.GetEncodings() method. The result is an array of EncodingInfo objects. Each of those has a GetEncoding() method, returning the corresponding System.Text.Encoding object. Calling GetPreamble() on that gives us a byte array that contains the Byte Order Mark, called a preamble in .NET-speak.
So it's relatively easy to iterate over all supported encodings, selecting those that have a preamble:
For Each Info As EncodingInfo In Encoding.GetEncodings()
Dim Preamble() As Byte = Info.GetEncoding().GetPreamble()
If Preamble.Length > 0 Then
' We have an Encoding with a Byte Order Mark
Next, of course, we need to test the contents of a file to see if it starts with one of the known preambles. To do so, we need to construct a list of encodings with a preamble, at the same time measuring the longest preamble in the list. To determine the encoding of a file, we need to read the first few bytes (the maximum preamble length) of the file and compare it to the known preambles. The trick is to test the longest preambles first.
That's not rocket science: it's EncodingDetector.
This class delivers three static methods:
public static string ReadAllText(
out Encoding usedEncoding
public static Encoding DetectEncoding(byte bytes)
public static Encoding DetectEncoding(string filename)
ReadAllText() returns the contents of a text file, much like System.IO.File.ReadAllText(), except that this version expects the default encoding to use when there's no Byte Order Mark in the file, and returns the encoding that was used to decode the file in the out-parameter usedEncoding. Normally, you'd supply Encoding.Default (for ANSI) or Encoding.UTF8 (for UTF-8) as the default encoding.
For large files, this method is inefficient: it reads the whole file into a byte array, determines the encoding and uses its Encoding.GetString() method to create the string. This uses approximately two to three times as much memory as there are bytes in the file.
Internally, ReadAllText() uses the first overload of DetectEncoding(), which takes a byte array as its only argument. The start of the byte array is compared to the known preambles and if a match is found, the corresponding encoding is returned. If no BOM is found, the method returns null.
If you just need to know the encoding of a file, without reading its entire contents, use the second overload of DetectEncoding(). This takes a filename as its argument, and reads in just enough data to distinguish the BOM, if any. This method also returns null if no matching Byte Order Mark was found. You could also use it to determine the encoding of a file before reading it with System.File.IO.ReadAllText(), this time supplying the encoding detected.
EncodingDetector keeps track of the supported encodings automatically, creating the list once when needed.
The full source code of EncodingDetector is available here.