public class CharsetToolkit
extends java.lang.Object
Utility class to guess the encoding of a given byte array. The guess is unfortunately not 100% sure. Especially for 8-bit charsets. It's not possible to know for sure, which 8-bit charset is used. We will then infer that the charset encountered is the same as the default standard charset.
On the other hand, unicode files encoded in UTF-16 (low or big endian) or UTF-8 files with a Byte Order Marker are easy to find. For UTF-8 files with no BOM, if the buffer is wide enough, it's easy to guess.
Tested against a complicated UTF-8 file, Sun's implementation does not render bad UTF-8
constructs as expected by the specification. But with buffer wide enough, the method guessEncoding(int, int, java.nio.charset.Charset)
did behave correctly and recognized the UTF-8 charset.
A byte buffer of 4 KB or 8 KB is sufficient to be able to guess the encoding.
Usage:
// guess the encoding Charset guessedCharset = CharsetToolkit.guessEncoding(file, 4096); // create a reader with the charset we've just discovered try (InputStreamReader reader = new InputStreamReader(new FileInputStream(file), guessedCharset)) { //... }
Modifier and Type | Class and Description |
---|---|
static class |
CharsetToolkit.GuessedEncoding |
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
FILE_ENCODING_PROPERTY |
static java.nio.charset.Charset |
ISO_8859_1_CHARSET |
static java.nio.charset.Charset |
US_ASCII_CHARSET |
static java.nio.charset.Charset |
UTF_16_CHARSET |
static java.nio.charset.Charset |
UTF_16BE_CHARSET |
static java.nio.charset.Charset |
UTF_16LE_CHARSET |
static java.nio.charset.Charset |
UTF_32BE_CHARSET |
static java.nio.charset.Charset |
UTF_32LE_CHARSET |
static byte[] |
UTF16BE_BOM |
static byte[] |
UTF16LE_BOM |
static byte[] |
UTF32BE_BOM |
static byte[] |
UTF32LE_BOM |
static java.lang.String |
UTF8 |
static byte[] |
UTF8_BOM |
static java.nio.charset.Charset |
UTF8_CHARSET |
static java.nio.charset.Charset |
WIN_1251_CHARSET |
Constructor and Description |
---|
CharsetToolkit(byte [] buffer)
Constructor of the
CharsetToolkit utility class. |
CharsetToolkit(byte [] buffer,
java.nio.charset.Charset defaultCharset)
Constructor of the
CharsetToolkit utility class. |
Modifier and Type | Method and Description |
---|---|
static java.lang.String |
bytesToString(byte [] bytes,
java.nio.charset.Charset defaultCharset) |
static boolean |
canHaveBom(java.nio.charset.Charset charset,
byte [] bom) |
static java.lang.String |
decodeString(byte [] bytes,
java.nio.charset.Charset charset) |
static java.nio.charset.Charset |
forName(java.lang.String name) |
static java.nio.charset.Charset [] |
getAvailableCharsets()
Retrieves all the available
Charset s on the platform, among which the default charset . |
static int |
getBOMLength(byte [] content,
java.nio.charset.Charset charset) |
java.nio.charset.Charset |
getDefaultCharset()
Retrieves the default Charset
|
static java.nio.charset.Charset |
getDefaultSystemCharset()
Retrieve the default charset of the system.
|
static byte [] |
getMandatoryBom(java.nio.charset.Charset charset) |
static java.nio.charset.Charset |
getPlatformCharset()
Retrieve the platform charset of the system (determined by "sun.jnu.encoding" property)
|
static byte [] |
getPossibleBom(java.nio.charset.Charset charset) |
static byte [] |
getUtf8Bytes(java.lang.String s) |
static java.nio.charset.Charset |
guessEncoding(java.io.File f,
int bufferLength,
java.nio.charset.Charset defaultCharset) |
java.nio.charset.Charset |
guessEncoding(int guess_length) |
java.nio.charset.Charset |
guessEncoding(int startOffset,
int endOffset,
java.nio.charset.Charset defaultCharset)
Guess the encoding of the provided buffer.
|
java.nio.charset.Charset |
guessFromBOM() |
static java.nio.charset.Charset |
guessFromBOM(byte [] buffer) |
CharsetToolkit.GuessedEncoding |
guessFromContent(int guess_length) |
CharsetToolkit.GuessedEncoding |
guessFromContent(int startOffset,
int endOffset) |
static boolean |
hasUTF16BEBom(byte [] bom)
Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).
|
static boolean |
hasUTF16LEBom(byte [] bom)
Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).
|
static boolean |
hasUTF32BEBom(byte [] bom) |
static boolean |
hasUTF32LEBom(byte [] bom) |
static boolean |
hasUTF8Bom(byte [] bom)
Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).
|
static java.io.InputStream |
inputStreamSkippingBOM(java.io.InputStream stream) |
void |
setEnforce8Bit(boolean enforce)
If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII.
|
static java.lang.String |
tryDecodeString(byte [] bytes,
java.nio.charset.Charset charset) |
public static final java.lang.String UTF8
public static final java.nio.charset.Charset UTF8_CHARSET
public static final java.nio.charset.Charset UTF_16_CHARSET
public static final java.nio.charset.Charset UTF_16LE_CHARSET
public static final java.nio.charset.Charset UTF_16BE_CHARSET
public static final java.nio.charset.Charset UTF_32BE_CHARSET
public static final java.nio.charset.Charset UTF_32LE_CHARSET
public static final java.nio.charset.Charset US_ASCII_CHARSET
public static final java.nio.charset.Charset ISO_8859_1_CHARSET
public static final java.nio.charset.Charset WIN_1251_CHARSET
public static final byte[] UTF8_BOM
public static final byte[] UTF16LE_BOM
public static final byte[] UTF16BE_BOM
public static final byte[] UTF32BE_BOM
public static final byte[] UTF32LE_BOM
public static final java.lang.String FILE_ENCODING_PROPERTY
public CharsetToolkit(byte [] buffer)
CharsetToolkit
utility class.buffer
- the byte buffer of which we want to know the encoding.public CharsetToolkit(byte [] buffer, java.nio.charset.Charset defaultCharset)
CharsetToolkit
utility class.buffer
- the byte buffer of which we want to know the encoding.defaultCharset
- the default Charset to use in case an 8-bit charset is recognized.public static java.io.InputStream inputStreamSkippingBOM(java.io.InputStream stream) throws java.io.IOException
java.io.IOException
public void setEnforce8Bit(boolean enforce)
charset
rather than US-ASCII.public java.nio.charset.Charset getDefaultCharset()
public java.nio.charset.Charset guessEncoding(int startOffset, int endOffset, java.nio.charset.Charset defaultCharset)
Guess the encoding of the provided buffer.
If Byte Order Markers are encountered at the beginning of the buffer, we immediately return the charset implied by this BOM. Otherwise, the file would not be a human readable text file.If there is no BOM, this method tries to discern whether the file is UTF-8 or not. If it is not UTF-8, we assume the encoding is the default system encoding (of course, it might be any 8-bit charset, but usually, an 8-bit charset is the default one).
It is possible to discern UTF-8 thanks to the pattern of characters with a multi-byte sequence.
UCS-4 range (hex.) UTF-8 octet sequence (binary) 0000 0000-0000 007F 0....... 0000 0080-0000 07FF 110..... 10...... 0000 0800-0000 FFFF 1110.... 10...... 10...... 0001 0000-001F FFFF 11110... 10...... 10...... 10...... 0020 0000-03FF FFFF 111110.. 10...... 10...... 10...... 10...... 0400 0000-7FFF FFFF 1111110. 10...... 10...... 10...... 10...... 10......
With UTF-8, 0xFE and 0xFF never appear.
public static java.lang.String bytesToString(byte [] bytes, java.nio.charset.Charset defaultCharset)
public static java.lang.String decodeString(byte [] bytes, java.nio.charset.Charset charset)
public static java.lang.String tryDecodeString(byte [] bytes, java.nio.charset.Charset charset)
public CharsetToolkit.GuessedEncoding guessFromContent(int guess_length)
public CharsetToolkit.GuessedEncoding guessFromContent(int startOffset, int endOffset)
public java.nio.charset.Charset guessFromBOM()
public static java.nio.charset.Charset guessFromBOM(byte [] buffer)
public java.nio.charset.Charset guessEncoding(int guess_length)
public static java.nio.charset.Charset guessEncoding(java.io.File f, int bufferLength, java.nio.charset.Charset defaultCharset) throws java.io.IOException
java.io.IOException
public static java.nio.charset.Charset getDefaultSystemCharset()
public static java.nio.charset.Charset getPlatformCharset()
public static boolean hasUTF8Bom(byte [] bom)
public static boolean hasUTF16LEBom(byte [] bom)
public static boolean hasUTF16BEBom(byte [] bom)
public static boolean hasUTF32BEBom(byte [] bom)
public static boolean hasUTF32LEBom(byte [] bom)
public static java.nio.charset.Charset [] getAvailableCharsets()
Charset
s on the platform, among which the default charset
.public static byte [] getUtf8Bytes(java.lang.String s)
public static int getBOMLength(byte [] content, java.nio.charset.Charset charset)
public static byte [] getMandatoryBom(java.nio.charset.Charset charset)
UTF8_BOM
which is optional, thus it won't be returned in this method.public static byte [] getPossibleBom(java.nio.charset.Charset charset)
public static boolean canHaveBom(java.nio.charset.Charset charset, byte [] bom)
public static java.nio.charset.Charset forName(java.lang.String name)