CharsetToolkit (Unofficial IntelliJ Community Edition API docs)

java.lang.Object
- com.intellij.openapi.vfs.CharsetToolkit

```
public class CharsetToolkit
extends java.lang.Object
```
Utility class to guess the encoding of a given byte array. The guess is unfortunately not 100% sure. Especially for 8-bit charsets. It's not possible to know for sure, which 8-bit charset is used. We will then infer that the charset encountered is the same as the default standard charset.

On the other hand, unicode files encoded in UTF-16 (low or big endian) or UTF-8 files with a Byte Order Marker are easy to find. For UTF-8 files with no BOM, if the buffer is wide enough, it's easy to guess.

Tested against a complicated UTF-8 file, Sun's implementation does not render bad UTF-8 constructs as expected by the specification. But with buffer wide enough, the method guessEncoding(int, int, java.nio.charset.Charset) did behave correctly and recognized the UTF-8 charset.

A byte buffer of 4 KB or 8 KB is sufficient to be able to guess the encoding.

Usage:
```
 // guess the encoding
 Charset guessedCharset = CharsetToolkit.guessEncoding(file, 4096);

 // create a reader with the charset we've just discovered
 try (InputStreamReader reader = new InputStreamReader(new FileInputStream(file), guessedCharset)) {
   //...
 }
 
```

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class CharsetToolkit.GuessedEncoding

Nested Classes
Modifier and Type	Class and Description
`static class`	`CharsetToolkit.GuessedEncoding`

Field Summary

Fields
Modifier and Type	Field and Description
`static java.lang.String`	`FILE_ENCODING_PROPERTY`
`static java.nio.charset.Charset`	`ISO_8859_1_CHARSET`
`static java.nio.charset.Charset`	`US_ASCII_CHARSET`
`static java.nio.charset.Charset`	`UTF_16_CHARSET`
`static java.nio.charset.Charset`	`UTF_16BE_CHARSET`
`static java.nio.charset.Charset`	`UTF_16LE_CHARSET`
`static java.nio.charset.Charset`	`UTF_32BE_CHARSET`
`static java.nio.charset.Charset`	`UTF_32LE_CHARSET`
`static byte[]`	`UTF16BE_BOM`
`static byte[]`	`UTF16LE_BOM`
`static byte[]`	`UTF32BE_BOM`
`static byte[]`	`UTF32LE_BOM`
`static java.lang.String`	`UTF8`
`static byte[]`	`UTF8_BOM`
`static java.nio.charset.Charset`	`UTF8_CHARSET`
`static java.nio.charset.Charset`	`WIN_1251_CHARSET`

Constructor Summary

Constructors
Constructor and Description
`CharsetToolkit(byte [] buffer)` Constructor of the `CharsetToolkit` utility class.
`CharsetToolkit(byte [] buffer, java.nio.charset.Charset defaultCharset)` Constructor of the `CharsetToolkit` utility class.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`static java.lang.String`	`bytesToString(byte [] bytes, java.nio.charset.Charset defaultCharset)`
`static boolean`	`canHaveBom(java.nio.charset.Charset charset, byte [] bom)`
`static java.lang.String`	`decodeString(byte [] bytes, java.nio.charset.Charset charset)`
`static java.nio.charset.Charset`	`forName(java.lang.String name)`
`static java.nio.charset.Charset []`	`getAvailableCharsets()` Retrieves all the available `Charset`s on the platform, among which the default `charset`.
`static int`	`getBOMLength(byte [] content, java.nio.charset.Charset charset)`
`java.nio.charset.Charset`	`getDefaultCharset()` Retrieves the default Charset
`static java.nio.charset.Charset`	`getDefaultSystemCharset()` Retrieve the default charset of the system.
`static byte []`	`getMandatoryBom(java.nio.charset.Charset charset)`
`static java.nio.charset.Charset`	`getPlatformCharset()` Retrieve the platform charset of the system (determined by "sun.jnu.encoding" property)
`static byte []`	`getPossibleBom(java.nio.charset.Charset charset)`
`static byte []`	`getUtf8Bytes(java.lang.String s)`
`static java.nio.charset.Charset`	`guessEncoding(java.io.File f, int bufferLength, java.nio.charset.Charset defaultCharset)`
`java.nio.charset.Charset`	`guessEncoding(int guess_length)`
`java.nio.charset.Charset`	`guessEncoding(int startOffset, int endOffset, java.nio.charset.Charset defaultCharset)` Guess the encoding of the provided buffer.
`java.nio.charset.Charset`	`guessFromBOM()`
`static java.nio.charset.Charset`	`guessFromBOM(byte [] buffer)`
`CharsetToolkit.GuessedEncoding`	`guessFromContent(int guess_length)`
`CharsetToolkit.GuessedEncoding`	`guessFromContent(int startOffset, int endOffset)`
`static boolean`	`hasUTF16BEBom(byte [] bom)` Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).
`static boolean`	`hasUTF16LEBom(byte [] bom)` Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).
`static boolean`	`hasUTF32BEBom(byte [] bom)`
`static boolean`	`hasUTF32LEBom(byte [] bom)`
`static boolean`	`hasUTF8Bom(byte [] bom)` Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).
`static java.io.InputStream`	`inputStreamSkippingBOM(java.io.InputStream stream)`
`void`	`setEnforce8Bit(boolean enforce)` If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII.
`static java.lang.String`	`tryDecodeString(byte [] bytes, java.nio.charset.Charset charset)`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

UTF8

public static final java.lang.String UTF8

See Also:: Constant Field Values

UTF8_CHARSET

public static final java.nio.charset.Charset UTF8_CHARSET

UTF_16_CHARSET

public static final java.nio.charset.Charset UTF_16_CHARSET

UTF_16LE_CHARSET

public static final java.nio.charset.Charset UTF_16LE_CHARSET

UTF_16BE_CHARSET

public static final java.nio.charset.Charset UTF_16BE_CHARSET

UTF_32BE_CHARSET

public static final java.nio.charset.Charset UTF_32BE_CHARSET

UTF_32LE_CHARSET

public static final java.nio.charset.Charset UTF_32LE_CHARSET

US_ASCII_CHARSET

public static final java.nio.charset.Charset US_ASCII_CHARSET

ISO_8859_1_CHARSET

public static final java.nio.charset.Charset ISO_8859_1_CHARSET

WIN_1251_CHARSET

public static final java.nio.charset.Charset WIN_1251_CHARSET

UTF8_BOM
```
public static final byte[] UTF8_BOM
```

UTF16LE_BOM
```
public static final byte[] UTF16LE_BOM
```

UTF16BE_BOM
```
public static final byte[] UTF16BE_BOM
```

UTF32BE_BOM
```
public static final byte[] UTF32BE_BOM
```

UTF32LE_BOM
```
public static final byte[] UTF32LE_BOM
```

FILE_ENCODING_PROPERTY

public static final java.lang.String FILE_ENCODING_PROPERTY

See Also:: Constant Field Values

Constructor Detail
- CharsetToolkit
```
public CharsetToolkit(byte [] buffer)
```
  Constructor of the CharsetToolkit utility class.
  
  Parameters:
  
  buffer - the byte buffer of which we want to know the encoding.
- CharsetToolkit
```
public CharsetToolkit(byte [] buffer,
                      java.nio.charset.Charset defaultCharset)
```
  Constructor of the CharsetToolkit utility class.
  
  Parameters:
  
  buffer - the byte buffer of which we want to know the encoding.
  
  defaultCharset - the default Charset to use in case an 8-bit charset is recognized.

Method Detail

inputStreamSkippingBOM

public static java.io.InputStream inputStreamSkippingBOM(java.io.InputStream stream)
                                                  throws java.io.IOException

Throws:: java.io.IOException

setEnforce8Bit
```
public void setEnforce8Bit(boolean enforce)
```
If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII. It might be a file without any special character in the range 128-255, but that may be or become a file encoded with the default charset rather than US-ASCII.

getDefaultCharset

public java.nio.charset.Charset getDefaultCharset()

Retrieves the default Charset

guessEncoding
```
public java.nio.charset.Charset guessEncoding(int startOffset,
                                              int endOffset,
                                              java.nio.charset.Charset defaultCharset)
```
Guess the encoding of the provided buffer.
If Byte Order Markers are encountered at the beginning of the buffer, we immediately return the charset implied by this BOM. Otherwise, the file would not be a human readable text file.

If there is no BOM, this method tries to discern whether the file is UTF-8 or not. If it is not UTF-8, we assume the encoding is the default system encoding (of course, it might be any 8-bit charset, but usually, an 8-bit charset is the default one).

It is possible to discern UTF-8 thanks to the pattern of characters with a multi-byte sequence.
```
 UCS-4 range (hex.)        UTF-8 octet sequence (binary)
 0000 0000-0000 007F       0.......
 0000 0080-0000 07FF       110..... 10......
 0000 0800-0000 FFFF       1110.... 10...... 10......
 0001 0000-001F FFFF       11110... 10...... 10...... 10......
 0020 0000-03FF FFFF       111110.. 10...... 10...... 10...... 10......
 0400 0000-7FFF FFFF       1111110. 10...... 10...... 10...... 10...... 10......
 
```
With UTF-8, 0xFE and 0xFF never appear.
Returns:

the Charset recognized.

bytesToString

public static java.lang.String bytesToString(byte [] bytes,
                                             java.nio.charset.Charset defaultCharset)

decodeString

public static java.lang.String decodeString(byte [] bytes,
                                            java.nio.charset.Charset charset)

tryDecodeString

public static java.lang.String tryDecodeString(byte [] bytes,
                                               java.nio.charset.Charset charset)

guessFromContent

public CharsetToolkit.GuessedEncoding guessFromContent(int guess_length)

guessFromContent

public CharsetToolkit.GuessedEncoding guessFromContent(int startOffset,
                                                       int endOffset)

guessFromBOM

public java.nio.charset.Charset guessFromBOM()

guessFromBOM

public static java.nio.charset.Charset guessFromBOM(byte [] buffer)

guessEncoding

public java.nio.charset.Charset guessEncoding(int guess_length)

guessEncoding

public static java.nio.charset.Charset guessEncoding(java.io.File f,
                                                     int bufferLength,
                                                     java.nio.charset.Charset defaultCharset)
                                              throws java.io.IOException

Throws:: java.io.IOException

getDefaultSystemCharset

public static java.nio.charset.Charset getDefaultSystemCharset()

Retrieve the default charset of the system.

getPlatformCharset
```
public static java.nio.charset.Charset getPlatformCharset()
```
Retrieve the platform charset of the system (determined by "sun.jnu.encoding" property)

hasUTF8Bom
```
public static boolean hasUTF8Bom(byte [] bom)
```
Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).

hasUTF16LEBom
```
public static boolean hasUTF16LEBom(byte [] bom)
```
Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).

hasUTF16BEBom
```
public static boolean hasUTF16BEBom(byte [] bom)
```
Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).

hasUTF32BEBom

public static boolean hasUTF32BEBom(byte [] bom)

hasUTF32LEBom

public static boolean hasUTF32LEBom(byte [] bom)

getAvailableCharsets
```
public static java.nio.charset.Charset [] getAvailableCharsets()
```
Retrieves all the available Charsets on the platform, among which the default charset.

getUtf8Bytes

public static byte [] getUtf8Bytes(java.lang.String s)

getBOMLength

public static int getBOMLength(byte [] content,
                               java.nio.charset.Charset charset)

getMandatoryBom
```
public static byte [] getMandatoryBom(java.nio.charset.Charset charset)
```
Returns:

BOM which is associated with this charset and the charset must have this BOM, or null otherwise. Currently, these are UTF-16xx and UTF-32xx families. UTF-8, on the other hand, might have BOM UTF8_BOM which is optional, thus it won't be returned in this method.

getPossibleBom
```
public static byte [] getPossibleBom(java.nio.charset.Charset charset)
```
Returns:

BOM which can be associated with this charset, or null otherwise. Currently, these are UTF-16xx, UTF-32xx and UTF-8.

canHaveBom

public static boolean canHaveBom(java.nio.charset.Charset charset,
                                 byte [] bom)

forName

public static java.nio.charset.Charset forName(java.lang.String name)

Class CharsetToolkit

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

UTF8

UTF8_CHARSET

UTF_16_CHARSET

UTF_16LE_CHARSET

UTF_16BE_CHARSET

UTF_32BE_CHARSET

UTF_32LE_CHARSET

US_ASCII_CHARSET

ISO_8859_1_CHARSET

WIN_1251_CHARSET

UTF8_BOM

UTF16LE_BOM

UTF16BE_BOM

UTF32BE_BOM

UTF32LE_BOM

FILE_ENCODING_PROPERTY

Constructor Detail

CharsetToolkit

CharsetToolkit

Method Detail

inputStreamSkippingBOM

setEnforce8Bit

getDefaultCharset

guessEncoding

bytesToString

decodeString

tryDecodeString

guessFromContent

guessFromContent

guessFromBOM

guessFromBOM

guessEncoding

guessEncoding

getDefaultSystemCharset

getPlatformCharset

hasUTF8Bom

hasUTF16LEBom

hasUTF16BEBom

hasUTF32BEBom

hasUTF32LEBom

getAvailableCharsets

getUtf8Bytes

getBOMLength

getMandatoryBom

getPossibleBom

canHaveBom

forName