util
Class BytesEncodingChecker

java.lang.Object
  extended by util.BytesEncodingChecker

public class BytesEncodingChecker
extends java.lang.Object

Provides UCS encoding information for a given byte array, allowing the client to determine whether its encoding is UTF-8 or UTF-16, whether in the latter case it is big- or little-endian, whether or not there is a BOM, and if so the number of bytes devoted to the BOM.

(The range of UCS encodings supported is thus somewhat limited.)

This class also defines a utility method to decode its byte array as a string.


Field Summary
static int BOM_0
          Byte 0 (big-endian) of a 16-bit BOM character.
static int BOM_1
          Byte 1 (big-endian) of a 16-bit BOM character.
static int NUL
          NUL byte value.
static int UTF8_BOM_0
          Byte 0 of the UTF-8 representation of a BOM character.
static int UTF8_BOM_1
          Byte 1 of the UTF-8 representation of a BOM character.
static int UTF8_BOM_2
          Byte 2 of the UTF-8 representation of a BOM character.
 
Constructor Summary
BytesEncodingChecker(byte[] bytes)
          Constructs a new checker for the given byte array and determines its encoding parameters.
 
Method Summary
 int countBOMPrefixBytes()
          Returns the number of bytes devoted to an initial BOM, so zero is returned in the case where there is no initial BOM.
 java.lang.String encodingForUCS16()
          Assuming the encoding is known to be some form of UTF-16, this method returns the appropriate encoding name with explicit endianness, i.e.
 boolean encodingIsKnown16Bit()
          Returns true iff the encoding is some form of UTF-16.
 boolean encodingIsKnownUTF8()
          Returns true iff the encoding is known to be UTF-8: to convert the bytes to a string using this encoding the first countBOMPrefixBytes() should be omitted.
 java.lang.String getDecodedString(java.lang.String enc, int nskip)
          Returns the string obtained by decoding the bytes array associated with this checker, assuming the given encoding, and skipping the specified number of initial bytes.
 boolean utf16IsBE()
          Assuming the encoding is known to be some form of UTF-16, returns true iff it is big-endian.
 boolean utf16IsLE()
          Assuming the encoding is known to be some form of UTF-16, returns true iff it is little-endian.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

BOM_0

public static final int BOM_0
Byte 0 (big-endian) of a 16-bit BOM character.

See Also:
Constant Field Values

BOM_1

public static final int BOM_1
Byte 1 (big-endian) of a 16-bit BOM character.

See Also:
Constant Field Values

UTF8_BOM_0

public static final int UTF8_BOM_0
Byte 0 of the UTF-8 representation of a BOM character.

See Also:
Constant Field Values

UTF8_BOM_1

public static final int UTF8_BOM_1
Byte 1 of the UTF-8 representation of a BOM character.

See Also:
Constant Field Values

UTF8_BOM_2

public static final int UTF8_BOM_2
Byte 2 of the UTF-8 representation of a BOM character.

See Also:
Constant Field Values

NUL

public static final int NUL
NUL byte value.

See Also:
Constant Field Values
Constructor Detail

BytesEncodingChecker

public BytesEncodingChecker(byte[] bytes)
Constructs a new checker for the given byte array and determines its encoding parameters.

Method Detail

encodingIsKnown16Bit

public boolean encodingIsKnown16Bit()
Returns true iff the encoding is some form of UTF-16.


utf16IsBE

public boolean utf16IsBE()
Assuming the encoding is known to be some form of UTF-16, returns true iff it is big-endian.


utf16IsLE

public boolean utf16IsLE()
Assuming the encoding is known to be some form of UTF-16, returns true iff it is little-endian.


encodingIsKnownUTF8

public boolean encodingIsKnownUTF8()
Returns true iff the encoding is known to be UTF-8: to convert the bytes to a string using this encoding the first countBOMPrefixBytes() should be omitted.


encodingForUCS16

public java.lang.String encodingForUCS16()
Assuming the encoding is known to be some form of UTF-16, this method returns the appropriate encoding name with explicit endianness, i.e. "UTF-16BE" (big-endian) or "UTF-16LE" (little-endian): to convert the bytes to a string using this encoding the first countBOMPrefixBytes() should be omitted.


countBOMPrefixBytes

public int countBOMPrefixBytes()
Returns the number of bytes devoted to an initial BOM, so zero is returned in the case where there is no initial BOM.


getDecodedString

public java.lang.String getDecodedString(java.lang.String enc,
                                         int nskip)
                                  throws java.io.UnsupportedEncodingException
Returns the string obtained by decoding the bytes array associated with this checker, assuming the given encoding, and skipping the specified number of initial bytes.

Throws:
java.io.UnsupportedEncodingException