2018-09-02

1: How to Decode Any Bytes Sequence into a String

<The previous article in this series | The table of contents of this series | The next article in this series>

The 'String' class or any 'Reader' class isn't always optimal or even passable for decoding the bytes sequence into a string. This way is better.

Topics


About: The Java programming language

The table of contents of this article


Starting Context


  • The reader has a basic knowledge on Java programming.
  • The reader wants to decode a possibly long and/or possibly errors-including bytes sequence into a string (a characters array or a String instance).

Target Context


  • The reader will understand how to optimally decode any bytes sequence into a string.

Orientation


Hypothesizer 7
When we need to decode a bytes sequence into a string, a handy way is to store the whole sequence in a bytes array and call a 'String' constructor, specifying the array and the encoding, like this.

@Java Source Code
			String l_inputString = "aϴbΩ";
			String l_encoding = "UTF-8";
			byte [] l_inputBytes = l_inputString.getBytes (l_encoding);
			String	l_outputString = new String (l_inputBytes, l_encoding);

However, that isn't always optimal or even passable. For one thing, the bytes sequence may be long, which must occupy a large memory space, which isn't favorable unless the computer happens to have unlimited memory space; for another thing, the length of the bytes sequence may be unpredictable, which makes efficiently allocating the array difficult; for another thing, not the whole sequence may be able to be successfully decoded, which necessitates identifying the locations of erroneous parts, which the 'String' constructor doesn't do. In fact, mainly, I am thinking of reading a stream (typically an input stream of a file).

Of course, there is a way that we use a 'Reader', which decodes the bytes sequence, but I am thinking of a case in which that way also isn't optimal: I need to read a bytes sequence of a specified length. In using any 'Reader' instance, we can specify the characters length to be read, but not the bytes length to be read, right? . . . Certainly, we can adjust the bytes length to be read, through re-encoding the read characters sequence and checking the read bytes length, but that doesn't seem optimal because that re-encoding seems wasteful: why do we have to create the copy of the source bytes sequence that has been already read (internally by the 'Reader' instance) and we don't want as such, just in order to check the length?

In fact, there is a class, 'java.nio.charset.CharsetDecoder', which seems to be usable for my concern. This article is about what it is and how we can use it.


Main Body


1: Why Cannot We Just Cut the Bytes Sequence into Fixed-Length Pieces and Decode the Bytes Sequence Piece by Piece?


Hypothesizer 7
As a bytes sequence is too long to be favorably placed in the memory as the whole at a time, we cannot help but cut the bytes sequence into pieces and decode the bytes sequence piece by piece. However, just cutting the bytes sequence into fixed-length (except the last piece, of course) pieces may chop up the bytes sequence of a character and such characters cannot be decoded successfully.

As each character has its own length in many major encodings, there is no piece length that guarantees that no character is chopped up.


2: What I Thought 'java.nio.charset.CharsetDecoder' Would Do


Hypothesizer 7
So, we need a decoder that takes the bytes sequence piece by piece and processes each piece properly by prefixing (if any) the character fragment(s) from the previous piece(s) to the current piece and recognizing and remembering (if any) the character fragment that is at the tail of the current piece.

After I read (somehow cursorily, I admit) the API document of 'java.nio.charset.CharsetDecoder' with that concern in my mind, I (erroneously) thought that the class would do that, because in my world view, any class with an interface of that class would do so.

So, I wrote this test program. The test program takes two arguments: the string from which the bytes sequence is created, and the bytes length by which the program cuts the bytes sequence into pieces. You know, in reality, the bytes sequence is read from a stream, but as a test, I created the bytes sequence from the first argument.

@Java Source Code
package theBiasPlanet.tests.bytesArrayDecodingTest1;

import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CoderResult;

public class Test1Test {
	private Test1Test () {
	}
	
	public static void main (String [] a_arguments) throws Exception {
		Test1Test.test (a_arguments [0], Integer.parseInt (a_arguments [1]));
	}
	
	private static void test (String a_inputString, int a_dataBufferSize) throws Exception {
		testCharsetDecoder (a_inputString, a_dataBufferSize);
	}
	
	private static void testCharsetDecoder (String a_inputString, int a_dataBufferSize) throws Exception {
		String l_encoding = "UTF-8";
		byte [] l_inputBytes = a_inputString.getBytes (l_encoding);
		int l_inputBytesLength = l_inputBytes.length;
		ByteBuffer l_newInputBuffer = null;
		CharBuffer l_outputBuffer = CharBuffer.allocate (a_dataBufferSize);
		CoderResult l_decodingResult = null;
		CharsetDecoder l_instanceOfCharsetDecoder = Charset.forName(l_encoding).newDecoder ();
		boolean l_isLastIteration = false;
		for (int l_processedBytesLengthSoFar = 0, l_processedBytesLengthPerIteration = 0; ; ) {
			System.out.println (String.format ("### Decoding from the index, %d.", l_processedBytesLengthSoFar));
			l_processedBytesLengthPerIteration = Math.min (l_inputBytesLength - l_processedBytesLengthSoFar, a_dataBufferSize);
			l_newInputBuffer = ByteBuffer.wrap (l_inputBytes, l_processedBytesLengthSoFar, l_processedBytesLengthPerIteration);
			l_processedBytesLengthSoFar += l_processedBytesLengthPerIteration;
			l_isLastIteration = l_processedBytesLengthSoFar == l_inputBytesLength;
			l_decodingResult = l_instanceOfCharsetDecoder.decode (l_newInputBuffer, l_outputBuffer, l_isLastIteration);
			handleDecodingResult (l_decodingResult, l_outputBuffer);
			if (l_isLastIteration || l_decodingResult.isMalformed () || l_decodingResult.isUnmappable ()) {
				break;
			}
		}
	}
	
	private static void handleDecodingResult (CoderResult a_decodingResult, CharBuffer a_outputBuffer) {
		System.out.println (String.format ("### The decoding result is '%s'.", a_decodingResult.toString ()));
		a_outputBuffer.flip ();
		System.out.println (String.format ("### The output is '%s'.", a_outputBuffer.toString ()));
		a_outputBuffer.clear ();
		System.out.println ("");
	}
}

However, that code piece gives me this when I pass "aϴbΩ" and "1" as the first and second arguments.

@Output
### Decoding from the index, 0.
### The decoding result is 'UNDERFLOW'.
### The output is 'a'.

### Decoding from the index, 1.
### The decoding result is 'UNDERFLOW'.
### The output is ''.

### Decoding from the index, 2.
### The decoding result is 'MALFORMED[1]'.
### The output is ''. 

Huh? Although I thought that I did as the instruction of the API said, the second character, 'ϴ', cannot be decoded . . .


3: What 'java.nio.charset.CharsetDecoder' Really Does


Hypothesizer 7
As I read the API document more carefully, in the explanation of the 'decode' method, it said, "In any case, if this method is to be reinvoked in the same decoding operation then care should be taken to preserve any bytes remaining in the input buffer so that they are available to the next invocation." . . . Um, OK, I had done so actually, for I passed a new ByteBuffer instance for each piece (so the older ByteBuffer instances are left intact) and the base bytes array was also intact throughout the whole process: any bytes remaining in each input buffer are really preserved, and the class should be able to read the remaining bytes . . .

So, what's wrong? . . . As I reread the API document even more carefully, it said, "Each invocation of the decode method will decode as many bytes as possible from the input buffer, writing the resulting characters to the output buffer." . . . Of course, . . . wait! Do I have to interpret that sentence like this: 'each invocation of the decode method will decode ONLY bytes from the input buffer, decoding as many bytes as possible and writing the resulting characters to the output buffer'? Really? Decoding bytes from the input buffer doesn't necessarily mean decoding no other byte, logically speaking, but considering the result of the test program, the document seems to mean so.

But then, what is the purpose of passing 'false' into the 'endOfInput' argument? I thought that I was passing 'false' of 'endOfInput' in order to make the class instance remember the character fragment(s): if the class doesn't remember the character fragment(s) anyway and the user has to hassle with the character fragment(s) by himself or herself, 'true' or 'false' of 'endOfInput' won't matter . . .

In fact, it really doesn't matter if the bytes sequence can be surely presupposed to have no error: 'false' of 'endOfInput' is only for not returning any error status when the remaining odd bytes sequence could be a character fragment that is completed by a following piece or pieces.

OK, so, I should pass 'false' and has to judge whether there is a character fragment because I have to hassle with the character fragment if there is, but how? . . . The 'decode' method actually doesn't let me know that information by the result instance . . .. Why? . . . Anyway, that information and the character fragment (also known as "remaining bytes") can be known by looking at the state of the input buffer: if the 'position' is not at the 'limit', there is a character fragment, and the character fragment is the bytes from the 'position' until the 'limit'.

So, it seems that that class just decodes only the current input buffer and somehow lets us know the character fragment, leaving us to hassle with the character fragment.


4: Then, How Can We Use 'java.nio.charset.CharsetDecoder'?


Hypothesizer 7
When the 'decode' method is called in the first iteration with 'false' of 'endOfInput', the result status may be 'Underflow', 'Malformed', or 'Unmappable'.

'Underflow' doesn't necessarily mean that the input buffer is really underflowed (meaning that there is a character fragment at the tail), and we have to know whether it is really underflowed by looking at the state of the input buffer. If it is really underflowed, the character fragment is the bytes from the 'position' until the 'limit', which we have to prefix to the next piece somehow.

If the result status is 'Malformed' or 'Unmappable', the error bytes are from the 'position' of the input buffer of the length of the return of the 'length' of the result instance. How we will proceed thereafter will depend on us.

When the 'decode' method is called in a later iteration with 'false' of 'endOfInput', we have to use the piece that has been prepared in the previous iteration, dealing with the result the same way with in the first iteration.

When the 'decode' method is called in the last iteration with 'true' of 'endOfInput', we have to use the piece that has been prepared in the previous iteration. The 'Underflow' result status is really 'not underflowed', and we can deal with 'Malformed' and 'Unmappable' result statuses the same way with in the first iteration.

Lastly, to squeeze (if any) the remaining output into an output buffer, we call the 'flush' method of the decoder.


5: Let's Create a Wrapper Class


Hypothesizer 7
Honestly, I don't think that 'java.nio.charset.CharsetDecoder' (and its document) is nice. So, let's create a wrapper class that behaves as I think the decoder should behave. In fact, I have created also a result class because I think the result should distinguish whether the input buffer is really underflowed or complete (there is no remaining byte).

Those classes, 'theBiasPlanet.coreUtilities.bytesArraysHandling.BytesBufferToCharactersBufferDecoder' and 'theBiasPlanet.coreUtilities.bytesArraysHandling.BytesBufferToCharactersBufferDecodingResult', are included in this ZIP file, and how to build the project (a Gradle project) is explained in this article (as that article is of another series that is about developing UNO programs, it contains some instructions that are unnecessary for just building this project: they can be ignored at one's discretion).

This is how to use the wrapper class, again with two arguments: the string from which the bytes sequence is created, and the bytes length by which the program cuts the bytes sequence into pieces.

@Java Source Code
package theBiasPlanet.tests.bytesBufferToCharactersBufferDecoderTest1;

import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import theBiasPlanet.coreUtilities.bytesArraysHandling.BytesBufferToCharactersBufferDecoder;
import theBiasPlanet.coreUtilities.bytesArraysHandling.BytesBufferToCharactersBufferDecodingResult;

public class Test1Test {
	private Test1Test () {
	}
	
	public static void main (String [] a_arguments) throws Exception {
		Test1Test.test (a_arguments [0], Integer.parseInt (a_arguments [1]));
	}
	
	private static void test (String a_inputString, int a_dataBufferSize) throws Exception {
		String l_encoding = "UTF-8";
		byte [] l_inputBytes = a_inputString.getBytes (l_encoding);
		int l_inputBytesLength = l_inputBytes.length;
		ByteBuffer l_newInputBuffer = null;
		CharBuffer l_outputBuffer = CharBuffer.allocate (a_dataBufferSize);
		BytesBufferToCharactersBufferDecodingResult l_decodingResult = null;
		BytesBufferToCharactersBufferDecoder l_bytesBufferToCharactersBufferDecoder = new BytesBufferToCharactersBufferDecoder (l_encoding);
		boolean l_isLastIteration = false;
		for (int l_processedBytesLengthSoFar = 0, l_processedBytesLengthPerIteration = 0; ; ) {
			System.out.println (String.format ("### Decoding from the index, %d.", l_processedBytesLengthSoFar));
			l_processedBytesLengthPerIteration = Math.min (l_inputBytesLength - l_processedBytesLengthSoFar, a_dataBufferSize);
			l_newInputBuffer = ByteBuffer.wrap (l_inputBytes, l_processedBytesLengthSoFar, l_processedBytesLengthPerIteration);
			l_processedBytesLengthSoFar += l_processedBytesLengthPerIteration;
			l_isLastIteration = l_processedBytesLengthSoFar == l_inputBytesLength;
			l_decodingResult = l_bytesBufferToCharactersBufferDecoder.decode (l_newInputBuffer, l_outputBuffer, l_isLastIteration);
			handleDecodingResult (l_decodingResult, l_outputBuffer);
			if (l_isLastIteration || l_decodingResult.isMalformed () || l_decodingResult.isUnmappable ()) {
				break;
			}
		}
	}
	
	private static void handleDecodingResult (BytesBufferToCharactersBufferDecodingResult a_decodingResult, CharBuffer a_outputBuffer) {
		System.out.println (String.format ("### The decoding result is '%s' while the inputs residue starting index = '%d'.", a_decodingResult.toString (), a_decodingResult.getInputsResidueStartingIndex ()));
		a_outputBuffer.flip ();
		System.out.println (String.format ("### The output is '%s'.", a_outputBuffer.toString ()));
		a_outputBuffer.clear ();
		System.out.println ("");
	}
}

After I have delved into the gist of the behavior of 'java.nio.charset.CharsetDecoder', there will be no necessity to delve into the details of those classes. But, I will make just a brief explanation. 'i_previousInputsResidueBuffer' is the buffer in which the wrapper remembers remaining bytes from the previous pieces (the remaining bytes may be from multiple pieces if the pieces size is less than 3 because a 4 bytes UTF-8 character may be split into 3 pieces). In fact, the wrapper doesn't literally prefix the remaining bytes to the next piece (as Java doesn't allow for us to prefix anything to any array), but deals with each split character in 'i_previousInputsResidueBuffer'.


6: The Conclusion


Hypothesizer 7
Now, I seem to understand how to decode any bytes sequence into a string.

When the bytes sequence is short enough and we are sure that the whole of the bytes sequence can be successfully decoded, we can just use a constructor of 'String'.

When we are going to read the bytes sequence as a stream without having to specify the bytes length, we can just use a 'Reader'.

When the bytes sequence is too long but any 'Reader' isn't appropriate for the concern, we cannot help but cut the bytes sequence into pieces and process the bytes sequence piece by piece. In that case, we can use the 'java.nio.charset.CharsetDecoder' class. And it is useful also when the bytes sequence may include errors.

However, the behavior of that class isn't very nice as well as the API document isn't. So, I have created a wrapper class.


References


<The previous article in this series | The table of contents of this series | The next article in this series>