I decided to check how the
getBytes () method of the
String class behaves when the code point of the letter has a value greater than 127. For example, I took a string with Latin and Cyrillic characters, which I converted to an array of bytes:
byte  bytes = "abcdefghijz & lt; ab" .getBytes ();
Received the following bytes:
[97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 122, 60, -32, -63]
As you can see, due to the overflow of the
byte type, the code point
a = 224 and
B = 193 shifted to the negative area
- 32 and
-63 , respectively.
Next, I convert the byte array to
char  chars = new char [bytes.length]; for (int i = 0; i & lt; bytes.length; i ++) chars [i] = (char) bytes [i];
[a, b, c, d, e, f, g, h, i, j, z, & lt ;,?,?]
As expected, char is undefined for negative values and the IDE draws
Next, I check how Java works with these
char when writing to a file. First way:
FileOutputStream fos = new FileOutputStream ("chars.txt"); BufferedOutputStream bos = new BufferedOutputStream (fos); bos.write (bytes, 0, bytes.length); bos.close ();
FileOutputStream , the program correctly displays Cyrillic letters, instead of question marks in the text file there are letters
B , and when opening a file in in the hexadecimal editor, the letters are replaced with the correct values of the bytes
193 , and not the negative numbers that the program originally produced
Second way to write to file:
FileWriter fos = new FileWriter ("chars.txt"); fos.write (chars, 0, chars.length); fos.close ();
FileWriter in a text file, question marks are displayed instead of Cyrillic characters, and when the file is opened in a hex editor, the question mark is displayed as byte
193 , which were originally included in the line)
This raises the following questions:
1) I expected that
FileWriter will correctly display Cyrillic characters that fit into one byte in the standard windows encoding (
windows-1251 ). Why is it wrong? I know that
FileWriter is a character stream, not a byte stream. But character streams also operate on bytes. If the byte of the symbol
a = 224 is initially set, then shouldn’t the program write the same byte to the file at a low level? When viewed in a text editor, something else may be displayed if the encoding is incorrect and the 224 byte corresponds to some kind of hieroglyph. But why does
FileWriter write a completely different byte?
2) Why does the
String class need the
getBytes () method, if in Java it has a different range than is traditionally used in computer technology? It seems that in computer technology, no one operates with negative numbers?
Answer 1, authority 100%
byte  bytes = "abcdefghijz & lt; aB" .getBytes ();
this is not quite the full version, the full version is like this:
byte  bytes = "abcdefghijz & lt; aB" .getBytes (charset); // charset - string encoding
that is, the bytes produced depend on the encoding used. In the first case, the default encoding set in the system is taken (usually Win-1251).
abcdefghijz & lt; aB – what encoding is it written in? Apparently, in UTF-8 – that’s where you get different readings.
On the second question:
It seems that in computer technology no one operates with negative numbers?
I’ll leave it to your conscience. A byte is byte in Africa, but how to display it with positive or negative numbers (in decimal) is already just a way of displaying it.
OutputStream works with a ready array of raw bytes, as opposed to
OutputStreamWriter re-encodes a ready array of bytes to
char according to
CharSet set when it was created. In the case of
FileWriter , which inherits from
OutputStreamWriter , then according to the documentation:
The constructors of this class assume that the default character encoding and the default byte-buffer size are acceptable. To specify these values yourself, construct an OutputStreamWriter on a FileOutputStream.
P.S. I advise you to carefully consider my remark:
abcdefghijz & lt; aB– what encoding is it written in?
Van mor update
byte are not identical concepts. To make it clear – let’s take a purely Russian letter
- in Unicode is
0x04 0x16(two bytes)
- in Win-1251 –
- in KOI-8 –
Now let’s imagine – there is a set of bytes and we are reading it. Attention question: how do we understand which characters are in this set of bytes? That’s right – we won’t be able to find out without a priori knowledge of the encoding (or rather, we can, but let’s say so only by means of certain inferences / constructions – such as here ) – accordingly, you need to give the reader as input a table for converting bytes to characters /
Char – this is called conversion.