I decided to check how the getBytes ()
method of the String
class behaves when the code point of the letter has a value greater than 127. For example, I took a string with Latin and Cyrillic characters, which I converted to an array of bytes:
byte [] bytes = "abcdefghijz & lt; ab" .getBytes ();
Received the following bytes:
[97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 122, 60, -32, -63]
As you can see, due to the overflow of the byte
type, the code point a = 224
and B = 193
shifted to the negative area - 32
and -63
, respectively.
Next, I convert the byte array to char
char [] chars = new char [bytes.length];
for (int i = 0; i & lt; bytes.length; i ++)
chars [i] = (char) bytes [i];
I receive
[a, b, c, d, e, f, g, h, i, j, z, & lt ;,?,?]
As expected, char is undefined for negative values and the IDE draws characters instead?
Next, I check how Java works with these char
when writing to a file. First way:
FileOutputStream fos = new FileOutputStream ("chars.txt");
BufferedOutputStream bos = new BufferedOutputStream (fos);
bos.write (bytes, 0, bytes.length);
bos.close ();
When using FileOutputStream
, the program correctly displays Cyrillic letters, instead of question marks in the text file there are letters a
and B
, and when opening a file in in the hexadecimal editor, the letters are replaced with the correct values of the bytes 224
and 193
, and not the negative numbers that the program originally produced
Second way to write to file:
FileWriter fos = new FileWriter ("chars.txt");
fos.write (chars, 0, chars.length);
fos.close ();
When using FileWriter
in a text file, question marks are displayed instead of Cyrillic characters, and when the file is opened in a hex editor, the question mark is displayed as byte 63
(not 224
and 193
, which were originally included in the line)
This raises the following questions:
1) I expected that FileWriter
will correctly display Cyrillic characters that fit into one byte in the standard windows encoding (windows-1251
). Why is it wrong? I know that FileWriter
is a character stream, not a byte stream. But character streams also operate on bytes. If the byte of the symbol a = 224
is initially set, then shouldn’t the program write the same byte to the file at a low level? When viewed in a text editor, something else may be displayed if the encoding is incorrect and the 224 byte corresponds to some kind of hieroglyph. But why does FileWriter
write a completely different byte?
2) Why does the String
class need the getBytes ()
method, if in Java it has a different range than is traditionally used in computer technology? It seems that in computer technology, no one operates with negative numbers?
Answer 1, authority 100%
byte [] bytes = "abcdefghijz & lt; aB" .getBytes ();
this is not quite the full version, the full version is like this:
byte [] bytes = "abcdefghijz & lt; aB" .getBytes (charset); // charset - string encoding
that is, the bytes produced depend on the encoding used. In the first case, the default encoding set in the system is taken (usually Win-1251).
The string abcdefghijz & lt; aB
– what encoding is it written in? Apparently, in UTF-8 – that’s where you get different readings.
On the second question:
It seems that in computer technology no one operates with negative numbers?
I’ll leave it to your conscience. A byte is byte in Africa, but how to display it with positive or negative numbers (in decimal) is already just a way of displaying it.
Update
OutputStream
works with a ready array of raw bytes, as opposed to OutputStreamWriter
re-encodes a ready array of bytes to char
according to CharSet
set when it was created. In the case of FileWriter
, which inherits from OutputStreamWriter
, then according to the documentation:
The constructors of this class assume that the default character encoding and the default byte-buffer size are acceptable. To specify these values yourself, construct an OutputStreamWriter on a FileOutputStream.
P.S. I advise you to carefully consider my remark:
The string
abcdefghijz & lt; aB
– what encoding is it written in?
Van mor update
char
and byte
are not identical concepts. To make it clear – let’s take a purely Russian letter Ž
(big)
- in Unicode is
0x04 0x16
(two bytes) - in Win-1251 –
0xC6
(1 byte) - in KOI-8 –
0xD6
(1 byte)
etc.
Now let’s imagine – there is a set of bytes and we are reading it. Attention question: how do we understand which characters are in this set of bytes? That’s right – we won’t be able to find out without a priori knowledge of the encoding (or rather, we can, but let’s say so only by means of certain inferences / constructions – such as here ) – accordingly, you need to give the reader as input a table for converting bytes to characters / Char
– this is called conversion.