java - Why does the String class need the getBytes () method if Java has a reduced byte?

I decided to check how the getBytes () method of the String class behaves when the code point of the letter has a value greater than 127. For example, I took a string with Latin and Cyrillic characters, which I converted to an array of bytes:

byte [] bytes = "abcdefghijz & lt; ab" .getBytes ();

Received the following bytes:

[97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 122, 60, -32, -63]

As you can see, due to the overflow of the byte type, the code point a = 224 and B = 193 shifted to the negative area - 32 and -63 , respectively.

Next, I convert the byte array to char

char [] chars = new char [bytes.length];
  for (int i = 0; i & lt; bytes.length; i ++)
    chars [i] = (char) bytes [i];

I receive

[a, b, c, d, e, f, g, h, i, j, z, & lt ;,?,?]

As expected, char is undefined for negative values and the IDE draws characters instead?

Next, I check how Java works with these char when writing to a file. First way:

FileOutputStream fos = new FileOutputStream ("chars.txt");
BufferedOutputStream bos = new BufferedOutputStream (fos);
bos.write (bytes, 0, bytes.length);
bos.close ();

When using FileOutputStream , the program correctly displays Cyrillic letters, instead of question marks in the text file there are letters a and B , and when opening a file in in the hexadecimal editor, the letters are replaced with the correct values of the bytes 224 and 193 , and not the negative numbers that the program originally produced

Second way to write to file:

FileWriter fos = new FileWriter ("chars.txt");
fos.write (chars, 0, chars.length);
fos.close ();

When using FileWriter in a text file, question marks are displayed instead of Cyrillic characters, and when the file is opened in a hex editor, the question mark is displayed as byte 63 (not 224 and 193 , which were originally included in the line)

This raises the following questions:

1) I expected that FileWriter will correctly display Cyrillic characters that fit into one byte in the standard windows encoding (windows-1251 ). Why is it wrong? I know that FileWriter is a character stream, not a byte stream. But character streams also operate on bytes. If the byte of the symbol a = 224 is initially set, then shouldn’t the program write the same byte to the file at a low level? When viewed in a text editor, something else may be displayed if the encoding is incorrect and the 224 byte corresponds to some kind of hieroglyph. But why does FileWriter write a completely different byte?

2) Why does the String class need the getBytes () method, if in Java it has a different range than is traditionally used in computer technology? It seems that in computer technology, no one operates with negative numbers?

Answer 1, authority 100%

byte [] bytes = "abcdefghijz & lt; aB" .getBytes ();

this is not quite the full version, the full version is like this:

byte [] bytes = "abcdefghijz & lt; aB" .getBytes (charset); // charset - string encoding

that is, the bytes produced depend on the encoding used. In the first case, the default encoding set in the system is taken (usually Win-1251).

The string abcdefghijz & lt; aB – what encoding is it written in? Apparently, in UTF-8 – that’s where you get different readings.

On the second question:

It seems that in computer technology no one operates with negative numbers?

I’ll leave it to your conscience. A byte is byte in Africa, but how to display it with positive or negative numbers (in decimal) is already just a way of displaying it.

Update

OutputStream works with a ready array of raw bytes, as opposed to OutputStreamWriter re-encodes a ready array of bytes to char according to CharSet set when it was created. In the case of FileWriter , which inherits from OutputStreamWriter , then according to the documentation:

The constructors of this class assume that the default character encoding and the default byte-buffer size are acceptable. To specify these values yourself, construct an OutputStreamWriter on a FileOutputStream.

P.S. I advise you to carefully consider my remark:

The string abcdefghijz & lt; aB – what encoding is it written in?

Van mor update

char and byte are not identical concepts. To make it clear – let’s take a purely Russian letter Ž(big)

in Unicode is 0x04 0x16 (two bytes)
in Win-1251 – 0xC6 (1 byte)
in KOI-8 – 0xD6 (1 byte)

etc.

Now let’s imagine – there is a set of bytes and we are reading it. Attention question: how do we understand which characters are in this set of bytes? That’s right – we won’t be able to find out without a priori knowledge of the encoding (or rather, we can, but let’s say so only by means of certain inferences / constructions – such as here ) – accordingly, you need to give the reader as input a table for converting bytes to characters / Char – this is called conversion.

Why does the String class need the getBytes () method if Java has a reduced byte?

Answer 1, authority 100%

Programmers, Start Your Engines!

Recent questions

yandex cards disappear labels with zoom

Embarcadero C++ Builder 10.3 does not give prompts by code

Found input variables with inconsistent numbers of samples error

Return to previous page

Lua C++ error handling