Home c# What is ` U`?

What is `\ U`?

Author

Date

Category

I met just such a task:

string str1 = "\ U0010FADE";
string str2 = "\ U0000FADE";
Console.WriteLine (str1.Length);
Console.WriteLine (str2.Length);

As it turned out, terminals 2 and 1. What is going on here?

I know only lowercase \ u , for which should follow the 4 hexadecimal digits.

In MSDN for char \ U is not listed, it is logical – the result is clearly not char

.

For strings – there is mention, but still do not understand:

The escape code \ udddd (where dddd is a four-digit number) represents the Unicode character U + dddd . Eight-digit Unicode escape codes are also recognized: \ Udddddddd

.

Elsewhere says that they need to form a surrogate pairs, but also without further explanation:

\ Uxxxxxxxx – Unicode escape sequence for character with hex value xxxxxxxx (for generating surrogates)

So does the \ U and why in the second row to be not a surrogate pair, but only one character?

I tried to run on ideone , but something derived characters are not those codes, which are specified in the source code. While this may be a shoal of ideone.


Answer 1, Authority 100%

The information in the documentation is absolutely correct. Syntax \ Udddddddd simply includes Unicode character string constant with dddddddd code. This symbol can be a surrogate pair and hold two code units of UTF-16, but it can also be an ordinary character, holding a single code unit.

ECMA-334

7.4.2 Unicode character escape sequences

A Unicode escape sequence represents a Unicode code point. Unicode escape sequences are
processed in identifiers (§7.4.3), character literals (§7.4.5.5), and
regular string literals (§7.4.5.6). A Unicode escape sequence is not
processed in any other location (for example, to form an operator,
punctuator, or keyword).

unicode-escape-sequence ::
\ U hex-digit hex-digit hex-digit hex-digit
\ U hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit

A Unicode character escape sequence represents the single Unicode code point
formed by the
hexadecimal number following the “\ u” or “\ U” characters. Since C #
uses a 16-bit encoding of Unicode code points in character and string
values, a Unicode code point in the range U + 10000 to U + 10FFFF is
represented using two Unicode surrogate code units. unicode code
points above U + FFFF are not permitted in character literals. Unicode
code points above U + 10FFFF are invalid and are not supported.

In the first case, the value of the code position over U + 10000, so it is represented by two code units. In the second case – the less so one.

In other words, write \ U0000FADE is equivalent to \ uFADE , not \ u0000 \ uFADE , as it might seem at first glance (the last really consists of two code units).

Programmers, Start Your Engines!

Why spend time searching for the correct question and then entering your answer when you can find it in a second? That's what CompuTicket is all about! Here you'll find thousands of questions and answers from hundreds of computer languages.

Recent questions