I met just such a task:
string str1 = "\ U0010FADE";
string str2 = "\ U0000FADE";
Console.WriteLine (str1.Length);
Console.WriteLine (str2.Length);
As it turned out, terminals 2 and 1. What is going on here?
I know only lowercase \ u
, for which should follow the 4 hexadecimal digits.
In MSDN for char \ U
is not listed, it is logical – the result is clearly not char
.
For strings – there is mention, but still do not understand:
The escape code
\ udddd
(wheredddd
is a four-digit number) represents the Unicode characterU + dddd
. Eight-digit Unicode escape codes are also recognized:\ Udddddddd
.
Elsewhere says that they need to form a surrogate pairs, but also without further explanation:
\ Uxxxxxxxx
– Unicode escape sequence for character with hex valuexxxxxxxx
(for generating surrogates)
So does the \ U
and why in the second row to be not a surrogate pair, but only one character?
I tried to run on ideone , but something derived characters are not those codes, which are specified in the source code. While this may be a shoal of ideone.
Answer 1, Authority 100%
The information in the documentation is absolutely correct. Syntax \ Udddddddd
simply includes Unicode character string constant with dddddddd code. This symbol can be a surrogate pair and hold two code units of UTF-16, but it can also be an ordinary character, holding a single code unit.
7.4.2 Unicode character escape sequences
A Unicode escape sequence represents a Unicode code point. Unicode escape sequences are
processed in identifiers (§7.4.3), character literals (§7.4.5.5), and
regular string literals (§7.4.5.6). A Unicode escape sequence is not
processed in any other location (for example, to form an operator,
punctuator, or keyword).unicode-escape-sequence :: \ U hex-digit hex-digit hex-digit hex-digit \ U hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit
A Unicode character escape sequence represents the single Unicode code point
formed by the
hexadecimal number following the “\ u” or “\ U” characters. Since C #
uses a 16-bit encoding of Unicode code points in character and string
values, a Unicode code point in the range U + 10000 to U + 10FFFF is
represented using two Unicode surrogate code units. unicode code
points above U + FFFF are not permitted in character literals. Unicode
code points above U + 10FFFF are invalid and are not supported.
In the first case, the value of the code position over U + 10000, so it is represented by two code units. In the second case – the less so one.
In other words, write \ U0000FADE
is equivalent to \ uFADE
, not \ u0000 \ uFADE
, as it might seem at first glance (the last really consists of two code units).