diff options
author | Roberto Ierusalimschy <roberto@inf.puc-rio.br> | 2019-03-15 13:14:17 -0300 |
---|---|---|
committer | Roberto Ierusalimschy <roberto@inf.puc-rio.br> | 2019-03-15 13:14:17 -0300 |
commit | 1e0c73d5b643707335b06abd2546a83d9439d14c (patch) | |
tree | b80b7d5e2cfeeef888ddf98fcc6276832134c1bf /manual | |
parent | 8fa4f1380b9a203bfdf002c2e9e9e13ebb8384c1 (diff) | |
download | lua-github-1e0c73d5b643707335b06abd2546a83d9439d14c.tar.gz |
Changes in the validation of UTF-8
All UTF-8 encoding functionality (including the escape
sequence '\u') accepts all values from the original UTF-8
specification (with sequences of up to six bytes).
By default, the decoding functions in the UTF-8 library do not
accept invalid Unicode code points, such as surrogates. A new
parameter 'nonstrict' makes them accept all code points up to
(2^31)-1, as in the original UTF-8 specification.
Diffstat (limited to 'manual')
-rw-r--r-- | manual/manual.of | 43 |
1 files changed, 39 insertions, 4 deletions
diff --git a/manual/manual.of b/manual/manual.of index 1e4ca857..8a8ebad5 100644 --- a/manual/manual.of +++ b/manual/manual.of @@ -1004,6 +1004,8 @@ the escape sequence @T{\u{@rep{XXX}}} (note the mandatory enclosing brackets), where @rep{XXX} is a sequence of one or more hexadecimal digits representing the character code point. +This code point can be any value smaller than @M{2@sp{31}}. +(Lua uses the original UTF-8 specification here.) Literal strings can also be defined using a long format enclosed by @def{long brackets}. @@ -6899,6 +6901,7 @@ x = string.gsub("$name-$version.tar.gz", "%$(%w+)", t) } @LibEntry{string.len (s)| + Receives a string and returns its length. The empty string @T{""} has length 0. Embedded zeros are counted, @@ -6907,6 +6910,7 @@ so @T{"a\000bc\000"} has length 5. } @LibEntry{string.lower (s)| + Receives a string and returns a copy of this string with all uppercase letters changed to lowercase. All other characters are left unchanged. @@ -6915,6 +6919,7 @@ The definition of what an uppercase letter is depends on the current locale. } @LibEntry{string.match (s, pattern [, init])| + Looks for the first @emph{match} of @id{pattern} @see{pm} in the string @id{s}. If it finds one, then @id{match} returns @@ -6946,6 +6951,7 @@ The format string cannot have the variable-length options } @LibEntry{string.rep (s, n [, sep])| + Returns a string that is the concatenation of @id{n} copies of the string @id{s} separated by the string @id{sep}. The default value for @id{sep} is the empty string @@ -6958,11 +6964,13 @@ with a single call to this function.) } @LibEntry{string.reverse (s)| + Returns a string that is the string @id{s} reversed. } @LibEntry{string.sub (s, i [, j])| + Returns the substring of @id{s} that starts at @id{i} and continues until @id{j}; @id{i} and @id{j} can be negative. @@ -6998,6 +7006,7 @@ this function also returns the index of the first unread byte in @id{s}. } @LibEntry{string.upper (s)| + Receives a string and returns a copy of this string with all lowercase letters changed to uppercase. All other characters are left unchanged. @@ -7318,8 +7327,24 @@ or one plus the length of the subject string. As in the string library, negative indices count from the end of the string. +Functions that create byte sequences +accept all values up to @T{0x7FFFFFFF}, +as defined in the original UTF-8 specification; +that implies byte sequences of up to six bytes. + +Functions that interpret byte sequences only accept +valid sequences (well formed and not overlong). +By default, they only accept byte sequences +that result in valid Unicode code points, +rejecting values larger than @T{10FFFF} and surrogates. +A boolean argument @id{nonstrict}, when available, +lifts these checks, +so that all values up to @T{0x7FFFFFFF} are accepted. +(Not well formed and overlong sequences are still rejected.) + @LibEntry{utf8.char (@Cdots)| + Receives zero or more integers, converts each one to its corresponding UTF-8 byte sequence and returns a string with the concatenation of all these sequences. @@ -7327,14 +7352,15 @@ and returns a string with the concatenation of all these sequences. } @LibEntry{utf8.charpattern| -The pattern (a string, not a function) @St{[\0-\x7F\xC2-\xF4][\x80-\xBF]*} + +The pattern (a string, not a function) @St{[\0-\x7F\xC2-\xFD][\x80-\xBF]*} @see{pm}, which matches exactly one UTF-8 byte sequence, assuming that the subject is a valid UTF-8 string. } -@LibEntry{utf8.codes (s)| +@LibEntry{utf8.codes (s [, nonstrict])| Returns values so that the construction @verbatim{ @@ -7347,7 +7373,8 @@ It raises an error if it meets any invalid byte sequence. } -@LibEntry{utf8.codepoint (s [, i [, j]])| +@LibEntry{utf8.codepoint (s [, i [, j [, nonstrict]]])| + Returns the codepoints (as integers) from all characters in @id{s} that start between byte position @id{i} and @id{j} (both included). The default for @id{i} is 1 and for @id{j} is @id{i}. @@ -7355,7 +7382,8 @@ It raises an error if it meets any invalid byte sequence. } -@LibEntry{utf8.len (s [, i [, j]])| +@LibEntry{utf8.len (s [, i [, j [, nonstrict]]])| + Returns the number of UTF-8 characters in string @id{s} that start between positions @id{i} and @id{j} (both inclusive). The default for @id{i} is @num{1} and for @id{j} is @num{-1}. @@ -7365,6 +7393,7 @@ returns a false value plus the position of the first invalid byte. } @LibEntry{utf8.offset (s, n [, i])| + Returns the position (in bytes) where the encoding of the @id{n}-th character of @id{s} (counting from position @id{i}) starts. @@ -8755,6 +8784,12 @@ You can enclose the call in parentheses if you need to discard these extra results. } +@item{ +By default, the decoding functions in the @Lid{utf8} library +do not accept surrogates as valid code points. +An extra parameter in these functions makes them more permissive. +} + } } |