summaryrefslogtreecommitdiff
path: root/src/regexp/regexp.go
diff options
context:
space:
mode:
authorRuss Cox <rsc@golang.org>2021-10-07 09:56:29 -0400
committerRuss Cox <rsc@golang.org>2021-10-11 15:28:50 +0000
commit702e33717486cb8331db17304f2369ef641da61f (patch)
treea024631a5c2497d89ad7da56979025fe2ac8bf0a /src/regexp/regexp.go
parent34f7b1f841cc450cc3aba42019e613fd03a84fce (diff)
downloadgo-git-702e33717486cb8331db17304f2369ef641da61f.tar.gz
regexp: document and implement that invalid UTF-8 bytes are the same as U+FFFD
What should it mean to run a regexp match on invalid UTF-8 bytes? The coherent behavior options are: 1. Invalid UTF-8 does not match any character classes, nor a U+FFFD literal (nor \x{fffd}). 2. Each byte of invalid UTF-8 is treated identically to a U+FFFD in the input, as a utf8.DecodeRune loop might. RE2 uses Rule 1. Because it works byte at a time, it can also provide \C to match any single byte of input, which matches invalid UTF-8 as well. This provides the nice property that a match for a regexp without \C is guaranteed to be valid UTF-8. Unfortunately, today Go has an incoherent mix of these two, although mostly Rule 2. This is a deviation from RE2, and it gives up the nice property, but we probably can't correct that at this point. In particular .* already matches entire inputs today, valid UTF-8 or not, and I doubt we can break that. This CL adopts Rule 2 officially, fixing the few places that deviate from it. Fixes #48749. Change-Id: I96402527c5dfb1146212f568ffa09dde91d71244 Reviewed-on: https://go-review.googlesource.com/c/go/+/354569 Trust: Russ Cox <rsc@golang.org> Run-TryBot: Russ Cox <rsc@golang.org> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Rob Pike <r@golang.org>
Diffstat (limited to 'src/regexp/regexp.go')
-rw-r--r--src/regexp/regexp.go8
1 files changed, 7 insertions, 1 deletions
diff --git a/src/regexp/regexp.go b/src/regexp/regexp.go
index bfcf7910cf..af7259c9bf 100644
--- a/src/regexp/regexp.go
+++ b/src/regexp/regexp.go
@@ -20,6 +20,8 @@
// or any book about automata theory.
//
// All characters are UTF-8-encoded code points.
+// Following utf8.DecodeRune, each byte of an invalid UTF-8 sequence
+// is treated as if it encoded utf8.RuneError (U+FFFD).
//
// There are 16 methods of Regexp that match a regular expression and identify
// the matched text. Their names are matched by this regular expression:
@@ -276,7 +278,11 @@ func minInputLen(re *syntax.Regexp) int {
case syntax.OpLiteral:
l := 0
for _, r := range re.Rune {
- l += utf8.RuneLen(r)
+ if r == utf8.RuneError {
+ l++
+ } else {
+ l += utf8.RuneLen(r)
+ }
}
return l
case syntax.OpCapture, syntax.OpPlus: