re: Document [:ascii:] character class deficiency #4551

jhogberg · 2021-02-25T14:49:40Z

Issue #4544 pointed out that the [:ascii:] character class erroneously matches Latin-1 characters. This PR documents this deficiency since we can't fix it in a backwards-compatible manner.

lukebakken · 2021-02-25T16:39:29Z

lib/stdlib/doc/src/re.xml

+    <p>There is another character class, <c>ascii</c>, that erroneously matches
+      Latin-1 characters instead of the 0-127 range specified by POSIX. This
+      cannot be fixed without altering the behaviour of other classes, so we
+      recommend matching the range with <c>[\x01-\x7f]</c> instead.</p>


If it's not too much trouble to explain further ... for my own knowledge is this related to locale and the underlying regex library or something else?

it's a bit of both. The underlying PCRE lib ties the ascii range with 'printable' characters range, which is only incidentally ascii if you're using the C locale. For historical reasons, Erlang sets the locale default to Latin-1 which extends the printable range and accidentally extends the ascii range at the same time.

So it's a change in defaults to match Erlang semantics that triggers a PCRE implementation detail in unclearly-documented ways (the C-locale dependence isn't obvious)

lib/stdlib/doc/src/re.xml

The regex library we use can work either in locale-specific mode, or unicode mode. The locale-specific mode uses a pregenerated table to tell which characters are printable, numeric, and so on. For historical reasons, OTP has always used Latin-1 for this table, so characters like `ö` are considered to be letters. This is fine, but the library has two quirks that don't play well with each other: * The locale-specific table is always consulted for code points below 256 regardless of whether we're in unicode mode or not, and the `ucp` option only affects code points that aren't defined in this table (zeroed). * The character class `[:ascii:]` matches characters that are defined in the above table. This is fine when the regex library is built with its default ASCII table: `[:ascii:]` only matches ASCII characters (by definition) and the library documentation states that `ucp` is required to match characters beyond that with `\w` and friends. Unfortunately, we build the library with the Latin-1 table so `[:ascii:]` matches Latin-1 characters instead, and we can't change the table since we've documented that `\w` etc work fine with Latin-1 characters, only requiring `ucp` for characters beyond that. At this point you might be thinking that this is a bug in how the regex library handles `[:ascii:]`. Well, yes, POSIX says it should match all code points between 0-127, but that's misleading since it's only true for strict supersets of ASCII: should `[:ascii:]` match 0x5C if the table is Shift-JIS? It would be just as wrong as matching `ö`. :-( Why not try to do the right thing and mark ASCII-compatibility for each code point, since (for instance) 0x41 is `A` both in ASCII and Shift-JIS? There's no way to ask a locale whether a code point refers to the same character in ASCII, so the users would need to manually go through the tables after generating them. Happy fun times. I've settled for documenting this mess since we can't fix this on our end without breaking people's code, and there's not much point in reporting this upstream since it'll either be misleading or far too much work for the user, and PCRE-8.x is nearing the very end of its life.

…deficiency (#10757) erlang/otp#4551

jhogberg added team:VM Assigned to OTP team VM testing currently being tested, tag is used by OTP internal CI documentation bug Issue is reported as a bug labels Feb 25, 2021

jhogberg added this to the OTP-24.0 milestone Feb 25, 2021

jhogberg self-assigned this Feb 25, 2021

jhogberg linked an issue Feb 25, 2021 that may be closed by this pull request

Regular expressions with POSIX character class [:ascii:] matches code points > 127 #4544

Closed

lukebakken reviewed Feb 25, 2021

View reviewed changes

michallepicki reviewed Feb 25, 2021

View reviewed changes

lib/stdlib/doc/src/re.xml Outdated Show resolved Hide resolved

jhogberg force-pushed the john/stdlib/document-ascii-class-snafu/GH-4544/OTP-17222 branch from 323bdf3 to 092584b Compare February 26, 2021 07:56

jhogberg added the fix label Feb 26, 2021

jhogberg merged commit 1f365da into erlang:master Feb 26, 2021

josevalim pushed a commit to elixir-lang/elixir that referenced this pull request Feb 27, 2021

Copy upstream documentation change about regex ascii character class …

e3d88ae

…deficiency (#10757) erlang/otp#4551

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re: Document [:ascii:] character class deficiency #4551

re: Document [:ascii:] character class deficiency #4551

jhogberg commented Feb 25, 2021

lukebakken Feb 25, 2021

ferd Feb 25, 2021 •

edited

Loading

re: Document [:ascii:] character class deficiency #4551

re: Document [:ascii:] character class deficiency #4551

Conversation

jhogberg commented Feb 25, 2021

lukebakken Feb 25, 2021

Choose a reason for hiding this comment

ferd Feb 25, 2021 • edited Loading

Choose a reason for hiding this comment

ferd Feb 25, 2021 •

edited

Loading