-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re: Document [:ascii:] character class deficiency #4551
re: Document [:ascii:] character class deficiency #4551
Conversation
lib/stdlib/doc/src/re.xml
Outdated
<p>There is another character class, <c>ascii</c>, that erroneously matches | ||
Latin-1 characters instead of the 0-127 range specified by POSIX. This | ||
cannot be fixed without altering the behaviour of other classes, so we | ||
recommend matching the range with <c>[\x01-\x7f]</c> instead.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not too much trouble to explain further ... for my own knowledge is this related to locale and the underlying regex library or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a bit of both. The underlying PCRE lib ties the ascii range with 'printable' characters range, which is only incidentally ascii if you're using the C locale. For historical reasons, Erlang sets the locale default to Latin-1 which extends the printable range and accidentally extends the ascii range at the same time.
So it's a change in defaults to match Erlang semantics that triggers a PCRE implementation detail in unclearly-documented ways (the C-locale dependence isn't obvious)
The regex library we use can work either in locale-specific mode, or unicode mode. The locale-specific mode uses a pregenerated table to tell which characters are printable, numeric, and so on. For historical reasons, OTP has always used Latin-1 for this table, so characters like `ö` are considered to be letters. This is fine, but the library has two quirks that don't play well with each other: * The locale-specific table is always consulted for code points below 256 regardless of whether we're in unicode mode or not, and the `ucp` option only affects code points that aren't defined in this table (zeroed). * The character class `[:ascii:]` matches characters that are defined in the above table. This is fine when the regex library is built with its default ASCII table: `[:ascii:]` only matches ASCII characters (by definition) and the library documentation states that `ucp` is required to match characters beyond that with `\w` and friends. Unfortunately, we build the library with the Latin-1 table so `[:ascii:]` matches Latin-1 characters instead, and we can't change the table since we've documented that `\w` etc work fine with Latin-1 characters, only requiring `ucp` for characters beyond that. At this point you might be thinking that this is a bug in how the regex library handles `[:ascii:]`. Well, yes, POSIX says it should match all code points between 0-127, but that's misleading since it's only true for strict supersets of ASCII: should `[:ascii:]` match 0x5C if the table is Shift-JIS? It would be just as wrong as matching `ö`. :-( Why not try to do the right thing and mark ASCII-compatibility for each code point, since (for instance) 0x41 is `A` both in ASCII and Shift-JIS? There's no way to ask a locale whether a code point refers to the same character in ASCII, so the users would need to manually go through the tables after generating them. Happy fun times. I've settled for documenting this mess since we can't fix this on our end without breaking people's code, and there's not much point in reporting this upstream since it'll either be misleading or far too much work for the user, and PCRE-8.x is nearing the very end of its life.
323bdf3
to
092584b
Compare
Issue #4544 pointed out that the
[:ascii:]
character class erroneously matches Latin-1 characters. This PR documents this deficiency since we can't fix it in a backwards-compatible manner.