Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re: Document [:ascii:] character class deficiency #4551

Conversation

jhogberg
Copy link
Contributor

Issue #4544 pointed out that the [:ascii:] character class erroneously matches Latin-1 characters. This PR documents this deficiency since we can't fix it in a backwards-compatible manner.

@jhogberg jhogberg added team:VM Assigned to OTP team VM testing currently being tested, tag is used by OTP internal CI documentation bug Issue is reported as a bug labels Feb 25, 2021
@jhogberg jhogberg added this to the OTP-24.0 milestone Feb 25, 2021
@jhogberg jhogberg self-assigned this Feb 25, 2021
<p>There is another character class, <c>ascii</c>, that erroneously matches
Latin-1 characters instead of the 0-127 range specified by POSIX. This
cannot be fixed without altering the behaviour of other classes, so we
recommend matching the range with <c>[\x01-\x7f]</c> instead.</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not too much trouble to explain further ... for my own knowledge is this related to locale and the underlying regex library or something else?

Copy link
Contributor

@ferd ferd Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit of both. The underlying PCRE lib ties the ascii range with 'printable' characters range, which is only incidentally ascii if you're using the C locale. For historical reasons, Erlang sets the locale default to Latin-1 which extends the printable range and accidentally extends the ascii range at the same time.

So it's a change in defaults to match Erlang semantics that triggers a PCRE implementation detail in unclearly-documented ways (the C-locale dependence isn't obvious)

The regex library we use can work either in locale-specific mode,
or unicode mode. The locale-specific mode uses a pregenerated
table to tell which characters are printable, numeric, and so on.

For historical reasons, OTP has always used Latin-1 for this table,
so characters like `ö` are considered to be letters. This is fine,
but the library has two quirks that don't play well with each
other:

* The locale-specific table is always consulted for code points
  below 256 regardless of whether we're in unicode mode or not,
  and the `ucp` option only affects code points that aren't
  defined in this table (zeroed).
* The character class `[:ascii:]` matches characters that are
  defined in the above table.

This is fine when the regex library is built with its default ASCII
table: `[:ascii:]` only matches ASCII characters (by definition)
and the library documentation states that `ucp` is required to
match characters beyond that with `\w` and friends.

Unfortunately, we build the library with the Latin-1 table so
`[:ascii:]` matches Latin-1 characters instead, and we can't change
the table since we've documented that `\w` etc work fine with
Latin-1 characters, only requiring `ucp` for characters beyond
that.

At this point you might be thinking that this is a bug in how the
regex library handles `[:ascii:]`. Well, yes, POSIX says it should
match all code points between 0-127, but that's misleading since
it's only true for strict supersets of ASCII: should `[:ascii:]`
match 0x5C if the table is Shift-JIS? It would be just as wrong as
matching `ö`. :-(

Why not try to do the right thing and mark ASCII-compatibility for
each code point, since (for instance) 0x41 is `A` both in ASCII and
Shift-JIS? There's no way to ask a locale whether a code point
refers to the same character in ASCII, so the users would need to
manually go through the tables after generating them. Happy fun
times.

I've settled for documenting this mess since we can't fix this
on our end without breaking people's code, and there's not much
point in reporting this upstream since it'll either be misleading
or far too much work for the user, and PCRE-8.x is nearing the
very end of its life.
@jhogberg jhogberg force-pushed the john/stdlib/document-ascii-class-snafu/GH-4544/OTP-17222 branch from 323bdf3 to 092584b Compare February 26, 2021 07:56
@jhogberg jhogberg added the fix label Feb 26, 2021
@jhogberg jhogberg merged commit 1f365da into erlang:master Feb 26, 2021
josevalim pushed a commit to elixir-lang/elixir that referenced this pull request Feb 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug documentation fix team:VM Assigned to OTP team VM testing currently being tested, tag is used by OTP internal CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regular expressions with POSIX character class [:ascii:] matches code points > 127
4 participants