-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revamp sub/3 to resolve most issues with gsub (and sub with "g") #2641
Conversation
… uniq(stream) The primary purpose of this commit (which supercedes PR jqlang#2624) is to rectify most problems with `gsub` (and also `sub` with the "g" option), in particular jqlang#1425 ('\b'), jqlang#2354 (lookahead), and jqlang#2532 (regex == "^(?!cd ).*$|^cd ";"")). This commit also partly resolves jqlang#2148 and jqlang#1206 in that `gsub` no longer loops infinitely; however, because the new `gsub` depends critically on match(_;"g"), the behavior when regex == "" is sometimes non-standard. [*1] Since the new sub/3 relies on uniq/1, that has been added as well [*2]. The documentation has been updated to reflect the fact that `sub` and `gsub` are intended to be regular in the second argument. [*3] Also, _nwise/1 has been tweaked to take advantage of TCO. Footnotes: [*1] Using the new gsub, '"a" | gsub( ""; "a")' emits "aa" rather than "aaa" as would be standard. This is nevertheless better than the infinite loop behavior of jq 1.6 in this case. With one exception (as explained in [*2]), the new gsub is implemented as though match/2 behavior is correct. That is, bugs in `gsub` behavior will most likely have their origin in `match/2`. [*2] `uniq/1` adopts the Unix/Linux name and semantics; it is needed for the following test case: gsub("(?=u)"; "u") "qux" "quux" Without this functionality: Test jqlang#23: 'gsub("(?=u)"; "u")' at line number 100 *** Expected "quux", but got "quuux" for test at line number 102: gsub("(?=u)"; "u") The root of the problem here is `match`: if `match` is fixed, then gsub would not need `untie`. The addition of `uniq` as a top-level function should be a non-issue relative to general concern about builtins.jq bloat: the line count of the new builtin.jq is significantly reduced overall, and the number of defs is actually reduced by 1 (from 111 (ignoring a redundant def) to 110). [*3] See e.g. jqlang#513 (comment)
@itchyny - Please note that the failing checks almost certainly have nothing to do with the changes in this PR. Please also note that in the case of "no match" examples such as |
@itchyny - as expected, the failing tests had nothing to do with this PR. Since “all checks have passed”, it would be a good time to pull, before merging becomes very complex, wouldn’t it? |
docs/content/manual/manual.yml
Outdated
named captures. The named captures are, in effect, presented | ||
as a JSON object (as constructed by `capture`) to | ||
`tostring`, so a reference to a captured variable named "x" | ||
would take the form: `"\(.x)"`. | ||
|
||
example: | ||
- program: 'sub("^[^a-z]*(?<x>[a-z]*).*")' | ||
input: '"123abc456"' | ||
output: '"ZabcZabc"' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example is invalid and somehow not included in man.test. There is no sub/1
and output
should always be an array. Since you added examples, would you rebuild man.test to check if the CI passes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per notes elsewhere, this example is very old, and was always wrong. It has been replaced.
docs/content/manual/manual.yml
Outdated
@@ -1428,6 +1428,30 @@ sections: | |||
input: '[{"foo":1, "bar":14}, {"foo":2, "bar":3}]' | |||
output: ['{"foo":2, "bar":3}'] | |||
|
|||
- title: "`uniq(stream)`" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still disagree with this naming, I'm very worried about this confuses the users. Consistency within the builtin functions is much important than external command name conventions. We need to remember that uniq
is a stream oriented and unique
is an array oriented. There're already filters having both array and stream versions like first
so that's good.
body: | | ||
|
||
`gsub` is like `sub` but all the non-overlapping occurrences of the regex are | ||
replaced by the string, after interpolation. | ||
replaced by `tostring`, after interpolation. If the second argument is a stream | ||
of jq strings, then `gsub` will produce a corresponding stream of JSON strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding an example will help users understand this sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example added in 5373191
@@ -75,6 +75,45 @@ gsub( "(.*)"; ""; "x") | |||
"" | |||
"" | |||
|
|||
gsub( ""; "a"; "g") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick but inconsistent spacing.
src/builtin.jq
Outdated
# in mind that s could be a stream | ||
def sub($re; s; $flags): | ||
. as $in | ||
| (reduce uniq(match($re; $flags)) as $edit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following patch isn't enough to fix the duplicate bug?
diff --git a/src/builtin.c b/src/builtin.c
index 9b2d9a2..03ab4d5 100644
--- a/src/builtin.c
+++ b/src/builtin.c
@@ -930,7 +930,7 @@ static jv f_match(jq_state *jq, jv input, jv regex, jv modifiers, jv testmode) {
match = jv_object_set(match, jv_string("string"), jv_string(""));
match = jv_object_set(match, jv_string("captures"), jv_array());
result = jv_array_append(result, match);
- start += 1;
+ start = (const UChar*)(input_string+region->end[0]+1);
continue;
}
$ ./jq -nc '"qqqqqqux" | match("(?=u)"; "g")'
{"offset":6,"length":0,"string":"","captures":[]}
def _nwise(a; $n): if a|length <= $n then a else a[0:$n] , _nwise(a[$n:]; $n) end; | ||
def _nwise($n): _nwise(.; $n); | ||
def _nwise($n): | ||
def n: if length <= $n then . else .[0:$n] , (.[$n:] | n) end; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use omitting else .
by flipping the condition.
def _nwise($n): def _nwise: if length > $n then .[:$n], (.[$n:] | _nwise) end; _nwise;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being able to omit else
is not in jq 1.6, so my reason for sticking with the 1.6 syntax is that it should be easy for someone familiar with 1.6 to understand code that might well be "copied and pasted" into a jq program, or indeed a jaq program.
src/builtin.jq
Outdated
| [reduce ( $edit | .captures | .[] | select(.name != null) | { (.name) : .string } ) as $pair | ||
({}; . + $pair) | s ] as $inserts | ||
| reduce range(0; $inserts|length) as $ix (.; .result[$ix] += $gap + $inserts[$ix]) | ||
| .previous = ($edit | .offset + .length ) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent tab indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. This seems to me to be a case of apples and oranges. The first reduce
occurs in the form [reduce ... ] as $_
. Anyway, in my view, it's fine to put a reduce
(or an if..then..else..end
) on one line if its fits. But feel free to tweak :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed you're mixing space and tab indentations...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that. Two tabs in builtin.jq removed in pkoppstein@3015ec8
| reduce range(0; $inserts|length) as $ix (.; .result[$ix] += $gap + $inserts[$ix]) | ||
| .previous = ($edit | .offset + .length ) ) | ||
| .result[] + $in[.previous:] ) | ||
// $in; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing // $in
passes the tests, but this is necessary. Would you add a test that doesn't match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
↑ I thought this but this is actually unnecessary!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see detailed notes elsewhere regarding the three variants of this line. Assuming you want gsub fixed in the next release, I'd suggest merging and then tweaking, otherwise this "branch" will fall too far behind and be quite difficult to manage later as there are so many files involved, two or three of which are the subject of many commits.
builtin.c: f_match: Zero-width match : ensure '"qux" | match("(?=u)"; "g")' matches just once rm uniq/1 as it is no longer needed In manual.yml, replace nonsensical sub/1 example
Correct one gsub example, and add another
Remove two tab characters
src/builtin.c
Outdated
@@ -930,7 +930,8 @@ static jv f_match(jq_state *jq, jv input, jv regex, jv modifiers, jv testmode) { | |||
match = jv_object_set(match, jv_string("string"), jv_string("")); | |||
match = jv_object_set(match, jv_string("captures"), jv_array()); | |||
result = jv_array_append(result, match); | |||
start += 1; | |||
// ensure '"qux" | match("(?=u)"; "g")' matches just once | |||
start = (const UChar*)(input_string+region->end[0]+1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is tab indentation here but we prefer spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed (in 832734b); also fixed the editor setting.
untabify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@itchyny wrote:
Thanks. I manually forced the restarting of the checking process, and now there's a sign "All checks have passed" again. But github also tells me:
|
@pkoppstein Can you kick the ci again (by rebasing again?) to make sure you can merge? |
@itchyny - As expected, my git/github skills are not up to the task. In fact, I may have made things worse. Using GitHub Desktop, I clicked on "Update from upstream/master" However, GitHub tells me now that "This branch [gsub] is not behind the upstream jqlang:master". |
@pkoppstein That's good. However , if you want to drop the merge commit, you can use |
@itchyny wrote:
Apparently, good enough for you to merge into jqlang: I'll try to avoid messing things up in the meantime. If something needs adjusting, please feel free to make changes: "Maintainers are allowed to edit this pull request." |
If you don't mind, I'll merge this PR. Would you rephrase the PR description (which contains the outdated descriptions about uniq)? Note that |
@itchyny wrote:
Since I don’t have the permissions to do so, that would be great. I think it would be best if you made any adjustments when doing so. I will undoubtedly just make a mess of things. Tx. |
I think there is a bug in this implementation: If you use the "1" | gsub("1*"; "2")
# "22" EDIT: actually, that is a bug with |
Interestingly, 7 of the 8 regex engines at regex101.com indicate that "1" ~ /1*/ twice, evidently because Perhaps the the moral of the story is that users should avoid using "g" in cases like this. |
The primary purpose of this commit (which supercedes PR #2624) is to rectify most problems with
gsub
(and alsosub
with the "g" option), in particular fix #1425 ('\b'), fix #2354 (lookahead), and fix #2532 (regex == "^(?!cd ).*$|^cd ";"")).This commit also partly resolves #2148 and also resolves #1206 in that
gsub
no longer loops infinitely; however, because the newgsub
depends critically on match(_;"g"), the behavior when regex == "" is sometimes non-standard. [*1]Since the new sub/3 relies on uniq/1, that has been added as well [*2].
The documentation has been updated to reflect the fact that
sub
andgsub
are intended to be regular in the second argument. [*3]Also, _nwise/1 has been tweaked to take advantage of TCO.
Footnotes:
[*1] Using the new gsub, '"a" | gsub( ""; "a")' emits "aa" rather than "aaa" as would be standard. This is nevertheless better than the infinite loop behavior of jq 1.6 in this case.
With one exception (as explained in [*2]), the new gsub is implemented as though match/2 behavior is correct. That is, bugs in
gsub
behavior will most likely have their origin inmatch/2
.[*2]
uniq/1
adopts the Unix/Linux name and semantics; it is needed for the following test case:gsub("(?=u)"; "u")
"qux"
"quux"
Without this functionality:
Test #23: 'gsub("(?=u)"; "u")' at line number 100
*** Expected "quux", but got "quuux" for test at line number 102: gsub("(?=u)"; "u")
The root of the problem here is
match
: ifmatch
is fixed, then gsub would not needuntie
.The addition of
uniq
as a top-level function should be a non-issue relative to general concern about builtins.jq bloat: the line count of the new builtin.jq is significantly reduced overall, and the number of defs is actually reduced by 1 (from 111 (ignoring a redundant def) to 110).[*3] See e.g. #513 (comment)