Feature request: support for streaming input delimited by null characters #2659

svdb0 · 2023-07-05T19:12:58Z

Historically, many Linux command line tools took their input — in particular file names — as newline terminated strings, or produced newline terminated output strings.
When an input or output entry includes an internal newline character, such an entry can be interpreted as multiple entries.
This will cause the tool processing these entries to misbehave, and can pose a security risk.

Increasingly, many Linux command line tools now have the ability to use null bytes instead of newlines as terminator in input and output.
Examples of such tools are Bash (read, mapfile), grep, cut, head, tail, sort, uniq, sed, find, xargs, env, wc, du, stat, file, id, tar, rsync, and sha256sum.
In addition, a few pseudo-files in /proc/, i.e. /proc/<pid>/cmdline and /proc/<pid>/environ, use null bytes as terminators.

It would be nice if jq too could be used in shell pipelines where null characters are used as terminators.
Issue #1990 addressed this for output generated by jq.
As far as I have been able to find, there is no similar functionality yet for reading null terminated data as input to jq.

This feature request is for the addition of a variant of --raw-input which uses null bytes as terminators instead of newline characters.
For the purpose of this feature request, I will refer to it as --raw-input0.

A common use case for --raw-input0 would be to securely read a file listing into jq as input:

find / -print0 | jq --raw-input0

In some cases, the following construct could be used instead of --raw-input0:

jq --raw-input --slurp 'split("\u0000")[]'

However, this does not stream its input, and hence would often be less suitable in combination with long-running commands like the find / -print0 statement above.

The text was updated successfully, but these errors were encountered:

emanuele6 · 2023-07-05T19:26:10Z

This would not be very useful since jq only supports valid JSON strings, so utf-8 encoded unicode strings. UNIX paths can contain any byte except NUL, even invalid utf-8 sequences.

A common use case for --raw-input0 would be to securely read a file listing into jq as input:
find / -print0 | jq --raw-input0

It does not work and it is not "secure".

bash-5.1$ touch $'\x96\xdd' foo bar
bash-5.1$ ls
''$'\226\335'   bar   foo
bash-5.1$ find . | jq -rR
.
./foo
./��
./bar
bash-5.1$ find . | jq -rR | sed -n l
.$
./foo$
./\357\277\275\357\277\275$
./bar$

Adding support for something like this will be useful only when jq supports byte strings, which, if ever, I assume will only happen after jq 1.7.

svdb0 · 2023-07-05T21:31:47Z

I am aware of the fact that UNIX file names can contain invalid UTF-8 sequences.

And while I would like it if jq could work with any byte string, I'd argue that jq erroring out on a string it can't deal with, is vastly preferable to it interpreting one input as two; in many cases it is perfectly acceptable to not being able to process all file names, while silently confusing file names, with all security implications, is not.

I was however not aware that jq silently replaces byte sequences which do not represent valid UTF-8 characters.
This is in fact the sort of surprise I wanted to avoid.
I will submit a new feature request to allow jq to produce an error (catchable as with the error function) when an invalid byte sequence is encountered. I would argue that it should be on by default. [Edit: Created ticket #2660]

The occasion for wanting this feature, however, was not because I expected to be dealing with untrusted or malformed input — in fact, I would be very surprised if my particular input could ever have newlines.
Had this feature existed, I would have used it as a defensive 'best practice' measure, in the same spirit in which I would write "$var" in Bourne shell script within quotes, even when I do not foresee how $var could ever contain anything else than digits.

Also, please look beyond my example of processing the output of find, or even filenames at all.
Shell script is not the ideal way to deal with complex data structures, and if you want to pass your data — which might be valid UTF-8 while containing newlines — to a different piece of software, then it would be very reasonable to use jq to turn it into JSON.
Using null bytes as terminators is an easy way to pass such a stream of data to jq.
And whether you're just forwarding data from some standard tool, or from something you wrote yourself, or whether you generate it in your script using printf '%s\0' statements, --raw-input0 would be be your friend.

Regardless of what other ways there might be to solve each instance of this problem, it would still be a useful tool in the toolbox.

nicowilliams · 2023-07-06T16:30:12Z

I agree that we need this, and we can have this independently of adding support for a binary type.

wader · 2023-07-06T17:27:19Z

So when enable zero byte separated input there would be two modes, with and without -R raw input? for without would zero byte behave the same as whitespaces does now between JSON values?

Example if treated as whitespace:

$ echo -ne '"a"\x00\x00\n "b"' | jq --raw-input0
"a"
"b"

Example if split only on zero byte:

$ echo -ne '"a"\x00\x00\n "b"' | jq --raw-input0
"a"
"" # or error?
"b" # skip whitespace around JSON?

Example with -R:

$ echo -ne 'a\x00\x00\n b' | jq -R --raw-input0
"a"
""
"\n b"

So --raw-input0 combined with -R would disable new line separation?

svdb0 · 2023-07-06T17:53:13Z

What I had in mind, is that you'd use --raw-input0 instead of --raw-input/-R — never both at once.
--raw-input0 would work exactly like --raw-input, except that when using --raw-input0 the null character fulfills the role which the newline character does in --raw-input.

You would be using --raw-input0 with strings as input — or byte sequences if that is ever implemented — not with JSON values.

If your input consists of JSON values, then there would be no need to delimit them with null bytes; normal whitespace would work just fine.
Characters like newlines that you want to include in your input strings would be within the double quotes of a JSON string, escaped as \n.

nicowilliams · 2023-07-06T18:03:58Z

So when enable zero byte separated input there would be two modes, with and without -R raw input? for without would zero byte behave the same as whitespaces does now between JSON values?

Is there a need to NUL-terminate JSON texts? Things like find .. -print0 are always about unstructured text that might contain newlines, but JSON texts are self-terminating, so we don't ever need a NUL terminator to disambiguate them.

Are there tools that can produce NUL-terminated JSON texts? If not, then I think the answer to your question is that NUL-terminated input processing has to imply or require -R.

wader · 2023-07-06T18:04:15Z

@svdb0 Ok that makes sense, thanks for clarifying

itchyny · 2023-07-10T03:10:45Z

This feature request makes me concerned about -0 short option may confuse users which stream to be applied.

nicowilliams · 2023-07-10T03:23:08Z

This feature request makes me concerned about -0 short option may confuse users which stream to be applied.

Yes! See #2683.

I think we want to a) remove -0/--nul-output, b) add -0 meaning "NUL-delimited inputs".

itchyny · 2023-07-10T03:26:03Z

c) keep --nul-output but remove -0, no short flag for input neither.

nicowilliams · 2023-07-10T03:52:43Z

c) keep --nul-output but remove -0, no short flag for input neither.

Yes, that is also an option. Probably the least disruptive one now.

nicowilliams · 2023-07-10T05:06:14Z

d) no -0, only --nul-output and --raw-input0.

svdb0 · 2023-07-10T12:50:48Z

I used the option name --raw-input0 as a placeholder, but to me it seemed like a good name, as it is easy to remember — meaning 'as --raw-input but null-terminated' — and also intuitive if you're familiar with similarly named parameters in some other tools (find -print0, rsync --from0, xz --files0, du --files0-from, wc --files0-from).

Following the same reasoning, and also for symmetry and the resulting intuitiveness, I would suggest that the null delimited version of --raw-output be named --raw-output0 instead of --nul-output.

One more reason for getting rid of --nul-output, is that there is also a --null-input (-n), with a completely different meaning. This can only be confusing.

emanuele6 added the feature request label Jul 5, 2023

nicowilliams mentioned this issue Jul 10, 2023

Feature request: NUL-delimited output #1271

Closed

svdb0 mentioned this issue Jul 10, 2023

Consider removing -0/--nul-output before releasing 1.7 #2683

Closed

emanuele6 mentioned this issue Sep 8, 2023

Zero byte separator to mimic find -print0, pipe to xargs? #1658

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: support for streaming input delimited by null characters #2659

Feature request: support for streaming input delimited by null characters #2659

svdb0 commented Jul 5, 2023

emanuele6 commented Jul 5, 2023 •

edited

Loading

svdb0 commented Jul 5, 2023 •

edited

Loading

nicowilliams commented Jul 6, 2023

wader commented Jul 6, 2023

svdb0 commented Jul 6, 2023

nicowilliams commented Jul 6, 2023

wader commented Jul 6, 2023

itchyny commented Jul 10, 2023

nicowilliams commented Jul 10, 2023

itchyny commented Jul 10, 2023

nicowilliams commented Jul 10, 2023

nicowilliams commented Jul 10, 2023

svdb0 commented Jul 10, 2023

Feature request: support for streaming input delimited by null characters #2659

Feature request: support for streaming input delimited by null characters #2659

Comments

svdb0 commented Jul 5, 2023

emanuele6 commented Jul 5, 2023 • edited Loading

svdb0 commented Jul 5, 2023 • edited Loading

nicowilliams commented Jul 6, 2023

wader commented Jul 6, 2023

svdb0 commented Jul 6, 2023

nicowilliams commented Jul 6, 2023

wader commented Jul 6, 2023

itchyny commented Jul 10, 2023

nicowilliams commented Jul 10, 2023

itchyny commented Jul 10, 2023

nicowilliams commented Jul 10, 2023

nicowilliams commented Jul 10, 2023

svdb0 commented Jul 10, 2023

emanuele6 commented Jul 5, 2023 •

edited

Loading

svdb0 commented Jul 5, 2023 •

edited

Loading