Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: support for streaming input delimited by null characters #2659

Open
svdb0 opened this issue Jul 5, 2023 · 13 comments
Open

Comments

@svdb0
Copy link

svdb0 commented Jul 5, 2023

Historically, many Linux command line tools took their input — in particular file names — as newline terminated strings, or produced newline terminated output strings.
When an input or output entry includes an internal newline character, such an entry can be interpreted as multiple entries.
This will cause the tool processing these entries to misbehave, and can pose a security risk.

Increasingly, many Linux command line tools now have the ability to use null bytes instead of newlines as terminator in input and output.
Examples of such tools are Bash (read, mapfile), grep, cut, head, tail, sort, uniq, sed, find, xargs, env, wc, du, stat, file, id, tar, rsync, and sha256sum.
In addition, a few pseudo-files in /proc/, i.e. /proc/<pid>/cmdline and /proc/<pid>/environ, use null bytes as terminators.

It would be nice if jq too could be used in shell pipelines where null characters are used as terminators.
Issue #1990 addressed this for output generated by jq.
As far as I have been able to find, there is no similar functionality yet for reading null terminated data as input to jq.

This feature request is for the addition of a variant of --raw-input which uses null bytes as terminators instead of newline characters.
For the purpose of this feature request, I will refer to it as --raw-input0.

A common use case for --raw-input0 would be to securely read a file listing into jq as input:

find / -print0 | jq --raw-input0

In some cases, the following construct could be used instead of --raw-input0:

jq --raw-input --slurp 'split("\u0000")[]'

However, this does not stream its input, and hence would often be less suitable in combination with long-running commands like the find / -print0 statement above.

@emanuele6
Copy link
Member

emanuele6 commented Jul 5, 2023

This would not be very useful since jq only supports valid JSON strings, so utf-8 encoded unicode strings. UNIX paths can contain any byte except NUL, even invalid utf-8 sequences.

A common use case for --raw-input0 would be to securely read a file listing into jq as input:

find / -print0 | jq --raw-input0

If you are using find . ! -name $'*\n*' | jq -nR '[ inputs | select(endswith(".png")) ]' or find . -print0 | jq -sR '(. / "\u0000")[] | select(indices("/") | length == 4)' or similar to filter your UNIX paths with jq, stop doing that.

It does not work and it is not "secure".

bash-5.1$ touch $'\x96\xdd' foo bar
bash-5.1$ ls
''$'\226\335'   bar   foo
bash-5.1$ find . | jq -rR
.
./foo
./��
./bar
bash-5.1$ find . | jq -rR | sed -n l
.$
./foo$
./\357\277\275\357\277\275$
./bar$

Adding support for something like this will be useful only when jq supports byte strings, which, if ever, I assume will only happen after jq 1.7.

@svdb0
Copy link
Author

svdb0 commented Jul 5, 2023

I am aware of the fact that UNIX file names can contain invalid UTF-8 sequences.

And while I would like it if jq could work with any byte string, I'd argue that jq erroring out on a string it can't deal with, is vastly preferable to it interpreting one input as two; in many cases it is perfectly acceptable to not being able to process all file names, while silently confusing file names, with all security implications, is not.

I was however not aware that jq silently replaces byte sequences which do not represent valid UTF-8 characters.
This is in fact the sort of surprise I wanted to avoid.
I will submit a new feature request to allow jq to produce an error (catchable as with the error function) when an invalid byte sequence is encountered. I would argue that it should be on by default. [Edit: Created ticket #2660]

The occasion for wanting this feature, however, was not because I expected to be dealing with untrusted or malformed input — in fact, I would be very surprised if my particular input could ever have newlines.
Had this feature existed, I would have used it as a defensive 'best practice' measure, in the same spirit in which I would write "$var" in Bourne shell script within quotes, even when I do not foresee how $var could ever contain anything else than digits.

Also, please look beyond my example of processing the output of find, or even filenames at all.
Shell script is not the ideal way to deal with complex data structures, and if you want to pass your data — which might be valid UTF-8 while containing newlines — to a different piece of software, then it would be very reasonable to use jq to turn it into JSON.
Using null bytes as terminators is an easy way to pass such a stream of data to jq.
And whether you're just forwarding data from some standard tool, or from something you wrote yourself, or whether you generate it in your script using printf '%s\0' statements, --raw-input0 would be be your friend.

Regardless of what other ways there might be to solve each instance of this problem, it would still be a useful tool in the toolbox.

@nicowilliams
Copy link
Contributor

I agree that we need this, and we can have this independently of adding support for a binary type.

@wader
Copy link
Member

wader commented Jul 6, 2023

So when enable zero byte separated input there would be two modes, with and without -R raw input? for without would zero byte behave the same as whitespaces does now between JSON values?

Example if treated as whitespace:

$ echo -ne '"a"\x00\x00\n "b"' | jq --raw-input0
"a"
"b"

Example if split only on zero byte:

$ echo -ne '"a"\x00\x00\n "b"' | jq --raw-input0
"a"
"" # or error?
"b" # skip whitespace around JSON?

Example with -R:

$ echo -ne 'a\x00\x00\n b' | jq -R --raw-input0
"a"
""
"\n b"

So --raw-input0 combined with -R would disable new line separation?

@svdb0
Copy link
Author

svdb0 commented Jul 6, 2023

What I had in mind, is that you'd use --raw-input0 instead of --raw-input/-R — never both at once.
--raw-input0 would work exactly like --raw-input, except that when using --raw-input0 the null character fulfills the role which the newline character does in --raw-input.

You would be using --raw-input0 with strings as input — or byte sequences if that is ever implemented — not with JSON values.

If your input consists of JSON values, then there would be no need to delimit them with null bytes; normal whitespace would work just fine.
Characters like newlines that you want to include in your input strings would be within the double quotes of a JSON string, escaped as \n.

@nicowilliams
Copy link
Contributor

So when enable zero byte separated input there would be two modes, with and without -R raw input? for without would zero byte behave the same as whitespaces does now between JSON values?

Is there a need to NUL-terminate JSON texts? Things like find .. -print0 are always about unstructured text that might contain newlines, but JSON texts are self-terminating, so we don't ever need a NUL terminator to disambiguate them.

Are there tools that can produce NUL-terminated JSON texts? If not, then I think the answer to your question is that NUL-terminated input processing has to imply or require -R.

@wader
Copy link
Member

wader commented Jul 6, 2023

@svdb0 Ok that makes sense, thanks for clarifying

@itchyny
Copy link
Contributor

itchyny commented Jul 10, 2023

This feature request makes me concerned about -0 short option may confuse users which stream to be applied.

@nicowilliams
Copy link
Contributor

This feature request makes me concerned about -0 short option may confuse users which stream to be applied.

Yes! See #2683.

I think we want to a) remove -0/--nul-output, b) add -0 meaning "NUL-delimited inputs".

@itchyny
Copy link
Contributor

itchyny commented Jul 10, 2023

c) keep --nul-output but remove -0, no short flag for input neither.

@nicowilliams
Copy link
Contributor

c) keep --nul-output but remove -0, no short flag for input neither.

Yes, that is also an option. Probably the least disruptive one now.

@nicowilliams
Copy link
Contributor

d) no -0, only --nul-output and --raw-input0.

@svdb0
Copy link
Author

svdb0 commented Jul 10, 2023

I used the option name --raw-input0 as a placeholder, but to me it seemed like a good name, as it is easy to remember — meaning 'as --raw-input but null-terminated' — and also intuitive if you're familiar with similarly named parameters in some other tools (find -print0, rsync --from0, xz --files0, du --files0-from, wc --files0-from).

Following the same reasoning, and also for symmetry and the resulting intuitiveness, I would suggest that the null delimited version of --raw-output be named --raw-output0 instead of --nul-output.

One more reason for getting rid of --nul-output, is that there is also a --null-input (-n), with a completely different meaning. This can only be confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants