-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ER/RFC: Schema Inference #748
Comments
I must say, this is very nice! Here's a handy and simple schema inference program I use:
What's handy about this is that it outputs jq path expressions. It needs a bit of work (to deal with object key names that need quoting because they aren't ident-like). What should we call this? Should your and my schema utils be in 1.5, or in a module? I think this is almost a killer app for jq... EDIT: formatting. |
Using your program (with
Using the 'schema' def in schema.jq at
|
The following code generates a simple JSON Schema according to http://json-schema.org/latest/json-schema-validation.html:
For example, for this input:
the generated schema is:
|
@fadado Beautiful! Can I borrow that for jq? |
This is partly an enhancement request, and partly a request for comments.
When confronted with a collection of JSON entities, it is often helpful to
know whether there is an implicit schema, and if so, what it is. Even in the
case of a single JSON document, it is often useful to have a structural
overview, e.g. for navigation.
Spark SQL can infer a schema from a collection of JSON
entities. It can be printed using printSchema().
The example given in the O'Reilly Spark book on p. 172 is the pair of records:
Using the proposed schema.jq below, we find:
This is equivalent to the schema inferred by Spark SQL except that:
this facilitates the use of the inferred schemas, e.g. for integrity
checking, while obviating the need for a special pretty-printer, since
the results produced by the jq pretty-printer are eminently readable.
whereas the proposed schema inference engine regards nulls as placeholders
without any particular structural significance.
As illustrated by the above example, the absence of a key in an object also
has no particular structural significance for either the Spark SQL inference
engine or the one proposed here.
Three noteworthy features of the proposed schema inference engine are:
a) the introduction of "scalar" as an extended type, e.g. ["scalar"] is the extended type signifying an array of 0 or more elements of scalar type;
b) the introduction of "JSON" as an extended type, e.g. ["JSON"] is the extended type signifying an array of 0 or more elements of any type;
c) arrays are only characterized by the extended type of their elements.
Thus, the following JSON object conforms to the above-mentioned schema:
See also #243
schema.jq
The text was updated successfully, but these errors were encountered: