ER/RFC: Schema Inference #748

pkoppstein · 2015-04-11T03:03:52Z

This is partly an enhancement request, and partly a request for comments.

When confronted with a collection of JSON entities, it is often helpful to
know whether there is an implicit schema, and if so, what it is. Even in the
case of a single JSON document, it is often useful to have a structural
overview, e.g. for navigation.

Spark SQL can infer a schema from a collection of JSON
entities. It can be printed using printSchema().

The example given in the O'Reilly Spark book on p. 172 is the pair of records:

{"name": "Holden"}

{"name": "Sparky the Bear", "lovesPandas": true, "knows": {"friends": ["holden"]} }

Using the proposed schema.jq below, we find:

$ jq -s -f schema.jq spark.json
{
  "knows": {
    "friends": [
      "string"
    ]
  },
  "lovesPandas": "boolean",
  "name": "string"
}

This is equivalent to the schema inferred by Spark SQL except that:

the schema inference engine proposed here always produces a JSON entity;
this facilitates the use of the inferred schemas, e.g. for integrity
checking, while obviating the need for a special pretty-printer, since
the results produced by the jq pretty-printer are eminently readable.
the Spark SQL schema adds "nullable" and "containsNull" annotations,
whereas the proposed schema inference engine regards nulls as placeholders
without any particular structural significance.

As illustrated by the above example, the absence of a key in an object also
has no particular structural significance for either the Spark SQL inference
engine or the one proposed here.

Three noteworthy features of the proposed schema inference engine are:

a) the introduction of "scalar" as an extended type, e.g. ["scalar"] is the extended type signifying an array of 0 or more elements of scalar type;

b) the introduction of "JSON" as an extended type, e.g. ["JSON"] is the extended type signifying an array of 0 or more elements of any type;

c) arrays are only characterized by the extended type of their elements.

Thus, the following JSON object conforms to the above-mentioned schema:

{"name": "Paddington", "lovesPandas": null, "knows": {"friends": ["holden", "Sparky"]}}

schema.jq

# Version 0.1

# This module defines three filters:
#   typeof/0 returns the extended-type of its input;
#   typeUnion(a;b) returns the union of the two specified extended-type values;
#   schema/0 returns the typeUnion of the extended-type values of the entities in the input array, if the input is an array;
#   otherwise it simply returns the "typeof" value of its input.

# Each extended type can be thought of as a set of JSON entities,
# e.g. "number" for the set of JSON numbers, and ["number"] for the
# set of JSON number-valued arrays including [].

# The extended-type values are always JSON entities.
# The possible values are:
# "null", "boolean", "string", "number";
# "scalar" for any combination of non-null scalars;
# [T] where T is an extended type;
# an object all of whose values are extended types;
# "JSON" signifying that no other extended-type value is applicable.

# The extended-type values are defined recursively:
# The extended-type of a scalar value is its JSON type.
# The extended-type of a non-empty array of values all of which have the same JSON type, t, is [t], and similarly for ["scalar"], and ["JSON"].
# The extended-type of [] is ["null"], signifying that the type of the array elements is indeterminate.
# The extended-type of an object is an object with the same keys, but the values of which are the extended-types of the corresponding values.

# typeUnion(a;b) returns the least extended-type value that subsumes both a and b.
# For example:
#  typeUnion("number"; "string") yields "scalar";
#  typeUnion({"a": "number"}; {"b": "string"}) yields {"a": "number", "b": "string"};
#  typeUnion("null", t) yields t for any valid extended type, t.

def typeUnion(a;b):
  def scalarp: . == "boolean" or . == "string" or . == "number" or . == "scalar";
  a as $a | b as $b
  | if $a == $b then $a
    elif ($a | scalarp) and ($b | scalarp) then "scalar"
    elif $a == "JSON" or $b == "JSON" then "JSON"
    elif ($a|type) == "array" and ($b|type) == "array" then [ typeUnion($a[0]; $b[0]) ]
    elif ($a|type) == "object" and ($b|type) == "object" then
      ((($a|keys) + ($b|keys)) | unique) as $keys
      | reduce $keys[] as $key ( {} ; .[$key] = typeUnion( $a[$key]; $b[$key]) )
    elif $a == "null" or $a == null then $b
    elif $b == "null" or $b == null then $a
    else "JSON"
    end ;

def typeof:
  def typeofArray:
    if length == 0 then ["null"]
    else [reduce .[] as $item (null; typeUnion(.; $item|typeof))]
    end ;
  def typeofObject:
    reduce keys[] as $key ( . ; .[$key] = (.[$key] | typeof) ) ;

  . as $in
  | type
  | if . == "string" or . == "number" or . == "null" or . == "boolean" then .
    elif . == "object" then $in | typeofObject
    else $in | typeofArray
    end ;

def schema:
  if type == "array" then reduce .[] as $x ("null";  typeUnion(.; $x|typeof))
  else typeof
  end ;

The text was updated successfully, but these errors were encountered:

nicowilliams · 2015-06-29T17:24:39Z

I must say, this is very nice!

Here's a handy and simple schema inference program I use:

[path(..) | ["",(.[]|if type=="number" then "[]" else . end)]] |
 sort | unique | .[] | join(".") | sub("\\.\\[";"[") | sub("^\\[";".[")

What's handy about this is that it outputs jq path expressions. It needs a bit of work (to deal with object key names that need quoting because they aren't ident-like). What should we call this?

Should your and my schema utils be in 1.5, or in a module? I think this is almost a killer app for jq...

EDIT: formatting.

pkoppstein · 2015-06-29T17:55:39Z

Using your program (with \ properly escaped) on armor.json at https://github.com/CleverRaven/Cataclysm-DDA/blob/master/data/json/items/armor.json:

.[]
.[].//
.[].ammo
.[].bashing
.[].bashing_protection
.[].category
.[].charges_per_use
.[].color
.[].coverage
.[].covers
.[].covers.[]
.[].cut
.[].cutting
.[].description
.[].encumbrance
.[].environmental_protection
.[].flags
.[].flags.[]
.[].id
.[].initial_charges
.[].material
.[].material.[]
.[].material_thickness
.[].max_charges
.[].name
.[].name_plural
.[].note
.[].phase
.[].power_armor
.[].price
.[].properties
.[].properties.[]
.[].properties.[].[]
.[].qualities
.[].qualities.[]
.[].qualities.[].[]
.[].revert_to
.[].snippet_category
.[].snippet_category.[]
.[].snippet_category.[].id
.[].snippet_category.[].text
.[].storage
.[].symbol
.[].techniques
.[].techniques.[]
.[].to_hit
.[].turns_per_charge
.[].type
.[].use_action
.[].use_action.activate_msg
.[].use_action.deactive_msg
.[].use_action.need_sunlight
.[].use_action.type
.[].volume
.[].warmth
.[].weight

Using the 'schema' def in schema.jq at
https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed#file-schema-jq
the result is:

$ jq -r -f schema.jq /tmp/armor.json
{
  "//": "string",
  "ammo": "string",
  "bashing": "number",
  "bashing_protection": "number",
  "category": "string",
  "charges_per_use": "number",
  "color": "string",
  "coverage": "number",
  "covers": [
    "string"
  ],
  "cut": "number",
  "cutting": "number",
  "description": "string",
  "encumbrance": "number",
  "environmental_protection": "number",
  "flags": [
    "string"
  ],
  "id": "string",
  "initial_charges": "number",
  "material": "JSON",
  "material_thickness": "number",
  "max_charges": "number",
  "name": "string",
  "name_plural": "string",
  "note": "string",
  "phase": "string",
  "power_armor": "boolean",
  "price": "number",
  "properties": [
    [
      "string"
    ]
  ],
  "qualities": [
    [
      "scalar"
    ]
  ],
  "revert_to": "string",
  "snippet_category": [
    {
      "id": "string",
      "text": "string"
    }
  ],
  "storage": "number",
  "symbol": "string",
  "techniques": [
    "string"
  ],
  "to_hit": "number",
  "turns_per_charge": "number",
  "type": "string",
  "use_action": "JSON",
  "volume": "number",
  "warmth": "number",
  "weight": "number"
}

fadado · 2017-03-17T12:46:32Z

This is partly an enhancement request, and partly a request for comments.

When confronted with a collection of JSON entities, it is often helpful to
know whether there is an implicit schema, and if so, what it is.

The following code generates a simple JSON Schema according to http://json-schema.org/latest/json-schema-validation.html:

def isobject:
    type == "object"
;
def isarray:
    type == "array"
;
def isscalar:
    type| . == "null" or . == "boolean" or . == "number" or . == "string"
;

def schema:
    { "type": type } +
    if isobject then
        if length == 0 then null
        else
            . as $object |
            { "properties": (
                reduce keys_unsorted[] as $name (
                    {};
                    . + {($name): ($object[$name] | schema)}
                )
              )
            }
        end
    elif isarray then
        if length == 0 then null
        else
            { "items": (
                if all(isscalar) and (map(type) | unique | length) == 1 then
                    { "type": (.[0] | type) }
                elif length == 1 then
                   .[0] | schema 
                else
                    reduce .[] as $item (
                        [];
                        .[length] = ($item | schema)
                    )
                end
              )
            }
        end
    else null end # scalar
;

For example, for this input:

{
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York"
  },
  "phoneNumber": [
    {
      "location": "home",
      "code": 44
    }
  ]
}

the generated schema is:

{
  "type": "object",
  "properties": {
    "address": {
      "type": "object",
      "properties": {
        "streetAddress": {
          "type": "string"
        },
        "city": {
          "type": "string"
        }
      }
    },
    "phoneNumber": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "location": {
            "type": "string"
          },
          "code": {
            "type": "number"
          }
        }
      }
    }
  }
}

nicowilliams · 2017-03-17T15:46:58Z

@fadado Beautiful! Can I borrow that for jq?

fadado · 2017-03-18T08:22:50Z

@fadado Beautiful! Can I borrow that for jq?

Of course! I have also written a validator, but it depends entirely on the modules of jq I have written; it is impossible to separate it. In any case, generator and validator are in my GitHub JBOL repository.

nicowilliams added the feature request: jq function library label Jun 29, 2015

nicowilliams added the contribution label Jul 27, 2015

pkoppstein mentioned this issue Oct 2, 2015

Succinct expression for the following in jq? #972

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ER/RFC: Schema Inference #748

ER/RFC: Schema Inference #748

pkoppstein commented Apr 11, 2015

nicowilliams commented Jun 29, 2015

pkoppstein commented Jun 29, 2015

fadado commented Mar 17, 2017

nicowilliams commented Mar 17, 2017

fadado commented Mar 18, 2017

ER/RFC: Schema Inference #748

ER/RFC: Schema Inference #748

Comments

pkoppstein commented Apr 11, 2015

schema.jq

nicowilliams commented Jun 29, 2015

pkoppstein commented Jun 29, 2015

fadado commented Mar 17, 2017

nicowilliams commented Mar 17, 2017

fadado commented Mar 18, 2017