Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ER/RFC: Schema Inference #748

Open
pkoppstein opened this issue Apr 11, 2015 · 5 comments
Open

ER/RFC: Schema Inference #748

pkoppstein opened this issue Apr 11, 2015 · 5 comments

Comments

@pkoppstein
Copy link
Contributor

This is partly an enhancement request, and partly a request for comments.

When confronted with a collection of JSON entities, it is often helpful to
know whether there is an implicit schema, and if so, what it is. Even in the
case of a single JSON document, it is often useful to have a structural
overview, e.g. for navigation.

Spark SQL can infer a schema from a collection of JSON
entities. It can be printed using printSchema().

The example given in the O'Reilly Spark book on p. 172 is the pair of records:

{"name": "Holden"}

{"name": "Sparky the Bear", "lovesPandas": true, "knows": {"friends": ["holden"]} }

Using the proposed schema.jq below, we find:

$ jq -s -f schema.jq spark.json
{
  "knows": {
    "friends": [
      "string"
    ]
  },
  "lovesPandas": "boolean",
  "name": "string"
}

This is equivalent to the schema inferred by Spark SQL except that:

  1. the schema inference engine proposed here always produces a JSON entity;
    this facilitates the use of the inferred schemas, e.g. for integrity
    checking, while obviating the need for a special pretty-printer, since
    the results produced by the jq pretty-printer are eminently readable.
  2. the Spark SQL schema adds "nullable" and "containsNull" annotations,
    whereas the proposed schema inference engine regards nulls as placeholders
    without any particular structural significance.

As illustrated by the above example, the absence of a key in an object also
has no particular structural significance for either the Spark SQL inference
engine or the one proposed here.

Three noteworthy features of the proposed schema inference engine are:

a) the introduction of "scalar" as an extended type, e.g. ["scalar"] is the extended type signifying an array of 0 or more elements of scalar type;

b) the introduction of "JSON" as an extended type, e.g. ["JSON"] is the extended type signifying an array of 0 or more elements of any type;

c) arrays are only characterized by the extended type of their elements.

Thus, the following JSON object conforms to the above-mentioned schema:

{"name": "Paddington", "lovesPandas": null, "knows": {"friends": ["holden", "Sparky"]}}

See also #243

schema.jq

# Version 0.1

# This module defines three filters:
#   typeof/0 returns the extended-type of its input;
#   typeUnion(a;b) returns the union of the two specified extended-type values;
#   schema/0 returns the typeUnion of the extended-type values of the entities in the input array, if the input is an array;
#   otherwise it simply returns the "typeof" value of its input.

# Each extended type can be thought of as a set of JSON entities,
# e.g. "number" for the set of JSON numbers, and ["number"] for the
# set of JSON number-valued arrays including [].

# The extended-type values are always JSON entities.
# The possible values are:
# "null", "boolean", "string", "number";
# "scalar" for any combination of non-null scalars;
# [T] where T is an extended type;
# an object all of whose values are extended types;
# "JSON" signifying that no other extended-type value is applicable.

# The extended-type values are defined recursively:
# The extended-type of a scalar value is its JSON type.
# The extended-type of a non-empty array of values all of which have the same JSON type, t, is [t], and similarly for ["scalar"], and ["JSON"].
# The extended-type of [] is ["null"], signifying that the type of the array elements is indeterminate.
# The extended-type of an object is an object with the same keys, but the values of which are the extended-types of the corresponding values.

# typeUnion(a;b) returns the least extended-type value that subsumes both a and b.
# For example:
#  typeUnion("number"; "string") yields "scalar";
#  typeUnion({"a": "number"}; {"b": "string"}) yields {"a": "number", "b": "string"};
#  typeUnion("null", t) yields t for any valid extended type, t.

def typeUnion(a;b):
  def scalarp: . == "boolean" or . == "string" or . == "number" or . == "scalar";
  a as $a | b as $b
  | if $a == $b then $a
    elif ($a | scalarp) and ($b | scalarp) then "scalar"
    elif $a == "JSON" or $b == "JSON" then "JSON"
    elif ($a|type) == "array" and ($b|type) == "array" then [ typeUnion($a[0]; $b[0]) ]
    elif ($a|type) == "object" and ($b|type) == "object" then
      ((($a|keys) + ($b|keys)) | unique) as $keys
      | reduce $keys[] as $key ( {} ; .[$key] = typeUnion( $a[$key]; $b[$key]) )
    elif $a == "null" or $a == null then $b
    elif $b == "null" or $b == null then $a
    else "JSON"
    end ;

def typeof:
  def typeofArray:
    if length == 0 then ["null"]
    else [reduce .[] as $item (null; typeUnion(.; $item|typeof))]
    end ;
  def typeofObject:
    reduce keys[] as $key ( . ; .[$key] = (.[$key] | typeof) ) ;

  . as $in
  | type
  | if . == "string" or . == "number" or . == "null" or . == "boolean" then .
    elif . == "object" then $in | typeofObject
    else $in | typeofArray
    end ;

def schema:
  if type == "array" then reduce .[] as $x ("null";  typeUnion(.; $x|typeof))
  else typeof
  end ;
@nicowilliams
Copy link
Contributor

I must say, this is very nice!

Here's a handy and simple schema inference program I use:

[path(..) | ["",(.[]|if type=="number" then "[]" else . end)]] |
 sort | unique | .[] | join(".") | sub("\\.\\[";"[") | sub("^\\[";".[")

What's handy about this is that it outputs jq path expressions. It needs a bit of work (to deal with object key names that need quoting because they aren't ident-like). What should we call this?

Should your and my schema utils be in 1.5, or in a module? I think this is almost a killer app for jq...

EDIT: formatting.

@pkoppstein
Copy link
Contributor Author

Using your program (with \ properly escaped) on armor.json at https://github.com/CleverRaven/Cataclysm-DDA/blob/master/data/json/items/armor.json:

.[]
.[].//
.[].ammo
.[].bashing
.[].bashing_protection
.[].category
.[].charges_per_use
.[].color
.[].coverage
.[].covers
.[].covers.[]
.[].cut
.[].cutting
.[].description
.[].encumbrance
.[].environmental_protection
.[].flags
.[].flags.[]
.[].id
.[].initial_charges
.[].material
.[].material.[]
.[].material_thickness
.[].max_charges
.[].name
.[].name_plural
.[].note
.[].phase
.[].power_armor
.[].price
.[].properties
.[].properties.[]
.[].properties.[].[]
.[].qualities
.[].qualities.[]
.[].qualities.[].[]
.[].revert_to
.[].snippet_category
.[].snippet_category.[]
.[].snippet_category.[].id
.[].snippet_category.[].text
.[].storage
.[].symbol
.[].techniques
.[].techniques.[]
.[].to_hit
.[].turns_per_charge
.[].type
.[].use_action
.[].use_action.activate_msg
.[].use_action.deactive_msg
.[].use_action.need_sunlight
.[].use_action.type
.[].volume
.[].warmth
.[].weight

Using the 'schema' def in schema.jq at
https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed#file-schema-jq
the result is:

$ jq -r -f schema.jq /tmp/armor.json
{
  "//": "string",
  "ammo": "string",
  "bashing": "number",
  "bashing_protection": "number",
  "category": "string",
  "charges_per_use": "number",
  "color": "string",
  "coverage": "number",
  "covers": [
    "string"
  ],
  "cut": "number",
  "cutting": "number",
  "description": "string",
  "encumbrance": "number",
  "environmental_protection": "number",
  "flags": [
    "string"
  ],
  "id": "string",
  "initial_charges": "number",
  "material": "JSON",
  "material_thickness": "number",
  "max_charges": "number",
  "name": "string",
  "name_plural": "string",
  "note": "string",
  "phase": "string",
  "power_armor": "boolean",
  "price": "number",
  "properties": [
    [
      "string"
    ]
  ],
  "qualities": [
    [
      "scalar"
    ]
  ],
  "revert_to": "string",
  "snippet_category": [
    {
      "id": "string",
      "text": "string"
    }
  ],
  "storage": "number",
  "symbol": "string",
  "techniques": [
    "string"
  ],
  "to_hit": "number",
  "turns_per_charge": "number",
  "type": "string",
  "use_action": "JSON",
  "volume": "number",
  "warmth": "number",
  "weight": "number"
}

@fadado
Copy link

fadado commented Mar 17, 2017

This is partly an enhancement request, and partly a request for comments.

When confronted with a collection of JSON entities, it is often helpful to
know whether there is an implicit schema, and if so, what it is.

The following code generates a simple JSON Schema according to http://json-schema.org/latest/json-schema-validation.html:

def isobject:
    type == "object"
;
def isarray:
    type == "array"
;
def isscalar:
    type| . == "null" or . == "boolean" or . == "number" or . == "string"
;

def schema:
    { "type": type } +
    if isobject then
        if length == 0 then null
        else
            . as $object |
            { "properties": (
                reduce keys_unsorted[] as $name (
                    {};
                    . + {($name): ($object[$name] | schema)}
                )
              )
            }
        end
    elif isarray then
        if length == 0 then null
        else
            { "items": (
                if all(isscalar) and (map(type) | unique | length) == 1 then
                    { "type": (.[0] | type) }
                elif length == 1 then
                   .[0] | schema 
                else
                    reduce .[] as $item (
                        [];
                        .[length] = ($item | schema)
                    )
                end
              )
            }
        end
    else null end # scalar
;

For example, for this input:

{
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York"
  },
  "phoneNumber": [
    {
      "location": "home",
      "code": 44
    }
  ]
}

the generated schema is:

{
  "type": "object",
  "properties": {
    "address": {
      "type": "object",
      "properties": {
        "streetAddress": {
          "type": "string"
        },
        "city": {
          "type": "string"
        }
      }
    },
    "phoneNumber": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "location": {
            "type": "string"
          },
          "code": {
            "type": "number"
          }
        }
      }
    }
  }
}

@nicowilliams
Copy link
Contributor

@fadado Beautiful! Can I borrow that for jq?

@fadado
Copy link

fadado commented Mar 18, 2017

@fadado Beautiful! Can I borrow that for jq?

Of course! I have also written a validator, but it depends entirely on the modules of jq I have written; it is impossible to separate it. In any case, generator and validator are in my GitHub JBOL repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants