JSON schema validation


#1

It is great that NoSQL/Couchbase does not require schema definition before writing a document. But it has the drawback that the database does not enforce consistency across docs. So I devised the following method for adding this control to our architecture.

The idea is that each bucket, besides its regular docs, also stores the schema of those docs. Then the application can use those schemas as it sees fit. It could check every doc against its schema before writing it to CB, or it could randomly check a subset of docs each night. Or only check important docs that have the potential to break something bad.

I am looking for feedback/suggestions on this. Others may copy as they want.

Standards

  • All regular (persisted to disk) documents in Couchbase store their value in JSON format (www.json.org).

  • Every document contains two text fields that state the schema for that document. The fields are docSchema and docSchemaVersion.

  • All documents contain a docType field that is used to distinguish logical document sets within a bucket.

  • Each Couchbase bucket contains a set of special documents that hold the data schemas for all the other documents in that bucket. The names (keys) of these documents are docSchemas and docSchemasNNNN (where NNNN is 0001, 0002… for previous versions).

  • Each of the schema documents contains a set of JSON schemas (http://json-schema.org), one for each schema in that bucket. The schemas define the list of fields permitted in a document and optionally other constraints such as required fields, non-null fields, numeric ranges for a field, keywords allowed, etc.

  • When a schema changes, the change is backward compatible so new documents match the previous schemas.

  • More than one docType can share a schema. So “worker” and “manager” documents can use the same “person” schema.

Not Addressed in this Standard

  • Memory-only (memcached) buckets that are not written to disk. This data may be more free-form and not governed by these standards.

Example

Bucket: HR_RECORDS

Document: key = jsmith, value = { "docSchema":"utd-hr-person", "docSchemaVersion":"1.0.4", "docType":"worker", "lName":"smith", "fName":"john", "city":"waltham"} 

Document: key = bjones, value = { "docSchema":"utd-hr-person", "docSchemaVersion":"1.0.4", "docType":"manager", "lName":"jones", "fName":"bob", "city":"newton"} 

Document: key = devops, value = { "docSchema":"utd-hr-dept", "docSchemaVersion":"2.2.2", "docType":"department", "name":"Development Operations", "location":"230 Third Ave, Waltham"}
 
Document: key = docSchemas, value = { 
    "utd-hr-person" : {
    "$schema": "http://json-schema.org/draft-04/schema#",
    "title": "utd-hr-person | 1.0.4",
    "description": "The JSON schema for a person who works at UpToDate",
    "type": "object",
      "properties": {"docSchema": {"type": "string"},
        "docSchemaVersion": {"type": "string"},
        "docType": {"type": "string"},
        "lName": {"type": "string"},
        "fName": {"type": "string"},
        "city": {"type": "string"}
      },
      "required": ["docSchema","docSchemaVersion","docType","lName"]
    },
    "utd-hr-dept": {
      "$schema": "http://json-schema.org/draft-04/schema#",
    "title": "utd-hr-dept | 2.2.2",
    "description": "The JSON schema for a company department at UpToDate",
    "type": "object",
      "properties": {"docSchema": {"type": "string"},
        "docSchemaVersion": {"type": "string"},
        "docType": {"type": "string"},
        "name": {"type": "string"},
        "location": {"type": "string"}
      },
      "required": ["docSchema","docSchemaVersion","docType","name"]
    }
    } 

Document: key = doc-schemas-0001, value = similar to above schema, but perhaps with fewer fields


#2

Hi @eben, please engage on this.


#3

Hi @chuck.connell,

FYI, in Couchbase Server 4.5 there is a new N1QL command ‘infer’, which uses a random sample of documents to perform schema inferencing to produce a schema for the documents compliant with the json-schema.org draft standard. It has some extensions, since the standard is designed to be prescriptive, while our schema is descriptive of what was found. For example, we can describe how frequently a field is seen, if it’s not present in every document. Also, when documents are very dissimilar (as can often happen in buckets), we generate multiple schemas for the different ‘flavors’ of document.

Please take a look at the ‘infer’ command and see if it could help you with implementing this proposal.

Regards,

-Eben


#4

Wow! This is very cool. I will try it out. I’m not sure it exactly addresses our goal, since we want to state up front what the doc schemas should be, based on data models from our whiteboards. But “infer” does sound useful in various ways.

Thanks.


#5

Yes, we need schema validation. Let’s continue to work together on this and get something standardized. Adding @till to this thread.