Schema for Clojure(Script) Data Shape Declaration and Validation

tl;dr: We open-sourced Schema to get many of the benefits of type systems in Clojure with less hassle. In the future, we will use Schema declarations to do awesome things like auto-generate Objective-C classes (take a peek). Join the discussion on Hacker News and tell us what you think

One of the touted benefits of functional programming is that it produces easier to understand and more reusable code. The reasoning behind this claim goes as follows: well-written functional code consists of many small pure functions. Since these functions are free of side-effects, you only need to understand a function's input and outputs to understand its behavior. Unfortunately, understanding what a function does and its inputs and outputs can be a non-trivial task. For instance,

This is a pure function. But what in the world does a share-counts data structure look like and what are updates? In this specific case, you have to actually read the body of the function and infer the nature of the inputs; in more complex real-world examples, you might not be able to tell from the function itself and have to track down calls to the function to figure out the inputs and outputs. In a large code base, this can be a real maintainability nightmare.

Now, the author of this function could definitely have written a doc-string for the function, explaining the 'shape' that share-counts and updates ought to have:

This is less than ideal for a number of reasons:

  • Code often changes faster than documentation, so the doc-string will likely become outdated
  • There's no way to automatically check at compile or runtime that inputs or outputs match this specification
  • If share-counts data shape is used throughout the namespace, it becomes very repetitive to specify in each function
  • There are many natural language descriptions that describe same shape, making it difficult for a team to adopt uniform standards or get good quickly 'parsing' this shape description

In order to easily understand this function, it's not only nice to know it's free of side-effect or mutation, we also want to understand how to use this function: what kind of data should I provide it and what will it give me back. Ideally, how we do this should be declarative and if we really want we can have the function check that input or output matches these expectations.

Don't you really just want static types?

Many readers will rightfully wonder: don't you just want your language to be statically typed? If so, there's a great Clojure project out there and other JVM languages (like Scala) for you. While having types as a first-class citizen in Clojure would address much of what we're asking for above, type systems aren't the only way to do so.

Static typing can be very useful and certainly provides documentation as well as some form of compile-time safety, but full-blown static-typing comes with many undesirable drawbacks. We don't want to rehash the static vs. dynamic typing arguments (shee here for a good overview), but suffice it to say that one of the strengths of Clojure is that (almost) everything can be manipulated as data and this promotes substantial reuse when utilized correctly. A common drawback of statically-typed languages is that it often yields type coupling, where reuse of code becomes difficult because type unnecessarily constrains use.

For example, consider a Clojure function that takes a map containing a :first-name key and a :last-name key and adds a :name key that concatenates them. In a statically-typed language, the idiomatic way to do this might be to declare a User (case) class with a firstName and lastName fields (or have an interface for Nameable things) and use of the function must be limited to objects which are that class or bother to implement that interface. Conceptually though, there are lots of kinds of places in a large code base where you might want this operation (for instance, working with contact information from external networks). You don't want to limit by type, but by the attribute that your data is a map that has :first-name and :last-name keys. In Clojure(Script), you might write this as:

While, it's possible to get at something like this with things like structural typing, it's not what type systems are optimized for, nor do they make it easy.

Introducing Schema

At Prismatic, one of the issues we were running into with a large Clojure(Script) codebase was the ability to document the kind of data functions took and what they returned. Aside from documentation, when the contract of a function is broken at runtime Clojure often might not fail at all, and it's up to the developer to track a nil halfway through the codebase to find the root cause.

For these reasons, we built the Schema library for declaring and validating data shapes. Schema isn't a full-blown type system, but a lightweight flexible DSL for describing data requirements, more approriate to how Clojure functions should delimit use. For example, our with-full-name example, would be written as:

The :- in the parameter vector says that m satisfies the schema on the right hand side; in this case that m is a map that has a string under the :name key, plus any number of arbitrary key-value mappings (the s/Any maps to s/Any part).

An extensible protocol

At the heart of Schema is a protocol that knows how to check if data satisfies the protocol and to explain where there is a mismatch between input data and a schema. The Schema library extends the protocol to many of the built-in data types to provide descriptions of data that read relatively naturally:

Schemas are still just data

Since we've extended the Schema protocol to Clojure core data structures, you can think of Schemas as data that you can combine and use Clojure code to manipulate.

Returning to our earlier example with update-share-counts, many of the kinds of data passed to this function might be re-used throughout a namespace (as types are re-used in a code base). We can simply define those schemas and use them throughout our code:

Flexible validation

By default, putting schemas on your functions doesn't do anything other than add documentation (which is no small thing). There are several ways to turn on validation for specific parts of your code. You can use the with-fn-validation macro to turn it on for any block:

The most common use of validation will probably be as a fixture in your test namespace, so that all the functions called by your tests check schemas. In JVM Clojure and in clojure.test this can be accomplished by adding

(use-fixtures :once schema.test/validate-schemata)

to the bottom of your test file. Another great use case is to declare schematized defrecords, and use validation to ensure your data is up to snuff before persisting it or passing it over the wire.

Sharing Schemas between Clojure and ClojureScript

The Schema library supports both Clojure and ClojureScript. While Schema supports platform-specific features (like primitives in the JVM or prototype chains on JS), many schemas can actually be shared as data between the two. For instance, backend and frontend can share a set of cross-platform schemas describing the API inputs and outputs.

Conclusion and Future Work

We've found Schema to be a productivity win for our Clojure teams and a significant time-saver in tracking down runtime issues. We have many plans for broader use of Schema, including:

  • Generating Objective-C domain classes from schemas (check out a preview of this code)
  • Generating core.typed form annotations from function schemas to get compile-time shape validation
  • Generating avro specficiations from schemas