Software Engineering at Prismatic

A few weeks ago, we gave a keynote at ClojureWest about our approach to building large real-time systems (slides available here). Since the audience of the talk was mostly Clojure hackers and enthusiasts, the talk focused on how we leverage Clojure to build our backend systems. Future posts will dive into specific libraries we've built for storage and aggregation, streaming graph computations, and fast numerical optimization and machine learning. In this post, we're going to unpack our higher-level software engineering philosophy: Use functional programming to build fine-grained, flexible abstractions. Compose these abstractions to express problem-specific logic. Avoid large, monolithic frameworks like Hadoop.

But first, some relevant context about Prismatic: the backend team consists of four engineers, two of whom are founders. The kinds of things Prismatic builds for the backend include web crawlers, social graph analysis over millions of entities, learned topic models, and real-time relevance ranking. We have the standard 'big data' challenges: managing and storing massive streaming document collections, social network signals, and user events. In addition to managing a lot of data, we also have the more difficult challenge of actually doing something useful with all this data. For instance, our relevance ranking system must maintain an online clustering of related docs and real-time indices of entities and social cues, and then return personally ranked feeds directly to the user.

What lets us build these components quickly is a combination of these principles, but more relevant is our team composition: It's small, but extremely talented. Between the four of us, we have three CS PhDs, the author of one of the most popular open-source NLP tools, a former Microsoft Research fellow, and a Google code jam finalist. The reason this is important is that on any given day, you may be building a service which knits together many disparate libraries and suddenly you need to dip down deep into the weeds and modify a critical library that everyone uses. There isn't room for people to be just API clients or not have a deep understanding of computer science.

Prefer Libraries to Frameworks

We're strong believers in the abstraction principle: code is good to the extent that it consists of re-usable abstractions. This yields a smaller code base, which, all things equal, is less error prone, simpler to understand, and easier to extend. For this reason, we prefer libraries over frameworks. The distinction can be fuzzy, but your code calls a library and a framework follows the Hollywood principle: Don't call us, we'll call you. In the JVM world, Tomcat and Hadoop are frameworks, while Jetty and Lucene are libraries.

Frameworks can be convenient so long as your needs don't deviate from how the framework wants to do things. Problems arise when you run into framework inflexibility and need to adapt or decompose its functionality. So if you have a standard batch MapReduce problem, Hadoop may do the trick. But if your needs change, say you want your computation to be streaming rather than batch, you have to dive deep into the framework to alter functionality. In principle, you should be able to use pieces of Hadoop functionality a la carte and compose them as you'd like. But that's generally not how frameworks are written.

We believe that functionality should be packed into finer-grained libraries and applications should compose these pieces. We tend to build our own libraries simply because much of the functionality of open-source code is locked up into monolithic frameworks and not easily re-usable. While rolling your own libraries has a higher upfront cost, in the long run you get a lot of flexibility in your codebase and gain a deeper understanding of the components you work with.

Prefer Functional Programming (FP) to Imperative/Object-oriented style

There are several reasons for the growth of FP: immutable data structures are better suited for dealing with multi-core concurrency and pure functions are easier to understand, easier to verify, and (as a consequence) less error-prone. These points are all accurate, but the most important advantage of FP is that having functions as first class citizens facilitates substantially richer abstractions, and therefore better libraries. Especially on the JVM, many popular open-source projects (Hadoop, Lucene, etc.) were written in Java and are limited in their re-usability in part because the underlying language can't elegantly encode rich abstractions. In a future post on storage and aggregation, we'll look at the specific example of buffering and flushing in key-value storage.

Our backend language choice is Clojure on the JVM. There are other great functional languages on the JVM, like Scala, but what we like about Clojure is that it has a small, simple core focused around processing data represented in its excellent set of persistent data structures (or their mutable java.util equivalents, when the need strikes). While we make heavy use of the core of Clojure, we don't use its concurrency primitives (atoms, refs, STM, etc.) because a function like pmap doesn't have enough fine grained control for our needs. We opt instead to build our own concurrency abstractions in Clojure on top of the outstanding java.util.concurrent package.

Roadmap

Here's a sneak peak of some individual libraries we'll be writing about soon.

  • Store: our library of storage and aggregation abstractions. We have a core key-value storage protocol with implementations for many storage engines, and abstracted operations on this protocol for common large-system needs including buffering, checkpointing, and flushing.
  • Graph: our real-time stream processing library. Graph separates an abstract specification of a streaming computation from how the computation is executed, and also allows for fine-grained, real-time monitoring of processing.
  • Flop & Optimize: our fast primitive array manipulation and numerical optimization libraries. These libraries are written in pure Clojure and their performance matches optimized Java without sacrificing expressiveness or succinctness; for instance, a state-of-the-art numerical optimization is implemented in < 180 lines of Clojure.