2020-10-21

Significance of Immutability, Extensibility, and Versioning in Persistent Data

Writing a program is all fun and games until someone persists your data. Your user data gets saved in the database, persisted! Some folks use your API, persisted! Your website's JS files cached, BAM, persisted! They're actually gonna be alright until you need to change something. When it's impossible to do without breaking change, emergency meeting!

Data persistence is a commitment. Commitments are usually fine. Let's make a database table for users, we need an id, username, and a password. Also, let's add user display_name. And then the table gets filled. That's you committing that that's the schema of the user table.

Someone on your team then decides that the code needs some refactoring. The database table users is not well-named and you need to add some other fields to that database, role, is_verified. You then decide to split the database to users and user_authentications, the former including display_name, role, the latter has username, is_verified, because the former and the latter's relationship will be one-to-many, supporting multiple authentication methods.

Write the service and database migration, test it in a sandbox, migrate, deploy! But, had you know you will need multiple authentication methods, would you start the same way? Some people would say YAGNI, others would say "let's prepare for the future (read: keep things open for extension)". Either way, when commitments change a lot, things get meddlesome, and that's quite not fine.

Data persistence is not just about the information stored in the database though. It's everything that your users interface with. If you're providing some Restful APIs, the actual list of paths, the schema, the responses, are the persisted data. If your users are actual humans, data are your software's perceivables. If you're making CPU chips, the information about the instruction set, the shape of the chip, the flops, are the data that are persisted in the form of user manual. Ok, that is reaching, but you know what I mean.

Fortunately, while I'm on the journey, making software, treating them as art, I found some tricks that help with these commitments, namely immutability, extensibility, and versioning.

Immutability and Extensibility

A broken clock is right twice a day. That's how immutability is so magical. It's reliable. Immutable data are cachable. Immutable data types and contracts are... well, reliable.

Take HTML5 syntax for example. A node is either a doctype notation, an element, CDATA, or text node. An element is either <tag_name>content</tag_name> or <tag_name />. A comment is . There's CDATA which I have never used. The rest is text node. That's how those are stored and read (a.k.a. persisted). That's how to make it never introduce breaking change.

HTML5's element syntax is incredibly powerful. It opens an opportunity to be extended. HTML spec designers can create near endless types of supported elements. Now there are <div>, <span>, and <h1>, in the future <fingerprint> tag may be introduced. With HTML5's element syntax, any software that reads HTML will not break if a new element is introduced. If you don't update your browser in 5 years, and it loads a document that uses a relatively new element introduced 2 months ago, it can simply ignore those unknown elements instead of getting confused and crashing because of unsupported syntax.

This element syntax is extensible. An HTML specification writer can add any element they want because the options for naming their new element is practically infinite. The options being infinite allows immutability in tag names. The concept of <select> or <input>, for example, will probably never change, except if extremely needed, like security reasons. It is because there's no need for competition over tag name, the resource is simply abundant. So these concepts of immutability and extensibility reinforce each other.

Immutability and extensibility are really useful for a persistent data format. HTML5, machine-readable log file, ODT format, CRDT format on transport, et cetera.

A hypothetical study case: machine-readable log file. (Disclaimer: This scenario below is all made up)

This particular software needs to produce log files. A log file needs to be machine-readable for visualization. The log file also needs to be preserved for later review. This software is in a prototype phase and currently, the log only captures global-level uncaught exceptions the NodeJS style.

To make the log file extensible, it uses JSON Lines with a t field for tagging.

{"t": "uncaught-exception", "stacktrace": "RangeError: Maximum call stack size exceeded \n      at WriteStream (..."}
{"t": "uncaught-exception", "stacktrace": "RangeError: Maximum call stack size exceeded \n      at WriteStream (..."}

When there's another feature to be added into the log file, it's just another value of t. For example, the software turned to be GC, and the developers need to log the time it starts the sweep.

{"t": "sweep", "duration": 210, "heapsize-old": 13212057, "heapsize-new": 12013457 }
{"t": "uncaught-exception", "stacktrace": "RangeError: Maximum call stack size exceeded \n      at WriteStream (..."}
{"t": "uncaught-exception", "stacktrace": "RangeError: Maximum call stack size exceeded \n      at WriteStream (..."}

To make it immutable, never change the meaning of sweep and uncaught-exception. Maybe deprecate it someday, like HTML5 spec deprecates <frame>, <applet>, or <big>, but never change the meaning of a tag.

Easy. Painless. There is no worry the log file consumer will crash from an extension. The probability of that happening is minuscule.

Versioning as Documentation and Directive

A browser needs to be updated regularly to be able to read the latest HTML tag. When a persistent data format changes, the consumer needs to be adjusted so it can fully comprehend the new data format. Therefore it is very important for the persistent data format designer to communicate changes to the consumer's creator.

Versioning is very useful in communicating changes. Let's take a versioning schema which has two parts, the major number and the minor number. Major number increase indicates breaking change, minor number increase indicates addition. We talked about immutability, but when a breaking change is needed, this versioning schema is powerful enough to communicate that.

Version is best when embedded in the persistent-data. This would act as a directive for the consumer to know beforehand whether the consumer can read it and how to read it. In HTML this is the DOCTYPE declaration. Another example with a similar would be shebang. It's not exactly versioning but it functions similarly.

// versioning
// format: x.y
// x: major
// y: minor

Example
1.0 <- initial version: features include: "uncaught-exception"
1.1 <- a feature added: "sweep" 
1.2 <- a feature added: "regular-heapsize-check"
2.0 <- breaking change: "sweep" payload definition and format is changed
2.1 <- a feature added: "non-main-stack-created", "non-main-stack-destroyed"
3.0 <- breaking change: "regular-heapsize-check" definition and format is changed

The versioning history example above tells a story about the progression of the log file. There are two breaking changes introduced, changes in the sweep and regular-heapsize-check tag definition. There are also a couple of additions, sweep, regular-heapsize-check, non-main-stack-created, and non-main-stack-destroyed.

The version can be embedded in the file as a file header. The header is the lines above --- while the data is the lines below it.

version: 2.0
---
{ "t": "sweep", "time": 1603285857926, "stackId": 0, "duration": 210, "heapsize-old": 13212057, "heapsize-new": 12013457 }
{ "t": "sweep", "time": 1603285906613, "stackId": 0, "duration": 210, "heapsize-old": 13212057, "heapsize-new": 12013457 }

Because the version is embedded in the header, the consumer can read the version before it reads the body. It can know that a log file is of version 2.0 and proceed to use 2.0 parser for it. The consumer implementor can also get creative, supporting multiple major versions by having multiple parsers in it, 1.x, 2.x, and 3.x.

Another case where version helps is when a persistent-data consumer is receiving many things and they only want a certain format with a certain version to be processed.

In conclusion, versioning as a persistent-data format is super useful for documentation and directive, complementing immutability and extensibility.