2022-02-17

Parallels of Antifragility in Software Space

Antifragile is a term for things that gain from stress. The term was invented by Nassim Nicholas Taleb and popularized in the book with the same name. When we look at software closely, it cannot be more different. Software is fundamentally fragile. How would the idea of antifragile fit into the software space?

Before going into the topic, I’d like to underline that this article is just an observation at a glance of the idea. Antifragile is a concept that is abstract and malleable. It is broad, so not everything can be put in. A bunch of ideas from the book can be drawn parallel to software, but the different nature of the software world from the real world makes some ideas need to be tweaked and twisted in order to make sense. But it doesn’t mean it can’t be applied at all.

What I want to put into this article are things I’ve observed while designing and building complex software. The business of my workplace is a game backend business. If you hear gamers mentioning game marketplace, matchmaking, voice chats, loot boxes, analytics, that’s us. Essentially, we’re building parts of games that are not the game itself. I’ve moved around in it and have tackled some memorable engineering incidents, although de jure my work is designing and building a desktop application. Therefore, I might know several things from infra level to company leadership and I’ve seen patterns that are similar to fragile-antifragile phenomena.

Software is Fragile

As you might have often heard, software, or computer programs, are 0s and 1s and rules that are commonly agreed upon. What rules? Take morse code for example. In morse code, • – is A, – • • • is B, and so on and so on. People can talk through morse code because they agree that a certain sequence of dots and dashes maps to a certain letter. Modern computers are working by the same principle. Instead of dots and dashes, they are working with 0s and 1s.

A rule is always one probability out of many. From all interpretations of bits sequence, one is chosen. “Let’s agree that 1 is true”, for example. Building with rules is suppressing chaos. Via Positiva. Each rule is stacked on top of several others, creating a skyscraper of potential energy should it collapse.

In contrast, a living organism is antifragile. A living organism might have a complete ecosystem as its part. The Kingdom of Animalia has a redundant system to self-heal and grow. The cells have their own food cycle forming a sub-ecosystem. The cells also have a mechanism to induce necrosis, self-destruction, in which the cells self-destruct. One function that necrosis has is reducing the chance of developing cancer. Thus, the animal, the supersystem of the cell, has its life prolonged as the result.

Over to the larger scope, the ecosystem consists of elephants, bananas, mushrooms, spiders, programmers, and other organisms. Each member of the ecosystem is its subsystem. The subsystems are redundant and sometimes eat each other, trampling each other, in order to survive. The ecosystem is chaotic, but unless touched by an armageddon-level Force Majeure, the redundant nature of the subsystems provides a self-stabilization to the ecosystem, much like an elephant’s organs (the subsystem) to the elephant (the system).

Despite the fragile nature, how did we get to this point where computers are highly advanced?

On the Path to Antifragility

In the world of pure software (running without human intervention), unexpected extreme occurrences, or what Taleb calls black swans, will almost always topple down the stack of rules. To avoid accidents the effect of black swans on software must be reduced to 0, a.k.a to make it robust. So we turn to physics and math.

Physics and math are robust, therefore they make a reliable tool to robustify software. 1+1 will always be 2, multiplication is always faster than factorization, etc. Pure math (derived from the law of physics) and logic should be used as the foundation for the rules.

Correct, complete, and finite

Due to avoidance of black swan, at the lower level, it is “eliminate all but the green ones”. Multiple versions of scenarios, called branches, are weaved into the software. At this level, the policy is to “eliminate-all-edge cases” or to be complete. Extreme via negativa. To explain what I mean by “eliminate-all-edge cases”, let’s take an example: the HTTP client’s handling of the HTTP status code.

HTTP is a protocol used for client-server communication. Two parties are involved, the HTTP client and the HTTP server. A client can make a request to a server, which then gives a response to the client. Status code is a value used by the server to communicate the status of a response.

HTTP status code is made up of 3 digit numbers. The first digit indicates the class of the status, while the other two digits represent the categorization inside each class. The classes are 1xx, 2xx, 3xx, 4xx, 5xx. As an example, 5xx is the class of server-error, 500 means ”internal server error”, 501 means ”not implemented”, 502 means “bad gateway”, and so on until 511. So what happens if the server returns 512, a non-standard status code? In this case, the HTTP client must treat the error as a server error because it is 5xx status code. What happens if the server returns 999? In that case, the HTTP client must return the response as is to the client’s client. Since HTTP status code consists of 3 digits, all numbers from 0 to 999 have a handler, even though the non-standard are treated differently. The control flow branches are controlled in a complete manner.

Imagine if an HTTP client is simply not complete! A browser that crashes every time the server misbehaves. It is a total nightmare, not a world I want to live in.

Completeness is easy to achieve with finite things. In recent years, programming languages have evolved better mechanisms to prevent programming mistakes. One finding from research in the functional programming field is the tagged union data type. Tagged union data type is a tool to describe a value that can have several meanings or several shapes. For example, it can be used to record past occurrences that have happened in a data structure with finite variants. Let’s see a more concrete analogy below.

At the end of the day of work, you write a report. A supervisor looks for a red flag in the report. If there is none, then there’s nothing to be done. But if there’s a red flag in the report, the supervisor takes mitigation action. The action can be anything depending on the content of the report. Pay attention to the report. Regardless of the content, it has two major categories: red flag and no red flag. This major category is the decisive factor to what the supervisor should do next. The report is a tagged union. It records your work of the day, past occurrences, into two major variants: “red flag” and “no red flag”. The rest of the report is additional information that might or might not affect the supervisor’s action. The supervisor’s action is the analogy for the code or program that receives the tagged union.

Several functional-flavored programming languages even have a mechanism to make sure the programmers do not forget to handle all variants of the tagged union. Several of the languages even actively reduce the number of categories throughout the program – a mechanism called flow-sensitive typing. This topic is also often called “Making Illegal States Unrepresentable”. (See this awesome video: Domain Modelling Made Functional)

In project management and product design, a field of software that is more human, completeness is crucial in managing deliveries. There is this thing called incremental delivery, which is how features are delivered piece by piece, instead of all in one go. The classic reason is physics. The number of features produced by a software development team within a certain period is finite. Whether a design of delivery is complete or not makes the difference between scope cutting and corner-cutting.

A nationwide car dealer and repair business is rolling out an app for customers to schedule a car service. Imagine if, in the first delivery, the app only allows the customer to book. It does not allow the customer to see their booked schedule nor does it remind the customer of their booking. There will be problems in the operational aspect, such as the forgotten appointment or accidental “unavailable time at the date” because of an accidental double book by the same customer, etc. The problem in the software product can impact the non-software operational cost and profits, just because of a wrong cut.

With this, I conclude that neglecting to keep things finite, correct, and complete is a recipe for a self-made negative Black Swan for a software-related business.

Also, think about this! Taleb believes that Via Negativa, an approach to defining a thing or reaching a goal by eliminating what it isn’t and what it should not be, tends to be antifragile. On the contrary, Via Positiva, an approach to defining a thing or reaching a goal by adding what it is and what it should be tends to fragilize. Defined finiteness is easy to reach in software. Having defined finiteness eliminates the huge gap between Via Positiva and Via Negativa approach, because being finite means being indexable by humans, thus every scenario, every item, is easier to be within reach of one’s cognitive context.

The antifragile - layering, isolation, experimentation

The close parallel of Taleb's way to antifragility to software space is layering, isolation, and experimentation (a form of extraction of goods out of the random). Those are applicable in the higher level of software, such as the organization of code, organization of objects, and organization of software developers. I can think of three examples: abstraction in code, runtime sandboxing, and the startup culture have something in common

Abstraction in code separates concerns (layering) between callers and callees, isolates concerns from one another (isolation), and treats each other as random by separating implementation (experimentation) from specification (interface between layers).

Runtime sandboxing separates the business of the sandboxed process from the business of the host (layering), essentially sandboxing errors (experimentation/randomness) in a box (isolation) while providing total protection to the host.

A startup nowadays is born out of good-enough ideas (experimentation) and capital (interface between layers). Most startups are given capital by investors. The investors invest in the end goal and the key performance indicator but omit most of the execution details (layering). The investors also have methods of securing their total capital regardless of the investment’s status (isolation).

The naturally fragile (complex code, complex runtime system, relation between startups, investors, and society) are tamed by the naturally antifragile (layering and isolating experimentations). Just like life, ecosystem, elephant, organs, and cells, the software space gains from layered systems, so that a supersystem can filter the impact from its subsystem by letting in success (positive black swan) and isolating and dampening failures (negative black swan).

Redundancy

Redundancy is another path to antifragility. Every computer operation that has a side-effect such as network calls, peripheral buffer reads, storage drive reads, there will be a case of the classic Time-of-check to time-of-use (TOCTOU) problem.

TOCTOU’s analogy is as follows. You’re eating from a can of pringles. You’re chewing chips. You notice there’s only one chip left in the can. Your mom called from the kitchen, so you look over the vicinity of your mom to try to understand what that she is about. At the same time, your hand reached for the last chip. But contrary to the fact you believed a few moments earlier, your hand cannot find a chip. It turns out, the last chip you saw a couple of milliseconds ago is gone. Your sibling stole the last chip.

That’s TOCTOU, the fact at the time-of-check (the look) is different from the fact at the time-of-use (the grab). Unlike a pringles eater human, the software won’t work unless you tell them to. So, the redundancy here is that despite what the software sees, there is a mechanism that anticipates the scenario when the fact is not as ideal on time-of-use as it is on time-of-check.

There are other redundant mechanisms to prevent unwanted errors (robust) or even to speed up things (antifragile), such as retries, unit tests, database indexes, data backup, RAID, all kinds of locks, all kinds of server scaling strategy, eventually-consistent earth-wide distributed database solutions, consensus algorithms like Raft, blockchain-derived solutions such as bitcoin and IPFS, and many more. I’m not getting into the details of each of those.

Chaos Engineering

One famous thing about Netflix’s engineering is its utilization of chaos engineering. Not only does Netflix have a backup system that’s reliable for handling failure, but it also resorts to using a tool called Chaos Monkey. Chaos Monkey actively kills machines on the server-side at random. It is counter-intuitive that killing our own machines helps with reliability. Examining it a little closer would show that it is probably inspired by the overcompensating trait of regeneration in organics. The backup system is designed to overcompensate the random failures.

That is automated redundancy at work.

Trial and Errors - Not A Direct Parallel

Taleb also sees trial and error as a better method than teleological approaches. We see that trial and error works at the higher level where the random happens. The startup and investment culture we have discussed earlier. But, when we look at the lower level, the lower it is, the less random things happen, and the rarer it is to gain from trial and error.

Put coarsely, you cannot try what code works to build a certain scenario, you have to form an intelligible sentence for the machine to understand. Other than that, when presented with an array of tools, libraries, or APIs, you can’t just try it all and hope it works the way you want it to. You must read what those tools, libraries, or APIs are meant to do. Trial and error in the very low level are just like brute-forcing to guess a password. It yields zero to negative gain.

High Cohesion, Low Coupling

In designing a library, an API, or other usable, there is a principle called high cohesion, low coupling. Let A be the usable and B be the user, therefore B uses A or, in other words, B depends on A. In this context, cohesion means A allows B to change in a way that the whole adds value. The coupling means, when A changes, B must change too. High cohesion means that A allows B to change freely while low coupling means when A changes, B’s need to change is minimal to zero. Here’s a link to my favorite read about this.

Having observability towards the dependency tree of software you’re maintaining is crucial when you have to make a change. Pick a node that needs change and you can trace where the side effect will spread. If possible, pick a tool that can automate the validation of your changes or act as a redundant backup system should the change cause some failures.

The fragilista of the software space is those who do not care about the side effects caused by changes concerning the dependency tree. They are those who thought changes in the lower layer have the same tail as changes in the surface layer. Fragilista, by the way, is Taleb’s term for people who don’t know about fragility and act in a way that causes fragility, sometimes at the cost of others.

Mistaking a Phenomenon for Theory

Talking about fragilista and dependency tree, there is a popular example of it, the cargo cult programming.

Cargo cult programming is when there is a ritualistic code or method of inclusion that serves no real purpose. Code and methods are included blindly without knowing what is the real purpose behind them. At best, it causes bloat to the software, at worst, it causes an error that is difficult to debug. Both cost everyone time and fortune. It causes unnecessary lines of code, unnecessary processes, unnecessary dependencies, and unnecessary instruction run by your CPU.

An example, TDD is an approach where software requirements are converted into tests before the software is fully developed. When TDD is “successful in project A”, it does not mean that TDD will be “successful in project B”. For example, TDD might be very helpful in making sure a driver for a custom-made hardware controller works correctly, but TDD might not be very useful in making a bunch of stateless UI components. The disability in differentiating “TDD is effective for A” from “TDD is effective!” is dangerous.

Another example is that approaches for building SaaS such as Netflix can’t be blindly applied to a different project, for example, an IaaS like AWS. The interface is different, the scale of the user is different, the pacing of the team is different, the priority of designing a good graphical user interface over-designing the low-level functionality is different. Blindly applying the principle of work of a successful, but different other project is dangerous.

My strong opinion on this: Every single discipline/guideline/tenet that is not a derivation of pure math, physics, must be regarded as a phenomenon rather than a trusted theory. One plus one is always two but always choosing a relational database does not always work for you.

Symptoms, not Root Cause - The Consequence of Playing Tall

Building a tall system incurs a price, the fragilization of knowledge organization. A broken piece of a system causes a chain reaction, amplified by the inevitable dependency stack. Isolation of damage only makes the difference between a crash and an error message, something of a trivial value.

Getting rid of what you don’t want (Via Negativa) is a viable way to build a tall system. It minimizes bad content going into the system, but most likely the execution is imperfect. Once a tall system is established, bad contents are embedded in it, a ticking time bomb. Once the embedded bad content becomes a problem, it might be stuck deep where it is hard to notice, let alone to diagnose.

Abstraction and indirection, the thing that makes a tall system bearable to operate and maintain, can be the same factor that fragilizes the debuggability of a system. The more abstract and indirect a concept or a content, the harder it is to dissect. What you see isn’t what you have anymore. This is a tug-of-war between operability and knowledge organization. Efforts in antifragilizing operability and maintainability will fragilize the knowledge organization in the process.

Fixing a deep problem requires more than Via Negativa. because what people tend to do is to patch the symptoms while forgetting that the root cause might be lying at a completely different place. Fixing the symptom adds another abstraction or indirection. While it adds to the overall integrity, it adds more components to understand.

My word of advice: Trace the symptoms back to the root cause and take note of its existence and properties. If it is important enough to be fixed, fix it! Otherwise, when fixing is just too expensive, acknowledge that it exists so that things are not added on top of it. If the whole chunk is not valuable enough to keep, then getting rid of it might be a good idea.

To cheapen a fix of a system at a later date, one can invest in documenting the reasons for each abstraction layer. Provided that both the writer and the reader of the documentation are equally literate, tracing back the root cause will be significantly cheaper. Comment on the why!

Of course, simplification and avoiding overengineering helps with tracing back and debugging too.

Less is more, but not enough is not enough

“Less is more” is a mantra frequently used to point out to only build what is necessary to remove the complexity that can amplify Black Swan, but there is a limit to simplification in one front of software development. Beyond the limit, it will blow up another front.

What do I mean by front? Software development involves different aspects, like the actual software, the knowledge organization, the sales, the project management, the user interface, the different factions of stakeholders, etc. Each stake, each aspect, is what I call a front.

I’ve seen people fighting about simplification in many directions. When the ask is for things to be “simple”, what does ”simple” really mean?

If a programming language is simple, it means the user must do a lot to achieve what he wants. If a programming language is meant to be ”simple to write”, it means the language must make a lot of compensation.

Same with a user-facing application. If simple means as few steps as possible for the user to do its task, the application must abstract a lot of things to compensate. If simple means a thin amount of abstraction, then the user has a lot to do to finish their tasks.

Pushing simplification too far might hurt many other fronts severely.

A file explorer on an OS is designed to interact with files both stored in local storage and somewhere in the network. But, on the front of UI, there is a task to minimize the user’s attention to what’s behind the detail, down to eliminating any indicators that can differentiate local files and network files.

The consequence? Users complain that some files are just slow to open, some files are missing half of the time from the explorer, and some files can’t be opened intermittently. Those are errors from network failures, but the users might not realize that there is any interaction with the network whatsoever. It is a leaky abstraction, and I strongly opine that a confusing leaky abstraction is a useless abstraction.

On Simplification For the Sake of Simplification

This is what I observe. Simplification for the sake of simplification leads to oversimplification. While this problem is often apparent in the layer close to humans, the UI, this is not exclusively a UI problem.

Any strong authoritative agent of change that understands only one front out of all in a software development team will likely cause oversimplification. Sometimes, if you're a strong authoritative agent that is trapped in that pitfall, you would likely not realize that you’re doing this, because you don’t know that you don’t know. That’s why you need to be as wise as you can.

The solution for it is easy to say but is not easy to implement: get to know the dependency tree, the fragility. A simple UI is not always cheap to build, you have to know when your design needs an abstraction. Abstraction is expensive.

This can also happen at the lower level. If you are an agent in the lower level insisting on making an abstraction, either for the sake of your user (less work on the user) or yourself (less work on yourself), pay attention to how much work needs to be done in the future and what options that you lock yourself out of.

The latter example is done to avoid the abstraction inversion problem.

My ideal user experience design to remove confusion from a leaky abstraction is to simply:

Tell the user what to do depending on the situation
Explain why some steps are necessary, and
Describe what the software has done to ease the experience

And those need to be done in that order. The how (first point) must be clear and noticeable. The why (second and third point) is optional, meaning that the user can opt to ”learn more” at their convenience. A small but contrasting tooltip, for example, is not intrusive, but enough for the user to be aware that they are there. The tooltip is then expandable into a new window or web page that explains the entirety of the problem.

AWS CloudHSM is one example that does abstraction right. CloudHSM is hard, yes, but it is not easy on AWS’ side too. AWS cannot abstract CloudHSM for non-power users, due to the inherent nature of the problem it is solving where a slight “ease-of-use” beyond the limit leads to a breach of security of the user. Despite that, the documentation provides plenty of information for the user to understand what it takes to enable fair use of hardware-level security in the cloud, including the steps to verify that AWS is doing the abstraction properly with an open-source tool.

A final word on simplicity: A complex problem cannot be solved with a simple solution. You can only build a solution as a ramp that eases the curve of learning so that people can reach the top easier, if not faster.

Optionality

Talking about optionality, the book Antifragile also mentions optionality as a major factor in antifragility. There are a huge number of parallels of optionality in the software space. In company strategy, optionality is flexible company direction. In marketing, optionality is having a broad potential market and customers. In budgeting, optionality is having an emergency fund. In multi-product management, optionality is the balance between integration and independence between products. In product design, optionality is the broad usage possibilities despite the limited feature set. In time management, optionality is the unannounced buffer time to provide a safety net for customers and developers alike. In vendor management, optionality is avoiding vendor lock-in. In protocols, optionality is extendability. In language mastery, optionality is polyglotism. In system design and dependency management, optionality is high cohesion and low coupling.

There are a lot more examples, but notice how those items from varying fronts have similarities. I won’t explain here the technicals of how optionality gives antifragility. But one thing that is certain: Do not trade in optionality for more trivial matters.

Literacy: The Ultimate Optionality

I have a favorite definition of what is software development: it is an art of spell casting through electrical media that builds on top of itself. Software development is about literature as much as it is about tinkering. Language is getting more important the more complex the state of software space is becoming. Every crazy thing that you find in software discourse is the result of either bad acts or illiteracy. Literacy is one of the most important aspects of software in general.

In communicating states and instructions, it is very important to differentiate “is” and “maybe”, or “most” and “all”, for example. While it is easy for computers to do that, it is not as easy for a human’s mind. But I dare to say, even though I’m not majoring in psychology, that a human mind can be trained for language, specifically on logic and probability. I would even argue that literacy is an important foundation of civilization, from the fact that literacy is a requirement for properly written laws and that lawyers are thriving with literacy as their arsenal.

This point goes back to one of the subsection titles in Antifragile, “I love randomness”. Exploration and exposure to random software-related literature, or even any kind of literature, is significant for the growth of one’s literacy. The combination of reading, writing, and exchanging feedback on any literature piece is more effective than exclusively reading, although done intensively. Of course, one should not be completely random, there should be a small amount of filtering of what’s going into your catalog of knowledge and a huge amount of unbiased criticism, both for the literature pieces and for yourself should you need to undo what you previously knew and is wrong.

Software companies and leaders should encourage people to improve their literacy because ultimately being literate leads to recognizing more options or hidden fragility. And so, I detest actors who profit from keeping people, including themselves, illiterate, because it will eventually drag us down as a whole.

Why is this important?

Peering through the glasses of antifragility gives us insight into aspects that we didn’t see before. Dependency tree and its relation to changes, for example, is one good parallel of the concept of Taleb’s concavity and exponential Black Swan, which we may have disregarded while making changes to a software. Although imperfect, it gives us an analogy to why things like redundancy work. We know how those work, but antifragility can explain why.

What I wrote is not all of Taleb’s concept of antifragility, nor is it all of antifragility’s parallel in software space. There is still much to explore, for example, going into the extreme, there is a class of software that is antifragile in nature, which is software based on artificial neural networks. Artificial neural networks are modeled based on how the brain works. It works by restructuring itself, using a huge amount of pairs of input and output, just like our brain. And it is hurt by huge “stress”, which is in this case, overtraining.

We see software (and everything else) from the perspective of antifragile today. What I hope is that we will see software from another perspective tomorrow. This is just one small contribution out of many others from many other people that I hope will help with the advancement of software in general.