By Oluwatobi Ojomu
Distributed systems form the underlying infrastructure of modern computing, enabling computations that span everything from global e-commerce sites to critical payment settlements. Building and supporting these systems is an incredibly demanding task, not because of device failures or the misuse of programming languages, but because of the intrinsic nature of distributed systems.
Distributed systems run across multiple nodes, involve concurrency, and experience partial failures, among many race conditions that are challenging to detect using conventional testing techniques.
For decades, software researchers as well as professionals have increasingly used formal verification, a branch of computer science with a foundation in mathematical logic, to formally prove a system correct before releasing it into a production environment.
As an experienced software engineer with a 5-year history of building scalable backend and distributed systems, Oluwatobi Ojomu has witnessed first-hand the weaknesses of conventional techniques when it comes to considering edge cases, which can lead to global-scale outages.
This approach moves well beyond traditional unit testing or integration testing. While a traditional approach would only test if a given system functions correctly in a bounded set of test cases, formal verification provides mathematical proof that a given system will function correctly in all possible scenarios.
As an experienced software engineer with a 5-year history of building scalable backend and distributed systems, I have witnessed first-hand the weaknesses of conventional techniques when it comes to considering edge cases, which can lead to global-scale outages.
A classic example of this was with a distributed scheduler on which I worked with a logistics company that lost data intermittently during region-global failover situations. This failure only occurred under an extraordinarily rare network partition scenario, one that is not easily producible within a staging environment.
As a result, this helped our team reassess our approach to correctness for systems and eventually ended up with our incorporation of formal verification tooling into our software development workflow.
Formal verification, as a formalism, starts with building a model of a system, some abstract representation of its state transitions, as well as of its invariants expected from the software. Such invariants can be requirements like “a task is never scheduled for more than a single node simultaneously” or “each realized task should have started from a queued status.”
After building a model of this kind, a software engineer formalizes those invariants in terms of mathematical assertions. The next step involves using either automated theorem provers or model checkers, like TLA+, Coq, or Isabelle/HOL, to check all possible execution traces of a given system and to ensure that those assertions are constantly true.
Formal verification, as a formalism, starts with building a model of a system, some abstract representation of its state transitions, as well as of its invariants expected from the software.
The value of this approach in a distributed environment can be explained by its ability to simulate thousands, and sometimes millions, of non-deterministic situations concerning message delay, packet loss, and out-of-order execution. These situations often expose the limitations of traditional testing; however, formal methods are deliberately designed to counter such limitations straightforwardly.
Integrating formal verification into a production flow is critically important. It calls for a wholesale change in engineering teams’ methods of solving problems. At first, our efforts were dedicated to building models of tiny, discrete units; for example, our cluster leadership service’s leader election algorithm.
By using TLA+, we discovered latent bugs in our timeout logic that would have caused multiple nodes to incorrectly take leadership of an event that, a few months earlier, had resulted in service instability as well as outages in production. Not only did our solutions behave as intended, but they were resilient as well, with no need for runtime patching or ad hoc recovery procedures.
One of the first lessons learned from this union is that formal verification must not be seen as a replacement for conventional testing but as a complementary technique with high assurance, especially for critical parts of a system that are intolerant of failure.
Coordination, consensus, replication, or fault recovery components get the most benefits from this approach. These components are the ones where not only are flaws harder to identify, but also harder to fix after deployment.
As time passed, our team transformed from one that certified individual algorithms to one that emphasized service interoperability. To illustrate, we used model checking to ensure consistency in a distributed queue across multiple data centres.
Integrating formal verification into a production flow is critically important. It calls for a wholesale change in engineering teams’ methods of solving problems.
This effort led to a tightening of our retry and acknowledgment semantics, which effectively eliminated the race condition that resulted in duplicate work. The assurance conveyed by mathematical verification reduced the workload of our Quality Assurance teams operationally and cut post-deployment failures more than 60% for a year.
Formal verification poses its challenges, no doubt. The high order of abstraction necessary for building specifiable models can be challenging for engineers who are used to thinking about concrete realizations.
Also, learning formal specification languages can be a formidable undertaking. In order to get rid of this barrier, we as developers must commit to internal workshops and bring in domain experts to work with our engineering group, embracing a culture where correctness is given due importance alongside velocity.
One of the striking success cases that highlights the long-term benefits of formal methods is the design of a distributed ledger intended to track high-value shipments. The ledger needed to meet exceptionally high consistency and duration requirements, since an isolated state fault would result in a loss of customer confidence.
We were able to offer a mathematical guarantee that would ensure the ledger agreed with itself about its entire history, no matter what failure mode occurred, using a formal model of a replication protocol for a ledger. This high level of assurance not only met internal stakeholders’ requirements but also provided a compelling point of confidence for external auditors of security as well.
The need to incorporate formal verification into software production engineering is that it challenges the widespread assumption that software bugs are an inevitable product of software construction. While it will not be feasible, or indeed necessary, to formally verify each line of code, selective application of these methods to key parts can yield dramatic improvements in dependability as well as maintainability.
Formal verification represents a sophisticated method that isn’t based on trial-and-error, as opposed to mathematical definitions of confidence. It is a careful approach that demands thorough consideration of details, fastidiousness, and deep understanding of both the system and the relevant domain. For teams willing to invest, benefits go well beyond simple bug reduction; they include the ability to create systems that truly can be trusted.
Reflecting on my knowledge of formal verification, I believe that its prospects for the future of the industry are not just bright, but necessary. As we move into ever greater scales, it is necessary that our approaches scale with them, not by ad hoc means, but by principled engineering based in mathematics.