SCC. Fault Tolerance: Just Make It Two
--
Last time we talked about how automating container management with Kubernetes is, in some ways, similar to how Web3 works. You have independent, self-sufficient (up to a certain point) nodes (containers) and a protocol (Kubernetes ) that regulates the way they communicate, keeps their common databases and states updated, and makes sure crucial applications stay operational even if some of the nodes go offline.
Just one step forward is required to turn this picture into a global vision of tomorrow’s web. While containers operate in a common virtual environment and still require some tweaking to make them communicate with each other, share data, and perform routine tasks such as update installation, diagnostics, and relaunch after failure, or regular reboot, Web3 decentralized networks are far more interoperable and resilient.
Let’s focus on the latter for today. A system’s ability to perform even if some of its parts are not correctly working is called fault tolerance. The concept of fault tolerance can be applied to any system (hardware, software, production, or business), but here we’d be mainly talking about keeping critical applications working and ensuring continuity.
A good example of continuity that makes its importance apparent would be any online editing tool (for example, Google Docs). How often do you have to press ctrl+s to save the document? Almost never! The app does it for you. Moreover, you can close the tab, open it on another device and continue editing. What’s more, you can do it together with a coworker.
The current state of collaborative tools is an excellent illustration of how the user experience should feel in terms of continuity. It takes a handful of technologies to make that happen. If not for them, you’d be constantly losing your data (in case the connection breaks, or the browser, or the OS hangs), have to save documents manually, and live through a nightmare of merging different edits from multiple authors. An important thing to note here is that fault tolerance means an uninterrupted experience for the end user in the first place. Even if something goes wrong (which happens all the time), it should not affect the user in any way.
In case there’s a single point of failure in the system, a go-to solution is increasing redundancy. As in the containers’ example, there could be two or more instances of the same app or server, so if one goes out, the other will take its workload without the users noticing. That, of course, means spending more resources (twice the CPU and storage capacity to spin the replicas of the app and its database). Complex applications have more than one mission-critical service that has to be replicated in order to increase the overall systems’ fault tolerance, so that puts more pressure on the developers, who have to spend resources (burn money) as “insurance”.
Web3 applications approach this in different ways: from hardcore redundancy (basically by removing a single point of failure completely) to more flexible solutions. Bitcoin and Ethereum are classic examples of the former; each node has its own copy of the whole chain of transactions, thus ensuring there’s nothing lost if any of the nodes goes offline or becomes malignant. Modern protocols aim to be less resource-demanding by decreasing the redundancy level with sophisticated load balancing, network design, and incentives.
Redundancy is not the only trick to make a system fault tolerant. We’ll discuss some specific solutions implemented by Super Protocol later in the series.