High Availability: When Five Nines is Enough
During the last month, we’ve journeyed through the key components of modern infrastructure design from the ground up. First, you turn discrete pieces of hardware into a software abstraction with virtualization, then slice the resources you have into semi-isolated environments with containers, and manage multiple containers with Kubernetes.
These technologies enabled modern public clouds and lowered the entry barrier — instead of building all the required infrastructure, including some expensive hardware, and wasting time to set up, maintain, and manage it, you can just spin up a service at any preferred cloud provider and work on your product. Developers can build and ship new services much faster, also being able to scale them “on the go”.
What did we miss here? The end user experience! All this sophisticated software, time, and effort of brilliant programmers put into creating tools that would be the foundation to build even better things, all to make people’s lives better. Today’s topic “High Availability” is a vital part of providing this experience.
From a technical standpoint, high availability is somewhat similar to fault tolerance. While the latter strives to guarantee 100% uptime (meaning the service is always on and accessible), the former aims a bit lower at “five nines” (99,999% of the time) leaving a slim chance that the service might not respond to users requests, for example, while a backup container with an app copy is loading.
Fault tolerance accounts for critical failures and is mostly required for crucial parts of the infrastructure. For example, financial dealings — if a service responsible for transaction processing would go down even for a millisecond, the damage could be estimated in billions. High availability is important, yet putting too much effort into keeping the service alive would be overkill — if some part of the app frontend takes a millisecond (or even a second) more to load, most users won’t notice.
Making decisions about which service must be fault tolerant, to which high availability will be good enough is an important part, and how to achieve them without wasting resources and effort is one of the key problems for any system’s architect.
This task might be less trivial than one might think. In the example of the frontend page mentioned above, the one-millisecond downtime might seem OK, but what if this page processes user inputs for some financial service? So in each case, a decision must be made individually, sometimes this requires investigating how the services are connected and depend on each other.
In the Web2 world, the heavy lifting is partially done by the service provider. Public cloud vendors have to ensure the SLA (service level agreement) because nobody wants to pay for an unreliable product. Web3 services market works a bit differently, independent agents (providers) compete with each other for rewards (for example, mining) based on the incentives mechanics of a particular protocol.
Early incentive design suggested that the risk of losing the reward (and the stake required to become a service provider in most early protocols such as Filecoin) would prevent agents from providing low-quality service. The basic idea was: let’s punish bad actors and give their stake to the users affected by the outage as a reparation.
Unfortunately, this does not consider most of the real damage an app could suffer — loss of valuable data, reputation and users’ trust. What’s more, this means the developers have to find solutions to ensure the high availability of their services on their own, which takes more effort and resources and leads to more convoluted architecture.
If we are talking Web3 developer adoption, the quality and reliability of the IaaS (Infrastructure as a Service) must match, if not exceed, the one the developers have been accustomed to with Web2.
Coming up next: how Super Protocol solves this and other bottlenecks to create an unstoppable and secure Web3 native cloud!