Reliability of Cloud Native Applications : Sanket Daru

Cloud native applications are all the rage today. Every CIO wants to migrate their legacy applications to a modern cloud environment. The cost and end-user experience benefits of cloud environment weigh-in on every key IT decision that any CIO makes. However, the cloud is not meant to run exclusive work-loads and multi-tenancy provides economies of scale allowing cloud operators to offer cost effective products & services. Over the past few decades, architects and engineers have mastered the art of designing reliable systems meant to run in a traditional setup. Therefore, the key question today’s CIO’s ask is how to approach reliability in cloud native applications?

What is Reliability?

The English definition of the word reliability is “the quality of being able to be trusted or believed because of working or behaving well” ^[1] The emphasis is on trust of working or behaving well. In the real world, we humans tend to praise everyday objects that exhibit this nature of being trusted of working well. Ex: a car that doesn’t break down without warning and needs only routine maintenance is a reliable car. It is then natural for us to expect software applications to be reliable as well.

So how exactly do we build a reliable software applications? To answer this question, we need to first answer, what makes up a typical software application?

A typical software application is made up of the hardware that powers the system and the software that runs on top of the hardware to provide the intended functions of the system. Hardware includes CPU, RAM, Hard Drive, Mother Board, Network Interface Cards, etc… Software includes the OS, System Services, System Utilities, End-User Applications, etc…

How Faults Affect Reliability of Cloud Native Applications?

All hardware has a pre-defined lifetime during which it’s supposed to deliver the advertised capabilities needing only timely maintenance. When a hardware fails (ex: due to it reaching its end-of-life or due to a failure of one of its sub-components) the failed hardware needs either to be repaired or replaced. Such hardware failure whose effect can be easily mitigated and therefore doesn’t lead to the unavailability of the entire system is known as a fault.

When designing an application, we always aim to design it in a fault tolerant manner whereby a fault doesn’t compromise the availability of the entire system. Ex: we can setup hard disks in RAID configuration to overcome the effects of a single hard disk failure.

Since the early years of designing and building systems till about a couple of decades ago, building redundancy to counter hardware faults was enough to have reliable systems. Typical corporate IT systems were used by few hundred users and high availability was reserved for the absolute mission critical software systems only. Over a period, hardware has become more reliable, and its cost has also decreased.

With the advent of cloud computing, today any person can own and run an online business without up-front capital investments to setup a data centre. Operators of the cloud computing systems share the underlying hardware amongst many different customers (multi-tenancy) to achieve economies of scale. However, as the underlying hardware is now under constant use it’s going to reach its intended end of life sooner thereby making hardware faults a much more common occurrence in the cloud computing world. Such faults will impact reliability of cloud native applications.

Therefore, any architect or developer designing & building a system to be run on the cloud must account for hardware faults more meticulously today even though they are aware that their system is not as taxing on the underlying hardware.

Effect of Failures on Reliability of Cloud Native Applications

Faults are usually seen as a random event i.e. we know it will happen however there is no simple means of pinpointing when? We anticipate faults and therefore we can build needed redundancy or fault tolerance in the system being developed. However, there are certain failures that are harder to anticipate and hence can lead to a failure of the entire system. Ex: a software bug that gets triggered on certain edge condition.

Such failures are routinely attributed to software that is running the functionality of the system. Such failure causing bugs are hard to detect and they lie dormant till they wreak havoc. Unlike faults, such failures cannot be overcome simply by applying fault tolerance measures.

On the reactive front, it usually needs additional engineering effort to code in checks and balances to overcome such errors. On the preventive front, the software release process needs to be beefed up around the software quality assurance front to try and uncover such failure causing bugs before they make their way to live system.

The cloud native or made for cloud software faces even more challenges on this front. The cloud operations in themselves rely extensively on software which in turn might have such hidden bugs that may cause a failure. Such failure may compromise reliability of cloud native applications.

The cloud environment may provide much more bang for the buck, however it’s also much more prone to failures owing to the automation software and specialized tools that run the cloud operations. Ex: an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from many clients inside the internal network leading to a large number of popular services becoming unreachable ^[2]

Therefore, any architect or developer designing & building a system to be run on the cloud must account for failures from their own software as well as that arising from the underlying cloud environment.

How to Ensure Reliability on the Cloud with Humans in the mix?

Even with all the software backed automation a lot of system touch points still need human intervention to continue to function reliably. And humans even with best of intentions also tend to make mistakes at times. Ex: a typo in Amazon S3 configuration in 2017 caused a widespread outage of hundreds of services across multiple different AWS customers globally! ^[3]

Therefore, it makes sense to design and develop administrative tools that guide such humans on what is expected of them and warns them when they are about to mess up. Providing a safe place for such humans to explore all parts of the system also makes sense so that certain operations can be tried and tested in a sandbox before being applied to a live instance.

Monitoring data comes in very handy and graphical dashboards that exhibit such monitoring data can go a long way in helping the humans in keeping the lights on of the cloud native applications.

In the on-cloud deployment models, it is imperative to have dedicated sandboxes for developers, QA’s and admins to try out different services, test the behaviour of developed software under different conditions, run automated test suites to find hidden failure causing bugs and to try out configuration changes safely. Most cloud vendors provide data rich dashboards that help with managing the system availability thereby improving reliability of cloud native applications.

Therefore, any architect or developer designing & building a system to be run on the cloud must understand what metrics to emit & how, so that the cloud vendor’s dashboard can pick it up to help the operations team to perform their job diligently.

Conclusion

Building reliable systems in the age of on-cloud/made for cloud/cloud native has become more challenging. However, if we stick to the basics of good engineering practices of building fault tolerant systems, rigorously testing the system to find hidden bugs and helping humans to do their job more efficiently, building reliable systems may not be that difficult after all!

Bibliography

[1]: https://dictionary.cambridge.org/dictionary/english/reliability

[2]: https://www.pcmag.com/news/heres-why-a-vital-aws-region-went-down-on-dec-7

[3]: https://www.theguardian.com/technology/2017/mar/03/typo-blamed-amazon-web-services-internet-outage