3 February, 2022 Why Reliability and Availability are crucial to any Cloud project Do you know why it is necessary to pursue high reliability in machine learning projects? We explain the reasons in this post. Reliability and availability are the big forgotten features. They always take a back seat when talking about what any application, software or infrastructure should be like. However, both are essential: it is useless to offer a good design if our development has availability failures or is unreliable. Both concepts, their definition and importance, are discussed in this post. We will also offer recommendations to achieve High Availability -which we will talk about later- and optimum reliability. What is reliability? Reliability is a characteristic that, in the context of software and applications, relates to the probability that such a solution will function faultlessly in a determinate environment and period. If the software is reliable, it does not have any defects or downtime and works correctly in all cases. Therefore, reliability is the most relevant characteristic of any application. If the software is unreliable, users will leave, and all other features will become irrelevant. Google coined the concept of Site Reliability Engineering (SRE) to address these issues. This model makes it possible to determine which new features to launch and at what time thanks to: Service Level Agreement (SLA) established between provider and client on quantifiable parameters such as responsibilities, uptime or response capacity. Service Level Objectives (SLOs) is a specific agreement framed in an SLA which is measured with a particular metric (SLI). Service Level Indicator (SLIs) is a direct measure of the performance of a service. Logs and trace entries are our best allies in finding problems and triggering alarms to keep our system up and running. Therefore, it is recommended to periodically audit monitoring and delete unused dashboards, alerts, traces and logs. This will mitigate clutter and ensure proper monitoring. Ensuring High Availability (HA) The concept of reliability is linked to the availability concept. Availability refers to the percentage of time that an infrastructure, system or solution remains operational to fulfill its purpose under normal circumstances. Availability problems of a system have a direct impact on its reliability. Achieving high reliability (High Availability) is not easy, although it may seem otherwise. It is an issue that has to be analyzed before designing our solution. For example, if a service needs to be running when an entire region is down, it is important to design it using groups of resources spread across different regions. In addition, an automatic failover should be added that is triggered by a region going down. In other words, single points of failure should be eliminated as this can cause a global outage when it cannot be accessed. This point is not always solved in the same way because it depends on where we deploy our applications (App Engine, GKE, Cloud Run...). The general recommendation is that any infrastructure solution should always be designed with this point in mind. Another aspect to take into account is scaling bottlenecks. If, for example, we use more CPU cores than we have on each core in a vertical scaling, we will never achieve HA. Example of an HA cluster in GKE RTO for High Availability Another essential value for an infrastructure to be considered High Availability is the "restoration" value. Therefore, it is necessary to calculate when a system can be down without it being critical for the business. This period is known as Recovery Time Objective or RTO (Recovery Time Objective). This value expresses the time during which an organization can tolerate the downtime of its applications and the associated drop in service level without affecting business continuity. Whenever a system downtime occurs, a certain amount of data is lost. In particular, this is all the information that has circulated since the last backup until the crash. Knowing how much data we are prepared to lose or have to reintroduce into the system will indirectly influence the availability of the system. This value is known as RPO (Recovery Point Objective) and unlike the RTO, it is not related to the recovery time but to the amount of data to be restored. Preparing for events If we anticipate that our system will have peaks in traffic during specific periods, it is necessary to prepare our application. Investing time in this preparation will save significant loss of visits and revenue. It may be easier to anticipate what the spike will be like for projects that already have some track record. For new projects, it should be an estimate. In any case, it is essential to ensure that the system has sufficient computing capacity to handle the peak. In addition to adding a buffer, it is recommended to load test the system with the expected mix of user requests. The goal is to ensure that your estimated handling capacity matches the actual one. Conduct exercises in which your operations team conducts mock outages, rehearsing your response procedures and exercising collaborative incident management procedures between teams. Anticipating incidents The last epic to consider in terms of reliability is disaster recovery. When we talk about High Availability, it is essential to have a recovery strategy because incidents do happen. Therefore, it is necessary to be prepared and have good procedures in place so that the recovery time (RTO) is low. To ensure that the system is well designed, it is required: Determine recovery points (RP) Establish a good disaster test plan to simulate such situations frequently and stress the system. It is necessary to have a good recovery point (RP) design and test plan to achieve this. At the centerpiece is a well-designed CI/CD (Continuous Integration/Continuous Development) pipeline to build automated development, deployment and release testing strategies with tools such as Cloud Build, which is a serverless solution with a high Service Level Agreement (SLA). Automate to improve Availability The system availability can be approached from different perspectives: traffic, use cases, platforms where it is deployed... But, in any case, if there is one thing to focus on, it would be: reduce work and automate as much as possible. Automating is useful, but it is not a miracle. It entails up-front development and configuration costs and, later on, maintenance costs. It can also entail risks to the reliability of a system. That is why some tasks and jobs should be eliminated before process automation is undertaken. There are a few main areas identified where this can be done with configurable automation provided by Google: Identity management Cloud Identity and Google IAM Cluster management Google Kubernetes Engine Relational database Cloud SQL Data Warehouse BigQuery API Management Apigee Google Cloud Services and Tenant Provisioning Cloud Deployment Manager, Terraform, Cloud Foundation Toolkit CI/CD pipelines with automatic deployment Cloud Build Canary analysis to validate deployments Kayenta It is also AutoML for automatic training of Machine Learning models. Finally, there is jMeter and Locust for load testing to guarantee performance.