Ensuring solution and operations reliability

How do you ensure that a solution is reliable? Part of it occurs in the design. Making sure that common changes like increased traffic are handled in elastic ways.

However, part of it is also in planning to monitor the service and to notice and respond to unplanned events. Some of those activities require human intelligence. For this reason, operations reliability spans both the technical and procedural domains.

Site reliability or SRE is Google's approach to DevOps. It's a very comprehensive system that involves changing the culture about how maintenance occurs. One central idea is the division of aspects of operations into separate layers for clarity.

Here's a tip, you ought to know something about each of these layers and most importantly, you should be able to distinguish between the layers. For example, monitoring is not an incident response.


They're related:


- Do you know what features relate to them? It's alert. A Stackdriver alert is triggered by monitoring and begins the incident response, which is composed mainly of procedures. Qualities are often where our goals start, but figuring out how to measure them quantitatively enables data-driven operations. It can be difficult to figure out exactly what to measure because sometimes what's easily measured is not a good indicator of customer interests. Speaking of alerts, at Google, we have the concept of alerting for the right reason.


Often, alerts are designed to signify some metrics passing some limits. But the question is whether that metric or trigger is something the customer cares about or not. We need to alert on some technical measures. But if there's something that is directly causing the customer frustration and upset, that should also be an alert or perhaps replace a more technical alert. Make sure you know the difference between BlackBox monitoring and Whitebox monitoring.


Blackbox monitoring and Whitebox monitoring are frequently misunderstood. In the cloud architect contexts, the difference has to do with the assumptions you can make when designing your monitoring framework. In Blackbox monitoring, you're not supposed to know or think about the inner workings of the application. All you can see is the user interface or the API interface. So, the only assumptions you're allowed to make have to do with these interactions. Blackbox monitoring is very good for validating user experience. You end up monitoring things like latency between request and response. In Whitebox monitoring, the application is assumed to be known to you. The inner workings of the application are transparent. So, you can use that special knowledge when defining the test. A good example would be if you knew that under certain conditions a critical resource will get oversubscribed and you've designed the system from resiliency. In this case, you might flood the interface to trigger the state as if the service was under the attack to see if the resiliency worked as expected. That's Whitebox monitoring, where the tests can be focused on inner workings and not just the UI.


In practice, of course, you need both kinds. Here's an example, CPU utilization may or may not indicate user satisfaction. Round-trip delay or frequency of request errors might be a better measure of the user's experience.

- What metrics are you using?

- Can you define metrics that relate directly to user experience and service objectives?

- What are the watermarks or alert levels at which human processes are engaged? How are you setting those values?

- When do they need to be revisited and updated?

- How do you know they're related to important events?

- Know how to use trace and debug.


Examples of other tools that Stackdriver replaces. Note that it's not just a collection of alternate tools that's the issue, but how you use them together. The individual tools are not integrated or designed to work together. So, a lot of manual procedures and translation massaging of data are required to use them together. With Stackdriver, the integration is by design. So, that work disappears. Stackdriver is also multi-cloud, able to manage projects across GCP and AWS. Another useful idea is that people don't plan to fail, they fail to plan. Another way of saying this is, the only time we have to prepare for emergencies is before they happen. Once the emergency is occurring, it's too late to prepare.


You can design a great technical solution, but if it doesn't include human processes, then it might not be adaptive and resilient. Easy buttons are tools and processes that automate common actions.


A playbook is a list of what to do when. So, here's a general rule; for every alert, you should have a play in the playbook. What are the differences between a dashboard, an alert, and an incident response? A dashboard is a display for monitoring a system.


It's commonly tailored to the application. An alert occurs when a condition is met such as a metric crossing above a particular value for a given duration. The alert is the notification and alert could just be a warning or it could be a notification of an incident that needs to be handled immediately. An incident response consists of the steps you would take when a problem occurs. This might be written up in a playbook.


Find a lab such as Quick Labs lab that uses logging and trace and debug to identify and solve an application problem. This will give you a sense of the value and how these components work together.


There's a lab like this in the architect in the GCP infrastructure class. Google's approach focuses on transparency, involving the customer in the solution and blamelessness. Assigning blame establishes the root cause with a person or an organization instead of getting to the real technical or procedural issue so that it can be fixed.


If blame has been assigned, there's a high likelihood that the process has been prematurely suspended without really addressing the problem.


- What are the people supposed to do?

- What decisions or actions are they supposed to make or take?

- Are these documented?


As mentioned, the metrics are not sufficient without the meeting to review the metrics, to evaluate them, and make decisions and take actions. In those cases where timing is critical, you'll want to the playbook and easy buttons supporting automation to increase the speed and consistency of incident response. Here's another tip, when something goes wrong with the cloud resource, give yourself or your team a limited time to solve it.


For example, if a VM starts behaving incorrectly, see if it's something that's easily fixed. Then spare the VM to the side and replace it. Perform your diagnostics and debugging after the instance is replaced.

By: Mutasem Elayyoub