—A look at how to deliver reliable software.
In Observability: Deliver Reliable Software Faster, Marcelo Boeira asks “How does one ensure code works the way it was design to?”. Indeed.
He looks to control theory for an answer.
Control theory in control systems engineering deals with the control of continuously operating dynamical systems in engineered processes and machines. The objective is to develop a control model for controlling such systems using a control action in an optimum manner without delay or overshoot and ensuring control stability. — Wikipedia
And focuses on observability:
In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. — Wikipedia
The focus of observability is to introduce machinery to monitor system performance during use. This has a distinct advantage over tests, which are confined to the build environment. I can buy this.
The application Marcelo has in mind requires apriori knowledge of user behavior. I base this on the motivating example for observability. In this example, a “collector” collects information relevant to the feature “my users need be able to search at any time”.
The collector is used to collect data based upon the observable outcome of having this need fulfilled. That is, the collector monitors search times and frequency and leads to the discovery that “my users usually search 10 times per second from 8am to 10pm”. The collector drives the construction of alarms and triggers for when behaviour lies outside of the expected norms. A recipe is provided to expand upon how to develop the collector and triggers.
But what might an implmentation look like? An answer to this question lies in Capturing and enhancing in situ system observability for failure detection. This paper describes a system called Panorama. Panorama is a system for
detecting complex production failures by enhancing observability (a measure of how well component’s internal states can be inferred from their external interactions) … when a component becomes unhealthy, the issue is likely observable through its effects on the execution of some, if not all, other components.
The only components considered are processes and threads. Components may be observers and subjects. An observer reports status on a subject.
The authors of Capturing and enhancing in situ system observability for failure detection focus on detecting unhealthy systems using the client and caller perspective. These perspectives are critical in detecting gray failures:
a system is defined to experience gray failure when at least one app makes the observation that the system is unhealthy, but the observer observes that the system is healthy.
The advantage of Panorama is it’s use of aspect-oriented programming to create perspectives between client and callers to detect gray failures in components. The focus on gray states and not just clear failed states and the focus on detection is critical.
Can you argue that Panorama doesn’t require apriori knowledge? I think so. Panorama uses a bounded-look-back majority algorithm to determine the health of a system. The description is in the paper but it uses current status to determine health. It moves Marcelo’s notion of a number of searches during a specific period of time to one of did a search succeed when requested. That’s a better position overall because you don’t have to worry about changes in user behaviour.
Am I being too harsh that apriori knowledge of user behaviour is a dangerous criteria to use to develop triggers and alerts? Maybe. The application Marcelo has in mind way embody a truth regarding how many searches during a specified time of day. I’m skeptical that its a better solution even it that’s true.
The patterns of observability is a worthwhile section to read. It has applications outside of Panorama.