At Banzai Cloud we develop a container management platform that provisions and orchestrates hundreds of Kubernetes clusters across six different cloud providers or hybrid clouds every day. Pipeline makes it easy for users to run applications on Kubernetes, but under the hood there is an insanely complex system, so it’s mission critical for us to have realtime information about what’s happening inside the platform and have access to historical data to investigate incidents our customers may have had.
Instrumentation 🔗︎
The textbook definition of instrumentation refers to the ability to monitor or measure performance, diagnose errors, collect and write trace information. It’s a pretty concise and accurate description, but also a really modest one, because it doesn’t go into details about the importance and practical applications of instrumentation.
Instrumentation — as explained in the definition above — provides certain information about an application’s state. Compared to information like CPU usage, memory consumption or network latency that’s managed by the OS or can be measured by some monitoring tool, instrumented applications provide information about themselves. Developers embed instrumentation code (or as commonly said: “instrument the code”) into the application’s code that either actively sends information (eg. distributed traces) to an external service or exposes it (eg. metrics) for an external service to fetch.
While it has an overhead (in terms of both development and application performance), instrumentation can provide an awful lot of data about an application, a lot more than any external measurement tools. Not only can it provide more data, but can also give context to the data which is often even more important. For example, unusually high memory usage in itself does not necessarily mean that the application misbehaves, it might simply mean that people are finally using it. Similarly, low resource consumption can be the result of a misconfigured HTTP router, serving 404 pages to everyone.
Dev and Ops 🔗︎
A very important aspect of instrumentation is that it affects both developers and operators. Operators use the realtime information to operate the application and prevent incidents, while developers use the collected data to understand how the application works in a production environment and investigate past incidents.
Since developers are responsible for instrumenting the code, it’s important that they understand both Dev and Ops use cases.
Instrumenting code 🔗︎
As mentioned above, instrumentation means embedding code into the application. Unfortunately, developers are often motivated by the bad reasons which leads to poorly implemented and inefficient instrumentation:
- Developers start adding instrumentation to the entire application as a result of a past incident
- Instrumentation is not part of the development process and gets implemented in the last minute before a feature is rolled out
In an ideal case developers think of instrumentation as part of the development process. For example, the business might require certain metrics from day one to decide if a feature is successful or not. On the other hand, there is absolutely no need to add instrumentation for everything in the first step. Instrumenting code is not always easy, so an iterative process for introducing it is usually more efficient and productive.
Another common “mistake” is hardcoding the chosen logging/metrics/tracing framework into the application. Like everything else (eg. database layer, ORM, etc) these frameworks/libraries are also external dependencies and we need to treat them that way. Especially when they are added close to the business logic layer of an application.
One way to make sure that no external dependencies sneak into our business logic is abstracting them away. For example, if there is a complex business calculation that requires logging, add a simple logger interface around the chosen library. That way the business logic remains independent and testable.
Another common solution is using a middleware. They are perfect for measuring latency of a process or for logging the beginning and the end of one. Middlewares can often be added to the transport (HTTP, gRPC) level, increasing the coverage of instrumented code.
Either way, the important thing here is to stay as independent from external dependencies as possible.
TL;DR Show me the code 🔗︎
The practices explained bellow contain simple and short, sometimes even dumb examples. You can find real examples in our codebase or in Modern Go Application which is an example project where we test new things. It contains a full working example with Prometheus, Grafana and Jaeger.
Logging 🔗︎
Let’s start with logging. It’s fair to say that logging is the most widespread of all instrumentation techniques in developer circles. Maybe that’s the reason why there are so many logging libraries for Go. (Counted 40 at the time of writing)
However, writing good logs is not as simple as we might think. Developers usually learn this at their own expense when they actually want to search logs as part of an investigation. The standard library logger offers a simple interface to developers, which might be enough for simple CLI applications, but not for a robust and distributed service.
Consider the following example:
1log.Printf("saving user %s failed: %v", username, err)
2// Output: saving user john.doe failed: cannot connect to database
Good luck searching the logs for the reason why John Doe couldn’t save his profile. Not to mention it’s impossible to set alerts for errors like this. Obviously, you can write regular expressions for all possible error log messages, but that’s not efficient at all. The solution (offered by many logging libraries) is features like adding levels and context to log events:
1log.Error("saving user failed", log.Fields{
2 "user": username,
3 "reason": err,
4})
5// Output in dev: ERROR saving user failed user=john.doe reason=cannot connect to database
6// Output in prod: {"level": "error", "msg": "saving user failed", "user": "john.doe", "reason": "cannot connect to database" }
This is called structured logging and improves the ability to search logs a lot. In an appropriate log analyzer you can search for a specific user, specific levels, etc. It’s also much easier to set alerts for errors.
Another common practice that helps investigations of past incidents is correlating log events. For example adding time (usually libraries and log collectors do that by default) determines in which order the events happened which helps finding the root cause of an error or the last successful step. But that’s usually not enough. We want to filter events related to each other (for example: events happened during a single HTTP request). Therefor we can use a correlation ID that we simply pass to the logger as context:
1log.Error("saving user failed", log.Fields{
2 "cid": someCorrelationID,
3 "user": username,
4 "reason": err,
5})
This begs the question: how are we supposed to carry this correlation ID to everywhere in the application?
Go has us covered here as well by its context
package and the so called context propagation. The idea is to pass a context variable to each function call in that can carry arbitrary information across API layers.
1ctx := contextWithCorrelationID()
2
3func foo(ctx context.Context) {
4 log.Error("saving user failed", log.Fields{
5 "cid": getCorrelationID(ctx),
6 "user": username,
7 "reason": err,
8 })
9}
That way every log event that happened during a single HTTP request should have the same correlation ID in its context. It’s pretty useful when the investigation starts from a single error event. With a correlation ID, you can get the rest of the logs.
The above examples are library agnostic on purpose. There are plenty of logging libraries out there, use the one which you like. We prefer using Logur with Logrus.
See Modern Go Application for more sophisticated examples and better correlation ID handling.
Error handling 🔗︎
Error handling deserves its own chapter. Not only because diagnosing errors is one of the primary goals of instrumentation, but also because error handling in Go is different.
If you come from other languages (like Java, C#, PHP, etc) there is usually a global exception handler that catches unhandled exceptions and either logs them or sends them to an exception tracking service. Although there is a similar concept in Go (called panics), it’s only supposed to be used for errors that are unrecoverable for the application. Most errors, however, are recoverable from the application standpoint. For example, a database connection error can result in an internal error response for the user, but should not terminate the application.
In Go, error handling is rather a convention than an actual language feature. Errors are values and are usually returned as the last value from a function:
1func foo() (string, error) {
2 return "", errors.New("something went wrong")
3}
4
5The returned error can then be handled like this:
6
7bar, err := foo()
8if err != nil {
9 // handle the error
10}
The convention says that an error should either be handled (whatever that means) or returned to the caller. At the end of the call chain, however, the “caller” is the user who initiated the request. Business errors (eg. validation errors) should obviously be returned, but internal errors (like a database connection error) shouldn’t be leaked to the user, but rather forwarded to the operators to act on and possibly to the developers for investigation.
A common practice is to just log these errors:
1logrus.Error(err)
2
3But that loses a lot of contextual information about the error. That's why we decided to introduce a new, simple interface that we use everywhere for unhandled/internal errors:
4
5interface ErrorHandler {
6 Handle(err error)
7}
Then we can handle errors at the edges of the application like this:
1if !isUserError(err) {
2 errorHandler.Handle(err)
3
4 err = errors.New("something went wrong")
5}
6
7return err
The error handler implementation can do whatever you want. It can still log the error or send it to an exception tracking service.
Under the hood, we use a library called Emperror. Emperror provides various error handler implementations and helpers. You can easily send errors to any of the well-known error tracking services (Stackdriver, Airbrake, Bugsnag, Sentry, Rollbar to name a few) with it or simply log it (using Logrus or Logur for example).
Using it is quite simple. All you need to do is choosing and configuring a handler:
1package main
2
3import (
4 "emperror.dev/emperror"
5 logrushandler "emperror.dev/handler/logrus"
6 "github.com/sirupsen/logrus"
7)
8
9func main() {
10 // Create a new error handler
11 errorHandler := logrushandler.New(logrus.New())
12
13 // Register the error handler as a panic recover function
14 defer emperror.HandleRecover(errorHandler)
15
16 // Use error handler for internal errors in eg. HTTP handlers
17}
Metrics 🔗︎
Prometheus became a standard in the Cloud Native ecosystem and we use it to monitor our managed platform installations, but customers often want to use their own metric collection solution. Initially we used the Prometheus client libraries for Go and imported additional libraries whenever it was necessary, but it was really hard to manage, so we decided to look around for a vendor independent solution. Fortunately, OpenCensus was released to the public around the same time and after a short evaluation we decided to go with it.
Finding the right metrics really-really depends on the business. A good thumb rule is to start with monitoring HTTP requests. It’s easy to add and can be used for lots of things from HTTP status code checks to alerting for a high number of failed login attempts.
Here is how you can do it with OpenCensus:
1package main
2
3import (
4 "contrib.go.opencensus.io/exporter/prometheus"
5 "go.opencensus.io/plugin/ochttp"
6 "go.opencensus.io/stats/view"
7)
8
9func main() {
10 // Register an exporter (eg. Prometheus)
11 view.RegisterExporter(prometheus.NewExporter(prometheus.Options{})
12
13 // Register stat views
14 _ = view.Register(
15 ochttp.ServerRequestCountView,
16 ochttp.ServerRequestBytesView,
17 ochttp.ServerResponseBytesView,
18 ochttp.ServerLatencyView,
19 ochttp.ServerRequestCountByMethod,
20 ochttp.ServerResponseCountByStatusCode,
21 )
22
23 http.ListenAndServe(":8080", &ochttp.Handler{ Handler: myHTTPHandler() })
24}
See Modern Go Application for custom metrics and more sophisticated examples.
Distributed tracing 🔗︎
Distributed tracing is getting more and more attention these days. The spread of distributed systems created the need for correlating tracing information across several services.
Similarly to metrics, we use OpenCensus for distributed tracing. Although there is no definitive standard tool (like Prometheus for monitoring and metric collection), there are several solutions (Jaeger, Zipkin, etc) out there.
Just like in case of metrics, adding tracing to the HTTP layer is a good start:
1package main
2
3import (
4 "contrib.go.opencensus.io/exporter/jaeger"
5 "go.opencensus.io/plugin/ochttp"
6 "go.opencensus.io/trace"
7)
8
9func main() {
10 // Register an exporter (eg. Jaeger)
11 trace.RegisterExporter(jaeger.NewExporter(jaeger.Options{})
12
13 http.ListenAndServe(":8080", &ochttp.Handler{ Handler: myHTTPHandler() })
14}
OpenCensus also provides integration with other components, like Go’s database/sql or Redigo.
On complex operations it might make sense to add custom spans to the trace.
1package mybusiness
2
3import (
4 "context"
5)
6
7func myProcess(ctx context.context) error {
8 ctx, span = trace.StartSpan(ctx, "myProcess")
9
10 doSomething(ctx)
11
12 span.End()
13}
Notice the function receives a context parameter which is then used for creating a so called child context and gets passed to the next function. This is the same context propagation concept explained in the Logging section.
Conclusion 🔗︎
This post gave you some information under the hood of how we observe
our platforms managed by Pipeline. Those wo use the free developer version of the platform — https://try.pipeline.banzai.cloud — are often asking us how we know and reach out to them in minutes if something goes wrong within their clusters (customers are different, there we have SLAs). Now you know — we follow these practices above, and all metrics, logs and traces are wired into different systems we use, alerts configured and sent out to Slack, PagerDuty and other channels.