Advanced logging on Kubernetes

The content of this page hasn't been updated for years and might refer to discontinued products and projects.

Update: Logging operator v3 (released March, 2020) 🔗︎

We’re constantly improving the logging-operator based on feature requests of our ops team and our customers. The main features of version 3.0 are:

Log routing based on namespaces

Excluding logs

Select (or exclude) logs based on hosts and container names

Logging operator documentation is now available on the Banzai Cloud site.

Check The Kubernetes logging operator reloaded post for details.

In this blog we’ll continue our series about Kubernetes logging, and cover some advanced techniques and visualizations pertaining to collected logs. Just to recap, with our open source PaaS, Pipeline, we monitor and collect/move a large number of the logs for the distributed applications we push to Kubernetes. We are expending a lot of effort to monitor large and federated clusters, and to automate these with Pipeline, so that our users receive out of the box monitoring and log collection for free. This blog will build on previous posts, so you may want to read those as well.

Logging series:
Centralized logging under Kubernetes
Secure logging on Kubernetes with Fluentd and Fluent Bit
Advanced logging on Kubernetes

Monitoring series:
Monitoring Apache Spark with Prometheus
Monitoring multiple federated clusters with Prometheus - the secure way
Application monitoring with Prometheus and Pipeline
Building a cloud cost management system on top of Prometheus
Monitoring Spark with Prometheus, reloaded

Collecting logs 🔗︎

Logs are usually collected for debugging and auditing purposes. However, since the emergence of microservices and containerization, it’s become increasingly time-consuming to manually check and analyze logs. This is one of the main reasons why structured logging is becoming an industry standard. Nearly every language has it’s own logging library to help developers create clean, well structured logs. Under ideal circumstances, you design logs in the project’s design phase, the same way you architect its other components. Because we’ll be focusing on Kubernetes and Fluentd, there are several logging libraries that can help us.

Logging libraries with Fluentd support
Go - Fluent (official)
Go - Logrus
Python - Fluent (official)
Java - Fluent (official)

Needless to say, we often don’t have the luxury of rewriting a legacy application to produce structured logs - this is when a little tooling comes in handy.

Parsing logs 🔗︎

Sometimes we can’t change the logging framework of our application, and we need to use a standard output with clear text messages. These parsers can be custom plugins like the MySQL slow-query log or a simple Regexp like the apache access log. Note that fluentd and fluent-bit can parse raw inputs, as well. To get the hang of it, you can start with some of the official options.

kibana-dashboard

Example Apache parse config

expression /^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$/
time_format %d/%b/%Y:%H:%M:%S %z

The problem, now, is how to identify the different log messages; applying regexp to each message would cause enormous overhead. Instead of that, we’ll modify the metadata to route different types of messages to different filters.

In the Fluent ecosystem, log messages are routed by their Tag label. Tags are dot separated hierarchical metadata labels.

Filtering messages based on app labels 🔗︎

Kubernetes helps us filter messages based on app labels; because we’re running Pods on our cluster, we’re able to add metadata to containers. One of the most common labels is app or application. As fluent-bit enriches the log messages with the Kubernetes metadata we can use that to identify our log. Changes to the routing tags of fluentd messages can be handled with a great plugin called rewrite_tag_filter. Make sure you name your tags carefully to make efficient log flows. Naming is done in accordance with specific needs, however, an acceptable generic template we can use is: $app_name.$cluster_name.$other_information.

Global variable is a useful Helm feature you can read about here

Here’s an example of a relabel config that uses the app label and cluster global variable:

<match kubernetes.**>
  @type rewrite_tag_filter
  <rule>
    key $['kubernetes']['labels']['app']
    pattern ^(.+)$
    tag $1.{{ .Values.global.cluster }}
  </rule>
</match>

Example raw tag

var.log.containers.monitor-apache-5b5b9dd84c-g2gx8_default_prometheus-apache-3d1c87b16947cdd686f5964ca668041eefe4f8c0b88bc21a77a52a6b7abd1b0f.log

Example rewritten tag

apache.test.kubernetes

Good, now we have a nice, clean tag, and we can apply our parsing configuration.

<filter apache.*>
  @type parser
  format /^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)$/
  time_format %d/%b/%Y:%H:%M:%S %z
  key_name log
</filter>

Store logs 🔗︎

After we’ve parsed logs we can ingest them into an analyzer like ElasticSearch. There are several benefits to storing logs in this way; you can create dashboards (with Kibana) for huge amounts of data. With simple queries you can check the logs from any pod or even from multiple clusters. If you don’t want to analyse your logs you can simply use file or S3 output.

Elasticsearch is a distributed, RESTful search and analytics engine.

We used a custom Helm chart from Lalamove when we installed Elasticsearch.

[~]$ git clone https://github.com/lalamove/helm-elasticsearch
[~]$ helm install helm-elasticsearch

Note, to use Elasticsearch for real-life workloads, default resource limits must be significantly increased: at least a 1 GB heap size with a 2 GB memory limit.

In a default configuration there are three Master, two Data and two Client pods. You can fine tune these settings as needed. For testing purposes one Master, one Client and two data nodes is sufficient.

[~]$ kubectl get po                                                                                                                                                                                        *[master]
NAME                                  READY     STATUS    RESTARTS   AGE
elasticsearch-client-654d4894-6nlvq   1/1       Running   0          2m
elasticsearch-client-654d4894-wwk46   1/1       Running   0          2m
elasticsearch-data-0                  1/1       Running   0          4m
elasticsearch-data-1                  1/1       Running   0          4m
elasticsearch-master-0                1/1       Running   0          6m
elasticsearch-master-1                1/1       Running   0          7m
elasticsearch-master-2                1/1       Running   0          6m

After a successful deployment you can check the cluster status on the REST API.

[~]$ curl http://elasticsearch:9200/_cluster/health

{
  "cluster_name": "quarrelsome-elk",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 7,
  "number_of_data_nodes": 2,
  "active_primary_shards": 0,
  "active_shards": 0,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100
}

If we configured our fluentd output to elasticsearch, we can check incoming indices as well.

[~]$ curl  http://elasticsearch:9200/_cat/indices?v

health status index               uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   logstash-2018.04.21 QgGbsv3AQMSwVb2hsOzW-Q   5   1       3873            0      4.1mb          2.1mb
green  open   logstash-2018.04.22 CLc8clbnQcW5hBDxrsXNNA   5   1     129811         2827    102.3mb         56.2mb
green  open   .kibana             Y3y0eh7qQ5iBQ92oRk03rw   1   1          2            0     12.4kb          6.2kb

Aggregating different log types into the same index can cause problems, so we recommend that you use index/logtype naming.

Here is an example configuration of fluentd with Elasticsearch output with log type prefixes.

<match **>
  @id elasticsearch
  @type elasticsearch
  @log_level info
  include_tag_key true
  host elasticsearch
  port 9200
  logstash_prefix ${tag}
  <buffer>
    @type file
    path /var/log/fluentd-buffers/elasticsearch.all.buffer
    flush_mode interval
    flush_interval 5s
    flush_thread_count 3
    retry_max_interval 60
    retry_forever
    retry_type exponential_backoff
    chunk_limit_size 20
    queue_limit_length 5
    overflow_action block
  </buffer>
</match>

Unfortunately, we don’t have the space to introduce you to every feature of Elasticsearch, but there are plenty of useful How-tos on its official page.

Kibana 🔗︎

There is a great Open Source tool from Elastic, named Kibana, that allows you to visualize your data. With it, you can spot anomalies and trends in a user friendly way.

Setting up kibana from the stable Helm repository is easy.

[~]$ helm install stable/kibana --name kibana --set env.ELASTICSEARCH_URL=http://elasticsearch:9200

kibana-dashboard

Monitoring legacy applications 🔗︎

Most of the applications we push to Kubernetes with Pipeline are legacy applications that are not monitoring friendly. These are mostly applications where you can’t create custom metric endpoints for a Prometheus exporter, but which have log messages that contain all the necessary information. To expose this information to Prometheus, you can use the prometheus input plug-in to fluentd (which we also use to monitor our logging deployments), then custom metrics can be set-up via a filter plugin.

Here is a simple example filter that counts response code by method type

<filter message>
  @type prometheus
  <metric>
    name response_code_count
    type counter
    desc The total number of requests labeled by status code and method
    key code
    <labels>
      method ${method}
      code ${code}
    </labels>
  </metric>
</filter>

Example metrics output

# TYPE response_code_count counter
# HELP response_code_count The total number of requests labeled by status code and method
response_code_count{code="200",method="GET"} 6
response_code_count{code="201",method="DELETE"} 1

That’s it. Afterward, you can use this data the same way you might use exporter output. You can even create dashboards and alerts like we did in our Prometheus related posts.

Conclusion 🔗︎

The emerging Kubernetes ecosystem has made it much easier to handle monitoring and logging in a standardized way. In this post we’ve tried to demonstrate the power of these tools without modifying a single bit of your application. We hope you enjoyed it, and we’ll continue our efforts to provide out of box solutions for your Kubernetes deployment.

Charts and configurations are available in our helm chart repository

http://kubernetes-charts.banzaicloud.com/

We also use our own container images with plugins pre-installed

Flunt-bit - upstream master branch build
Fluentd - stable build with plugins

Related resources

Tap into Kubernetes logs with the Logging Operator

article

The logging operator 3.0 release

article

Understanding Kubernetes cluster events

article