Banzai Cloud is now part of Cisco

Banzai Cloud Logo Close
Home Products Benefits Blog Company Contact

The content of this page hasn't been updated for years and might refer to discontinued products and projects.

It may not be a well-known fact, but a Kubernetes API server can proxy HTTP connections between a client and any service running on a cluster. A simple kubectl command exists that allows it:

$ kubectl proxy
Starting to serve on 127.0.0.1:8001

We use this for demo purposes or when we don’t want to expose APIs publicly, but need to access them from our computers. Moreover, this proxy can transport WebSocket connections. A common example is the: kubectl logs command. Although these are pretty common, there exists an edge-case issue wherein we wanted to use WebSocket connections as well as query parameters.

The purpose behind proxy WebSocket 🔗︎

One Eye provides both metrics and logs for Kubernetes containers, and it’s only natural to want to tail the logs of a given container. Since the previous release - 0.3.0 - One Eye supports Loki as it’s log backend. Loki offers a built in HTTP API that supports the LogQL query language. Additionally, it provides a streaming endpoint to tail the results of a query through WebSocket. We only have to access this endpoint through the Kubernetes API. Sounds simple, right?

Architectural overview 🔗︎

Let’s take a look at the simplified architectural map, to better understand the querying journey.

Architecture Architecture

  • CLI Proxy A small go application that runs on the local computer. It’s a simple proxy to the Kubernetes API Server’s endpoint. As the UI is javascript running inside a browser, in this case it’s running from the same host as the CLI.

  • Kubernetes Api Server The entry point to the Kubernetes cluster.

  • Ingress Ingress server, which proxies the requests to a particular backend-component. We use Nginx-ingress at the moment.

  • Loki The Loki backend, which collects all the logs in the cluster and provides HTTP & GRPC API.

About LogQL: 🔗︎

The Loki API endpoint requires a LogQL query provided as HTTP query args to filter the logs that match the given terms.

There are two conventional ways to initiate Loki queries:

  • One is to call the /loki/api/v1/query endpoint over HTTP with query args, which returns results in a simple HTTP response
  • The other is to use the /loki/api/v1/tail endpoint where the query is similar (relatively speaking) to the first one, but it upgrades the HTTP connection to a WebSocket and continuously feeds this live connection with the most recent data (e.g. POSIX tail)

Both query endpoints require a LogQL query that’s provided in HTTP query args, like this:

query={pod="one-eye-event-tailer-0"}&limit=10

If you are not familiar with LogQL check out our post Loki, the tail and grep for Kubernetes

Crack some eggs 🔗︎

With a running CLI proxy on port 8080 we were able to call Loki services over HTTP.

We tried this with the most basic check: a readiness check. It worked well.

$ curl -i localhost:8080/loki/ready
HTTP/1.1 200 OK
Content-Length: 6
Content-Type: text/plain; charset=utf-8
Date: Tue, 30 Jun 2020 09:02:05 GMT
Server: nginx/1.17.8
X-Content-Type-Options: nosniff

ready

The next step was to grab some labels from the Loki server.

curl localhost:8080/loki/loki/api/v1/label | jq

Please note the double /loki URL:

  • The first /loki URL belongs to the ingress, which makes the ingress route the request to the loki backend
  • The second /loki URL is actually part of the Loki service HTTP API
{
  "status": "success",
  "data": [
    "__name__",
    "app",
    "app_kubernetes_io_component",
    "app_kubernetes_io_instance",
    "app_kubernetes_io_name",
    "chart",
    "container",
    "container_id",
    "controller_revision_hash",
    "heritage",
    "host",
    "name",
    "namespace",
    "pod",
    "pod_id",
    "pod_template_hash",
    "prometheus",
    "release",
    "statefulset_kubernetes_io_pod_name"
  ]
}

We ran some more complex queries as well, which all came out fine.

curl 'localhost:8080/loki/loki/api/v1/query_range?query=\{name="loki"\}&limit=1' 2>/dev/null | jq

Notice the escaped curly brackets (\{ \}). We need these because bash tries to evaluate them otherwise.

{
  "status": "success",
  "data": {
    "resultType": "streams",
    "result": [
      {
        "stream": {
          "namespace": "default",
          "pod": "one-eye-loki-0",
          "pod_id": "78a4c513-06b9-449d-aca6-10a1266fc116",
          "statefulset_kubernetes_io_pod_name": "one-eye-loki-0",
          "app": "loki",
          "container": "loki",
          "container_id": "c7a9c62eb8e5f5629a04aecf54c39156b4903ecbc5e93ac729de335dbfcf6dd8",
          "controller_revision_hash": "one-eye-loki-7874db668c",
          "host": "kind-control-plane",
          "name": "loki",
          "release": "one-eye-loki"
        },
        "values": [
          [
            "1593529265837448400",
            "{\"stream\":\"stderr\",\"logtag\":\"F\",\"message\":\"level=info ts=2020-06-30T15:01:05.8367595Z caller=table_manager.go:292 msg=\\\"synching tables\\\" expected_tables=116\"}"
          ]
        ]
      }
    ],
    "stats": {
      "summary": {
        "bytesProcessedPerSecond": 14001628,
        "linesProcessedPerSecond": 43852,
        "totalBytesProcessed": 53321,
        "totalLinesProcessed": 167,
        "execTime": 0.0038082
      },
      "store": { ... }
    }
  }
}

Everything worked fine during our first few attempts, but now we want to use the /tail endpoint with a WebSocket connection to continuously get the most recent logs.

Websocket tools 🔗︎

We debugged the HTTP API endpoints with simple tools like curl and other small oneliners. However, query WebSocket is a bit trickier than just giving a URL to the shell.

curl 🔗︎

The WebSocket protocol requires a lot of headers, so we used a tool which simplifies the call to make WebSocket requests with curl, and scripted a simple bash function to handle all those headers. We’ve tested the bash function against another WebSocket application, and it worked well. It looked like this:

curlws() {
    setopt local_options BASH_REMATCH;
    local uri="${1}"
    [[ "${uri}" =~ ^((http|https)[:]//)?([^/]+)(.*)$ ]] || return
    local protocol=${BASH_REMATCH[2]:-http://}
    curl -N \
    -H "Connection: Upgrade" \
    -H "Upgrade: websocket" \
    -H "Host: ${BASH_REMATCH[4]}" \
    -H "Origin: $protocol${BASH_REMATCH[4]}" \
    -H "Sec-WebSocket-Key: $(head -c 16 /dev/urandom | base64)" \
    -H "Sec-WebSocket-Version: 13" \
    $@ --output - ;
}

The setopt local_options BASH_REMATCH; required on zsh, to use bash regex format

javascript 🔗︎

There was another way of testing WebSocket connections, and that was with the built-in javascript console in Chrome. It should be noted that it’s possible to initiate a WebSocket connection in js with just a few lines of code.

let socket = new WebSocket("ws://localhost:8080/loki/loki/api/v1/tail?query={name=`loki}`")
socket.onopen=socket.onclose=socket.onmessage=socket.onerror=console.log

The issue 🔗︎

The first probes of /tail queries resulted in an error…

curlws 'http://localhost:8080/loki/loki/api/v1/tail?query=\{name="loki"\}'

…with this response in the body, without HTTP status.

�-�parse error : syntax error: unexpected $end

The javascript-websocket powered /tail request failed too.

JS Error JS Error

Our first thought was that the problem was hiding somewhere in the CLI proxy, because it was just a few lines of Go code, but after several code reviews and additional tests, we decided to check the other components as well.

Discovering the problem 🔗︎

After checking the response headers and the body message, the response looked like a successful WebSocket connection upgrade. The Sec-Websocket-Accept header is defined as a server response header in Websocket specifications.

curlws 'http://localhost:8080/loki/loki/api/v1/tail?query=\{name="loki"\}' -i
HTTP/1.1 101 Switching Protocols
Server: nginx/1.17.8
Date: Tue, 30 Jun 2020 08:16:35 GMT
Connection: upgrade
Sec-Websocket-Accept: CTk59tUEO+X3dzYN3yElMnoK4Cc=
Upgrade: websocket

�-�parse error : syntax error: unexpected $end%

The response body also contained some suspicious control characters which are bitmasks from the WebSocket’s base framing protocol. We won’t go into further detail now, but feel free to check out RFC 6455 if you want to dig deeper.

curlws 'http://localhost:8080/loki/loki/api/v1/tail?query=\{name="loki"\}' 2>/dev/null | hexdump -C
00000000  88 2d 03 f3 70 61 72 73  65 20 65 72 72 6f 72 20  |.-..parse error |
00000010  3a 20 73 79 6e 74 61 78  20 65 72 72 6f 72 3a 20  |: syntax error: |
00000020  75 6e 65 78 70 65 63 74  65 64 20 24 65 6e 64     |unexpected $end|
0000002f

All signs pointed to the http-to-websocket connection upgrade having worked well, and the error message we received in the response body having been returned by Loki.

Checking Loki endpoint directly 🔗︎

To make sure Loki handles the WebSocket queries properly, we’ve requested its endpoint directly through a Kubernetes port forward to its service. This way we bypassed all the proxy services to focus on the Loki component.

Debug Loki Debug Loki

kubectl port-forward svc/one-eye-loki 3100:3100

Loki service port 3100 forwarded to localhost 3100

curlws 'http://localhost:3100/loki/api/v1/tail?query=\{name="loki"\}&limit=10' -i

Note the single loki URL; our request goes directly to the Loki service, skipping the ingress.

HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: qGEgH3En71di5rrssAZTmtRTyFk=

�~�{"streams":[{"stream":{"app":"loki","container":"loki","container_id":"809458c447120553ec55bf7159fa1b5eab71e1d541077633d91f5511f48b14d5","controller_revision_hash":"one-eye-loki-7874db668c","host":"k3d-k3s-default-server","name":"loki","namespace":"default","pod":"one-eye-loki-0","pod_id":"89f61a2a-5baa-44bd-be2e-d6e4a7b79156","release":"one-eye-loki","statefulset_kubernetes_io_pod_name":"one-eye-loki-0"},"values":[["1593507930108519100","{\"stream\":\"stderr\",\"logtag\":\"F\",\"message\":\"level=info ts=2020-06-30T09:05:30.1080634Z caller=table_manager.go:292 msg=\\\"synching tables\\\" expected_tables=116\"}"]]},{"stream":{"app":"loki","container":"loki","container_id":"809458c447120553ec55bf7159fa1b5eab71e1d541077633d91f5511f48b14d5","controller_revision_hash":"one-eye-loki-7874db668c","host":"k3d-k3s-default-server","name":"loki","namespace":"default","pod":"one-eye-loki-0","pod_id":"89f61a2a-5baa-44bd-be2e-d6e4a7b79156","release":"one-eye-loki","statefulset_kubernetes_io_pod_name":"one-eye-loki-0"},"values":[["1593508049970165600","{\"stream\":\"stderr\",\"logtag\":\"F\",\"message\":\"level=info ts=2020-06-30T09:07:29.9696504Z caller=table_manager.go:292 msg=\\\"synching tables\\\" expected_tables=116\"}"]]}]}
���^C

Our test tail query worked like a charm, even held the connection and received new data chunks until it was terminated by a ctrl+c. Loki is seemingly functioning well.

Checking Loki through ingress service 🔗︎

The ingress service might also have been the culprit, so in the next step, we checked the ingress endpoint against the test query.

Debug Ingress Debug Ingress

Nginx-ingress also presented nice access-logs, which helped a lot to discover this issue. With a simple log tailing, we were able to examine the requests on it.

kubectl logs -l app=nginx-ingress,app.kubernetes.io/component=controller -f
kubectl port-forward svc/one-eye-nginx-ingress-controller 8000:80

Ingress service port 80 forwarded to localhost 8000

curlws 'http://localhost:8000/loki/loki/api/v1/tail?query=\{name="loki"\}&limit=2' -i
HTTP/1.1 101 Switching Protocols
Server: nginx/1.17.8
Date: Tue, 30 Jun 2020 09:13:01 GMT
Connection: upgrade
Sec-Websocket-Accept: qGEgH3En71di5rrssAZTmtRTyFk=
Upgrade: websocket

�~d{"streams":[{"stream":{"app":"loki","container":"loki","container_id":"809458c447120553ec55bf7159fa1b5eab71e1d541077633d91f5511f48b14d5","controller_revision_hash":"one-eye-loki-7874db668c","host":"k3d-k3s-default-server","name":"loki","namespace":"default","pod":"one-eye-loki-0","pod_id":"89f61a2a-5baa-44bd-be2e-d6e4a7b79156","release":"one-eye-loki","statefulset_kubernetes_io_pod_name":"one-eye-loki-0"},"values":[["1593508211818521600","{\"stream\":\"stderr\",\"logtag\":\"F\",\"message\":\"level=error ts=2020-06-30T09:10:11.8182047Z caller=tailer.go:97 org_id=fake traceID=7939cf98b55a337f msg=\\\"Error writing to tail client\\\" err=\\\"rpc error: code = Internal desc = transport: transport: the stream is done or WriteHeader was already called\\\"\"}"]]},{"stream":{"app":"loki","container":"loki","container_id":"809458c447120553ec55bf7159fa1b5eab71e1d541077633d91f5511f48b14d5","controller_revision_hash":"one-eye-loki-7874db668c","host":"k3d-k3s-default-server","name":"loki","namespace":"default","pod":"one-eye-loki-0","pod_id":"89f61a2a-5baa-44bd-be2e-d6e4a7b79156","release":"one-eye-loki","statefulset_kubernetes_io_pod_name":"one-eye-loki-0"},"values":[["1593508289700192000","{\"stream\":\"stderr\",\"logtag\":\"F\",\"message\":\"level=info ts=2020-06-30T09:11:29.6975065Z caller=table_manager.go:292 msg=\\\"synching tables\\\" expected_tables=116\"}"]]}]}
��^C%

Our test query results were still looking ok, with the particular nginx-ingress log line looking like this. We could see the URL path /loki/loki/api/v1/tail and the query args, also query={name=\x22loki\x22}&limit=2.

127.0.0.1 - - [30/Jun/2020:09:27:49 +0000] "GET /loki/loki/api/v1/tail?query={name=\x22loki\x22}&limit=2 HTTP/1.1" 101 1245 "-" "curl/7.64.1" 267 3.778 [default-one-eye-ui-80] [] 10.42.0.20:80 0 3.777 101 478b2dfd61a5a38bd6ef2e0d57226ad3

Requesting Loki through the Kubernetes API server 🔗︎

The next step was to run the test query proxied through the Kubernetes API server, the same way the CLI proxy does.

Debug API server Debug API server

kubectl proxy --port 8001

Running application-level gateway between localhost:8001 and the Kubernetes API Server.

The URL required by the API server is a bit more complex. Its first part contains the destination service /api/v1/namespaces/default/services/one-eye-nginx-ingress-controller:80 and its second part describes what URL to proxy /loki/loki/api/v1/tail?query=\{name="loki"\} connected with the /proxy URL which tells the apiserver to use proxy mode.

curlws 'localhost:8001/api/v1/namespaces/default/services/one-eye-nginx-ingress-controller:80/proxy/loki/loki/api/v1/tail?query=\{name="loki"\}' -i

The result was another error.

HTTP/1.1 101 Switching Protocols
Server: nginx/1.17.8
Date: Tue, 30 Jun 2020 09:23:29 GMT
Connection: upgrade
Sec-Websocket-Accept: wrWIcq4LOt2fDoU9xsJxH0zBMx0=
Upgrade: websocket

�-�parse error : syntax error: unexpected $end

When we had a look at the nginx-ingress logs, we realized that there weren’t any query arguments.

10.42.0.1 - - [30/Jun/2020:09:28:24 +0000] "GET /loki/loki/api/v1/tail HTTP/1.1" 101 47 "-" "curl/7.64.1" 283 0.006 [default-one-eye-ui-80] [] 10.42.0.20:80 0 0.006 101 aa853734e432cd429a242169045c1881

Verifying the error 🔗︎

So, it looked like no query args were getting transferred through the Kubernetes API server proxy with the WebSocket connections, and the (empty) Loki LogQL query was failing and returning an error. The error message was a little bit misleading or vague, because a simple emptiness check with a relevant error message would have helped us a lot.

That meant that we should receive the same error message when calling the previously port-forwarded Loki service directly without any query args.

curlws 'http://localhost:3100/loki/api/v1/tail' -i

Please note the single loki URL, since our request is skipping the ingress.

And the result was the same! Now we had enough information about the issue to do something about it.

HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: bFKsYQ4fN7yQPofaObI65dx1mek=

�-�parse error : syntax error: unexpected $end

Possible solutions overview 🔗︎

Of course, we did a lot of googling and found several tickets:

Update: a solution to this issue has since been submitted, but is yet to be approved: Fixed bug where query string was lost on connection upgrade without using request location

Due to the inactive status of these tickets, we decided to work around the problem. It was clear that we needed to transfer the query args in a different way, so we figured out two possible ways of handling the problem:

  • transfering the query args in request headers
  • building a new URL with query args (somehow) placed in the URL path, and restoring them when they were no longer needed (which we named composition/decomposition)

We considered transfering the query args in request headers, since it might have been a cleaner solution, but some of the tickets mentioned missing headers as well, so we decided it was better to be safe than sorry. But changing the URL path part might have gotten messy fast, so we tried to keep the composition process as close to the decomposition as possible. At first, we thought we could do the magic in the Ingress configuration, but the nginx-ingress was difficult to customize. This forced us to extend the LogQL request proxy-chain with our backend server - we had it there for unrelated reasons - from where we could easily restore the URLs.

The new architectural design looked like this: New Architecture New Architecture

  • CLI Proxy composites the URL
  • One-Eye-Backend decomposites the URL
  • We also had some issues about Loki base path handling (shout out to the Loki devs for the quick response in regards to that), therefore we did some proxy path rule rewriting at the Ingress level. With the chosen solution (proxying the Loki requests and responses through the One Eye backend) we could redesign the rewrites and simply place them in the backend.
  • Any element of the proxy-chain was able to proxy standard LogQL queries this way (except when the Kubernetes API server called directly)

An in-depth exploration of the chosen solution 🔗︎

Compositing query args on the CLI Proxy 🔗︎

This function receives a URL and builds a new URL from it, where the query args are stored in the URL path. It returns the original URL when an error occurs.

import ("encoding/base64"
        "net/url")
...
const QueryArgsURLPart string = "queryargs"
...
func paramsToPath(uri *url.URL) *url.URL {
   // cloning url
   result, err := url.Parse(uri.String())
   if err != nil {
      return uri
   }
   if len(result.RawQuery) > 0 {
        // composite all the query args into the path
        result.Path += fmt.Sprintf("/%s/%s/", QueryArgsURLPart, base64.URLEncoding.EncodeToString([]byte(result.RawQuery)))
        // clear query args to avoid future problems
      result.RawQuery = ""
   }
   return result
}

We call URL composition for only WebSocket requests, where query args aren’t empty

...
    // only websocket connections
    if req.Header.Get("Connection") == "Upgrade" &&
        req.Header.Get("Upgrade") == "websocket" &&
        // only with query args presented
      len(req.URL.RawQuery) > 0 {
      req.URL = paramsToPath(req.URL)
    }
...

After implementing the workaround in the CLI proxy and requesting our test Loki tail query again, the ingress log looked like this:

curlws 'localhost:8080/loki/loki/api/v1/tail?query=\{name="loki"\}&limit=2'
10.42.0.1 - - [30/Jun/2020:12:58:17 +0000] "GET /loki/loki/api/v1/tail/queryargs/cXVlcnk9e25hbWU9Imxva2kifSZsaW1pdD0y/ HTTP/1.1" 101 1242 "-" "curl/7.64.1" 350 1.923 [default-one-eye-ui-80] [] 10.42.0.20:80 0 1.923 101 e4425bd0a2c4462ca1746e6757e5cc86

Our url-composition worked well, the URL /loki/loki/api/v1/tail?query=\{name="loki"\}&limit=2 has changed to /loki/loki/api/v1/tail/queryargs/cXVlcnk9e25hbWU9Imxva2kifSZsaW1pdD0y/. We were able to simply verify the encoded part with a quick bash command:

echo "cXVlcnk9e25hbWU9Imxva2kifSZsaW1pdD0y" | base64 -d
query={name="loki"}&limit=2

Decompositing query args on One-Eye-Backend 🔗︎

This function restores the original Loki tail query URL or returns the intact URL if the error occurs.

const QueryArgsURLPart string = "queryargs"

var (
   QueryArgsRegxPattern string = fmt.Sprintf("/%s/(%s)/", QueryArgsURLPart, "[^/]*")
   QueryArgsRegx               = regexp.MustCompile(QueryArgsRegxPattern)
)

func PathToParams(uri *url.URL) *url.URL {
   if QueryArgsRegx.MatchString(uri.String()) {
      // cloning url
      result, err := url.Parse(uri.String())
      if err != nil {
         return uri
      }

        // checking the existence of composited query args
      subResults := QueryArgsRegx.FindStringSubmatch(result.String())
      if len(subResults) != 2 {
         return uri
        }

        // decomposites query args
      decodedArgs, err := base64.URLEncoding.DecodeString(subResults[1])
      if err != nil {
         return uri
      }

        // restores the original query args
      result.RawQuery = string(decodedArgs)

        // removing the composited query args from the path
      result.Path = QueryArgsRegx.ReplaceAllLiteralString(result.Path, "")

      return result
   }

   return uri
}

The decomposition function can be called on every request, it only matches with composited URLs, or returns the original URL if not a composite (the regex doesn’t match). The real implementation was more complex of course, but the basic idea was that, this way, our One Eye backend restored the form of the URL, and the Loki service received the original request.

After we’ve released all the workaround changes, Loki’s /tail endpoint became useable through the CLI proxy.

Summary 🔗︎

Overall, we had ambivalent feelings about applying a workaround to a feature as basic as proxying a request. On one hand, it seemed unnecessary, on the other hand, it was a great exercise that sent us digging into the depths of Kubernetes.

In other words: this wouldn’t have been a big deal if it had had been outside Kubernetes, with a regular nginx, or if there had been more informative error handling in Loki.

A few thoughts about the components:

Kubernetes: the Kubernetes API server proxy issue (with HTTP query args) still exists despite the several issue tickets.

Nginx: we love Nginx but nginx-ingress is quite cumbersome to customize, this led us to move the rewrites to the One Eye backend (like rewrites, etc.).

Loki: the base-path URL handling problem was a bug, it will be fixed I’m sure, but the non-relevant error message made it hard to discover the problem. The Grafana/Loki HTTP api was very useful though.