Request errors on GLB

This document attempts to collect the resources to debug request errors on the Google External HTTP(S) Load Balancer.

Basic Tools

See the introduction for global load balancing on GAP.

Metric description
Monitoring -> Dashboards -> Google Cloud Load Balancers -> pick yours (e.g: k8s-um-mobil-engage-me-push-api…)

Log description
Error codes
Logging -> Log Explorer -> query: resource.type="http_load_balancer"

Suggested approach

The GLB logs should contain an error in any case. It is recommended to filter for the status code (e.g: 502, 504, see below…) to find the requests and look for the statusDetail field. Check the descriptions of the relevant errors.

Filter:

resource.type="http_load_balancer"
httpRequest.status="502"

In case the request’s log entry is missing, it is very likely failed before it could have reached the nginx container or nginx/application cut off the request due to improper keepalive timeout.

Additional steps to verify the pods are in fact working well is to open a shell to the pod (in k9s or kubectl ssh) and call the endpoints directly (if curl or a custom client exists on the image) or start a one-off pod from an image that has curl and attempt to call the pod using the pod ip or its hostname.

Most common issues:

backend_connection_closed_before_data_sent_to_client or connection refused errors on the pods
- usually has to do with the keepalive timeout
502s on shutdown/restart
- due to the eventual nature of the GLB configuration on the shutdown signal the endpoint is not taken out of the load balancing for several seconds, so if the application stops immediately on SIGTERM it is possible that many failing requests (502) will be reported. It is recommended to add a 5-30s delay before the application stops accepting new connections on the shutdown signals (a node example)
- when the router-log-sidecar is used, the delay should be bigger than the application’s delay to make sure the nginx will not cut off connections

It is important to note that GLB connects to the pod IPs directly, without going through additional layers. This is done through a Zonal NEG, that has its own health checks instead of relying on the k8s probes so this needs to be considered when checking for the health of the backends.

Keepalive timeout

Keepalive timeout is fixed and set by google to 600s so any backend needs to set this to something higher (e.g: 620s). Some frameworks (e.g: nginx) allow to limit the of requets accepted on a connection. This should be unlimited or very high to avoid closing the connection too early.

Every framework should have a setting for this, however the naming might be different. See GLB docs. Some framework defaults:

nginx: 75s
- config
apache: 5s
- config
node: 5s
- koa
- fastify
spring-boot: 30s
- examples
akka-http: 60s
- idle-timeout