Common alerts

This document describes the most common GAP alerts and their resolution - where applicable.

JobStatusFailed
DeploymentReplicasUnavailable
HTTPRequestErrors
ContainerTerminatedOOMKilled
HPAReachesMaxReplicas
Resources

JobStatusFailed

An alert will be fired when a job has failed for any reason. Note that the failed Job must be manually deleted for the alert to be resolved.

Cause 1

Kubernetes terminated a pod that had been created from a job because it exceeded time limit.

Cause 2

Kubernetes terminated a pod that had been created from a job because it exceeded memory limit.

Note: for administrative reasons, GAP keeps one instance of the failed job to be available for investigation. This can be tuned if necessary.

Cause 3

The process running in the job has finished with a non-zero exit code.

Resolution 1

Example alert:

Failed job in
Failed job example-job-1595511600 (example-application) in namespace smart-insight.
- Labels: alertname=JobStatusFailed endpoint=http-metrics instance=10.132.8.218:8080 job=kube-state-metrics job_name=example-job-1595511600 label_applicationName=example-application namespace=example-namespace pod=kube-state-metrics-5cbdf44f9f-mtxzx prometheus=gap-system/prometheus service=kube-state-metrics severity=warning

Use kubectl describe to determine the termination reason:

kubectl describe job example-job-1595511600 -n example-namespace
... some output omitted ...
Events:
  Type     Reason            Age    From            Message
  ----     ------            ----   ----            -------
  Normal   SuccessfulCreate  2m27s  job-controller  Created pod: example-job-1595522040-6qh89
  Normal   SuccessfulDelete  27s    job-controller  Deleted pod: example-job-1595522040-6qh89
  Warning  DeadlineExceeded  27s    job-controller  Job was active longer than specified deadline

if the Events section is empty (displayed as Events: <none>), then the failed job instance is too old to determine the reason of failure.

Job execution in GAP is currently configured to allow only one instance running. Which means if there is a long running job, the other executions for the same cronjob are not going to be scheduled.

To be able to debug a problematic job, follow these steps:

Increase the activeDeadlineSeconds for the cronjob to a high value (e.g.: 86400)
1. Temporarily: use k9s to edit the manifest of the cronjob in question
2. Permanently: add/update the activeDeadlineSeconds value in the gap.yaml and execute a deploy
Once the new values become effective, the next job execution will be affected. Use k9s to open a shell into the long running pod to investigate
1. Have a look at LaaS Kibana for application logs
2. Have a look at POD logs with k9s
Kill command can be used to suspend/resume a process from execution:
1. to pause a process: kill -STOP 1
2. to resume a process: kill -CONT 1

Resolution 2

Create a patch file, with increased memory limit for the given cronjob.

Resolution 3

Look into the job logs to find out why the process exited with non-zero code. E.g. misconfiguration, issues when connecting to external services, etc.

Cleanup

The failed job object has to be deleted, otherwise prometheus will keep sending alerts about it.

Kubectl can be used to determine job execution durations

% kubectl get job -n example-namespace
NAME                                                     COMPLETIONS   DURATION   AGE
example-job-1594976400                                   0/1           7d21h      7d21h
example-job-1594993800                                   1/1           4s         23m
example-job-1594994400                                   1/1           5s         13m
example-job-1594995000                                   1/1           3s         3m26s

Increase the limit by setting the activeDeadlineSeconds to a higher value in the gap.yaml for the affected cronjob
Delete the failed job instance, otherwise Prometheus will keep sending alerts about it
```
kubectl delete job example-job-1594976400 --namespace example-namespace
```

DeploymentReplicasUnavailable

Cause 1

Kubernetes is currently unable to maintain the requested number of pods. Based on the configured metrics, the Horizontal Pod Autoscaler might have drastically increased the required number of replicas in a deployment. There are cases when the the cluster already has a high number of pods running, and Kubernetes decides to provision new Nodes (Virtual Machines) to be able to launch new pods. This might take 5 to 20 minutes.

Cause 2

Editing the manifest of a deployment via the Kubernetes API (with kubectl or k9s) causes Kubernetes to unable start the desired number of pods (e.g.: providing wrong image for one of the pods).

Resolution 1

If the root cause is a scaling issue, Kubernetes should resolve the problem automatically by adding new Nodes to the cluster. In this case, it’s just a matter of time this alert gets resolved. If the alert doesn’t get resolved within 20 minutes, please notify the Cloud Platform team in the #team-tooling room on Slack.

Resolution 2

If the problem is with the deployment itself, the affected pods need to inspected for errors.

Using kubectl to inspect the affected pods:

# get the list of affected pod(s)
% kubectl get pod --selector app=example-app-web -n example-namespace
NAME                                          READY   STATUS             RESTARTS   AGE
example-app-web-74d64bc79f-qnr6g              2/2     Running            0          3h33m
example-app-web-7f6dc8fd78-zgs5f              1/2     InvalidImageName   0          13m

# describe pod
kubectl describe pod example-app-web-7f6dc8fd78-zgs5f -n example-namespace

...  output omitted

Events:
  Type     Reason         Age                     From                                                  Message
  ----     ------         ----                    ----                                                  -------
  Warning  InspectFailed  6m27s (x60 over 13m)  kubelet, gke-gap-staging-baseline-pool-0c60d758-cbbc  Failed to apply default image tag "eu.gcr.io/ems-gap-images/example-app:latest@sha256:invalidhash": couldn't parse image reference "eu.gcr.io/ems-gap-images/example-app:latest@sha256:invalidhash": invalid reference format
  Warning  Failed         94s (x62 over 13m)    kubelet, gke-gap-staging-baseline-pool-0c60d758-cbbc  Error: InvalidImageName

Production Grafana dashboards

Staging Grafana dashboards

HTTPRequestErrors

The alerts

HighInbound4xxErrorRate
HighInbound5xxErrorRate
IngressRequestErrors5xx
IngressRequestErrors4xx
HighOutbound4xxErrorRate

Symptoms for inbound alerts

These alerts trigger when the inbound error rates of 4xx or 5xx HTTP requests would exceed 5% in the last 5 minutes. (this state needs to be true for a minute)

Symptoms for ingress alerts

The alert triggers when the ratio of 4xx or 5xx HTTP requests to all HTTP requests is greater then or equal to 5% for a given ingress in the last 5 minutes. (this state needs to be true for a minute)

Symptoms for outbound alerts

These alerts trigger when the outbound error rates of 4xx HTTP requests would exceed 10% in the last 5 minutes. (this state needs to be true for a minute)

5xx Error Alerts

Cause 1

There is a sudden spike in the number of requests and the current number of pods are unable to handle the load.

Cause 2

There are errors in the working of the application.

Resolution 1

If the Horizontal Pod Autoscaler is configured for the deployment, it should scale it up automatically. If it’s not configured, scale up the application manually (by editing the replicas value in the deployment manifest using k9s).

Temporary solution: Scale up the deployment with kubectl or k9s. For more information, look at Scaling ops-guide.

Permanent solution: create a patch file for the specific deployment with the desired replica count, and then start a deploy.

Resolution 2

Examine the application logs in the LaaS Kibana, and/or debug the application.

Have a look at the Requests, Latency and Errors panels in the Traffic row on the Deployment dashboard for the given deployment.

production Deployment dashboard
staging Deployment dashboard

ContainerTerminatedOOMKilled

Cause

The pod had been killed by Kubernetes because it exceeded its specified memory limit.

Resolution

Increase limit for the deployments in gap.yaml.

Managing Resources for Containers - How Pods with resource limits are run

HPAReachesMaxReplicas

If your Horizontal Pod Autoscaler (HPA) reaches the maximum replicas that you defined in your gap.yaml for your deployment under the autoscaling key (or in a custom component), you wil get this alert.

Cause

This type of alert can be triggered by any activity that affects the deployment resource which can be triggered the autoscaling (e.g.: high request load on your application).

Resolution

Find out the root cause of your autoscaling and if it’s relevant, increase the maximum number of HPA replicas of your deployment (see more under auto-scaling).

Production Grafana dashboards

Staging Grafana dashboards

Resources

Documentation site
Production GAP Grafana
Staging GAP Grafana
k9s - Kubernetes CLI for managing Kubernetes clusters
Using Kubernetes with k9s
gap.yaml features
Highly recommended alerts

Common alerts

Table of contents

JobStatusFailed

Cause 1

Cause 2

Cause 3

Resolution 1

Resolution 2

Resolution 3

Cleanup

DeploymentReplicasUnavailable

Cause 1

Cause 2

Resolution 1

Resolution 2

Related Resources

HTTPRequestErrors

The alerts

Symptoms for inbound alerts

Symptoms for ingress alerts

Symptoms for outbound alerts

5xx Error Alerts

Cause 1

Cause 2

Resolution 1

Resolution 2

Related Grafana dashboards

ContainerTerminatedOOMKilled

Cause

Resolution

Related resources

HPAReachesMaxReplicas

Cause

Resolution

Related Resources

Resources