Common alerts
This document describes the most common GAP alerts and their resolution - where applicable.
- JobStatusFailed
- DeploymentReplicasUnavailable
- HTTPRequestErrors
- ContainerTerminatedOOMKilled
- HPAReachesMaxReplicas
- Resources
An alert will be fired when a job has failed for any reason.
Note that the failed Job must be manually deleted for the alert to be resolved.
Kubernetes terminated a pod that had been created from a job because it exceeded time limit.
Kubernetes terminated a pod that had been created from a job because it exceeded memory limit.
Note: for administrative reasons, GAP keeps one instance of the failed job to be available for investigation. This can be tuned if necessary.
The process running in the job has finished with a non-zero exit code.
Example alert:
Failed job in
Failed job example-job-1595511600 (example-application) in namespace smart-insight.
- Labels: alertname=JobStatusFailed endpoint=http-metrics instance=10.132.8.218:8080 job=kube-state-metrics job_name=example-job-1595511600 label_applicationName=example-application namespace=example-namespace pod=kube-state-metrics-5cbdf44f9f-mtxzx prometheus=gap-system/prometheus service=kube-state-metrics severity=warning
Use kubectl describe to determine the termination reason:
kubectl describe job example-job-1595511600 -n example-namespace
... some output omitted ...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 2m27s job-controller Created pod: example-job-1595522040-6qh89
Normal SuccessfulDelete 27s job-controller Deleted pod: example-job-1595522040-6qh89
Warning DeadlineExceeded 27s job-controller Job was active longer than specified deadline
if the Events section is empty (displayed asEvents: <none>), then the failed job instance is too old to determine the reason of failure.
Job execution in GAP is currently configured to allow only one instance running. Which means if there is a long running job, the other executions for the same cronjob are not going to be scheduled.
To be able to debug a problematic job, follow these steps:
- Increase the activeDeadlineSeconds for the cronjob to a high value (e.g.: 86400)
- Temporarily: use k9s to edit the manifest of the cronjob in question
- Permanently: add/update the activeDeadlineSeconds value in the
gap.yamland execute a deploy
- Once the new values become effective, the next job execution will be affected. Use k9s to open a shell into the long running pod to investigate
- Have a look at LaaS Kibana for application logs
- Have a look at POD logs with k9s
- Kill command can be used to suspend/resume a process from execution:
- to pause a process:
kill -STOP 1 - to resume a process:
kill -CONT 1
- to pause a process:
Create a patch file, with increased memory limit for the given cronjob.
Look into the job logs to find out why the process exited with non-zero code. E.g. misconfiguration, issues when connecting to external services, etc.
The failed job object has to be deleted, otherwise prometheus will keep sending alerts about it.
Kubectl can be used to determine job execution durations
% kubectl get job -n example-namespace NAME COMPLETIONS DURATION AGE example-job-1594976400 0/1 7d21h 7d21h example-job-1594993800 1/1 4s 23m example-job-1594994400 1/1 5s 13m example-job-1594995000 1/1 3s 3m26sIncrease the limit by setting the activeDeadlineSeconds to a higher value in the gap.yaml for the affected cronjob
Delete the failed job instance, otherwise Prometheus will keep sending alerts about it
kubectl delete job example-job-1594976400 --namespace example-namespace
Kubernetes is currently unable to maintain the requested number of pods. Based on the configured metrics, the Horizontal Pod Autoscaler might have drastically increased the required number of replicas in a deployment. There are cases when the the cluster already has a high number of pods running, and Kubernetes decides to provision new Nodes (Virtual Machines) to be able to launch new pods. This might take 5 to 20 minutes.
Editing the manifest of a deployment via the Kubernetes API (with kubectl or k9s) causes Kubernetes to unable start the desired number of pods (e.g.: providing wrong image for one of the pods).
If the root cause is a scaling issue, Kubernetes should resolve the problem automatically by adding new Nodes to the cluster. In this case, it’s just a matter of time this alert gets resolved. If the alert doesn’t get resolved within 20 minutes, please notify the Cloud Platform team in the #team-tooling room on Slack.
If the problem is with the deployment itself, the affected pods need to inspected for errors.
Using kubectl to inspect the affected pods:
# get the list of affected pod(s)
% kubectl get pod --selector app=example-app-web -n example-namespace
NAME READY STATUS RESTARTS AGE
example-app-web-74d64bc79f-qnr6g 2/2 Running 0 3h33m
example-app-web-7f6dc8fd78-zgs5f 1/2 InvalidImageName 0 13m
# describe pod
kubectl describe pod example-app-web-7f6dc8fd78-zgs5f -n example-namespace
... output omitted
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning InspectFailed 6m27s (x60 over 13m) kubelet, gke-gap-staging-baseline-pool-0c60d758-cbbc Failed to apply default image tag "eu.gcr.io/ems-gap-images/example-app:latest@sha256:invalidhash": couldn't parse image reference "eu.gcr.io/ems-gap-images/example-app:latest@sha256:invalidhash": invalid reference format
Warning Failed 94s (x62 over 13m) kubelet, gke-gap-staging-baseline-pool-0c60d758-cbbc Error: InvalidImageName
Production Grafana dashboards
Staging Grafana dashboards
- HighInbound4xxErrorRate
- HighInbound5xxErrorRate
- IngressRequestErrors5xx
- IngressRequestErrors4xx
- HighOutbound4xxErrorRate
These alerts trigger when the inbound error rates of 4xx or 5xx HTTP requests would exceed 5% in the last 5 minutes. (this state needs to be true for a minute)
The alert triggers when the ratio of 4xx or 5xx HTTP requests to all HTTP requests is greater then or equal to 5% for a given ingress in the last 5 minutes. (this state needs to be true for a minute)
These alerts trigger when the outbound error rates of 4xx HTTP requests would exceed 10% in the last 5 minutes. (this state needs to be true for a minute)
There is a sudden spike in the number of requests and the current number of pods are unable to handle the load.
There are errors in the working of the application.
If the Horizontal Pod Autoscaler is configured for the deployment, it should scale it up automatically. If it’s not configured, scale up the application manually (by editing the replicas value in the deployment manifest using k9s).
Temporary solution: Scale up the deployment with kubectl or k9s. For more information, look at Scaling ops-guide.
Permanent solution: create a patch file for the specific deployment with the desired replica count, and then start a deploy.
Examine the application logs in the LaaS Kibana, and/or debug the application.
Have a look at the Requests, Latency and Errors panels in the Traffic row on the Deployment dashboard for the given deployment.
- production Deployment dashboard
- staging Deployment dashboard
The pod had been killed by Kubernetes because it exceeded its specified memory limit.
Increase limit for the deployments in gap.yaml.
Managing Resources for Containers - How Pods with resource limits are run
If your Horizontal Pod Autoscaler (HPA) reaches the maximum replicas that you defined in your gap.yaml for your deployment under the autoscaling key (or in a custom component), you wil get this alert.
This type of alert can be triggered by any activity that affects the deployment resource which can be triggered the autoscaling (e.g.: high request load on your application).
Find out the root cause of your autoscaling and if it’s relevant, increase the maximum number of HPA replicas of your deployment (see more under auto-scaling).
Production Grafana dashboards
Staging Grafana dashboards
- Documentation site
- Production GAP Grafana
- Staging GAP Grafana
- k9s - Kubernetes CLI for managing Kubernetes clusters
- Using Kubernetes with k9s
- gap.yaml features
- Highly recommended alerts