Cloud Platform Alerts Playbook

UnableAllocateIP:

Check grafana Cluster health overview, Pod count per namespace, we should check for the autoscaling trend, if it’s constantly spiking up, then somebody messed up something. But if it is going down from the spike then the alert will self-resolve. If nothing obvious there, check the Nodepools dashboard and inside that the Standard and baseline pool CPU requests, number of nodes and check it against how many IPs there are. Also go to K9s, all namespaces, and shift + a to see all pods created sorted by age, to see what namespace is scaling out the nodes, and ask them if it is intentional (like performance tests etc.) If it didn’t skyrocket (increased by 5 etc.) then wait it out, if it is growing consistently, then ask the team what is going on.

A solution that worked once: We increased the node alert in the staging then increased the ip_count in terraform clusters/staging-euw3. Before that we checked the prometheus kube_node_info, and checked with =~ regex the baseline and standard labels to see how many nodes are of standard and baseline.

ContainerOOMKilled:

Check the limits and requests of the pod.
Check on Grafana, deployments or statefulset dashboard, trend for the memory usage.
Check the node of the pod by: node and the / search.
- Under the conditions you can see MemoryPressure is false or true.
- You can check the events under Allocated resources.
- Check the memory percentage of the node.
Grafana node usage dashboard: https://monitoring-staging.gservice.emarsys.com/d/nodeusage/node-usage?orgId=1. Check the free memory/cpu of the node in thanos query (node_memory_MemFree_bytes f.e). Also check the memory/cpu usage of the pod. Just start typing on the thanos query and it will intellisense your queries.
Nowadays we get this a lot because of thanos getting OOM killed.
If a prometheus pod gets this in an infinite loop we might need to manually delete the wal files from the prometheus container that gets OOMKilled again and again. Shelling into the container via K9s and deleting everything in the wal folder inside the prometheus folder should solve this issue.

CertificateNotReady:

Could be a no-op, with the new version of cert-manager it should automatically resolve, but still check the CNAME, notify the team if something is wrong with the CNAME. You can check the CNAME (Domain) in the ingress object in the namespace of the exported_namespace of the alert. Check for typos. also check for the systec_requests CNAME requests. After the cname request is accepted, go to certificate object and delete it if it’s not re-created, also watch the certificaterequest object to see what’s up. Check in the namespace the ingresses and try to host command the domain. Might be that they didn’t register the CNAME. maybe delete the ingress as a last resort, but very carefully after verification (check commit history of the repo of the team of the ingress).

OverseerRequestError:

With this alert we can see to which destinations the overseer requests are failing. We need to run the following query in the description of the alert to check for the otel collector deamonset instances which report the metrics from the failing overseers’ nodes, through which we can find out the nodes in which the failing overseers are running, by e.g going to k9s, checking for pods in all namespaces and filtering for the pod ip. Then we can see on k9s in which node the otel collector pod.

If the error comes from all sorts of nodes (standard, baseline, fixip) then it is some network problem, they might all have to be cordoned and drained but beware of White List, Fix IP and Cluster Components, don’t cordone all of them at the same time. If it comes from a single node, then you can cordon and drain it. Try out the kubectl debug node, might be helpful.

AppUnknown error:

Check argoCD and filter by unknown, github might be down, if alot of applications are unknown. If it’s just 1 app down, it could be a config issue. Check the logs on argo-server, argo-repo-server deployments in the HQ cluster.

prometheus/alertmanager_notifications_failed_total/counter for (ems-gap-stage|ems-gap-production) with metric labels {reason=<clientError|contextCanceled|contextDeadlineExceeded>} is above the threshold of 0.000 with a value of (X):

Check the timeframe in the google alert’s page whose link is included in the pagerduty notification, and go to the Cloud logging page here for stage and here for prod and look for resource.labels.container_name="alertmanager" in the query with the correct timeframe (the links should already have the query included).

Note that if you get some error log there with something like notify retry canceled after 17 attempts: unexpected status code 429, then we are being rate-limited by Pagerduty, soon after there should be a Notify success log with the same alert with the same label_application_name field value.

Deployment/StatefulSet/Daemonset/ReplicasUnavailable:

Deployment//StatefulSet/Daemonset has x amount of replicas, but not all (it has 3 but 5 is required), or they are not healthy. there is 5 or 10 mins grace period for pod to start, if it isn’t ready by that time it will be counted as unavailable. Check the events and logs of failing pods. Also might be worth to check the ReplicaSet. In case of kube-dns, it has a long time to rollout restart completely on prod (about 50 mins last time), as it is managed by GKE we cannot optimize the restart strategy.

AP Whitelist Internal Pool/Fix IP general pool IP change:

Check the IP which is in the alert in google cloud ip addresses, and if the ip is in the gap-production-whitelist-internal-ip name range, then it’s okay, if it’s f.e from baseline then it’s a problem.
Check that the ip name and the nodes are correlating. https://console.cloud.google.com/networking/addresses/list?project=ems-gap-stage https://console.cloud.google.com/networking/addresses/list?project=ems-gap-production Oldest unacked message age for ems-gap-production laas-sink-subscription is above the threshold of 900.000: Go to pubsub of production and check the oldest unacked message in laas sink and mobile laas sink, it is probably systec doing something
Check the charts on the bottom here: https://console.cloud.google.com/monitoring?project=ems-gap-production&timeDomain=1h might help enabling gap-fallback-log-router log router

Image pull failure:

We can see these when we are upgrading, takes longer to pull all the images will self-resolve. Might be node network issue (correlate with GAP Network error). It can be a firewall misconfigration. Can be verified by pulling from the registry locally. If works can try to ping it from a debug container (have ping and telnet).

Job status failed:

In K9s jobs page, describe the pod, or check the logs from LAAS, cleanup the failed job to resolve.

IngressRequestErrorsHigh:

Check Nginx error log, Nginx access logs are in LaaS, application logs. webs might have bug, or dependency problem. Might be there are no backends, they scaled down the webs but still requests coming in.

An uptime check on ems-gap-production:

Ingress not reachable from the outside. Might happen because of nginx upgrade. Might be some Google thing.

Check Ingress controller dashboard, if there is no traffic it is a huge incident.
Check if ingress is running, is it running correctly. Determine if it is a config issue or Google issue, check the Google status page.
Check the pods try to make healthcheck requests to services.

KubePersistentVolumeFillingUp:

It can happen every 14 days because of thanos compact, check it in stateful set in grafana.

Unacked messages for ems-gap-production laas-sink-subscription is above the threshold and Oldest unacked message age for ems-gap-production laas-sink-subscription is above the threshold:

Check the metrics explorer dashboard to understand what is the scale of the logging increase. Systec should be able to help, and they should already be alerted as well. If during working hours notify @sep at #private-infra-support.

Also you can use this grafana dashboard to tell if something goes wrong: https://grafana.service.emarsys.net/d/NJpSLB9nz/laas-monitoring?orgId=1&from=now-30m&to=now&refresh=1m (needs VPN and emarsys AD login with shortname and AD password), the top left graph is the incoming logs for the pipeline and you can even filter it to stage/prod for gap (and same for the gap-me pubsubs).

quota-usage-alert-low or quota-usage-alert-high

GCP quota page for prod, where the quotas can be adjusted: https://console.cloud.google.com/iam-admin/quotas?referrer=search&project=ems-gap-production

PendingPod

Might happen during cluster upgrades, otherwise go into k9s and investigate the status of the pod etc.

Kyverno Alerts

KyvernoHighNumberOfPolicyFailures, KyvernoLowNumberOfAdmissionReviewPasses and KyvernoLowNumberOfAdmissionRequests

The KyvernoHighNumberOfPolicyFailures alert indicates that one or more of the policies we have, such as require-run-as-nonroot, are failing against e.g a workload trying to run as root.

The latter two can indicate that that there are lower than normal number of objects passing through the admission controller to be let in by the validation policies being applied to them. They can be an indicator for a more cluster wide issue, or the kyverno components not being in good shape. See the below section for the debugging steps.

Debugging

Information sources for debugging the above three Kyverno alerts:

Observing the Kyverno Metrics Grafana dashboard under the cluster dashboards folder.
Port forwarding the kyverno-policies-ui and looking through the UI, the error logs should also be combined there.
Checking the logs of the kyverno controller pods, especially the admission controllers.

KyvernoChangeInNumberOfActivePolicies

Possibly we have forgotten to update the number of actual policies in the alert config, we should have received the relevant alert at any rate. Otherwise might have been accidentally deleted.

Istio Alerts

Check the Istio Control Plane Dashboard under the Cluster folder in Grafana.

Google-managed Prometheus Alerts

CollectorExcessiveRestarts

The CollectorExcessiveRestarts alert fires when one of the collector Pods in the gmp-system namespace has restarted more than 10 times in the past hour, indicating a the Pod entering a CrashLoopBackoff status.

This can happen either when:

collector Pod is OOMKilled (the limits are managed by Google)
Writing to the time series database fails

The root cause we have identified so far in both cases is that there are containers on GAP that expose overly large Prometheus metrics on their metrics endpoint. For example, the size of the ingress-nginx metrics increase strictly monotonously in time.

The only remedy we have found so far is to identify the workload exposing too much metrics, and restart them. Istio metrics are scraped via the otel-collectors in the istio-system namespace.