SLI metric collection

The current way of configuration is a starting point. We will iterate on how to configure the metric collection better and easier while learning from the teams adapting the metric collection.

Instrumenting your code

For more information please see the Telemetry Quick Start guide.

Gap.yaml

If you application is instrumented to offer metrics, an OTel collector sidecar needs to be added for your deployment to collect and push the metrics to the metrics storage. One can configure the metric collector sidecar(s) like the example below.

name: your-application-name
... # usual properties
metricCollectorSidecar:
  enabled: true
deployments:
  ... # define your deployments

the metricCollectorSidecar on the root level is the global configuration
- it will be applied to all deployments, except if the deployment has its own metricCollectorSidecar property which overrides the global setting
properties
- enabled: (optional) true, if you want to enable the telemetry metric collection, thus injecting a sidecar to your deployments. This is a global setting that will enable it for all deployments (defaults to false). If you enable this feature, you either need to put an otel configuration file per environment to the following path gap/<environment>/config/otel.yaml or define the exporters via metricCollectorSidecar.config in case of setting it globally.
- image - the image of the OTel collector sidecar; defaults to eu.gcr.io/ems-gap-images/otel-collector:latest. Alternatively, you can use OTel Collector Contrib image: eu.gcr.io/ems-gap-images/otel-collector-contrib if you need more plugins.
- configMap - name of the ConfigMap containing the opentelemetry collector configuration; defaults to my-application-otel-config
- config: (optional) configuration for the otel sidecar config exporters
  - exporters: (required) at least one of the following must be set:
    - googlemanagedprometheus: (optional) exporter settings for the googlemanagedprometheus. Any setting is allowed that can be found in the exporter. Docs can be found here.
    - debug: (optional) exporter settings for the debug. Any setting is allowed that can be found in the exporter. Docs can be found here.
- jobShutdownDelay: (optional, integer seconds) applies to every cronjob when set on root level, unless overriden via cronjob level jobShutdownDelay, . If you use metric aggregation and the main process in your pod is terminated or your job runs to completion in a shorter time period than the aggregation period, the metrics will not be sent to Cloud Monitoring (ex. you aggregate for 60s but the job completes in 20s). To mitigate this we added a delay to shut down the metric collector container, which defaults to the terminationGracePeriodSeconds gap.yaml cronjob setting (default 30 seconds). You can override that default with shorter or longer delay. Please note, that the terminationGracePeriodSeconds in gap.yaml needs to be set to a higher value, because if e.g the aggregation_interval in the otel config is set to 60s, the terminationGracePeriodSeconds kept at default 30s which is also the default of jobShutdownDelay, then no metric will ever be sent away if pod is short living, and also because when the terminationGracePeriodSeconds is reached, the entire pod will be terminated.

OTel configuration

One can either simply set the exporter configuration with the global metricsCollectorSidecar.config option as documented above or set with a more advanced method as shown below.

A config file in the gap/staging/config or gap/production/config folders should exists with the name otel.yaml. This file will be picked up at the last step of the manifest generation and a ConfigMap will be generated from it with the name my-application-otel-config.

"configMap": "my-application-otel-config"

The content of the config file should be a standard Otel configuration like the example below. The manifest creation will take care of creating a config map from this config and attach it to the OTel sidecar as a configuration.

# gap/staging/config/otel.yaml
receivers:
  otlp: # Only necessary if using OTLP client library
    protocols:
      grpc:
  statsd: # Only necessary if using StatsD client library
    endpoint: 0.0.0.0:8125 #default
    aggregation_interval: 30s # 60s is the default
    enable_metric_type: false 
    is_monotonic_counter: false 
    timer_histogram_mapping: #default
      - statsd_type: "histogram" #default
        observer_type: "gauge" #default
      - statsd_type: "timer" #default
        observer_type: "gauge" #default      
processors:
  memory_limiter:
    check_interval: 5s
    limit_percentage: 70
    spike_limit_percentage: 25
  statsdhistogram: # Necessary for histogram support with StatsD libraries
 
exporters:
  googlecloud:
    # Google Cloud Monitoring returns an error if any of the points are invalid, but still accepts the valid points.
    # Retrying successfully sent points is guaranteed to fail because the points were already written.
    # This results in a loop of unnecessary retries.  For now, disable retry_on_failure.
    retry_on_failure:
      enabled: false
    project: my-project
  
service:
  pipelines:
    metrics/statsd:
      receivers: [statsd]
      processors: [memory_limiter, statsdhistogram]
      exporters: [googlecloud]
    metrics/otlp:
      receivers: [otlp]
      processors: [memory_limiter]
      exporters: [googlecloud]

Metrics storages

GCP Cloud Monitoring

GCP project

We need a GCP project set up to push the metrics to. The currently advised way of creating a GCP project is to ask your Tech Lead. The project needs to have

the billing enabled
the Cloud Metrics API (Stackdriver Monitoring API) enabled
the Cloud Tracing API (for tracing) enabled

The type of project can be a

service project
functional / team project

Workload identity

To enable the metrics collection sidecar to push metrics to GCP Cloud Monitoring you need to create a service account and set up Workload Identity.

During the set up of the Workload identity you need to patch all deployments(s) on which you enabled the metric collection sidecar.

Additionally, you need to set the following roles to the GCP service account:

roles/monitoring.metricWriter
roles/cloudtrace.agent