SLI metric collection
The current way of configuration is a starting point. We will iterate on how to configure the metric collection better and easier while learning from the teams adapting the metric collection.
For more information please see the Telemetry Quick Start guide.
If you application is instrumented to offer metrics, an OTel collector sidecar needs to be added for your deployment to collect and push the metrics to the metrics storage. One can configure the metric collector sidecar(s) like the example below.
name: your-application-name
... # usual properties
metricCollectorSidecar:
enabled: true
deployments:
... # define your deployments
- the
metricCollectorSidecaron the root level is the global configuration- it will be applied to all deployments, except if the deployment has its own
metricCollectorSidecarproperty which overrides the global setting
- it will be applied to all deployments, except if the deployment has its own
- properties
enabled: (optional) true, if you want to enable the telemetry metric collection, thus injecting a sidecar to your deployments. This is a global setting that will enable it for all deployments (defaults to false). If you enable this feature, you either need to put an otel configuration file per environment to the following pathgap/<environment>/config/otel.yamlor define the exporters viametricCollectorSidecar.configin case of setting it globally.image- the image of the OTel collector sidecar; defaults toeu.gcr.io/ems-gap-images/otel-collector:latest. Alternatively, you can use OTel Collector Contrib image:eu.gcr.io/ems-gap-images/otel-collector-contribif you need more plugins.configMap- name of the ConfigMap containing the opentelemetry collector configuration; defaults tomy-application-otel-configconfig: (optional) configuration for the otel sidecar config exportersexporters: (required) at least one of the following must be set:googlemanagedprometheus: (optional) exporter settings for thegooglemanagedprometheus. Any setting is allowed that can be found in the exporter. Docs can be found here.debug: (optional) exporter settings for thedebug. Any setting is allowed that can be found in the exporter. Docs can be found here.
jobShutdownDelay: (optional, integer seconds) applies to every cronjob when set on root level, unless overriden via cronjob leveljobShutdownDelay, . If you use metric aggregation and the main process in your pod is terminated or your job runs to completion in a shorter time period than the aggregation period, the metrics will not be sent to Cloud Monitoring (ex. you aggregate for 60s but the job completes in 20s). To mitigate this we added a delay to shut down the metric collector container, which defaults to theterminationGracePeriodSecondsgap.yaml cronjob setting (default 30 seconds). You can override that default with shorter or longer delay. Please note, that theterminationGracePeriodSecondsin gap.yaml needs to be set to a higher value, because if e.g theaggregation_intervalin the otel config is set to 60s, theterminationGracePeriodSecondskept at default 30s which is also the default ofjobShutdownDelay, then no metric will ever be sent away if pod is short living, and also because when theterminationGracePeriodSecondsis reached, the entire pod will be terminated.
One can either simply set the exporter configuration with the global metricsCollectorSidecar.config option as documented above or set with a more advanced method as shown below.
A config file in the gap/staging/config or gap/production/config folders should exists with the name otel.yaml. This file will be picked up at the last step of the manifest generation and a ConfigMap will be generated from it with the name my-application-otel-config.
"configMap": "my-application-otel-config"
The content of the config file should be a standard Otel configuration like the example below. The manifest creation will take care of creating a config map from this config and attach it to the OTel sidecar as a configuration.
# gap/staging/config/otel.yaml
receivers:
otlp: # Only necessary if using OTLP client library
protocols:
grpc:
statsd: # Only necessary if using StatsD client library
endpoint: 0.0.0.0:8125 #default
aggregation_interval: 30s # 60s is the default
enable_metric_type: false
is_monotonic_counter: false
timer_histogram_mapping: #default
- statsd_type: "histogram" #default
observer_type: "gauge" #default
- statsd_type: "timer" #default
observer_type: "gauge" #default
processors:
memory_limiter:
check_interval: 5s
limit_percentage: 70
spike_limit_percentage: 25
statsdhistogram: # Necessary for histogram support with StatsD libraries
exporters:
googlecloud:
# Google Cloud Monitoring returns an error if any of the points are invalid, but still accepts the valid points.
# Retrying successfully sent points is guaranteed to fail because the points were already written.
# This results in a loop of unnecessary retries. For now, disable retry_on_failure.
retry_on_failure:
enabled: false
project: my-project
service:
pipelines:
metrics/statsd:
receivers: [statsd]
processors: [memory_limiter, statsdhistogram]
exporters: [googlecloud]
metrics/otlp:
receivers: [otlp]
processors: [memory_limiter]
exporters: [googlecloud]
We need a GCP project set up to push the metrics to. The currently advised way of creating a GCP project is to ask your Tech Lead. The project needs to have
- the billing enabled
- the Cloud Metrics API (Stackdriver Monitoring API) enabled
- the Cloud Tracing API (for tracing) enabled
The type of project can be a
- service project
- functional / team project
To enable the metrics collection sidecar to push metrics to GCP Cloud Monitoring you need to create a service account and set up Workload Identity.
During the set up of the Workload identity you need to patch all deployments(s) on which you enabled the metric collection sidecar.
Additionally, you need to set the following roles to the GCP service account:
- roles/monitoring.metricWriter
- roles/cloudtrace.agent