Exposing custom metrics to Prometheus

While the platform collects system-specific metrics (CPU/RAM usage, etc.) from your deployments by default, it’s unable to do this for metrics that represent your application’s internal state or other custom metric that you might want to expose and collect automatically. This document explains how to expose these metrics in your application and instruct the platform to collect this for you.

Exposing metrics

You must expose your metrics on the /metrics endpoint in Prometheus exposition format.

There are various libraries already available to expose metrics from your application for various languages and frameworks. Here’s one for node.js as an example. You can validate that your metrics endpoint is working correctly with a simple GET /metrics request to your application.

Enable metrics collection for your deployments

You can enable metrics collection on a per-deployment basis by setting the collectMetrics flag to true on your deployments in gap.yaml.

  appName: "name-of-your-application"
  namespace: "your-teams-namespace"
  deployments: 
    web:
      collectMetrics: true

Will my metrics be publicly available?

If you are using collectMetrics: true for your deployment and the deployment has ingress enabled, the /metrics endpoint is automatically excluded, thus will not be visible from the public internet.

An other solution could be to expose these metrics in a separate deployment, which does not have an ingress, thus it is not publicly available.

Verifying metrics in Prometheus

You can verify your exported metrics are being collected by Prometheus by visiting its web UI:

Using custom metrics for auto-scaling

Your custom metrics by default are only collected by Prometheus but not exposed to the autoscalers. In order to mark a metric as something you would like to autoscale on, add the expose="true" label to the metric.

Here’s an example metric that would be correctly discovered for pod autosclaing:

example_metric{expose="true"} 1 1586344228

Setting this label provides 6 different aggregation metrics to the autoscalers:

[metric_name]_rate (ex. example_metric_rate): The rate of change in the metric from the last 2 minutes.
[metric_name]_max: The maximum value of this metric observed in the past 2 minutes.
[metric_name]_min: The minimum value of this metric observed in the past 2 minutes.
[metric_name]_count: The number of data points of this metric observed in the past 2 minutes.
[metric_name]_sum: The sum of all data points of this metric observed in the past 2 minutes.
[metric_name]_avg: The average of all data points of this metric observed in the past 2 minutes.

All exported metrics are scoped by pod. The app and applicationName labels generated by our deploy tool are automatically added to the metric, so make sure you use these label filters to avoid scaling on metrics from other applications.

Here is an example of a horizontal pod autoscaler that makes use of one of these metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: [your-deployment-name]
  namespace: [your-team-namespace]
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: [your-deployment-name]
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      name: "[metric_name]_sum"
      selector:
        matchLabels:
          status_code: "200"
          app: gap-example-project-web
      targetAverageValue: 1

As usual, the HPA will scale up if the observed value exceeds the target, and down if it does not exceed the target. The target value that is used for comparison is the value provided in the targetAverageValue field divided by the number of currently active pods in the deployment.