How to do canary release using service mesh

First things first, this should be a last resort. If possible, use feature switches. This is only necessary for extensive changes to the codebase that can’t easily be toggled with a feature switch, e.g. major core library version updates, switching a core library like the HTTP server, etc.

The first thing to decide is how to implement the feature change (let’s call the versions v1 and v2) that will need to be canary released. There are two options: you either make both versions of the code coexist in the codebase on the same branch (coexistence method), or you simply override the old version of the code with the new one (succession method).

Coexistence Method

The requirement for this method is the ability to switch between implementations v1 and v2 with a flag passed to the Docker image, e.g. by an environment variable or command-line arg. The benefit is that this allows development of both v1 and v2 versions even while the canary release is in progress. This is also what causes the weakness: the two versions might diverge this way. E.g. if a change is introduced in v1, care must be taken that it gets introduced in v2 as well.

Succession Method

This version is achieved by selecting a last commit for v1. From that point on, the codebase is considered v2, and all the necessary changes can be made in it. The main advantage of this is simplicity, but this means that introducing changes to v1 is going to be very difficult, involving a temporary rollback of all v2 code changes and repointing the v1 deployment to the new v1.

Preparations

At present, this document primarily outlines the setup of canary releases using the Succession method. With some creative thinking, it should be straightforward to modify these steps for the Coexistence method. Additionally, this guide presumes that the service receives traffic from the ingress controller.

Weight-based canary release can be setup using an Istio VirtualService and DestinationRule subsets. As we use Istio, a prerequisite is the enrollment of the service into the service mesh.

Setting up new deployment

The initial step involves creating a new v1 deployment. The existing deployment will be transitioned to v2. While this transition may initially appear perplexing, it is intended to facilitate a simplified cleanup process and minimize potential traffic disruptions.

Let us assume you have the following gap.yaml deployment:

# gap/gap.yaml
name: my-service
deployments:
  web:
    ingress:
      enabled: true
    command:
      - /opt/docker/bin/start

Add a new v1 deployment and setup the version labels:

# gap/gap.yaml
name: my-service
deployments:
  web:
    podLabels:
      version: v2
      canary: enabled
    ingress:
      enabled: true
    command:
      - /opt/docker/bin/start
  web-v1:
    podLabels:
      version: v1
      canary: enabled
    ingress:
      enabled: false 
    image:
      repository: sap-ems-base-infra-package-p/gap-images/my-service
      tag: sha256:be6bebfde2273ad254e8ef2a210e3c132a46255195408f675b4187420717f65f
    command:
      - /opt/docker/bin/start

The image version for v1 should point to the last v1 commit. Additionally, please note that we only enable ingress for v2. At this point, v1 and v2 should be identical releases, so it doesn’t matter which one serves the traffic. However, by setting up the ingress for v2, we ensure that there is no interruption in the traffic, as v2 already receives it, eliminating the need for rerouting. The canary: true label will be later utilized for setting up the Service for weight based routing.

Don’t forget to apply already existing patches to the v1 deployment as well.

At this point, commit and deploy the changes to stage and production. With this, the service should be ready for the traffic routing setup.

Setting up routing

Add a new the DestinationRule and patch the VirtualService (created because ingress.enabled = true in gap.yaml):

# gap/gap_canary_destination_rule.yaml
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: my-service--web
spec:
  host: my-service-web.relational-data.svc.cluster.local
  subsets:
    - name: v1
      labels:
        version: v1 # this matches the labels we added as podLabels in gap.yaml
    - name: v2
      labels:
        version: v2 # this matches the labels we added as podLabels in gap.yaml

# gap/staging/gap_patch_canary_virtual_service.yaml
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: relational-data-access--web
spec:
  hosts:
    - my-service-staging.gservice.emarsys.com
    - my-service-web.relational-data.svc.cluster.local
  http:
    - route:
        - destination:
            host: my-service-web.relational-data.svc.cluster.local
            subset: v1 # this refers to the names in the destination rule
            weight: 0
        - destination:
            host: my-service-web.relational-data.svc.cluster.local
            subset: v2 # this refers to the names in the destination rule
            weight: 100

# gap/production/gap_patch_canary_virtual_service.yaml
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: relational-data-access--web
spec:
  hosts:
    - my-service.gservice.emarsys.net
    - my-service-web.relational-data.svc.cluster.local
  http:
    - route:
        - destination:
            host: my-service-web.relational-data.svc.cluster.local
            subset: v1 # this refers to the names in the destination rule
          weight: 0
        - destination:
            host: my-service-web.relational-data.svc.cluster.local
            subset: v2 # this refers to the names in the destination rule
          weight: 100

This creates a v1 and a v2 subset of the service based on the version labels on the pods. Additionally, It patches the existing VirtualService to be aware of the two subsets. Initially, 100% of the traffic is routed to v2. to maintain the existing behavior, as all traffic already routed to v2. Traffic will be routed to v1 once the necessary configurations are complete.

Subsequently, patch the existing service to match both of the deployments. To achieve this, modify the selector to remove the app label from the rule (as it is different for the two deployments). Additionally, add the applicationName, which will match all deployments of the service (even the workers if they exist!). To ensure that only the intended deployments are matched by the service, add the canary: enabled label, which was introduced initially for this purpose.

# gap/gap_patch_canary_service.yaml
---
apiVersion: v1
kind: Service
metadata:
  name: my-service-web
spec:
  selector:
    app: # this removes the app label
    canary: enabled
    applicationName: my-service

In addition, the authorization policy for v1 also needs to be patched to allow traffic from ingress to the pods:

# gap/gap.yaml
...
deployments:
  ...
  web-v1:
    ...
    authorizationPolicy:
      rules:
      - from:
          - source:
              principals:
                - namespace: ingress-nginx
                  serviceAccountName:
                    - "ingress-nginx"

Upon completion of all necessary modifications, the changes can be deployed to the staging and production environments.

Once all of these changes are deployed, it becomes feasible to adjust the weights assigned to the canary VirtualService, distributing traffic based on the weights between v1 and v2 versions.

Ensure that the v1 version is running the correct image. If not, perform one final update. Configure the weights to route 100% of the traffic to v1 by editing the weights in the VirtualService patch files.

Doing the release

At this point, everything is ready for the canary release.

Implement the necessary changes to the codebase and deploy it to production. Since no production traffic is currently directed to the v2 routes, this is safe to do. You can test the v2 (by changing the weights) on staging.

Upon completion of testing, start increasing the traffic to production v2. Depending on how long it takes to confirm that v2 works at a specific weight, this change can be done either by editing the VirtualService using k9s (for short term), or by editing the VirtualService patch file in the repository (long term).

Progressively increase the weight until 100% traffic is directed to the v2 routes.

Cleanup

The cleanup happens in three phases to make sure everything is removed in the right order.

Phase 1

Remove the destination to the v1 subset of the service from the gap/(staging|production)/gap_patch_canary_virtual_service.yaml files. At this point, this should be at weight: 0 anyway. Then, remove the canary patch for the Service (gap/gap_patch_canary_service.yaml) and remove the canary deployment from the gap.yaml files (don’t forget about potential patches for the canary deployment).

Once all this is done, deploy to staging and production. Deleted resources will be automatically removed from stage, but you need to either manually delete them on prod, or use the Prune option when syncing with Argo CD. With this, the deployment is cleaned up.

Phase 2

Remove the VirtualService patch: gap/(staging|production)/gap_patch_canary_virtual_service.yaml

Deploy to staging and production to make sure nothing else refers to the subsets defined in the DestinationRule.

Phase 3

Remove the DestinationRule: gap/gap_canary_destinationrule.yaml

Deploy to staging and production.

Congrats, you are done with the cleanup and the canary release!