Upgrade Istio
Please note that only jumping max 2 minor versions at a time is officially supported by Istio.
The guide is the same for staging and production, but first staging should be upgraded for testing.
- Check the behavioral changes that might be introduced with
istioctl x precheck --from-version 1.xx, replace the xx with the actual version to be checked against. - Check the changelogs for the new version(s) here for any breaking changes! Please note that according to Istio, it is safe to jump max only 2 versions.
- Upgrade the Istio Base and then the Istio CNI, only the versions linked in kustomization.yaml files need to be changed.
- Bring up the canary version by:
- Test the new version by setting the label
istio.io/rev: defaulttoistio.io/rev: canaryand restarting a deployment, for example thegap-docs. After the new proxy version gets injected, try out by going to the gap-docs and see if it loads properly then check the istio proxy logs to see that the traffic gets logged correctly. - Revert the manually set canary label of the test deployment to default from the testing phase, rename config.json in the gap-gitops from the canary dir to
RENAME-TO-CONFIG-JSON-DURING-UPGRADEand bring down the canary by deleting the respective environment’sistiod-canaryapp. - If all is well, then do the same steps similarly as in the 4rd step for upgrading default istio revision here ⚠️ (DO NOT PRUNE THE OLD ISTIO COMPONENTS IN THE ISTIOD APP).
- Restart some of our own meshed components by hand, such as Ingress-nginx, not forgetting to restart the Overseer daemonset or any other daemonset we will have in the future as well. (maybe somehow incorporate these restarts to the restart script as an improvement)
- Set the
defaultRevisionproperty in the Istio Base chart values to the new revision (don’t usedefault, but the exact revision ex.1-21) and sync the chart.- This is to make sure that the
Validatingwebhookconfigurationsnamedistiod-default-validatorpoints to an existing service (ex.admissionReviewVersions.clientConfig.service: istiod-1-21) and thefailurePolicyisFail, otherwise the validation will not work, just silently admitting invalid configurations into the cluster.
- This is to make sure that the
- Update the Kiali revisions here, here and here for stage and here, here and here for prod.
- Update the Istio PDB patch revisions here and here for stage and here #update and here #update for prod
- Make sure all the gateways are injected with the new istio proxies
- ⚠️ Don’t prune the components of the old version in the istiod default app until nothing is in the old istio version, check with the below one-liner after some time to see which pods are still running with the older version proxies.
kubectl get pods -A -o json | jq -r '.items[] | select(.metadata.annotations["istio.io/rev"]=="1-XX" and .status.phase=="Running") | .metadata.name + " " + .metadata.namespace'
Then you can run the following script to restart all the deployments and statefulsets which have pods running with the older proxy versions. Mind that based on the above script, you can give the below script a list of namespaces to work over if there are not too many. Otherwise it can go over the all the namespaces if uncommented.
Be aware that the below script will not restart everything, we should still check with the above one-liner what remains after, like some long-running jobs. ⚠️ DO NOT PRUNE the old istiod components in the istiod app without making sure that there is no workload or system component pod running with the older istiod proxies (e.g mtls proxies are ok), as pruning the older istiod will break those pods.
Before running the script below, post in #infra-announcements:
⚠️ GAP Stage Service Mesh or ❗ GAP Production Service Mesh
As the last step of the Istio Service Mesh upgrades on the
#!/bin/bash
revision="1-XX"
namespaces=$(kubectl get namespaces -o=jsonpath='{.items[*].metadata.name}') # to get all namespaces
# namespaces="mobile-engage mobile-engage-qa" # example
for namespace in $namespaces; do
echo "---------- Checking namespace: $namespace ----------"
deployments=$(kubectl get deployments -n $namespace -o=jsonpath='{.items[*].metadata.name}')
for deployment in $deployments; do
podsWithRevision=$(kubectl get pods -n $namespace -l app=$deployment --field-selector=status.phase=Running -o=jsonpath='{.items[*].metadata.annotations.istio\.io\/rev}' | grep $revision)
if [ ! -z "$podsWithRevision" ]; then
echo "Restarting deployment: $deployment"
kubectl rollout restart deployment $deployment -n $namespace
fi
done
statefulsets=$(kubectl get statefulsets -n $namespace -o=jsonpath='{.items[*].metadata.name}')
for statefulset in $statefulsets; do
# we use the service.istio.io/canonical-name label for statefulsets because the app label is not present in some statefulsets
podsWithRevision=$(kubectl get pods -n $namespace -l service.istio.io/canonical-name=$statefulset --field-selector=status.phase=Running -o=jsonpath='{.items[*].metadata.annotations.istio\.io\/rev}' | grep $revision)
if [ ! -z "$podsWithRevision" ]; then
echo "Restarting statefulset: $statefulset"
kubectl rollout restart statefulset $statefulset -n $namespace
fi
done
done
- Make sure that there are no pods running old version proxies with the above one-liner. Then the old Istios can be pruned.