How to shorten deployment time

By default all application are configured to be deployed with the RollingUpdate strategy for their deployments. See more info here. You can fine tune this strategy with the maxSurge and maxUnavailable properties in the gap.yaml for each of your deployments. The other strategy is called Recreate, which shuts down all pods, waits for them to terminate and creates the new ones just after that. This will cause outage for web application, depending on the boot and termination time of the pods.

The default values for these properties are 25%, rounded down. So with two replicas and the default settings your deployment (if successful) will be finished in two iterations. eg.:

iteration 1: starting 1 new and stopping 1 old pod
iteration 2: starting a second new and making the second old pod

Please do not use a high value for the maxSurge property if your deployment has high replica count, can scale to high replica count or has high resource request. If you are unsure if a value is high or not for your setup contact the GAP team.

Rules of thumb

if your pod has 2-3 replicas which is a common use case, you can only make the deployment faster by setting the maxSurge and maxUnavailable properties to 100%, which is not the same as Recreate strategy, because it will immediately start new pods while terminating the old ones. This will also cause service outage, but shorter than the Recreate strategy.
if you have 4 or more pods for a deployment and it can suffer half of your pods missing for a time without performance impact, we advise to use the 50% value for both maxSurge and maxUnavailable, which essentially results in 2 or 3 iterations depending on the replica count being even or odd. In this case please watch out for the warning above.
setting these values to 75% guarantees two iterations, but please consider the waring above even more carefully.
you can also play with these properties based on the needs of your own application.
for high traffic web applications it is better to accept a slower rollout time with the default settings or try to only increase the maxSurge than to risk an outage due to unavailable pods

Pod starting and stopping time

The other factor of the deployment time is the length the pods starting and stopping, which is entirely dependent on the implementation and drives the length of each iteration spent with deployment. Each of these can drive up the deployment time.

Let’s say a deployment has 4 pods, with maxSurge and maxUnavailable set to 25%, which means 4 iterations. The deployment will take roughly

$$4 * (max(T_{start}, T_{stop}) + C)$$

where C some overhead per iteration. Because of this, even if one of these processes is slow, it will make the whole deployment slow.