How do dynamic workers work in Octopus Cloud?

I’ve noticed some odd behavior with Octopus Cloud and Dynamic Workers. Sometimes it takes a few seconds to lease a dynamic worker and other times it takes over a minute. And it looks like the workers are being re-used. How are dynamic workers supposed to work in Octopus Cloud?

Before diving into the weeds, please have a look at the documentation explaining some of the basics around dynamic workers.

Dynamic Workers are VMs which we manage the lifecycle of. The VMs have a tentacle along with some additional tooling to make them useful. We maintain a set of “hot workers” which are ready for a customer to pick up. If we didn’t do that, you’d have to wait for the VM to be provisioned, which with Azure can take anywhere from 3 minutes to 20 minutes.

When you first run a deployment, a worker lease is requested. You can see this at the top of your task log. In my experience, it can take 1 to 5 minutes to get a dynamic worker to be acquired. When an instance requests a lease, the dynamic worker service will first check to see if that lease already exists in the database. If the lease doesn’t exist, then we kick off the leasing lifecycle, the first thing it does is pull the worker out of the pool so other instances. After that, it will tell the dynamic worker VM to trust the Octopus Server’s thumbprint. By default, the dynamic worker VMs do not trust any Octopus Servers. Dynamic workers with either trust one instance or no instances. The workflow looks like this:

The various states a worker goes through our internal documentation:

There is a very real chance that worker will need to be re-used during the deployment. It doesn’t make sense to destroy the worker after each step is finished. Our logic re-uses the same worker. This is why each subsequent steps in a deployment typically take a few seconds to acquire a dynamic worker. An interesting way to know if you got a fresh worker at the start of a deployment, or your instance is re-using the same worker as before, is if calamari is pushed to it. Calamari gets updated faster than we can update the worker images, so nine times out of ten, calamari will need to be pushed to the worker.

A worker has to be idle for an hour (60 minutes) before we delete/de-provision the VM. This is done by looking at the last lease on the worker. No matter what, we will delete/de-provision the VM after 72 hours. The workers in those screenshots will soon be deleted. If something goes wrong and a lease isn’t removed from the worker (they have long running tasks), we will kill the worker after 24 hours. The lifecycle settings are managed on the instance level.