Server running 3.7.2 HA with 2 server nodes and 1000+ tentacles (mostly listen & some poll). Check Health task runs every few hours. Often times the task gets stuck, with Task Summary page displaying message like ‘1,031 of 1,032 health checks complete (99%)’ under Task Progress.
The message may be misleading - I downloaded the raw task log, extracted all the Success & Failed lines (see samples below), found he total count matched what’s expected (in this case 1,032). So it appears that Check Health did complete checking each tentacle, but it might be doing something else after done with the last check, and didn’t get to update the count.
I’ve experienced this many times - message always says ‘n-1 of n health cheks complete (99% done)’. The only way to unstuck is to cancel the task.
Questions - Where/how it gets stuck? How to resolve/work around?
Thanks for getting in touch! This is a known issue that we believe is fixed in 3.11.2.
Here is the corresponding GitHub issue: https://github.com/OctopusDeploy/Issues/issues/3203
If you are able to upgrade, and report back that it is resolved as the customers who previously reported this have not yet confirmed it resolved the problem.
Unfortunately there is no known workaround so upgrading will be the only way to resolve the issue.
The issue didn’t really get resolved. It completed a few times, but got stuck on all other runs.
I’ve just upgraded Octopus to 3.12.0, then re-ran the job, it still got stuck.
When it got stuck, it always shows that ‘n-1 of n health checks complete (99%)’, and stayed in a state that the task can’t be cancelled either. The only way to cancel the task is to restart the server, afterwards the task will be automatically rerun and before it’s complete, click on ‘Cancel’.
This is frustrating! We rely on Check Health to mark servers that temporarily went offline as healthy when they became available again during Check Health run. It appears that if Check Health didn’t complete, none of the machine status got updated.