Concourse worker balancing

Xing Du
4 min readMar 27, 2023

--

Recently I received many complaints about concourse builds taking a ridiculously long time and I decided to figure out what happened.

An example screenshot illustrates the issue better than I can describe:

Concourse worker delay

This looks like a resource starvation problem, and I quickly checked what metrics concourse emits, hoping to find something like “job queue length” or “job wait time”.

Unfortunately, I was not able to find anything equivalent. However, the load of a concourse worker can be indirectly reflected by the number of containers running and the number of volumes streamed, both of which are available to me via Datadog concourse integration.

worker_containers
worker_volumes

From the 2 metrics above: it seems like there’s a bias in the worker pool, causing 2 instances (highlighted 1) to take on more work than the other 18 workers (we scaled the worker to 20 instances).

With this bias favoring 2 instances, scaling up the pool won’t help solve the long delay problem: newly launched workers won’t take the load off the 2 hot instances. We need to go deeper into what caused this bias.

With more research on concourse documentation, I found out the default container placement strategy is volume-locality , which aims to reduce network IO but would introduce bias among workers. This seems like a known issue (issue1, issue2, and more) which introduced other container placement strategies as well as the container placement strategy chain feature in concouse@7

I quickly verified the configuration and runtime environment variables for our concourse-web pods: we’re currently running concourse@6 (deployed via concourse-chart@v13) and we’re using the default value, which is volume-locality .

This setup is done with the intention to reduce overhead transferring heavy git-resource s. We have some GitHub repos that are huge in size and have to be deep-copied. Utilizing volume-locality policy can improve job-worker affinity for a given build (similar to how reuseNode works in Jenkins). We had the impression that this configuration is only used for picking a worker for the next step within the same build, when new builds are assigned using something like round-robin. However, after checking theatc source code, the container placement strategy is indeed involved for starting the 1st step for a new build.

To confirm this hypothesis, I ssh ed into the instances, ensured that concourse-worker is the only non-daemonset pod running (having “dedicated” hardware resources), and I ran top and watched the output for 15mins. The output is very consistent, and an example is here:

top output: overloaded

The overloaded worker is constantly running git deep clones of large repos, taking all the CPU resources and starving all the other jobs assigned to the same worker. top command output from a non-overloaded worker looks like this:

top output: normal

And I’ve also verified the memory usage pattern: usage is roughly the same for overloaded and non-overloaded workers. This issue is caused by CPU resource bottleneck.

Now that we’ve confirmed the hypothesis let’s talk about the fixes. We should make a trade-off here: change the container placement strategy to fewest-build-container, increase volume stream frequency and network IO for a balanced worker pool and the enabling the effectiveness of horizontal scaling.

If we’re on concourse@7 , the solution can be nicer without much compromise on increasing network IO:

  • take advantage of the container placement strategy chaining feature to use a compound strategy. This would reduce the bias and reduce network IO at the same time.
  • turn on the p2p volume streaming feature to offload pressure on the web nodes. volume streaming goes through concourse-web nodes, and it’s redundant.

The container placement strategy change was applied and became effective immediately:

expected volume stream increase
worker container count bias eliminated

Problem solved, and my engineers no longer need to deal with stuck-for-hours automated pipelines.

Learnings

  • concourse observability can really use some improvements. Insights on per-worker job queue, wait time, etc should be available in the metrics. The data can be found in concourse DB (running your own queries), but it really should have been a metric as well.
  • proactively upgrade your tool to the latest stable version. The solution would have been nicer if we were on the latest version of concourse
  • if you’re building something new, use a tool other than concourse

--

--

Xing Du
Xing Du

Written by Xing Du

Minimalist. Game Developer. Software Engineer. DevOps enthusiast. Foodie. Gamer.

No responses yet