Self Hosted Runner (Linux Shell) - Hanging on CLONE - Failed to exec spawn helper

May 17, 2024

Hi all,

In recent months, we have seen sporadic hanging pipelines on the cloning step.

Asking here in the community in case helps others, or if folks have experienced this before. Forgive me if this is issue was already raised (i quickly glanced over existing posts)

We need the self hosted runner, mainly as our tests consume more memory then supported by the regular runners.

We are likely going to add a server reboot, or terminate/create after each pipeline, but would be nice to avoid this for the sake of time(server recreate, boot time etc).

We are currently on version 1.563, we tried bumping versions to latest in the past but we still randomly hit this hung CLONE step.
The only workaround we are aware of, is to reboot our self hosted runner server, and then re-run the pipeline hung step.
We have tried different server specs, tried Debian & ubuntu but all seem to sporadically hang on CLONE step.
It 'seems' to only happen if the runner was used by a previous pipeline (we terminate the compute servers/runners if not in use).
- So fresh runners that have not been used before have never hit this issue before.
We clear the build directory before each run.

The server has a spec of e2-standard-8(8 CPUs, 32GB RAM) - https://cloud.google.com/compute/docs/general-purpose-machines

We also install the following stuff on the server, if useful

# linux apt
apt-get install 
openjdk-17-jdk 
curl 
gnupg 
git 
postgresql 
postgresql-contrib 
redis-tools 
redis 
build-essential 
zlib1g 
zlib1g-dev 
libbz2-dev 
libncurses5-dev 
libncursesw5-dev 
libffi-dev 
libreadline-dev 
libssl-dev 
liblzma-dev 
libsqlite3-dev 
docker-ce 
docker-ce-cli 
containerd.io 
docker-buildx-plugin 
docker-compose-plugin
ca-certificates

# Also
GCP SDK
GCP Ops Monitoring agent
pyenv, python312 + pip
bitbucket self hosted runner
nodejs

Monitoring looks fine, no spikes in memory, CPU etc(except when the server boots for first time), no open file limits reached etc. Linux logs/journalctl look fine around this time.
Im going to enable jvm/java monitoring/metrics for the self-hosted runner, using something like javamelody.
- Maybe enabling that we will see out of memory exceptions, thread exhaustion etc...something
Ill jot down logs below from the server (we run on GCP, X86)

# End of prior pipeline using the runner (this was a success)
[2024-05-17 08:33:45,455] Cancelling timeout
[2024-05-17 08:33:45,456] Completing step with result Result{status=FAILED, error=None}.
[2024-05-17 08:33:45,779] Setting runner state to not executing step.
[2024-05-17 08:33:45,779] Waiting for next step.
[2024-05-17 08:33:45,779] Finished executing step. StepId{accountUuid={accountUuid-redacted}, repositoryUuid={repositoryUuid-redacted}, pipelineUuid={pipelineUuid-redacted, stepUuid={stepUuid-redacted}}

 

# Next pipeline to use the runner (hangs after the exception is raised below)
[2024-05-17 08:33:49,172] Setting runner state to executing step.
[2024-05-17 08:33:49,172] Executing step. StepId{accountUuid={accountUuid-redacted}, repositoryUuid={repositoryUuid-redacted}, pipelineUuid={pipelineUuid-redacted}, stepUuid={stepUuid-redacted}}
[2024-05-17 08:33:49,173] Getting step StepId{accountUuid={accountUuid-redacted}, repositoryUuid={repositoryUuid-redacted}, pipelineUuid={pipelineUuid-redacted}, stepUuid={stepUuid-redacted}}.
[2024-05-17 08:33:49,175] Getting environment variables for step.
[2024-05-17 08:33:49,634] Getting oauth token for step.
[2024-05-17 08:33:49,972] Getting all artifacts for step.
[2024-05-17 08:33:49,974] Getting SSH private key.
[2024-05-17 08:33:49,975] Getting known hosts.
[2024-05-17 08:33:50,194] Setting up directories.
[2024-05-17 08:33:50,195] Starting log uploader.
[2024-05-17 08:33:50,197] Setting up step timeout: PT1H
[2024-05-17 08:33:50,198] Starting websocket listening to STEP_COMPLETED events.
[2024-05-17 08:33:50,200] Checking for step completion every PT2M seconds.
[2024-05-17 08:33:50,434] Updating step progress to CLONING.
[2024-05-17 08:33:50,783] Generating clone script.
[2024-05-17 08:33:50,784] Adding log file: /root/atlassian-bitbucket-pipelines-runner/bin/../temp../temp/uuid-runner-dir-redacted/tmp/clone11970097551783427306.log
[2024-05-17 08:33:50,785] Executing clone script.
[2024-05-17 08:33:50,789] An error occurred whilst executing script.
java.io.IOException: error=0, Failed to exec spawn helper: pid: 11147, exit value: 1
at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:314)
at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:244)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1110)
... 76 common frames omitted
Wrapped by: java.io.IOException: Cannot run program "/bin/bash" (in directory "/root/atlassian-bitbucket-pipelines-runner/bin/../temp../temp/uuid-runner-dir-redacted/build"): error=0, Failed to exec spawn helper: pid: 11147, exit value: 1
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1143)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073)
at java.base/java.lang.Runtime.exec(Runtime.java:594)
at org.apache.commons.exec.launcher.Java13CommandLauncher.exec(Java13CommandLauncher.java:61)
at org.apache.commons.exec.DefaultExecutor.launch(DefaultExecutor.java:279)
at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:336)
at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166)
at com.atlassian.pipelines.runner.core.service.execute.ScriptExecutorServiceImpl.execute(ScriptExecutorServiceImpl.java:94)
at com.atlassian.pipelines.runner.core.service.execute.ScriptExecutorServiceImpl.lambda$execute$0(ScriptExecutorServiceImpl.java:49)
at io.reactivex.internal.operators.single.SingleFromCallable.subscribeActual(SingleFromCallable.java:44)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleDoOnSuccess.subscribeActual(SingleDoOnSuccess.java:35)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleDoOnError.subscribeActual(SingleDoOnError.java:35)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleFlatMap.subscribeActual(SingleFlatMap.java:36)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleDoOnSubscribe.subscribeActual(SingleDoOnSubscribe.java:41)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleDelayWithCompletable$OtherObserver.onComplete(SingleDelayWithCompletable.java:69)
at brave.context.rxjava2.internal.TraceContextCompletableObserver.onComplete(TraceContextCompletableObserver.java:53)
at io.reactivex.internal.operators.completable.CompletableFromAction.subscribeActual(CompletableFromAction.java:46)
at io.reactivex.Completable.subscribe(Completable.java:2309)
at brave.context.rxjava2.internal.TraceContextCompletable.subscribeActual(TraceContextCompletable.java:39)
at io.reactivex.Completable.subscribe(Completable.java:2309)
at io.reactivex.internal.operators.single.SingleDelayWithCompletable.subscribeActual(SingleDelayWithCompletable.java:36)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleFlatMap$SingleFlatMapCallback.onSuccess(SingleFlatMap.java:84)
at brave.context.rxjava2.internal.TraceContextSingleObserver.onSuccess(TraceContextSingleObserver.java:56)
at io.reactivex.internal.operators.single.SingleZipArray$ZipCoordinator.innerSuccess(SingleZipArray.java:119)
at io.reactivex.internal.operators.single.SingleZipArray$ZipSingleObserver.onSuccess(SingleZipArray.java:170)
at brave.context.rxjava2.internal.TraceContextSingleObserver.onSuccess(TraceContextSingleObserver.java:56)
at io.reactivex.internal.operators.single.SingleFromCallable.subscribeActual(SingleFromCallable.java:56)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleZipArray.subscribeActual(SingleZipArray.java:63)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleFlatMap.subscribeActual(SingleFlatMap.java:36)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleDelayWithCompletable$OtherObserver.onComplete(SingleDelayWithCompletable.java:69)
at brave.context.rxjava2.internal.TraceContextCompletableObserver.onComplete(TraceContextCompletableObserver.java:53)
at io.reactivex.internal.operators.completable.CompletableOnErrorComplete$OnError.onComplete(CompletableOnErrorComplete.java:48)
at brave.context.rxjava2.internal.TraceContextCompletableObserver.onComplete(TraceContextCompletableObserver.java:53)
at io.reactivex.internal.operators.completable.CompletablePeek$CompletableObserverImplementation.onComplete(CompletablePeek.java:115)
at brave.context.rxjava2.internal.TraceContextCompletableObserver.onComplete(TraceContextCompletableObserver.java:53)
at io.reactivex.internal.operators.completable.CompletablePeek$CompletableObserverImplementation.onComplete(CompletablePeek.java:115)
at brave.context.rxjava2.internal.TraceContextCompletableObserver.onComplete(TraceContextCompletableObserver.java:53)
at io.reactivex.internal.operators.completable.CompletableObserveOn$ObserveOnCompletableObserver.run(CompletableObserveOn.java:89)
at brave.propagation.CurrentTraceContext$1CurrentTraceContextRunnable.run(CurrentTraceContext.java:264)
at com.atlassian.pipelines.common.trace.rxjava.CopyMdcSchedulerHandler$CopyMdcRunnableAdapter.run(CopyMdcSchedulerHandler.java:74)
at io.reactivex.Scheduler$DisposeTask.run(Scheduler.java:608)
at brave.propagation.CurrentTraceContext$1CurrentTraceContextRunnable.run(CurrentTraceContext.java:264)
at com.atlassian.pipelines.common.trace.rxjava.CopyMdcSchedulerHandler$CopyMdcRunnableAdapter.run(CopyMdcSchedulerHandler.java:74)
at io.reactivex.internal.schedulers.ScheduledRunnable.run(ScheduledRunnable.java:66)
at io.reactivex.internal.schedulers.ScheduledRunnable.call(ScheduledRunnable.java:57)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
[2024-05-17 08:33:51,197] Appending log line to main log.

# And now our pipelines are handing on CLONE step, which requires us to reboot the java agent/and or reboot the compute server
[2024-05-17 08:34:45,505] Updating runner state to "ONLINE".
[2024-05-17 08:36:45,505] Updating runner state to "ONLINE".
[2024-05-17 08:38:45,505] Updating runner state to "ONLINE".


# Checking the build directory, its empty
root@bitbucket-self-hosted-runner-instance-1:~# ls -ltra /root/atlassian-bitbucket-pipelines-runner/bin/../temp../temp/uuid-runner-dir-redacted/build 
total 8
drwxr-xr-x 7 root root 4096 May 17 08:33 ..
drwxrwxrwx 2 root root 4096 May 17 08:33 .

We add the self-hosted-runner as a startup script on the GCP server, here is the snippet from the python code we use to launch these runners:

instance_start_up_script = "cd /root/atlassian-bitbucket-pipelines-runner/bin ; " \
                           "./start.sh " \
                           "--accountUuid %s " \
                           "--runnerUuid %s " \
                           "--OAuthClientId %s " \
                           "--OAuthClientSecret %s " \
                           "--runtime %s " \
                           "--cleanupPreviousFolders true " \
                           "--scheduledStateUpdateInitialDelaySeconds 60 " \
                           "--scheduledStateUpdatePeriodSeconds 120 " \
                           "--workingDirectory %s" % (bitbucket_runners_creds["account_uuid"],
                                                      bitbucket_runners_creds["runner_uuid"],
                                                      bitbucket_runners_creds["o_auth_client_id"],
                                                      bitbucket_runners_creds["o_auth_client_secret"],
                                                      bitbucket_runners_creds["runtime"],
                                                      bitbucket_runners_creds["working_directory"])

You may be experiencing slow load times on the forum; the team is working to resolve this ASAP.

Forums

Q&A

Community resources

Support

Top groups

Community resources

Support

Learn

Community resources

Support

Events

Community resources

Support

Self Hosted Runner (Linux Shell) - Hanging on CLONE - Failed to exec spawn helper

1 answer

1 accepted

Suggest an answer

Was this helpful?

Thanks!

DEPLOYMENT TYPE

PRODUCT PLAN

TAGS

Atlassian Community Events