Self Hosted Runner (Linux Shell) - Hanging on CLONE - Failed to exec spawn helper

Peter
Contributor
May 17, 2024

Hi all, 

 

In recent months, we have seen sporadic hanging pipelines on the cloning step. 

Asking here in the community in case helps others, or if folks have experienced this before. Forgive me if this is issue was already raised (i quickly glanced over existing posts)

We need the self hosted runner, mainly as our tests consume more memory then supported by the regular runners. 

We are likely going to add a server reboot, or terminate/create after each pipeline, but would be nice to avoid this for the sake of time(server recreate, boot time etc). 

 

  • We are currently on version 1.563, we tried bumping versions to latest in the past but we still randomly hit this hung CLONE step. 
  • The only workaround we are aware of, is to reboot our self hosted runner server, and then re-run the pipeline hung step.
  • We have tried different server specs, tried Debian & ubuntu but all seem to sporadically hang on CLONE step.
  • It 'seems' to only happen if the runner was used by a previous pipeline (we terminate the compute servers/runners if not in use).
    • So fresh runners that have not been used before have never hit this issue before.
  • We clear the build directory before each run.
  • The server has a spec of e2-standard-8(8 CPUs, 32GB RAM) - https://cloud.google.com/compute/docs/general-purpose-machines 

    • We also install the following stuff on the server, if useful
      • # linux apt
        apt-get install
        openjdk-17-jdk
        curl
        gnupg
        git
        postgresql
        postgresql-contrib
        redis-tools
        redis
        build-essential
        zlib1g
        zlib1g-dev
        libbz2-dev
        libncurses5-dev
        libncursesw5-dev
        libffi-dev
        libreadline-dev
        libssl-dev
        liblzma-dev
        libsqlite3-dev
        docker-ce
        docker-ce-cli
        containerd.io
        docker-buildx-plugin
        docker-compose-plugin
        ca-certificates

        # Also
        GCP SDK
        GCP Ops Monitoring agent
        pyenv, python312 + pip
        bitbucket self hosted runner
        nodejs
  • Monitoring looks fine, no spikes in memory, CPU etc(except when the server boots for first time), no open file limits reached etc. Linux logs/journalctl look fine around this time.

  • Im going to enable jvm/java monitoring/metrics for the self-hosted runner, using something like javamelody.

    • Maybe enabling that we will see out of memory exceptions, thread exhaustion etc...something

  • Ill jot down logs below from the server (we run on GCP, X86)
# End of prior pipeline using the runner (this was a success)
[2024-05-17 08:33:45,455] Cancelling timeout
[2024-05-17 08:33:45,456] Completing step with result Result{status=FAILED, error=None}.
[2024-05-17 08:33:45,779] Setting runner state to not executing step.
[2024-05-17 08:33:45,779] Waiting for next step.
[2024-05-17 08:33:45,779] Finished executing step. StepId{accountUuid={accountUuid-redacted}, repositoryUuid={repositoryUuid-redacted}, pipelineUuid={pipelineUuid-redacted, stepUuid={stepUuid-redacted}}

 

# Next pipeline to use the runner (hangs after the exception is raised below)
[2024-05-17 08:33:49,172] Setting runner state to executing step.
[2024-05-17 08:33:49,172] Executing step. StepId{accountUuid={accountUuid-redacted}, repositoryUuid={repositoryUuid-redacted}, pipelineUuid={pipelineUuid-redacted}, stepUuid={stepUuid-redacted}}
[2024-05-17 08:33:49,173] Getting step StepId{accountUuid={accountUuid-redacted}, repositoryUuid={repositoryUuid-redacted}, pipelineUuid={pipelineUuid-redacted}, stepUuid={stepUuid-redacted}}.
[2024-05-17 08:33:49,175] Getting environment variables for step.
[2024-05-17 08:33:49,634] Getting oauth token for step.
[2024-05-17 08:33:49,972] Getting all artifacts for step.
[2024-05-17 08:33:49,974] Getting SSH private key.
[2024-05-17 08:33:49,975] Getting known hosts.
[2024-05-17 08:33:50,194] Setting up directories.
[2024-05-17 08:33:50,195] Starting log uploader.
[2024-05-17 08:33:50,197] Setting up step timeout: PT1H
[2024-05-17 08:33:50,198] Starting websocket listening to STEP_COMPLETED events.
[2024-05-17 08:33:50,200] Checking for step completion every PT2M seconds.
[2024-05-17 08:33:50,434] Updating step progress to CLONING.
[2024-05-17 08:33:50,783] Generating clone script.
[2024-05-17 08:33:50,784] Adding log file: /root/atlassian-bitbucket-pipelines-runner/bin/../temp../temp/uuid-runner-dir-redacted/tmp/clone11970097551783427306.log
[2024-05-17 08:33:50,785] Executing clone script.
[2024-05-17 08:33:50,789] An error occurred whilst executing script.
java.io.IOException: error=0, Failed to exec spawn helper: pid: 11147, exit value: 1
at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:314)
at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:244)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1110)
... 76 common frames omitted
Wrapped by: java.io.IOException: Cannot run program "/bin/bash" (in directory "/root/atlassian-bitbucket-pipelines-runner/bin/../temp../temp/uuid-runner-dir-redacted/build"): error=0, Failed to exec spawn helper: pid: 11147, exit value: 1
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1143)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073)
at java.base/java.lang.Runtime.exec(Runtime.java:594)
at org.apache.commons.exec.launcher.Java13CommandLauncher.exec(Java13CommandLauncher.java:61)
at org.apache.commons.exec.DefaultExecutor.launch(DefaultExecutor.java:279)
at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:336)
at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166)
at com.atlassian.pipelines.runner.core.service.execute.ScriptExecutorServiceImpl.execute(ScriptExecutorServiceImpl.java:94)
at com.atlassian.pipelines.runner.core.service.execute.ScriptExecutorServiceImpl.lambda$execute$0(ScriptExecutorServiceImpl.java:49)
at io.reactivex.internal.operators.single.SingleFromCallable.subscribeActual(SingleFromCallable.java:44)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleDoOnSuccess.subscribeActual(SingleDoOnSuccess.java:35)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleDoOnError.subscribeActual(SingleDoOnError.java:35)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleFlatMap.subscribeActual(SingleFlatMap.java:36)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleDoOnSubscribe.subscribeActual(SingleDoOnSubscribe.java:41)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleDelayWithCompletable$OtherObserver.onComplete(SingleDelayWithCompletable.java:69)
at brave.context.rxjava2.internal.TraceContextCompletableObserver.onComplete(TraceContextCompletableObserver.java:53)
at io.reactivex.internal.operators.completable.CompletableFromAction.subscribeActual(CompletableFromAction.java:46)
at io.reactivex.Completable.subscribe(Completable.java:2309)
at brave.context.rxjava2.internal.TraceContextCompletable.subscribeActual(TraceContextCompletable.java:39)
at io.reactivex.Completable.subscribe(Completable.java:2309)
at io.reactivex.internal.operators.single.SingleDelayWithCompletable.subscribeActual(SingleDelayWithCompletable.java:36)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleFlatMap$SingleFlatMapCallback.onSuccess(SingleFlatMap.java:84)
at brave.context.rxjava2.internal.TraceContextSingleObserver.onSuccess(TraceContextSingleObserver.java:56)
at io.reactivex.internal.operators.single.SingleZipArray$ZipCoordinator.innerSuccess(SingleZipArray.java:119)
at io.reactivex.internal.operators.single.SingleZipArray$ZipSingleObserver.onSuccess(SingleZipArray.java:170)
at brave.context.rxjava2.internal.TraceContextSingleObserver.onSuccess(TraceContextSingleObserver.java:56)
at io.reactivex.internal.operators.single.SingleFromCallable.subscribeActual(SingleFromCallable.java:56)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleZipArray.subscribeActual(SingleZipArray.java:63)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleFlatMap.subscribeActual(SingleFlatMap.java:36)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleDelayWithCompletable$OtherObserver.onComplete(SingleDelayWithCompletable.java:69)
at brave.context.rxjava2.internal.TraceContextCompletableObserver.onComplete(TraceContextCompletableObserver.java:53)
at io.reactivex.internal.operators.completable.CompletableOnErrorComplete$OnError.onComplete(CompletableOnErrorComplete.java:48)
at brave.context.rxjava2.internal.TraceContextCompletableObserver.onComplete(TraceContextCompletableObserver.java:53)
at io.reactivex.internal.operators.completable.CompletablePeek$CompletableObserverImplementation.onComplete(CompletablePeek.java:115)
at brave.context.rxjava2.internal.TraceContextCompletableObserver.onComplete(TraceContextCompletableObserver.java:53)
at io.reactivex.internal.operators.completable.CompletablePeek$CompletableObserverImplementation.onComplete(CompletablePeek.java:115)
at brave.context.rxjava2.internal.TraceContextCompletableObserver.onComplete(TraceContextCompletableObserver.java:53)
at io.reactivex.internal.operators.completable.CompletableObserveOn$ObserveOnCompletableObserver.run(CompletableObserveOn.java:89)
at brave.propagation.CurrentTraceContext$1CurrentTraceContextRunnable.run(CurrentTraceContext.java:264)
at com.atlassian.pipelines.common.trace.rxjava.CopyMdcSchedulerHandler$CopyMdcRunnableAdapter.run(CopyMdcSchedulerHandler.java:74)
at io.reactivex.Scheduler$DisposeTask.run(Scheduler.java:608)
at brave.propagation.CurrentTraceContext$1CurrentTraceContextRunnable.run(CurrentTraceContext.java:264)
at com.atlassian.pipelines.common.trace.rxjava.CopyMdcSchedulerHandler$CopyMdcRunnableAdapter.run(CopyMdcSchedulerHandler.java:74)
at io.reactivex.internal.schedulers.ScheduledRunnable.run(ScheduledRunnable.java:66)
at io.reactivex.internal.schedulers.ScheduledRunnable.call(ScheduledRunnable.java:57)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
[2024-05-17 08:33:51,197] Appending log line to main log.

# And now our pipelines are handing on CLONE step, which requires us to reboot the java agent/and or reboot the compute server
[2024-05-17 08:34:45,505] Updating runner state to "ONLINE".
[2024-05-17 08:36:45,505] Updating runner state to "ONLINE".
[2024-05-17 08:38:45,505] Updating runner state to "ONLINE".

# Checking the build directory, its empty
root@bitbucket-self-hosted-runner-instance-1:~# ls -ltra /root/atlassian-bitbucket-pipelines-runner/bin/../temp../temp/uuid-runner-dir-redacted/build
total 8
drwxr-xr-x 7 root root 4096 May 17 08:33 ..
drwxrwxrwx 2 root root 4096 May 17 08:33 .

 

We add the self-hosted-runner as a startup script on the GCP server, here is the snippet from the python code we use to launch these runners:

instance_start_up_script = "cd /root/atlassian-bitbucket-pipelines-runner/bin ; " \
"./start.sh " \
"--accountUuid %s " \
"--runnerUuid %s " \
"--OAuthClientId %s " \
"--OAuthClientSecret %s " \
"--runtime %s " \
"--cleanupPreviousFolders true " \
"--scheduledStateUpdateInitialDelaySeconds 60 " \
"--scheduledStateUpdatePeriodSeconds 120 " \
"--workingDirectory %s" % (bitbucket_runners_creds["account_uuid"],
bitbucket_runners_creds["runner_uuid"],
bitbucket_runners_creds["o_auth_client_id"],
bitbucket_runners_creds["o_auth_client_secret"],
bitbucket_runners_creds["runtime"],
bitbucket_runners_creds["working_directory"])

1 answer

1 accepted

1 vote
Answer accepted
Theodora Boudale
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
May 21, 2024

Hi Peter,

I suggest removing the startup script from the server and starting the runner by running the preconfigured command manually. Then, monitor the builds to see if you run into the same error. This is to narrow down whether the issue has to do with the way the runner is started or not.

If you still encounter the same issue, I suggest updating the runner to the latest version and then trying the same with the latest version of the runner.

Please feel free to let me know how it goes.

Does the server where these builds are running meet the prerequisites listed here (Supported platforms and minimum requirements)?

Also, I wanted to give you a heads-up that we will soon support 4x and 8x size steps for builds running on Atlassian's infrastructure as well (if you would prefer that eventually). This will be released in June:

You can add yourself as a watcher to the following feature request (by selecting the Start watching this issue link) if you'd like to get notified via email when this is released:

Kind regards,
Theodora

Peter
Contributor
June 24, 2024

Thanks @Theodora Boudale, please forgive my delayed response, i must not have been subscribed to this threads notifications/updates.

We ended up moving our workloads to jenkins(compute server agents) for now, which we trigger from our bitbucket pipeline. No failures for the last 200 builds. 

Thanks for all the info, i can confirm some of above, but due to timing wasnt able to test all. All very helpful info, and i will bookmark/save for later, as im hopeful we will look into the 4x & 8x sizes later in year, and see if they work for us. 

Theodora Boudale
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
June 25, 2024

Hi Peter,

Thank you for the update.

If you would like to look into 4x & 8x sizes later in the year, we released this feature yesterday for builds running on Atlassian's infrastructure:

Kind regards,
Theodora

Peter
Contributor
June 27, 2024

Thanks Theodora, 

 

Ill take a peek at these

Like Theodora Boudale likes this

Suggest an answer

Log in or Sign up to answer
DEPLOYMENT TYPE
CLOUD
PRODUCT PLAN
PREMIUM
TAGS
AUG Leaders

Atlassian Community Events