Hi all,
In recent months, we have seen sporadic hanging pipelines on the cloning step.
Asking here in the community in case helps others, or if folks have experienced this before. Forgive me if this is issue was already raised (i quickly glanced over existing posts)
We need the self hosted runner, mainly as our tests consume more memory then supported by the regular runners.
We are likely going to add a server reboot, or terminate/create after each pipeline, but would be nice to avoid this for the sake of time(server recreate, boot time etc).
The server has a spec of e2-standard-8(8 CPUs, 32GB RAM) - https://cloud.google.com/compute/docs/general-purpose-machines
# linux apt
apt-get install
openjdk-17-jdk
curl
gnupg
git
postgresql
postgresql-contrib
redis-tools
redis
build-essential
zlib1g
zlib1g-dev
libbz2-dev
libncurses5-dev
libncursesw5-dev
libffi-dev
libreadline-dev
libssl-dev
liblzma-dev
libsqlite3-dev
docker-ce
docker-ce-cli
containerd.io
docker-buildx-plugin
docker-compose-plugin
ca-certificates
# Also
GCP SDK
GCP Ops Monitoring agent
pyenv, python312 + pip
bitbucket self hosted runner
nodejs
Monitoring looks fine, no spikes in memory, CPU etc(except when the server boots for first time), no open file limits reached etc. Linux logs/journalctl look fine around this time.
Im going to enable jvm/java monitoring/metrics for the self-hosted runner, using something like javamelody.
Maybe enabling that we will see out of memory exceptions, thread exhaustion etc...something
# End of prior pipeline using the runner (this was a success)
[2024-05-17 08:33:45,455] Cancelling timeout
[2024-05-17 08:33:45,456] Completing step with result Result{status=FAILED, error=None}.
[2024-05-17 08:33:45,779] Setting runner state to not executing step.
[2024-05-17 08:33:45,779] Waiting for next step.
[2024-05-17 08:33:45,779] Finished executing step. StepId{accountUuid={accountUuid-redacted}, repositoryUuid={repositoryUuid-redacted}, pipelineUuid={pipelineUuid-redacted, stepUuid={stepUuid-redacted}}
# Next pipeline to use the runner (hangs after the exception is raised below)
[2024-05-17 08:33:49,172] Setting runner state to executing step.
[2024-05-17 08:33:49,172] Executing step. StepId{accountUuid={accountUuid-redacted}, repositoryUuid={repositoryUuid-redacted}, pipelineUuid={pipelineUuid-redacted}, stepUuid={stepUuid-redacted}}
[2024-05-17 08:33:49,173] Getting step StepId{accountUuid={accountUuid-redacted}, repositoryUuid={repositoryUuid-redacted}, pipelineUuid={pipelineUuid-redacted}, stepUuid={stepUuid-redacted}}.
[2024-05-17 08:33:49,175] Getting environment variables for step.
[2024-05-17 08:33:49,634] Getting oauth token for step.
[2024-05-17 08:33:49,972] Getting all artifacts for step.
[2024-05-17 08:33:49,974] Getting SSH private key.
[2024-05-17 08:33:49,975] Getting known hosts.
[2024-05-17 08:33:50,194] Setting up directories.
[2024-05-17 08:33:50,195] Starting log uploader.
[2024-05-17 08:33:50,197] Setting up step timeout: PT1H
[2024-05-17 08:33:50,198] Starting websocket listening to STEP_COMPLETED events.
[2024-05-17 08:33:50,200] Checking for step completion every PT2M seconds.
[2024-05-17 08:33:50,434] Updating step progress to CLONING.
[2024-05-17 08:33:50,783] Generating clone script.
[2024-05-17 08:33:50,784] Adding log file: /root/atlassian-bitbucket-pipelines-runner/bin/../temp../temp/uuid-runner-dir-redacted/tmp/clone11970097551783427306.log
[2024-05-17 08:33:50,785] Executing clone script.
[2024-05-17 08:33:50,789] An error occurred whilst executing script.
java.io.IOException: error=0, Failed to exec spawn helper: pid: 11147, exit value: 1
at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:314)
at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:244)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1110)
... 76 common frames omitted
Wrapped by: java.io.IOException: Cannot run program "/bin/bash" (in directory "/root/atlassian-bitbucket-pipelines-runner/bin/../temp../temp/uuid-runner-dir-redacted/build"): error=0, Failed to exec spawn helper: pid: 11147, exit value: 1
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1143)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073)
at java.base/java.lang.Runtime.exec(Runtime.java:594)
at org.apache.commons.exec.launcher.Java13CommandLauncher.exec(Java13CommandLauncher.java:61)
at org.apache.commons.exec.DefaultExecutor.launch(DefaultExecutor.java:279)
at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:336)
at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166)
at com.atlassian.pipelines.runner.core.service.execute.ScriptExecutorServiceImpl.execute(ScriptExecutorServiceImpl.java:94)
at com.atlassian.pipelines.runner.core.service.execute.ScriptExecutorServiceImpl.lambda$execute$0(ScriptExecutorServiceImpl.java:49)
at io.reactivex.internal.operators.single.SingleFromCallable.subscribeActual(SingleFromCallable.java:44)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleDoOnSuccess.subscribeActual(SingleDoOnSuccess.java:35)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleDoOnError.subscribeActual(SingleDoOnError.java:35)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleFlatMap.subscribeActual(SingleFlatMap.java:36)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleDoOnSubscribe.subscribeActual(SingleDoOnSubscribe.java:41)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleDelayWithCompletable$OtherObserver.onComplete(SingleDelayWithCompletable.java:69)
at brave.context.rxjava2.internal.TraceContextCompletableObserver.onComplete(TraceContextCompletableObserver.java:53)
at io.reactivex.internal.operators.completable.CompletableFromAction.subscribeActual(CompletableFromAction.java:46)
at io.reactivex.Completable.subscribe(Completable.java:2309)
at brave.context.rxjava2.internal.TraceContextCompletable.subscribeActual(TraceContextCompletable.java:39)
at io.reactivex.Completable.subscribe(Completable.java:2309)
at io.reactivex.internal.operators.single.SingleDelayWithCompletable.subscribeActual(SingleDelayWithCompletable.java:36)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleFlatMap$SingleFlatMapCallback.onSuccess(SingleFlatMap.java:84)
at brave.context.rxjava2.internal.TraceContextSingleObserver.onSuccess(TraceContextSingleObserver.java:56)
at io.reactivex.internal.operators.single.SingleZipArray$ZipCoordinator.innerSuccess(SingleZipArray.java:119)
at io.reactivex.internal.operators.single.SingleZipArray$ZipSingleObserver.onSuccess(SingleZipArray.java:170)
at brave.context.rxjava2.internal.TraceContextSingleObserver.onSuccess(TraceContextSingleObserver.java:56)
at io.reactivex.internal.operators.single.SingleFromCallable.subscribeActual(SingleFromCallable.java:56)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleZipArray.subscribeActual(SingleZipArray.java:63)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleFlatMap.subscribeActual(SingleFlatMap.java:36)
at io.reactivex.Single.subscribe(Single.java:3666)
at brave.context.rxjava2.internal.TraceContextSingle.subscribeActual(TraceContextSingle.java:39)
at io.reactivex.Single.subscribe(Single.java:3666)
at io.reactivex.internal.operators.single.SingleDelayWithCompletable$OtherObserver.onComplete(SingleDelayWithCompletable.java:69)
at brave.context.rxjava2.internal.TraceContextCompletableObserver.onComplete(TraceContextCompletableObserver.java:53)
at io.reactivex.internal.operators.completable.CompletableOnErrorComplete$OnError.onComplete(CompletableOnErrorComplete.java:48)
at brave.context.rxjava2.internal.TraceContextCompletableObserver.onComplete(TraceContextCompletableObserver.java:53)
at io.reactivex.internal.operators.completable.CompletablePeek$CompletableObserverImplementation.onComplete(CompletablePeek.java:115)
at brave.context.rxjava2.internal.TraceContextCompletableObserver.onComplete(TraceContextCompletableObserver.java:53)
at io.reactivex.internal.operators.completable.CompletablePeek$CompletableObserverImplementation.onComplete(CompletablePeek.java:115)
at brave.context.rxjava2.internal.TraceContextCompletableObserver.onComplete(TraceContextCompletableObserver.java:53)
at io.reactivex.internal.operators.completable.CompletableObserveOn$ObserveOnCompletableObserver.run(CompletableObserveOn.java:89)
at brave.propagation.CurrentTraceContext$1CurrentTraceContextRunnable.run(CurrentTraceContext.java:264)
at com.atlassian.pipelines.common.trace.rxjava.CopyMdcSchedulerHandler$CopyMdcRunnableAdapter.run(CopyMdcSchedulerHandler.java:74)
at io.reactivex.Scheduler$DisposeTask.run(Scheduler.java:608)
at brave.propagation.CurrentTraceContext$1CurrentTraceContextRunnable.run(CurrentTraceContext.java:264)
at com.atlassian.pipelines.common.trace.rxjava.CopyMdcSchedulerHandler$CopyMdcRunnableAdapter.run(CopyMdcSchedulerHandler.java:74)
at io.reactivex.internal.schedulers.ScheduledRunnable.run(ScheduledRunnable.java:66)
at io.reactivex.internal.schedulers.ScheduledRunnable.call(ScheduledRunnable.java:57)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
[2024-05-17 08:33:51,197] Appending log line to main log.
# And now our pipelines are handing on CLONE step, which requires us to reboot the java agent/and or reboot the compute server
[2024-05-17 08:34:45,505] Updating runner state to "ONLINE".
[2024-05-17 08:36:45,505] Updating runner state to "ONLINE".
[2024-05-17 08:38:45,505] Updating runner state to "ONLINE".
# Checking the build directory, its empty
root@bitbucket-self-hosted-runner-instance-1:~# ls -ltra /root/atlassian-bitbucket-pipelines-runner/bin/../temp../temp/uuid-runner-dir-redacted/build
total 8
drwxr-xr-x 7 root root 4096 May 17 08:33 ..
drwxrwxrwx 2 root root 4096 May 17 08:33 .
We add the self-hosted-runner as a startup script on the GCP server, here is the snippet from the python code we use to launch these runners:
instance_start_up_script = "cd /root/atlassian-bitbucket-pipelines-runner/bin ; " \
"./start.sh " \
"--accountUuid %s " \
"--runnerUuid %s " \
"--OAuthClientId %s " \
"--OAuthClientSecret %s " \
"--runtime %s " \
"--cleanupPreviousFolders true " \
"--scheduledStateUpdateInitialDelaySeconds 60 " \
"--scheduledStateUpdatePeriodSeconds 120 " \
"--workingDirectory %s" % (bitbucket_runners_creds["account_uuid"],
bitbucket_runners_creds["runner_uuid"],
bitbucket_runners_creds["o_auth_client_id"],
bitbucket_runners_creds["o_auth_client_secret"],
bitbucket_runners_creds["runtime"],
bitbucket_runners_creds["working_directory"])
Hi Peter,
I suggest removing the startup script from the server and starting the runner by running the preconfigured command manually. Then, monitor the builds to see if you run into the same error. This is to narrow down whether the issue has to do with the way the runner is started or not.
If you still encounter the same issue, I suggest updating the runner to the latest version and then trying the same with the latest version of the runner.
Please feel free to let me know how it goes.
Does the server where these builds are running meet the prerequisites listed here (Supported platforms and minimum requirements)?
Also, I wanted to give you a heads-up that we will soon support 4x and 8x size steps for builds running on Atlassian's infrastructure as well (if you would prefer that eventually). This will be released in June:
You can add yourself as a watcher to the following feature request (by selecting the Start watching this issue link) if you'd like to get notified via email when this is released:
Kind regards,
Theodora
Thanks @Theodora Boudale, please forgive my delayed response, i must not have been subscribed to this threads notifications/updates.
We ended up moving our workloads to jenkins(compute server agents) for now, which we trigger from our bitbucket pipeline. No failures for the last 200 builds.
Thanks for all the info, i can confirm some of above, but due to timing wasnt able to test all. All very helpful info, and i will bookmark/save for later, as im hopeful we will look into the 4x & 8x sizes later in year, and see if they work for us.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Peter,
Thank you for the update.
If you would like to look into 4x & 8x sizes later in the year, we released this feature yesterday for builds running on Atlassian's infrastructure:
Kind regards,
Theodora
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.