Come for the products,
stay for the community

The Atlassian Community can help you and your team get more value out of Atlassian products and practices.

Atlassian Community about banner
4,299,766
Community Members
 
Community Events
165
Community Groups

Runner stop responding when idle for too long

Context: during EAP, we use the runner only once a day. Once in a while the runner gets offline (running, but offline) for up to 4 hours for infrastructure maintenance.

My problem:

I've noticed that very frequently the runner stop responding to the cloud. On logs I have just "Waiting for next step", no errors. Once I restart the runner container, everything runs as it's has never been offline.

Anyone with similar issue?

8 answers

1 vote
lassian Atlassian Team Apr 13, 2021

Hi All,

Can you all please provide your runner logs when it has "hung" and what version of the runner you are using so we can investigate this.

Kind Regards,
Nathan Burrell

Hi, sorry for the late reply.

I've updated the runner to the latest version on friday (2021-04-16) and couldn't notice this issue on this version.

When it was happening there was no log output, just the normal logs ending with "Waiting for next step".

I'll monitor this runner closely this week and report if the problem shows up again.

Thank you.

Here is the logs from my Runner on v1.107 after it goes "Offline":

[2021-04-17 00:11:18,777] An error occurred whilst listening to RUNNER_UPDATED websocket events.
java.io.EOFException: null
at okio.RealBufferedSource.require(RealBufferedSource.java:65)
at okio.RealBufferedSource.readByte(RealBufferedSource.java:78)
at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:273)
at okhttp3.internal.ws.RealWebSocket$1.onResponse(RealWebSocket.java:209)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:174)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Exception in thread "main" java.lang.RuntimeException: java.io.EOFException
at io.reactivex.internal.util.ExceptionHelper.wrapOrThrow(ExceptionHelper.java:46)
at io.reactivex.internal.observers.BlockingMultiObserver.blockingGet(BlockingMultiObserver.java:93)
at io.reactivex.Completable.blockingAwait(Completable.java:1227)
at com.atlassian.pipelines.runner.core.ApplicationImpl.main(ApplicationImpl.java:47)
Caused by: java.io.EOFException
at okio.RealBufferedSource.require(RealBufferedSource.java:65)
at okio.RealBufferedSource.readByte(RealBufferedSource.java:78)
at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:273)
at okhttp3.internal.ws.RealWebSocket$1.onResponse(RealWebSocket.java:209)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:174)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)

yes, I get the same logs. And mine went offline the same date.

Hi @lassian 

It just happened again, pipelines shows "Starting Agent" on web (longer than 1 hour now). The runner shows as ONLINE on admin page.

On the logs we get: a successful pipeline followed by hundreds of health checks.

Despite of not responding, the runner keeps reporting, logging and show on admin as ONLINE until now.

......
[2021-04-20 12:19:40,014] Updating runner state to "ONLINE".
[2021-04-20 12:20:10,015] Updating runner state to "ONLINE".
[2021-04-20 12:20:35,310] Appending log line to main log.
[2021-04-20 12:20:38,309] Appending log line to main log.
[2021-04-20 12:20:40,014] Updating runner state to "ONLINE".
[2021-04-20 12:20:45,309] Appending log line to main log.
[2021-04-20 12:20:46,315] Appending log line to main log.
[2021-04-20 12:21:10,014] Updating runner state to "ONLINE".
[2021-04-20 12:21:40,014] Updating runner state to "ONLINE".
[2021-04-20 12:22:10,014] Updating runner state to "ONLINE".
[2021-04-20 12:22:40,014] Updating runner state to "ONLINE".
[2021-04-20 12:22:53,308] Appending log line to main log.
[2021-04-20 12:22:56,919] Container has state (exitCode: Some(5), OOMKilled Some(false))
[2021-04-20 12:22:56,920] Removing container build
[2021-04-20 12:22:57,144] Not uploading caches. (numberOfCaches: 0, resultOrError: FAILED)
[2021-04-20 12:22:57,144] Not uploading artifacts. (numberOfArtifacts: 0, resultOrError: FAILED)
[2021-04-20 12:22:57,144] Updating step progress to PARSING_TEST_RESULTS.
[2021-04-20 12:22:57,615] Test report processing complete.
[2021-04-20 12:22:57,615] Removing container clone
[2021-04-20 12:22:58,309] Appending log line to main log.
[2021-04-20 12:22:58,409] Removing container clone
[2021-04-20 12:22:58,411] Removing container build
[2021-04-20 12:22:58,413] Updating step progress to COMPLETING_LOGS.
[2021-04-20 12:22:58,878] Shutting down log uploader.
[2021-04-20 12:22:58,878] Tearing down directories.
[2021-04-20 12:22:59,043] Cancelling timeout
[2021-04-20 12:22:59,044] Completing step with result Result{status=FAILED, error=None}.
[2021-04-20 12:22:59,530] Waiting for next step.
[2021-04-20 12:23:10,014] Updating runner state to "ONLINE".
[2021-04-20 12:23:40,015] Updating runner state to "ONLINE".
[2021-04-20 12:24:10,014] Updating runner state to "ONLINE".
[2021-04-20 12:24:40,014] Updating runner state to "ONLINE".
[2021-04-20 12:25:10,014] Updating runner state to "ONLINE".
[2021-04-20 12:25:40,014] Updating runner state to "ONLINE".
[2021-04-20 12:26:10,015] Updating runner state to "ONLINE".
[2021-04-20 12:26:40,015] Updating runner state to "ONLINE".
[2021-04-20 12:27:10,014] Updating runner state to "ONLINE".
[2021-04-20 12:27:40,015] Updating runner state to "ONLINE".
[2021-04-20 12:28:10,014] Updating runner state to "ONLINE".
[2021-04-20 12:28:40,014] Updating runner state to "ONLINE".
[2021-04-20 12:29:10,015] Updating runner state to "ONLINE".
[2021-04-20 12:29:40,014] Updating runner state to "ONLINE".
[2021-04-20 12:30:10,015] Updating runner state to "ONLINE".
[2021-04-20 12:30:40,014] Updating runner state to "ONLINE".
[2021-04-20 12:31:10,015] Updating runner state to "ONLINE".

.....
Like Jorge Cotelo likes this

YESSS. I just got that issue as well. pipelines shows "Starting Agent" on web and the agent is showing online every 30 sec as normal

lassian Atlassian Team Apr 20, 2021

Hi,

I have identified a bug with websockets above and retries and have released a fix, can you please pull the latest version of the runner (ensuring you have v1.110 or greater).

When you receive the error above it will than resubscribe to the websocket I found an edge case where it was not.

Can you update your runners and tell us if your still experiencing these issues with this version.

Kind Regards,

Nathan Burrell

Hi, 

I pulled the latest image. (v1.110) and started the runner again. After two builds I got the same issue @Paulo Antoniassi was getting without the error.

Basically the web shows starting agent but it stays there forever while the logs in the runner still shows [2021-04-21 13:54:00,057] Updating runner state to "ONLINE"

lassian Atlassian Team Apr 21, 2021

Hi Jorge,

Can you provide me the uuid of the step that was stuck at starting agent, and the uuid of the runner.

Kind Regards,
Nathan Burrell

Hi @lassian 


I am using a private repo so I don't think you will be able to have any visual on it. I don't know if @Paulo Antoniassi  still have the issue.

After a few hours the pipeline times out

Hi, 

I've setup "dummy steps" on every commit to keep te runner alive so I don't have anything recent. Mine is a private repository too but I'm pretty sure Atlassian can get interesting logs even for private repos. 

I guess the following one was stuck, but I'm not sure about that cause restarting the runner resumes the step as anything happened, I'm just guessing considering 12 minutes of "build setup".

Step b0394b49-4dc3-4cd9-9e0b-3eabc89350ee (#108)

Repository eabc3499-aff6-4a5e-a5ff-4058c6bc57a2 (b5s/sic718_atendimento)

@lassian , 

Just got an confirmed one, from logs:

{accountUuid={105381f7-a165-4737-899a-c14d91b3b768}, repositoryUuid={eabc3499-aff6-4a5e-a5ff-4058c6bc57a2}, pipelineUuid={a4e15047-11c4-444e-99f8-9ecd1f769d3d}, stepUuid={d130fdd6-26ab-464d-bd41-86a030539f2d}}

 

I had to restart the runner to resume the step.

@Paulo Antoniassi@lassian 
Hello, Any update on this?

I can confirm this issue still happens. I created a few custom pipelines and run them on schedule every 1 hour. After approximately 10 hours the runner stops receiving request and then I have to restart it.
So is not "When idle for too long" it just stop responding and it still shows online on the web client and in the runner itself.

 

Regards

Just wanted to circle back since my original comment.

Since updating to v1.110, my runner has consistently run according to schedule.

Interesting. I am currently running v1.127 and still have the issue. 

thanks @Steven Rupert 

0 votes
lassian Atlassian Team May 10, 2021

Hi,

I have just released a patch to the runner that should stop it from hanging.

Please upgrade the runner and ensure you are running 1.133 and confirm for me if this occurs again.

Kind Regards,

Nathan Burrell

0 votes
lassian Atlassian Team May 05, 2021

Hi,

Can you all please raise a support case and attach the runner logs at the time a step was scheduled onto it but the runner didnt pick it up so we can investigate, and what step was assigned to it (one that is not in queued in the UI).

Kind Regards,
Nathan

Bump. 

Same sort of thing is happening to me.  But mine just hangs during checkout every time.  

I posed an issue for it because it seems a little different (mine never runs):

https://community.atlassian.com/t5/Bitbucket-Pipelines-Runners/Pipeline-Hang-During-Checkout/qaq-p/1673959

Un-bump.

My similar issue is gone.  See the note on the above thread (started working in v1.110).

Hello,

I also experience an issue. But mine comes after an error:

An error occurred whilst listening to RUNNER_UPDATED websocket events.
java.io.EOFException: null
    at okio.RealBufferedSource.require(RealBufferedSource.java:65)
    at okio.RealBufferedSource.readByte(RealBufferedSource.java:78)
    at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
    at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
    at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:273)
    at okhttp3.internal.ws.RealWebSocket$1.onResponse(RealWebSocket.java:209)
    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:174)
    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Exception in thread "main" java.lang.RuntimeException: java.io.EOFException
    at io.reactivex.internal.util.ExceptionHelper.wrapOrThrow(ExceptionHelper.java:46)
    at io.reactivex.internal.observers.BlockingMultiObserver.blockingGet(BlockingMultiObserver.java:93)
    at io.reactivex.Completable.blockingAwait(Completable.java:1227)
    at com.atlassian.pipelines.runner.core.ApplicationImpl.main(ApplicationImpl.java:47)
Caused by: java.io.EOFException
    at okio.RealBufferedSource.require(RealBufferedSource.java:65)
    at okio.RealBufferedSource.readByte(RealBufferedSource.java:78)
    at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
    at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
    at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:273)
    at okhttp3.internal.ws.RealWebSocket$1.onResponse(RealWebSocket.java:209)
    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:174)
    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)

yeah we also reproduced it randomly

I've experienced this issue myself. Currently have a single runner that's scheduled to run weekly Packer image builds. The runner is inactive the rest of the week and consistently fails to run my scheduled build until the runner is restarted.

 

The error I've seen when it fails to run is "ERROR Step exceeded processing limits and has timed out:.

yeah i seems to have the same issue randomly too. never found why

Suggest an answer

Log in or Sign up to answer
TAGS

Community Events

Connect with like-minded Atlassian users at free events near you!

Find an event

Connect with like-minded Atlassian users at free events near you!

Unfortunately there are no Community Events near you at the moment.

Host an event

You're one step closer to meeting fellow Atlassian users at your local event. Learn more about Community Events

Events near you