Create
cancel
Showing results for 
Search instead for 
Did you mean: 
Sign up Log in

Runner is stuck on a step "runner in a non retryable state" error

Jorge Cotelo April 26, 2021

Hello.

I have a new Issue this morning. My pipelines are not running on the web client. when I look in the runner the last error I have is the following.

 



[2021-04-26 15:35:38,650] Runner is in a non retryable state. (state: CompletedStepStateModel{result=SuccessfulResultForCompletedStepStateModel{}})
[2021-04-26 15:35:38,655] Step was already complete skipping.
[2021-04-26 15:35:38,711] Completing step with result Result{status=ERROR, error=Some(Error{key='runner.step.execution-error', message='null', arguments={}})}.
[2021-04-26 15:35:39,020] Translating HTTP 400 BAD_REQUEST response into a com.atlassian.pipelines.stargate.client.core.exceptions.StargateBadRequestException
[2021-04-26 15:35:39,097] An error occurred whilst completing step.
com.atlassian.pipelines.stargate.client.core.exceptions.StargateBadRequestException: Response Summary: HttpResponseSummary{httpStatusCode=400, httpStatusMessage=Bad Request, bodyAsString={"error":{"message":"Bad request","detail":"Attempt to complete an already completed step (existing status=PASSED, new status=ERROR, existing error=null, new error=runner.step.execution-error)","data":{"key":"result-service.step.invalid-state","arguments":{}}}}}
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException:
Error has been observed at the following site(s):
|_ checkpoint ⇢ 400 from POST https://api.atlassian.com/ex/bitbucket-pipelines/rest/1.0/accounts/xxxxxxxxx/stepComplete [DefaultWebClient]
Stack trace:
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.base/java.lang.reflect.Constructor.newInstance(Unknown Source)
at com.atlassian.bitbucketci.client.reactive.ResponseExceptionFactory$ConstructorInvoker.invokeConstructor(ResponseExceptionFactory.java:119)
at io.vavr.CheckedFunction1.lambda$unchecked$43b513dd$1(CheckedFunction1.java:220)
at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:100)
at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:114).

 

Now even if I stop the web client pipeline and then I restart the docker container it goes back to this error. 

I also destroyed the container I started a new one. I this issue persists

7 comments

Comment

Log in or Sign up to comment
Jorge Cotelo April 26, 2021

Nothing gets me out of this.

restarting the container gets back to this issue.
removing the container and creating a new one gets me back to this issue

This is the information of the step that is stuck.

ex.


Getting step StepId{accountUuid={b14cf156-b835-40a8-9fd4-eba8785bb22c}, repositoryUuid={dfc78678-e1b2-4eaf-835c-9adb5687c21b}, pipelineUuid={8d3fca5a-b8e2-447e-b085-25d32996db6f}, stepUuid={d2e3a735-68e4-4dcd-8e1a-24b23311e328}
Mohamed Elhossary April 27, 2021

Can anyone from the Atlassian team please help with the above? We have not been able to use the runner in the last 2 days

cc: @Justin Thomas 

lassian
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 27, 2021

Hi Jorge/Mohamed

Apologies for the delay, we have identified a bug that was causing this behaviour and deployed a fix to our backend services.

However the small number of runners that were affected by this are now in an invalid state.

Can you delete and recreate those runners?

That will fix these now corrupt runners.

Kind Regards,
Nathan Burrell

Like Jorge Cotelo likes this
Jorge Cotelo April 28, 2021

Hi @lassian 

 

Thank you. I will do that now and let you know.

Jorge Cotelo April 28, 2021

Hi @lassian 

I have stopped and remove the container, deleted the image and pulled the new one and as soon as I start it then it goes back to the same step and get stuck there:

[2021-04-28 13:57:54,883] Getting step StepId{accountUuid={b14cf156-b835-40a8-9fd4-eba8785bb22c}, repositoryUuid={dfc78678-e1b2-4eaf-835c-9adb5687c21b}, pipelineUuid={8d3fca5a-b8e2-447e-b085-25d32996db6f}, stepUuid={d2e3a735-68e4-4dcd-8e1a-24b23311e328}}.
[2021-04-28 13:57:54,899] Getting oauth token for step.
[2021-04-28 13:57:54,914] Getting environment variables for step.
[2021-04-28 13:57:55,628] Getting all artifacts for step.
[2021-04-28 13:57:55,994] Runner is in a non retryable state. (state: CompletedStepStateModel{result=SuccessfulResultForCompletedStepStateModel{}})
[2021-04-28 13:57:56,000] Step was already complete skipping.
[2021-04-28 13:57:56,021] Completing step with result Result{status=ERROR, error=Some(Error{key='runner.step.execution-error', message='null', arguments={}})}.
[2021-04-28 13:57:56,697] Translating HTTP 400 BAD_REQUEST response into a com.atlassian.pipelines.stargate.client.core.exceptions.StargateBadRequestException
[2021-04-28 13:57:56,722] An error occurred whilst completing step.
com.atlassian.pipelines.stargate.client.core.exceptions.StargateBadRequestException: Response Summary: HttpResponseSummary{httpStatusCode=400, httpStatusMessage=Bad Request, bodyAsString={"error":{"message":"Bad request","detail":"Attempt to complete an already completed step (existing status=PASSED, new status=ERROR, existing error=null, new error=runner.step.execution-error)","data":{"key":"result-service.step.invalid-state","arguments":{}}}}}
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException:
Error has been observed at the following site(s):
|_ checkpoint ⇢ 400 from POST https://api.atlassian.com/ex/bitbucket-pipelines/rest/1.0/accounts/%7Bb14cf156-b835-40a8-9fd4-eba8785bb22c%7D/repositories/%7Bdfc78678-e1b2-4eaf-835c-9adb5687c21b%7D/pipelines/%7B8d3fca5a-b8e2-447e-b085-25d32996db6f%7D/steps/%7Bd2e3a735-68e4-4dcd-8e1a-24b23311e328%7D/stepComplete [DefaultWebClient]
Stack trace:
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.base/java.lang.reflect.Constructor.newInstance(Unknown Source)
at com.atlassian.bitbucketci.client.reactive.ResponseExceptionFactory$ConstructorInvoker.invokeConstructor(ResponseExceptionFactory.java:119)
at io.vavr.CheckedFunction1.lambda$unchecked$43b513dd$1(CheckedFunction1.java:220)
at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:100)
at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:114)
at reactor.core.publisher.FluxDefaultIfEmpty$DefaultIfEmptySubscriber.onNext(FluxDefaultIfEmpty.java:92)
at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:121)
at reactor.core.publisher.FluxContextStart$ContextStartSubscriber.onNext(FluxContextStart.java:96)
at reactor.core.publisher.FluxMapFuseable$MapFuseableConditionalSubscriber.onNext(FluxMapFuseable.java:287)
at reactor.core.publisher.FluxFilterFuseable$FilterFuseableConditionalSubscriber.onNext(FluxFilterFuseable.java:330)
at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1782)
at reactor.core.publisher.MonoCollect$CollectSubscriber.onComplete(MonoCollect.java:152)
at reactor.core.publisher.FluxMap$MapSubscriber.onComplete(FluxMap.java:136)
at reactor.core.publisher.FluxPeek$PeekSubscriber.onComplete(FluxPeek.java:252)
at reactor.core.publisher.FluxMap$MapSubscriber.onComplete(FluxMap.java:136)
at reactor.netty.channel.FluxReceive.onInboundComplete(FluxReceive.java:374)
at reactor.netty.channel.ChannelOperations.onInboundComplete(ChannelOperations.java:373)
at reactor.netty.channel.ChannelOperations.terminate(ChannelOperations.java:429)
at reactor.netty.http.client.HttpClientOperations.onInboundNext(HttpClientOperations.java:655)
at reactor.netty.channel.ChannelOperationsHandler.channelRead(ChannelOperationsHandler.java:96)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296)
at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1526)
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1275)
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1322)
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:440)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
[2021-04-28 13:57:56,728] Setting runner state to not executing step.
[2021-04-28 13:57:56,729] Waiting for next step.
[2021-04-28 13:57:56,803] Setting runner state to executing step.
[2021-04-28 13:58:22,524] Updating runner state to "ONLINE".

 As you can see in the logs it pick up that step and then it shows is back online but is not. when we start a new pipeline from the web client it does not work.

The only thing I havent try is to restart the EC2 instance the runner lives in but I guess that job is in a queue and it will keep coming back until we do some manual work?

regards

Jorge Cotelo April 29, 2021

Hello @lassian , @Justin Thomas 

Any update on this?

We have tried everything from destroying the container to destroying the EC2 instance and This step is still stuck there.

The only thing I haven't tried is to delete the runner from bitbucket and creating a new one.

please advise on this issue

Thanks

Justin Thomas
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
May 2, 2021

@Jorge Cotelo Sorry for the late reply. The state of the runner is broken on the Bitbucket side, you will have to delete the existing runner and create a new one.

Jorge Cotelo May 4, 2021

@Justin Thomas  Thank you for the reply. I deleted the runner and created the new one and everything is back.

There is still the Issue mentioned here: 
https://community.atlassian.com/t5/Bitbucket-Pipelines-Runners/Runner-stop-responding-when-idle-for-too-long/qaq-p/1662478

Any update on that?

 

Regards

Justin Thomas
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
May 12, 2021

@Jorge Cotelo The hung bug was a tough one, it's fixed now. Sorry for the trouble.

Mahmut Ali Ozkuran October 24, 2022

Unfortunately bug is still around.

An error occurred whilst completing step.

com.atlassian.pipelines.stargate.client.core.exceptions.StargateBadRequestException: Response Summary: HttpResponseSummary{httpStatusCode=400, httpStatusMessage=Bad Request, bodyAsString={"error":{"message":"Bad request","detail":"Attempt to complete an already completed step (existing status=STOPPED, new status=ERROR, existing error=null, new error=runner.step.execution-error)","data":{"key":"result-service.step.invalid-state","arguments":{}}}}}
Like aurelien.bettini likes this
aurelien.bettini November 2, 2022

Also got the same problem and I had to delete and recreate the runner in bitbucket settings.

Like Mahmut Ali Ozkuran likes this
Mahmut Ali Ozkuran November 2, 2022

Yes nothing solves except recreating the runner.

TAGS
AUG Leaders

Atlassian Community Events