Pipeline fails randomly and abruptly

Hugo Baldwin May 12, 2023

I've got a pipeline having three steps which run in parallel. Two of the steps run unittests using node and Python, respectively. The third step uses docker-compose to build containers necessary to run Python integration tests.

The first two steps both typically finish in under a minute and never fail (unless there's a genuine test failure).

The third step typically completes in 5-6 minutes, but it's during this step I keep experiencing the same error - always after the containers are built and always while pytest is running the integration tests - but the tests themselves never fail. Instead this is a typical last three lines of the build log:

devices/tests/test_views.py::test_create_device_view PASSED [ 22%]
devices/tests/test_views.py::test_create_device PASSED [ 23%]
2023-05-11T14:00:36.474810314Z stdout P devices/tests/test_views.py::test_delete_device

The tests generally reach a different completion percentage every time, sometimes failing immediately, sometimes at 80-90% completion.

The docker log doesn't contain anything unusual, or at least, anything which I haven't also seen in the logs of green builds. Here's a sample tail end of a build that failed abruptly:

time="2023-05-11T14:00:19.251355162Z" level=error msg="AuthZRequest for HEAD /_ping returned error: authorization denied by plugin pipelines: "
time="2023-05-11T14:00:19Z" level=info msg="Pipelines plugin request authorization." allowed=true method=GET plugin=pipelines uri=/_ping
time="2023-05-11T14:00:19Z" level=info msg="Pipelines plugin request authorization." allowed=true method=GET plugin=pipelines uri=/v1.41/containers/digipipe_test_django/json
time="2023-05-11T14:00:19Z" level=info msg="Container exec request." AttachStderr=true AttachStdin=true AttachStdout=true Detach=false DetachKeys= Privileged=false Tty=false User= plugin=pipelines
time="2023-05-11T14:00:19Z" level=info msg="Pipelines plugin request authorization." allowed=true method=POST plugin=pipelines uri=/v1.41/containers/digipipe_test_django/exec
time="2023-05-11T14:00:19Z" level=info msg="Pipelines plugin request authorization." allowed=true method=POST plugin=pipelines uri=/v1.41/exec/2bc433c5abc8911fc41d9a4e76a32d4d8ac98dd6f3a6c598162dae78f32e7379/start
time="2023-05-11T14:00:36Z" level=info msg="Pipelines plugin request authorization." allowed=true method=GET plugin=pipelines uri=/v1.41/exec/2bc433c5abc8911fc41d9a4e76a32d4d8ac98dd6f3a6c598162dae78f32e7379/json
time="2023-05-11T14:00:36.495477577Z" level=warning msg="cleaning up after shim disconnected" id=67747a1968e7530303ebe16693e13a76f92b9918b11b26e5c1afa19f476e39d2 namespace=moby
time="2023-05-11T14:00:36.524003398Z" level=warning msg="cleanup warnings time=\"2023-05-11T14:00:36Z\" level=info msg=\"starting signal loop\" namespace=moby pid=8808 runtime=io.containerd.runc.v2\n"

 I've tried various things, enabling / disabling various tests, re-arranging the Dockerfile, enabling/disabling Buildkit but haven't got anywhere.  Any help greatly appreciated.

 

1 answer

0 votes
Theodora Boudale
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
May 15, 2023

Hi Hugo and welcome to the community!

It's hard to say what is happening without an error message.

Is there any error message before the last three lines of the build log that you shared?

Any error message in the column next to the build log, below the build number?

If not, is there a way to add verbose output to the command that is failing?

If there is a chance that this may be caused by memory issues, you could add the following lines in your yml file, at the beginning of the third step's script

- while true; do date && ps -aux && sleep 5 && echo ""; done &
- while true; do date && echo "Memory usage in megabytes:" && echo $((`cat /sys/fs/cgroup/memory/memory.memsw.usage_in_bytes | awk '{print $1}'`/1048576)) && echo "" && sleep 5; done &

These commands will show memory usage throughout the step in the build log, so it will be possible to see whether memory usage seems to reach the limit.

You could also try debugging the step locally with Docker, as per the steps outlined in this document, and see if the same issue occurs there as well, or not.

Kind regards,
Theodora

Suggest an answer

Log in or Sign up to answer
DEPLOYMENT TYPE
CLOUD
TAGS
AUG Leaders

Atlassian Community Events