Fix Found for heartbeat problem

Peter Kahn February 15, 2013

Hi, I'm addressing BSP-7948 issue so I can inform you of the cause and fix in an effort to assist others. Please feel free to close this immediately.

Fix:

JVM wrapper settings changes fix problem

Approach

  • Setup 2 node system
  • Setup bogus forever job that runs cmd shell using infinite loop
  • Used stress tool to max server disk, cpu, memory - could not cause heart beat problem
  • Reviewed Logs and found pattern (see description). Agent attempts to restart JVM and complains of DLL problem on windows
  • Fix DLL problem by running all win agents with x86 jdk
  • Fix JVM restart problem by increasing time for jvm ping {code}wrapper.java.command=d: Java x86 jdk1.6.0_26 bin java.exe
wrapper.ping.timeout=900
wrapper.ping.interval=30

References

  • https://answers.atlassian.com/questions/61674/remote-agent-crashing-when-being-farmed
  • http://wrapper.tanukisoftware.com/doc/english/prop-ping-timeout.html

Problem:

Agent builds stop and results lost after missing heartbeat. for months this appears to be server problem. Re-evaluate

  • Determine if we can replicate the problem with micro environment under load
  • Check agent logs looking for a pattern

Situation:

2013-02-12 11:12:22,785 INFO ActiveMQ Session Task DefaultErrorHandler

Recording error: Agent General Purpose Windows X64 (vmwbuild-03) went offline while building TRUNK-QUICKCOMPILEWINDOWS-2876.

The build results will not be saved. : TRUNK-QUICKCOMPILEWINDOWS

Agent

ERROR | wrapper | 2013/02/12 11:08:54 | JVM appears hung: Timed out waiting for signal from JVM.

ERROR | wrapper | 2013/02/12 11:08:55 | JVM did not exit on request, terminated

STATUS | wrapper | 2013/02/12 11:09:10 | Launching a JVM...

ERROR | wrapper | 2013/02/12 11:09:34 | Startup failed: Timed out waiting for a signal from the JVM.

ERROR | wrapper | 2013/02/12 11:09:34 | JVM did not exit on request, terminated

STATUS | wrapper | 2013/02/12 11:09:39 | Launching a JVM...

ERROR | wrapper | 2013/02/12 11:10:08 | Startup failed: Timed out waiting for a signal from the JVM.

ERROR | wrapper | 2013/02/12 11:10:08 | JVM did not exit on request, terminated

STATUS | wrapper | 2013/02/12 11:10:13 | Launching a JVM...

ERROR | wrapper | 2013/02/12 11:10:42 | Startup failed: Timed out waiting for a signal from the JVM.

ERROR | wrapper | 2013/02/12 11:10:42 | JVM did not exit on request, terminated

STATUS | wrapper | 2013/02/12 11:10:47 | Launching a JVM...

ERROR | wrapper | 2013/02/12 11:11:16 | Startup failed: Timed out waiting for a signal from the JVM.

ERROR | wrapper | 2013/02/12 11:11:16 | JVM did not exit on request, terminated

STATUS | wrapper | 2013/02/12 11:11:21 | Launching a JVM...

INFO | jvm 105 | 2013/02/12 11:11:31 | Wrapper (Version 3.2.3) http://wrapper.tanukisoftware.org

INFO | jvm 105 | 2013/02/12 11:11:31 | Copyright 1999-2006 Tanuki Software, Inc. All Rights Reserved.

...

INFO | jvm 105 | 2013/02/12 11:11:32 | WARNING - Unable to load the Wrapper's native library 'wrapper.dll'.

INFO | jvm 105 | 2013/02/12 11:11:32 | The file is located on the path at the following location but

INFO | jvm 105 | 2013/02/12 11:11:32 | could not be loaded:

INFO | jvm 105 | 2013/02/12 11:11:32 | d: bamboo-agent-home bin .. lib wrapper.dll

INFO | jvm 105 | 2013/02/12 11:11:32 | Please verify that the file is readable by the current user

INFO | jvm 105 | 2013/02/12 11:11:32 | and that the file has not been corrupted in any way.

INFO | jvm 105 | 2013/02/12 11:11:32 | One common cause of this problem is running a 32-bit version

INFO | jvm 105 | 2013/02/12 11:11:32 | of the Wrapper with a 64-bit version of Java, or vica versa.

INFO | jvm 105 | 2013/02/12 11:11:32 | This is a 64-bit JVM.

INFO | jvm 105 | 2013/02/12 11:11:32 | Reported cause:

INFO | jvm 105 | 2013/02/12 11:11:32 | D: bamboo-agent-home lib wrapper.dll: Can't find dependent libraries

INFO | jvm 105 | 2013/02/12 11:11:32 | System signals will not be handled correctly.

INFO | jvm 105 | 2013/02/12 11:11:32 |

INFO | jvm 105 | 2013/02/12 11:11:41 | Agent bootstrap using baseUrl: http://vmbamboo-01:8085/bamb

{code}

1 answer

1 accepted

1 vote
Answer accepted
Peter Kahn February 15, 2013

Heartbeat problems can happen when the agent restarts the local JVM when it fails to respond. We can avoid these by relaxing the schedule for checking the JVM.

wrapper.ping.timeout=900

wrapper.ping.interval=30

References

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events