Crowd Bootstrap Failed - The Crowd database is being updated by another instance

Mystech August 16, 2017

I've recently upgraded to Crowd 3.0.0 and upon starting crowd, crowd fully loads but shortly after it gives the error:

ERROR [atlassian.scheduler.core.JobLauncher] Scheduled job with ID 'com.atlassian.crowd.manager.cluster.ClusterSafetyManager-checkSafety' failed
java.lang.RuntimeException: The Crowd database (jdbc:postgresql://localhost:5432/crowd) is being updated by another instance. The instance IP is 127.0.0.1. Please make sure all the instances connected to this database are Data Center instances and have clustering enabled.

I have a single server deployment, not a clustered deployment.  I did find an article on Confluence that has a similar issue but the resolutions weren't helpful to Crowd.

I also note that I'm connecting via localhost in my database strings, but that is giving the instance IP of 127.0.01...of course both of those are the same.  I tried updating my connection string to replace localhost with 127.0.0.1 in case there was an issue there, but either way I get the same error (the difference being localhost above is replaced with 127.0.0.1).

This seems to be a situation where the Crowd Server is blocking itself, but I'm not sure where to go or how to troubleshoot.

 

6 answers

0 votes
Marcin Kempa
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
September 8, 2017

This problem may be related with the following issue CWD-4974,

@Mystech@mimacom International GmbH@Sistemas Startic@Michael Schneider could you check if this is the case in your setup?

Thanks,

Marcin Kempa

0 votes
Marcin Kempa
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
September 7, 2017

@Mystech@mimacom International GmbH@Michael Schneider@Sistemas Startic

Could you tell a little bit more about how the upgrade process looks like in your organisation? Are you using Crowd WAR distribution by any chance?

After analysing one of the cases with exact same problem, I noticed that the Crowd web context is started two times but for different path (context root):

localhost-startStop-1 INFO [ContainerBase.[Catalina].[localhost].[/]] Initializing Spring root WebApplicationContext

and after some time later in logs:

localhost-startStop-1 INFO [ContainerBase.[Catalina].[localhost].[/crowd]] Initializing Spring root WebApplicationContext

which is then followed up with two (or more) log entires such as:

Starting Crowd Server, Version: 3.0.0 (Build:#865 - 2017-08-14)

occuring shortly after each other (there are no entries about Crowd being stopped in between).

This indicates that even though there is only one java Crowd process as returned by

$ ps -ef | grep java | grep -v grep | wc -l

there are two (or possibly more) Crowd applications deployed on the same application server (tomcat by default in Crowd standalone distribution).

Therefore, as this might turn into a problem (running more than one instance of Crowd on the samedatabase), I would like to ask you if you could check for aforementioned log entries and check if by any chance you are running more then one Crowd application deployed to the same application server (one can check that by connecting via jconsole to the tomcat java process)

I would grately appreciate your reply.

0 votes
lpater
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
August 24, 2017

Hi all,

I've raised CWD-4964 to track this issue. We'll release Crowd 3.0.1 shortly with a fix.

In the meantime you can disable the check using the system property documented in the CWD-4964 - this should prevent the problem from occuring.

lpater
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
August 25, 2017

A new version of Crowd, 3.0.1 is now available, and contains a fix for this problem.

0 votes
Sistemas Startic August 24, 2017

Same problem here.

I have this topology

APPS----ssl-----[NGINX------8090----CROWD]

nginx, crowd, and postgres are in the same server. Postgres listening in localhost only, and upgrading from 2.7.3 to 3.0 I receive this error some seconds after crowd startup

The Crowd database (jdbc:postgresql://127.0.0.1:5432/crowddb) is being updated by another instance. The instance IP is 127.0.0.1. Please make sure all the instances connected to this database are Data Center instances and have clustering enabled.

After that error I check cwd_cluster_safety table and there is only one record from 127.0.0.1. Of course there is only one crowd instance.

Maybe the reverse proxy from nginx is causing this? 

I just upgraded my crowd instance following Automatic Database Upgrade method, but I had to roll back because of that bug.

Please tell me how to disable that scheduled job aswell.

Thank you.

0 votes
lpater
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
August 18, 2017

Hi,

As Ann mentioned this is triggered by a check, that tries to determine that multiple Crowd instances are trying to access the same database, and shuts down Crowd, in order to prevent data corruption.

This check was introduced in 3.0.0, the way it works is that every 2 minutes, the instance reads a value from the 'cwd_cluster_safety' table, compares it with the value it expects to find, and writes a new value there.

If the value read from the table differs from the one previously written the instance shuts down with the message you mentioned. This shouldn't be happening if only a single instance is running.

I'd advise double-checking that there's not a leftover instance running still, or if any other processes might be accessing the database and changing it in this manner. Try running `ps aux` as a superuser, to verify the process list, and search for Crowd instances (ther should be exactly one). Also enabling database query logging, and checking the queries made on the table might help you discover the culprit.

If you still have issues, please reach out to Atlassian Support, so that we can investigate further.

Mystech August 18, 2017

Yesterday while testing I rebooted the server just to make sure there was nothing in memory.  I have confirmed by running ps aux as speruser that there is only one process in the list, but I did not go so far as enabling database query logging.

Just wanted to make a note here that it's definitely not an issue of multiple instance or the same instance running twice.

mimacom International GmbH August 24, 2017

Hi there,

 

I'm facing the same issue but I can fix it by simply rebooting the instance.

It seems to occur after I run my ansible playbook which is used to setup Crowd and other servers.

I checked this by printing the contents of "cwd_cluster_safety" before and after the playbook run.

 

Before:

crowd=# select * from cwd_cluster_safety;
entry_key | entry_value | node_id | ip_address | entry_timestamp
--------------------+--------------------------------------+---------------+---------------+-----------------
clusterSafetyToken | ca6bea73-5a84-45ba-a825-27093f91a204 | NOT_CLUSTERED | 172.18.136.20 | 1503562271442
(1 row)


After:

crowd=# select * from cwd_cluster_safety;
entry_key | entry_value | node_id | ip_address | entry_timestamp
--------------------+--------------------------------------+---------------+---------------+-----------------
clusterSafetyToken | 55137567-37d8-4811-8c3b-4002bf3cd2b7 | NOT_CLUSTERED | 172.18.136.20 | 1503562391439
(1 row)

 

The value "entry_value" immediately changes after the playbook run, but the instance is running without any error messages.

When I run my playbook again, the entry in the database changes again:

crowd=# select * from cwd_cluster_safety;
entry_key | entry_value | node_id | ip_address | entry_timestamp
--------------------+--------------------------------------+---------------+---------------+-----------------
clusterSafetyToken | e58872e6-30fc-4b44-99da-48e01041de6d | NOT_CLUSTERED | 172.18.136.20 | 1503562918315
(1 row)

 

Additionally, when I tail the logs I can see somehow an error occuring and Crowd seems to reboot in itself (without raising a new linux process.

 

Filtered logs:

2017-08-24 10:22:09,640 ContainerBackgroundProcessor[StandardEngine[Catalina]] ERROR [atlassian.event.internal.AsynchronousAbleEventDispatcher] There was an exception thrown trying to dispatch event [com.atlassian.plugin.event.events.PluginFrameworkShutdownEvent@256fdd8e] from the invoker [com.atlassian.plugin.event.impl.MethodSelectorListenerHandler$1$1@68ff6095]
java.lang.RuntimeException: java.lang.NullPointerException

[...]

2017-08-24 10:22:09,664 ContainerBackgroundProcessor[StandardEngine[Catalina]] INFO [com.atlassian.crowd.startup] Stopping Crowd
2017-08-24 10:23:11,433 Caesium-1-2 ERROR [crowd.manager.cluster.ClusterSafetyManager] The Crowd database (jdbc:postgresql://localhost:5432/crowd) is being updated by another instance. The instance IP is 172.18.136.20. Please make sure all the instances connected to this database are Data Center instances and have clustering enabled.

 

Linux processes, there is always one:

[root@mima-crowdtest-01 rewe]# ps -ef | grep java | grep -v grep | wc -l
1

 

And it is the same as before I ran my playbook, because it's running since 30mins:

[root@mima-crowdtest-01 rewe]# ps -o etime= -p 23986
30:20

 

Important to note that ansible reports that it didn't change anything, on all three runs!

 

This brings me to the question, what value "entry_value" is exactly and where it comes from, why it seems to change during the runtime and why Crowd seems to kinda reboot itself?

 

Cheers,
Remo

0 votes
AnnWorley
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
August 17, 2017

All my test instances are using localhost in the JDBC connection string so I think we are ok there.

Crowd thinks another instance is connecting to it's database. We call this state a "cluster panic".

Please make sure there is only one entry in the table cwd_cluster_safety by running this query:

select * from cwd_cluster_safety;

There should only be one record in that table, if there are two, shutdown Crowd, backup the database and delete one of the rows. If you would like syntax to delete a row, please post the results of the query.

In my testing, I only had one record in the cwd_cluster_safety table and when I shut down Crowd it persisted. Because it is a test instance I experimented by deleting the only row and Crowd re-populated it on startup and seems to be running fine. I have to emphasize, since this seems to be your Production instance please back up the database before running any SQL against it.

Mystech August 17, 2017

Thanks Ann.  There's only one entry in the cwd_cluster_safety table.  I have a backup of the database from before the upgrade, so I tried the same thing yesterday before posting -- I shut down Crowd and just deleted the one entry from the cwd_cluster_safety table.  On startup it repopulated, but I was still getting the same error.

AnnWorley
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
August 17, 2017

If this is on Linux, please check that there is no other instance running (or not quite shut down):

ps aux | grep java

In Windows the processes can be listed by running TASKLIST

The way I am reading the error message it sounds like the other instance updating the database is on 127.0.0.1 (the local machine).

Mystech August 17, 2017

It's a Linux server.  I do have other java applications on that server (I have other Atlassian products), but crowd is not running.  Running the grep for java apps shows confluence & jira, but not crowd.  Running the grep for crowd only returns the grep command.

When I perform the upgrade for Crowd my first step is to stop previous instances and move the install directory so I don't inadvertently start it up or start up the wrong instance.

I'm also running Crowd in the foreground for testing so I can make sure nothing else is starting up (it happens whether I run it in the foreground or not though).

AnnWorley
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
August 17, 2017

If you don't ever plan to run Crowd Data Center (clustered instance) and you are confident you will not accidentally spin up a test instance against the database, then we can disable the scheduled job that checks the cluster safety table.

If you are interested in that strategy, please let me know.

It will have to be done directly in the database. I will need to do some testing to see which job it is (I have to delete a value, spin up two instances and make sure the second one starts despite the first one being pointed at the database.) 

In the cwd_cluster_job table there are two jobs I need to test one at a time to see which one or whether both need to be disabled: "clusterMessageReaperJob" and "clusterNodeInformationPrunerJob". I am betting on "clusterMessageReaperJob".

It will be later today or tomorrow morning before I finish testing if you want me to try it.

Mystech August 17, 2017

Thanks Ann!  That works for me.  We're not going to run it as a clustered instance and our process really won't have it startup two at the same time.

AnnWorley
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
August 18, 2017

You should be getting an email from the Atlassian support portal shortly.

Michael Schneider August 22, 2017

@AnnWorleyWe've the same issue as well. Can you provide me the necessary information to disable the job?

We run Crowd behind a proxy and configured it only to listen on 127.0.0.1. But as seen in the DB table cwd_cluster_safety it expects requests from the public interface IP address. May this cause the error?

lpater
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
August 22, 2017

Hi Michael,

The IP address shouldn't really matter. It's put into that table just for informational purposes, so that when the problem occurs we can display it. The only case that should trigger the error is multiple instances writing to the same database.

Have you verified there are no other Crowd instances running that might cause the issue, or an old instance that's still active? 

Are you also using postgres? Which version?

To diagnose further, please try enabling logging, by adding the line:

log4j.logger.com.atlassian.crowd.manager.cluster=TRACE

to your <CROWD_INSTALL_DIR>/crowd-webapp/WEB-INF/classes/log4j.properties/log4j.properties file, and restart Crowd. This should output extra information about the check to the logs. These should look like this:

2017-08-22 13:49:25,236 Caesium-1-1 TRACE [crowd.manager.cluster.ClusterSafetyManager] Writing cluster safety token b0c5ea83-a9e1-4c13-b18c-58ccba84234b
2017-08-22 13:51:25,235 Caesium-1-1 TRACE [crowd.manager.cluster.ClusterSafetyManager] Doing a cluster safety check with token b0c5ea83-a9e1-4c13-b18c-58ccba84234b
2017-08-22 13:51:25,248 Caesium-1-1 TRACE [crowd.manager.cluster.ClusterSafetyManager] Writing cluster safety token 306538af-397f-404b-8e28-95275f64e7b8
2017-08-22 13:53:25,235 Caesium-1-3 TRACE [crowd.manager.cluster.ClusterSafetyManager] Doing a cluster safety check with token 306538af-397f-404b-8e28-95275f64e7b8
2017-08-22 13:53:25,246 Caesium-1-3 TRACE [crowd.manager.cluster.ClusterSafetyManager] Writing cluster safety token 1c4b60f6-8280-48bb-a548-6ec7a6c9cb9c

If you can please raise a support ticket, so that we can investigate further.

Michael Schneider August 22, 2017

Sorry, I can't raise a support ticket as we just have a "Starter License" in place.

2017-08-22 14:10:37,342 Caesium-1-1 TRACE [crowd.manager.cluster.ClusterSafetyManager] Writing cluster safety token 23e0ffba-56c6-4daf-9555-00edf2757d58
2017-08-22 14:11:12,366 Caesium-1-1 TRACE [crowd.manager.cluster.ClusterSafetyManager] Writing cluster safety token 9a0900fc-4cc3-4a36-9eb2-21e08141ec04
2017-08-22 14:12:37,342 Caesium-1-3 TRACE [crowd.manager.cluster.ClusterSafetyManager] Doing a cluster safety check with token 23e0ffba-56c6-4daf-9555-00edf2757d58
<Crowd runs in error mode>
2017-08-22 14:13:12,362 Caesium-1-2 TRACE [crowd.manager.cluster.ClusterSafetyManager] Doing a cluster safety check with token 9a0900fc-4cc3-4a36-9eb2-21e08141ec04
2017-08-22 14:13:12,384 Caesium-1-2 TRACE [crowd.manager.cluster.ClusterSafetyManager] Writing cluster safety token 24edaf4e-b87f-44a1-8627-ef24b52869b9
2017-08-22 14:15:12,362 Caesium-1-3 TRACE [crowd.manager.cluster.ClusterSafetyManager] Doing a cluster safety check with token 24edaf4e-b87f-44a1-8627-ef24b52869b9
2017-08-22 14:15:12,381 Caesium-1-3 TRACE [crowd.manager.cluster.ClusterSafetyManager] Writing cluster safety token 7a63b64b-3270-4893-917c-4c695723547c
2017-08-22 14:17:12,362 Caesium-1-1 TRACE [crowd.manager.cluster.ClusterSafetyManager] Doing a cluster safety check with token 7a63b64b-3270-4893-917c-4c695723547c
2017-08-22 14:17:12,377 Caesium-1-1 TRACE [crowd.manager.cluster.ClusterSafetyManager] Writing cluster safety token f08a6e59-a594-4c9d-b7b6-c16761596277

 We're using PostgreSQL 9.5.8 on Ubuntu 16.04.

ps -ef | grep java

shows only one process.

lpater
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
August 22, 2017

Thanks for the logs. They do seem to show two instances of the ClusterSafetyManager running at once, which could be causing the issue.

Is this a fresh installation of Crowd? Which version have you downloaded (zip, tar.gz or war?)? Did you make any changes to any config files in the Crowd installation directory (specifically web.xml)? Are you using any custom integrations or add-ons? What JVM you are running?

I'd also be grateful if you could send the full logs to lpater@atlassian.com.

mimacom International GmbH August 24, 2017

@AnnWorley could you please send us the information aswell? Thank you very much.

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events