We are deploying windows agents using the script provided in the KB article:
However, in our case, these are being deployed on machines that have not had the agent before. We are noticing that upon deployment approximately 30-40% of agents report back as having invalid/unhealthy tokens, the other 60-70% are fine. These are deployed using the same script, many on the same subnet as those that work without issue. Given that we are getting an “Unhealthy” error message, there is communication taking place and we have ruled out firewall issues. We find that if we do an initial revocation of the token, some machines then report back in with valid tokens, but this happens for less than 1% of the affected machines. For the others, we find that if uninstall the client and reinstall multiple times, manually deleting files and restarting the machines, that this resolves the issue for about 80% of the affected machines. The other machines still seem to fail to acknowledge the tokens.
What is confounding is that as stated above, multiple reinstalls of the agents, manually deleting installation folders, sometimes works right away, other times, works after a number of attempts. We cannot seem to find a pattern to the problem or even the seemingly random resolution. We wanted to see if anyone else has encountered this issue.
As a follow up we have noticed that if we delete the .bak as well as the cfg file, this resolves issues with some of the problematic clients. Initially we attempted to edit the .cfg file and blank out the token string there and we noticed 2 things. First that we needed to reset permissions on that file to allow the svc account that was running the automated deployment of the client read write access. Removing the .bak file was also critical as the old "bad" token string would return.
We are still trying to work through why the last few clients are still giving us issues. Not sure if we are somehow missing steps with them or not.
One thing I forgot to mention in my original post was that, for any client that has an "unhealthy" token message, while not an intuitive thing to do, you should "revoke" that token. That is because there may be a local record on the server that may not match. Not sure why, but this did resolve the issue for 1-2 of the clients. I assume somehow the initial attempt created some entry on the server that needed to be removed server side.
Nonetheless, still working through the rest of the clients to see if deletion of the .cfg and the .bak files along with revoking the token on the server is the ultimate solution to this.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.