Best practices for monitoring tool test design

Hello, Community! We have a SaaS product where multiple tenants are hosted within a single instance.

What is the best practice for implementing automated checks from a monitoring tool where the behavior could be different from one tenant to the next? If you run a pass/fail check against a given tenant and the test passes, meanwhile another tenant could be having issues with the same component. The fact that the check did not fail in the test tenant proves that the root cause of the other tenant's issue is not a system wide incident. Is this method of checking sufficient? The alternative would be to gather statistics and run a check against the average of the data points collected from all tenants in the system. In this case, one tenant having a tenant-specific issue could negatively impact the average, causing the check to fail erroneously.

Thanks for any guidance you may have :)

1 answer

1 vote

Hi @Corey Garretson ,

Happy to help!

Statuspage itself wouldn't be able to perform those checks, but it could certainly work in tandem with another service like Pingdom that would! Another option would be to use Opsgenie, again with a service that tests for pass/fail on different components, and then alert based on that. That being said, I would recommend that you look into services like Pingdom or Datadog to establish that pass/fail test between components, and then use Statuspage or Opsgenie to help notify associated parties to that action could be taken on the error.

Please let me know if you have any follow up questions!

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Hi @Shivam Naik , thanks for your reply! We already use another service that runs pass/fail tests on our components, and we pipe the outputs to Statuspage to update status automatically. My question is more around best practices for configuring those tests. Is running the checks against a single test tenant sufficient, or is it better to gather stats from across all tenants and have your Pingdom/Datadog monitoring tool read out a failure for any check where the group average exceeds a threshold?

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Hi @Corey Garretson ,

Thank you for the clarification!

I think with Statuspage it would be best to use a Single Tenant so that you can immediately alert based on that failure. The grouping option could work, but I believe it would be more reliant on Pingdom/Datadog assessing results crossing that threshold to alert properly. Both options could work, but for what Statuspage does on its own, the Single Test would be the better option to notify on Components to at least distribute a message stating an irregularity was found and testing is being done to assess whether other tenants are affected

Please let me know if you have any follow up questions!

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Like • like this

Thanks!

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Suggest an answer

Was this helpful?

Thanks!

Statuspage

DEPLOYMENT TYPE

CLOUD

Forums

Product Q&A

Community resources

Support

Top groups

Community resources

Support

Learn

Community resources

Support

Events

Community resources

Support

Best practices for monitoring tool test design

1 answer

Suggest an answer

Was this helpful?

Thanks!

DEPLOYMENT TYPE

TAGS

Atlassian Community Events