Create
cancel
Showing results for 
Search instead for 
Did you mean: 
Sign up Log in

Best practices for monitoring tool test design

Corey Garretson November 8, 2022

Hello, Community! We have a SaaS product where multiple tenants are hosted within a single instance. 

What is the best practice for implementing automated checks from a monitoring tool where the behavior could be different from one tenant to the next? If you run a pass/fail check against a given tenant and the test passes, meanwhile another tenant could be having issues with the same component. The fact that the check did not fail in the test tenant proves that the root cause of the other tenant's issue is not a system wide incident. Is this method of checking sufficient? The alternative would be to gather statistics and run a check against the average of the data points collected from all tenants in the system. In this case, one tenant having a tenant-specific issue could negatively impact the average, causing the check to fail erroneously. 

Thanks for any guidance you may have :)

1 answer

1 vote
Shivam Naik
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
November 10, 2022

Hi @Corey Garretson ,

Happy to help!

Statuspage itself wouldn't be able to perform those checks, but it could certainly work in tandem with another service like Pingdom that would! Another option would be to use Opsgenie, again with a service that tests for pass/fail on different components, and then alert based on that. That being said, I would recommend that you look into services like Pingdom or Datadog to establish that pass/fail test between components, and then use Statuspage or Opsgenie to help notify associated parties to that action could be taken on the error.

Please let me know if you have any follow up questions!

Corey Garretson November 10, 2022

Hi @Shivam Naik , thanks for your reply! We already use another service that runs pass/fail tests on our components, and we pipe the outputs to Statuspage to update status automatically. My question is more around best practices for configuring those tests. Is running the checks against a single test tenant sufficient, or is it better to gather stats from across all tenants and have your Pingdom/Datadog monitoring tool read out a failure for any check where the group average exceeds a threshold? 

Shivam Naik
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
November 11, 2022

Hi @Corey Garretson ,

Thank you for the clarification!

I think with Statuspage it would be best to use a Single Tenant so that you can immediately alert based on that failure. The grouping option could work, but I believe it would be more reliant on Pingdom/Datadog assessing results crossing that threshold to alert properly. Both options could work, but for what Statuspage does on its own, the Single Test would be the better option to notify on Components to at least distribute a message stating an irregularity was found and testing is being done to assess whether other tenants are affected

Please let me know if you have any follow up questions!

Like # people like this
Corey Garretson November 11, 2022

Thanks!

Suggest an answer

Log in or Sign up to answer
DEPLOYMENT TYPE
CLOUD
TAGS
AUG Leaders

Atlassian Community Events