Introduction
At the end of 2021, I had one of the most sleepless nights of my career. This story is about a problem we faced with the SLA calculation feature in Jira Service Management, which pushed our PostgreSQL cluster to its limits and caused a storage overflow. Here’s what happened, what I learned, and how we fixed it to prevent future issues.
The Calm Before the Storm
It started as a regular workday, managing our Jira Service Management system. This system handles SLA calculations for many projects. You can easily adjust SLA calendars and track service tickets. However, making these adjustments for large projects can be risky.
On that day, we needed to adjust the SLA calendar for two of our biggest projects, each with around 1.8 million tickets. The changes seemed simple, but they triggered a massive recalculation process made massive updates in the database.
The Late Warning
Just 15 minutes after starting this process, our system administrators got a critical alert:
alertname = Disk free space less than 5%
alias = db01.company.local
mountpoint = /var/lib/
priority = P2
severity = warning
Annotations:
description = db01.company.local
summary = Free space on /var/lib/ < 5%
In those ~10-15 minutes, our server’s 1.2 TB storage was filled up. The rapid disk usage was due to the heavy database changes from the SLA recalculations. This caused a critical problem in our PostgreSQL cluster at the WAL (write-ahead log) level.
A Risky Fix Under Pressure
With a crisis at hand and pressure from customers, I had to take a quick, risky action. Using pg_controldata and pg_archivecleanup, I tried to free up disk space:
/usr/pgsql-14/bin/pg_controldata -D /var/lib/pgsql/14/data/ > pgwalfile.txt
/usr/pgsql-14/bin/pg_archivecleanup -d /var/lib/pgsql/14/data/pg_wal $(cat pgwalfile.txt | grep “Latest checkpoint’s REDO WAL file” | awk ‘{print $6}’)
This command sequence was meant to clean up the WAL files and quickly free up space. I then restarted the PostgreSQL cluster, hoping it would stabilize the system.
Lessons Learned
Thankfully, the quick fix worked for the moment. Over the weekend, our team worked hard to fix the misconfigurations that caused the problem. Here are some important lessons I learned:
1. Know Your Hardware Limits: Never forget that physical hardware has limits. Make sure you have enough disk space and resources to handle unexpected spikes in usage.
2. Configure WAL Properly: Configure your WAL settings and log rotations carefully. Good WAL management is key for database stability and performance.
3. Track Changes and Have a Rollback Plan: When making changes to a large system, track them and always have a rollback plan. Never execute changes without a safety net.
4. Do Disaster Recovery Training: Regularly practice disaster recovery (DR) and high availability (HA) training. Be ready for worst-case scenarios and ensure your team can respond effectively.
Conclusion
The quick fix helped us in that emergency, but I do not recommend such risky methods. Instead, follow best practices and refer to comprehensive guides like the one from Percona on PostgreSQL WAL retention and clean-up, and the official PostgreSQL documentation.
By sharing this story, I hope to remind my fellow system admins and DBAs to respect the limits of physical hardware and be careful with database configurations. Always be prepared, and remember: overlooking these details can lead to sleepless nights.
For more detailed guidance, check out these resources:
• Percona: PostgreSQL WAL Retention and Cleanup
• PostgreSQL Documentation on WAL
By learning from challenging experiences, we can build more resilient and reliable systems.
P.S. OR Better to move away from on-prem to cloud :)
Gonchik Tsymzhitov
Solution architect | DevOps
:)
Cyprus, Limassol
175 accepted answers
0 comments