We ran into a tricky problem this week. Most of our back-end data is stored in a database called PostgreSQL (aka “Postgres”). Postgres is a fantastic open-source database used by tons of companies around the globe. Postgres is normally rock-solid, but like any other complex system it can run into trouble…. which is what happened to us.
In order to minimize the maintenance burden on our team, we use an AWS service (“RDS”) that manages the database for us. RDS replicates our database in multiple data centers and backs up to safe locations. (This configuration is common and ensures that FieldClock can operate with negligible downtime even if our primary data center becomes unavailable.). A trade-off for this simplicity is that we don’t have access to the actual filesystem underneath Postgres.
In 15 years of working with Postgres, I’ve never run into a corrupted production database – but late last week we found that one of our database files was corrupt. This was likely caused by a hardware failure somewhere in AWS infrastructure. The good news is that there was no threat of data loss. (We back up religiously.) The bad news was that the corrupted file was interfering with routine maintenance that keeps the database snappy. Over time, the database would get slower and eventually (way in the future) stop working.
Because we don’t have access to the actual filesystem, we couldn’t fix the corrupt file directly. We opened a ticket with AWS and their technicians dug in. The optimal solution would be to restore the corrupt file in the filesystem, but AWS was unable to do this in a timely manner. We eventually gave up on restoring the actual file and rebuilt the affected table in our database using our backup data. This solved the problem, but required some downtime.
At this point, we should have been out of the woods – but the downtime caused an unexpected side-effect…
FieldClock apps keep track of sync success rates. If sync fails too many times, the apps send additional data to help resolve the problem. The downtime caused all mobile clients to start sending excess data, which caused an unplanned stress test on our system. This resulted in admin site slowness and sync errors that many users experienced yesterday.
Ultimately we were able to resolve the “stress test” by optimizing some code as well as tweaking some server configurations to better handle the syncing clients. The final tweaks took place yesterday evening and the system has been healthy since.
In the process of solving the problems above, we found several places to optimize our existing infrastructure and code as well as some points of failure that need better automated monitoring and alerts for our team. The end result is that the FieldClock system is stronger now than it was before the incident.
We realize this is more technical information than most of our users care about, but we want to be totally transparent about what goes on “behind the scenes”. We are deeply appreciative of our users and your trust in our system. It is important to know that there was never a threat of data loss at any point. The worst-case scenario is what we went through, which was slower-than-usual Admin Site performance and an increased rate of sync errors in the mobile apps.
~josh