Admin site 'slowness'

Incident Report for FieldClock

Postmortem

We ran into a tricky problem this week. Most of our back-end data is stored in a database called PostgreSQL (aka “Postgres”). Postgres is a fantastic open-source database used by tons of companies around the globe. Postgres is normally rock-solid, but like any other complex system it can run into trouble…. which is what happened to us.

Background

In order to minimize the maintenance burden on our team, we use an AWS service (“RDS”) that manages the database for us. RDS replicates our database in multiple data centers and backs up to safe locations. (This configuration is common and ensures that FieldClock can operate with negligible downtime even if our primary data center becomes unavailable.). A trade-off for this simplicity is that we don’t have access to the actual filesystem underneath Postgres.

The Problem

In 15 years of working with Postgres, I’ve never run into a corrupted production database – but late last week we found that one of our database files was corrupt. This was likely caused by a hardware failure somewhere in AWS infrastructure. The good news is that there was no threat of data loss. (We back up religiously.) The bad news was that the corrupted file was interfering with routine maintenance that keeps the database snappy. Over time, the database would get slower and eventually (way in the future) stop working.

Because we don’t have access to the actual filesystem, we couldn’t fix the corrupt file directly. We opened a ticket with AWS and their technicians dug in. The optimal solution would be to restore the corrupt file in the filesystem, but AWS was unable to do this in a timely manner. We eventually gave up on restoring the actual file and rebuilt the affected table in our database using our backup data. This solved the problem, but required some downtime.

At this point, we should have been out of the woods – but the downtime caused an unexpected side-effect…

Oh-no, another problem!

FieldClock apps keep track of sync success rates. If sync fails too many times, the apps send additional data to help resolve the problem. The downtime caused all mobile clients to start sending excess data, which caused an unplanned stress test on our system. This resulted in admin site slowness and sync errors that many users experienced yesterday.

Ultimately we were able to resolve the “stress test” by optimizing some code as well as tweaking some server configurations to better handle the syncing clients. The final tweaks took place yesterday evening and the system has been healthy since.

Whatever doesn’t kill us…

In the process of solving the problems above, we found several places to optimize our existing infrastructure and code as well as some points of failure that need better automated monitoring and alerts for our team. The end result is that the FieldClock system is stronger now than it was before the incident.

In Short

We realize this is more technical information than most of our users care about, but we want to be totally transparent about what goes on “behind the scenes”. We are deeply appreciative of our users and your trust in our system. It is important to know that there was never a threat of data loss at any point. The worst-case scenario is what we went through, which was slower-than-usual Admin Site performance and an increased rate of sync errors in the mobile apps.

‌~josh

Posted Jun 25, 2020 - 09:45 PDT

Resolved

The system has been healthy since our final tweaks yesterday. This incident is now resolved. Thanks to all our users for their patience while we resolved it!

We will write a report shortly with the technical details for those who are interested.

Posted Jun 25, 2020 - 08:44 PDT

Update

Our overnight updates made the general system healthier, but we're still working through a backlog of large sync payloads that devices are sending. We're working on this situation and will have it resolved ASAP.

Posted Jun 24, 2020 - 15:37 PDT

Monitoring

We've undertaken some maintenance steps on our database hosts that should improve our overall performance. We'll be keeping an eye on the system to verify whether this resolves the intermittent slowness that people have been experiencing.

Posted Jun 23, 2020 - 23:54 PDT

Update

We are continuing to work on a fix for this issue.

Posted Jun 23, 2020 - 20:02 PDT

Update

We believe we have a solution to the root problem and will be implementing a fix tonight after end-of-day for our US West Coast users. The API will be offline starting shortly after 8pm PDT. The maintenance is expected to take 1-2 hrs.

Posted Jun 23, 2020 - 11:49 PDT

Identified

We've identified an issue with our database host that is preventing a maintenance task from running correctly. The maintenance task starting/stopping is what is causing the intermittent slowness visible on the admin site (and likely occasional sync errors). AWS's technicians are working with us to resolve this and we will have it sorted out ASAP.

We're making some tweaks to mitigate the slowness when it does occur. If you do happen to experience a long page load or job finalization, please try again after a minute or so and all should work smoothly.

Thanks for your patience as we fix this!

Posted Jun 23, 2020 - 08:23 PDT

Investigating

The Admin site is experiencing longer than usual load times. We are looking into the issue and expect to have it resolved shortly.

Posted Jun 22, 2020 - 08:06 PDT

This incident affected: Admin Site and API.