Friday's Partial Outage

Incident Report for FieldClock

Postmortem

On Friday evening our Admin Site and API became very unreliable for several hours. It was never fully down, but it was never fully up either. It would work for several minutes at a time, and then show error messages for several minutes at a time.

Note: If you’re new to FieldClock, this might be the first “retrospective” you’ve read. We understand that many of our customers are not computer programmers or have full-time IT staff so we try to convey the technical details in an approachable way. If you have any questions, please reach out to our service team and we’re happy to clarify. You’re a partner with us in this venture, and it is paramount to be transparent when incidents happen.

Background

As part of FieldClock’s robust data storage system, our production database is replicated across multiple datacenters. Each change you make on the admin site, or sync from the mobile apps, is written to our primary database and then copied to replicas in other datacenters. This makes your data very safe because there are always at least 3 copies of it in 3 different safe locations.

This distributed approach offers performance enhancements because we can spread out database traffic between multiple instances. (If all requests went to the primary database, it would be very slow, so a “read only” request should be handled by a replica if possible.) Unfortunately, the options available in our early years with AWS always left us with replicas that we couldn’t use for “read only” requests. Last year, AWS launched a new database service that would allow us to maintain the same layout we had but all replicas would be readable which would be a win for system performance.

The Incident

Last weekend we had a scheduled maintenance event where we took FieldClock API offline and migrated our database to this new system (and performed some other TLC that was needed). This should have been a like-for-like swap (like Indiana Jones swiping the golden idol from the pedestal), but it didn’t work out that smoothly (also like Indiana Jones).

We started noticing performance problems Monday morning, and they continued to worsen over the week. Most of our customers start their workweek on Sunday or Monday, so by the time Friday rolls around our system has to do a lot of background calculations to make sure all the screens can render quickly with accurate information. There is much more of this background calculating to do on Friday than there is on Monday, and on Friday we were starting to see serious performance degradation.

On the networking side, our “load balancers” spread traffic across many servers. This process is automated and the load balancers launch new servers or retire unhealthy servers according to rules we’ve established. As performance worsened on Friday, our load balancers were launching more servers to handle the load – but they were also aggressively retiring servers that looked “unhealthy”. This resulted in the incredibly inconsistent site performance where Admin Site would be fast and snappy one moment, and then unresponsive the next.

We eventually tracked the root cause of the performance difference to an undocumented change in default parameters between our old database structure and our new one. Despite them being the same database version and server size, there were some nuanced differences in the configurations that drastically reduced performance. We rolled out changes that fixed the immediate issue, and are going to continue tuning things until we get it optimized.

Next Steps

We’re always about self-improvement, and we’ve taken a number of lessons from this incident. There is room for improvement in our load balancer rules, and a new checklist of parameters to verify even when official documentation says things should be the same. We also found some opportunities to optimize our pay calculation performance.

Despite the bumpy migration, this database change will be a win for FieldClock customers as we can provide better performance without increasing our hosting bills.

I’m proud to say that since 2019 we have provided 99.93% uptime (including scheduled maintenance events and Friday’s incident). We’ll never be perfect, but that won’t stop us from trying. Our confidence in our stability is why we offer guaranteed uptime as part of our commitment to members of our Loyalty Program. If you haven’t looked into this, please reach out to our customer service team for more information.

Posted Nov 20, 2023 - 13:37 PST

Resolved

All systems have been stable and green since our last fix deployed. We're going to continue keeping an eye on the system but the current incident has been corrected.

As always, we'll post full notes about this incident later. We value the partnership with our clients and continue to lean-in to transparency in a way that few tech companies will. Thank you for your patience as we resolved issues today! (And rest-assured, no data loss was at stake and our mobile apps continued to work in offline mode as they do.)

Posted Nov 17, 2023 - 20:54 PST

Update

Still seeing network errors. Continuing to resolve.

Posted Nov 17, 2023 - 19:25 PST

Monitoring

We think we have deployed a permanent solution to the current problem. Systems are currently stable. We're continuing to monitor.

Posted Nov 17, 2023 - 19:06 PST

Update

We are still hard at work on this. Devices are syncing with API (with some failures). Admin Site is having a particularly hard time. We think we've identified the cause and are deploying potential fixes now.

Posted Nov 17, 2023 - 18:33 PST

Update

We've disabled some maintenance tasks that may be causing undue load. This should solve the current outage but we'll still have some degraded performance.

Posted Nov 17, 2023 - 15:56 PST

Identified

API is currently down while we reconfigure the part of the database that appears to be giving us trouble.

Posted Nov 17, 2023 - 15:44 PST

Investigating

Our latency issue from earlier today has resurfaced. We are working on a permanent fix.

Posted Nov 17, 2023 - 15:30 PST

This incident affected: Admin Site and API.