On Friday evening our Admin Site and API became very unreliable for several hours. It was never fully down, but it was never fully up either. It would work for several minutes at a time, and then show error messages for several minutes at a time.
Note: If you’re new to FieldClock, this might be the first “retrospective” you’ve read. We understand that many of our customers are not computer programmers or have full-time IT staff so we try to convey the technical details in an approachable way. If you have any questions, please reach out to our service team and we’re happy to clarify. You’re a partner with us in this venture, and it is paramount to be transparent when incidents happen.
As part of FieldClock’s robust data storage system, our production database is replicated across multiple datacenters. Each change you make on the admin site, or sync from the mobile apps, is written to our primary database and then copied to replicas in other datacenters. This makes your data very safe because there are always at least 3 copies of it in 3 different safe locations.
This distributed approach offers performance enhancements because we can spread out database traffic between multiple instances. (If all requests went to the primary database, it would be very slow, so a “read only” request should be handled by a replica if possible.) Unfortunately, the options available in our early years with AWS always left us with replicas that we couldn’t use for “read only” requests. Last year, AWS launched a new database service that would allow us to maintain the same layout we had but all replicas would be readable which would be a win for system performance.
Last weekend we had a scheduled maintenance event where we took FieldClock API offline and migrated our database to this new system (and performed some other TLC that was needed). This should have been a like-for-like swap (like Indiana Jones swiping the golden idol from the pedestal), but it didn’t work out that smoothly (also like Indiana Jones).
We started noticing performance problems Monday morning, and they continued to worsen over the week. Most of our customers start their workweek on Sunday or Monday, so by the time Friday rolls around our system has to do a lot of background calculations to make sure all the screens can render quickly with accurate information. There is much more of this background calculating to do on Friday than there is on Monday, and on Friday we were starting to see serious performance degradation.
On the networking side, our “load balancers” spread traffic across many servers. This process is automated and the load balancers launch new servers or retire unhealthy servers according to rules we’ve established. As performance worsened on Friday, our load balancers were launching more servers to handle the load – but they were also aggressively retiring servers that looked “unhealthy”. This resulted in the incredibly inconsistent site performance where Admin Site would be fast and snappy one moment, and then unresponsive the next.
We eventually tracked the root cause of the performance difference to an undocumented change in default parameters between our old database structure and our new one. Despite them being the same database version and server size, there were some nuanced differences in the configurations that drastically reduced performance. We rolled out changes that fixed the immediate issue, and are going to continue tuning things until we get it optimized.
We’re always about self-improvement, and we’ve taken a number of lessons from this incident. There is room for improvement in our load balancer rules, and a new checklist of parameters to verify even when official documentation says things should be the same. We also found some opportunities to optimize our pay calculation performance.
Despite the bumpy migration, this database change will be a win for FieldClock customers as we can provide better performance without increasing our hosting bills.
I’m proud to say that since 2019 we have provided 99.93% uptime (including scheduled maintenance events and Friday’s incident). We’ll never be perfect, but that won’t stop us from trying. Our confidence in our stability is why we offer guaranteed uptime as part of our commitment to members of our Loyalty Program. If you haven’t looked into this, please reach out to our customer service team for more information.