Elevated API error rate
Incident Report for FieldClock
Postmortem

This particular incident was detected & fixed quickly, but it did cause a brief API outage so it it is worthy of some additional information. We appreciate that our users put their trust in us, and we strive to be fully transparent so we can keep earning that trust.

Background

Our API service is hosted by many virtual “instances” on AWS. Over the course of each day, instances are put into service or taken out of service based on various needs (such as increased/decreased demand, or deploying new code). When each instance starts up, the first thing it does is check our dependencies (i.e. other software packages we use) for any security updates and apply them if found.

The Incident

A certain amount of time is allotted for the security updates as well as any other startup procedures. In the past day, an update for one of our dependencies was published. The new update took much longer than normal to install, and the combined time was exceeding the allotted time for startup. As a result, new API instances would fail to launch.

Automated alarms went off as our API fleet was dwindling, but we weren’t able to identify the root cause before the impact was noticeable to our users. As soon as we identified the timeout problem, we allotted more time and the API fleet came back to “green” within minutes.

Next Steps

We are proud of the uptime and stability of the FieldClock platform. Though this outage was resolved quickly, we take it very seriously and will be reviewing our procedures to avoid a similar incident in the future.

As always, we thank you for your patience during this incident! If you have any questions, please reach out to Support and we’ll be happy to provide any additional information we can.

Thanks,
~The FieldClock Team

Posted Jul 14, 2021 - 12:19 PDT

Resolved
This incident has been resolved.
Posted Jul 14, 2021 - 11:57 PDT
Monitoring
The configuration error has been resolved and API servers are now functioning properly. We're keeping an eye on the situation and implementing a permanent fix for the root cause.
Posted Jul 14, 2021 - 11:19 PDT
Identified
We think we've identified the issue. API instances are failing to launch properly due to a mismatched configuration. We're working on a solution now.
Posted Jul 14, 2021 - 11:10 PDT
Investigating
We are aware that the API is responding with "unavailable" errors to a high percentage of requests right now. We are looking into the issue and will have it resolved ASAP.

Mobile apps will continue to work fine in offline mode but may experience sync errors.
Posted Jul 14, 2021 - 10:53 PDT
This incident affected: Admin Site, Employee Portal, and API.