This particular incident was detected & fixed quickly, but it did cause a brief API outage so it it is worthy of some additional information. We appreciate that our users put their trust in us, and we strive to be fully transparent so we can keep earning that trust.
Our API service is hosted by many virtual “instances” on AWS. Over the course of each day, instances are put into service or taken out of service based on various needs (such as increased/decreased demand, or deploying new code). When each instance starts up, the first thing it does is check our dependencies (i.e. other software packages we use) for any security updates and apply them if found.
A certain amount of time is allotted for the security updates as well as any other startup procedures. In the past day, an update for one of our dependencies was published. The new update took much longer than normal to install, and the combined time was exceeding the allotted time for startup. As a result, new API instances would fail to launch.
Automated alarms went off as our API fleet was dwindling, but we weren’t able to identify the root cause before the impact was noticeable to our users. As soon as we identified the timeout problem, we allotted more time and the API fleet came back to “green” within minutes.
We are proud of the uptime and stability of the FieldClock platform. Though this outage was resolved quickly, we take it very seriously and will be reviewing our procedures to avoid a similar incident in the future.
As always, we thank you for your patience during this incident! If you have any questions, please reach out to Support and we’ll be happy to provide any additional information we can.
Thanks,
~The FieldClock Team