Yesterday our API was knocked offline for about 3 hours. It turned out that the root cause was in the datacenter itself (this was not a code bug), but we still treat this incident as an opportunity to improve our processes to mitigate the effects of similar incidents in the future.
FieldClock is hosted on AWS. AWS provides services in various Regions, and each Region has multiple Availability Zones (AZs) within it. Our HQ is in central Washington, and our infrastructure is largely hosted in the Northern Oregon AWS region and spread across multiple AZs within the region.
The goal of this design is to make us resilient in the event AWS has problems with a specific AZ. AWS is generally very reliable, but we always plan for the worst-case scenario so that we can provide the highest level of service for our users.
At 11:25am (PDT), the Northern Oregon AWS region started experiencing networking issues. Internet connectivity was affected as well as load balancers. (A “load balancer” is a network device that decides which server handles each inbound request.) This meant that when users loaded the admin site, or sync’d the mobile apps, the requests weren’t able to get to our servers.
At 12:10pm, AWS identified a specific AZ that was affected and advised that we route around it. We took down our servers in the affected zone, but found that we were unable to launch new servers to compensate.
After investigating further, we found the problem preventing server launches. We use an AWS service called “Secrets Manager” to manage secrets (seems straightforward, right?). Secrets Manager allows us to securely store passwords, keys, and other data that needs to be protected but also needs to be used by our apps. Whenever a new server launches, it fetches the secret info that it needs from this service. Despite not being specific to any AZ, we found that Secrets Manager was in fact affected by the networking problems in our region and was not loading the necessary secrets for our servers. We were able to make some changes to mitigate this and servers started launching consistently.
As of about 2:30pm, our service was back up but we were still having intermittent errors. We tracked this down to some database calls that were still routed through the unhealthy AZ. Once we corrected all database calls to avoid the unhealthy AZ, we stopped seeing errors.
Our infrastructure was designed to mitigate the effects of an event like the one that occurred. We have servers in several zones, our databases are hosted in multiple zones, and backups are spread across regions. This planning allowed us to be back online several hours faster than AWS was able to correct their networking issues. Our response time was good, but we always want to be better.
During the incident we identified ways to further spread out our internal dependencies so that we can “turn off” a malfunctioning AZ. We’re implementing those changes immediately so that, if an identical event were to happen today, we could be up and running even faster than before.
While 100% uptime is not realistic for any service, we try very hard to get as close as we can to that number. We know that our users have time-critical workflows and it’s important that our service is available. We thank you for choosing FieldClock and for your patience when the unavoidable occurs.