Site Outage

Incident Report for FieldClock

Postmortem

Yesterday our API was knocked offline for about 3 hours. It turned out that the root cause was in the datacenter itself (this was not a code bug), but we still treat this incident as an opportunity to improve our processes to mitigate the effects of similar incidents in the future.

Background

FieldClock is hosted on AWS. AWS provides services in various Regions, and each Region has multiple Availability Zones (AZs) within it. Our HQ is in central Washington, and our infrastructure is largely hosted in the Northern Oregon AWS region and spread across multiple AZs within the region.

The goal of this design is to make us resilient in the event AWS has problems with a specific AZ. AWS is generally very reliable, but we always plan for the worst-case scenario so that we can provide the highest level of service for our users.

The Incident

At 11:25am (PDT), the Northern Oregon AWS region started experiencing networking issues. Internet connectivity was affected as well as load balancers. (A “load balancer” is a network device that decides which server handles each inbound request.) This meant that when users loaded the admin site, or sync’d the mobile apps, the requests weren’t able to get to our servers.

At 12:10pm, AWS identified a specific AZ that was affected and advised that we route around it. We took down our servers in the affected zone, but found that we were unable to launch new servers to compensate.

After investigating further, we found the problem preventing server launches. We use an AWS service called “Secrets Manager” to manage secrets (seems straightforward, right?). Secrets Manager allows us to securely store passwords, keys, and other data that needs to be protected but also needs to be used by our apps. Whenever a new server launches, it fetches the secret info that it needs from this service. Despite not being specific to any AZ, we found that Secrets Manager was in fact affected by the networking problems in our region and was not loading the necessary secrets for our servers. We were able to make some changes to mitigate this and servers started launching consistently.

As of about 2:30pm, our service was back up but we were still having intermittent errors. We tracked this down to some database calls that were still routed through the unhealthy AZ. Once we corrected all database calls to avoid the unhealthy AZ, we stopped seeing errors.

Next Steps

Our infrastructure was designed to mitigate the effects of an event like the one that occurred. We have servers in several zones, our databases are hosted in multiple zones, and backups are spread across regions. This planning allowed us to be back online several hours faster than AWS was able to correct their networking issues. Our response time was good, but we always want to be better.

During the incident we identified ways to further spread out our internal dependencies so that we can “turn off” a malfunctioning AZ. We’re implementing those changes immediately so that, if an identical event were to happen today, we could be up and running even faster than before.

While 100% uptime is not realistic for any service, we try very hard to get as close as we can to that number. We know that our users have time-critical workflows and it’s important that our service is available. We thank you for choosing FieldClock and for your patience when the unavoidable occurs.

Posted Sep 01, 2021 - 09:21 PDT

Resolved

Things are looking stable! Please reach out to support if you notice any lingering issues.

Posted Aug 31, 2021 - 16:18 PDT

Monitoring

AWS is still having trouble in the affected region, but we've been able to re-launch our service outside the affected zone. We're still not operating at 100%, but the admin site is now accessible and mobile devices are syncing correctly. We'll continue to monitor the situation.

Posted Aug 31, 2021 - 15:11 PDT

Identified

We are still waiting for further update from AWS. The latest word is that there is an internet connectivity issue affecting at least one of their availability zones (AZs). We’re trying to move our servers out of the zone that we think is affected, but AWS is also having an issue registering new servers with the load balancer. Their engineers are working on the issue but we do not have an ETA for resolution.

This issue is affecting many other service providers and appears to be affecting StatusPage as well. We will post updates as frequently as we can, and remember that you can always email support@fieldclockapp.com if needed.

Posted Aug 31, 2021 - 12:42 PDT

Update

After further investigation, there are no issues with FieldClock itself.

AWS is experiencing network traffic issues in US-WEST-2, which is where our servers are located. Although we do operate in multiple Availability Zones (I.e. multiple data centers), which would typically save us if one went down, AWS is having issues at the Region level. This is very very rare, and we're sure AWS has all of their people on it.

Please note that all data is safe and devices will continue to work in offline mode. Appreciate your patience, we'll provide an update as soon as we get one from Amazon.

Posted Aug 31, 2021 - 12:14 PDT

Investigating

We are currently investigating outages of our Admin Site and API. Please be aware that our mobile apps will continue to work in offline mode while we resolve the issue. We apologize for the inconvenience and appreciate your patience.

Posted Aug 31, 2021 - 11:46 PDT

This incident affected: Admin Site and API.