As many of you know, we had an issue today where the admin site would have trouble displaying jobs lists and other views. It took us a few hours to figure out and solve the root cause, so the right thing to do is let you know what was going on behind the scenes.
Despite going out of our way to way to make FieldClock appear simple, there’s a lot going on under the surface. Employees are being clocked-in and -out, pieces are being recorded, equipment and quality notes are being logged – it’s a lot of data! In order to keep the admin site and mobile apps as snappy as possible, we do a lot of calculations before you even load a page. For example, the Jobs List will show you how many are on the clock and how many have participated at each job, but we don’t calculate those numbers at the moment you load the page. We have automated systems keeping track of active jobs and recalculating those numbers regularly so that we immediately show you the numbers when you load the Jobs List.
Over the weekend we updated the code that does those calculations, and the new code was a bit “eager” and it recalculated all active jobs since the beginning of time. Most companies finalize jobs regularly, which is good, but there are some accounts that have very old jobs that are still active. This caused a flurry of jobs to get updated, which then snowballed when a lot of job updates were sent to mobile devices.
In a normal hour, we only transmit about 1 hours worth of activity – but this morning, as US West Coast got to work, some clients' sync’d data for months- or years-worth of jobs. This set off automated alarms which woke up or developers and let them know there was a problem. We were able to identify the recalculation catalyst quickly, and we disabled it while we looked at performance.
Unfortunately, the initial flood of data caused one of our database “poolers” to stop handling connections properly. A “pooler” is a server that sits between our API app and our database and facilitates communications. API talks to the pooler, which talks to the database… except for one of the poolers that got overwhelmed and jsut stopped talking. This didn’t pose any threat of data loss or corruption, but it meant that a small percentage of database queries (i.e. the queries routed through this sick pooler) would fail. Since rendering a view on the Admin Site requires many database queries, this was most evident on the admin site as pages would be slow to render or only render partially. A side-effect of this pooler’s condition was that it stopped sending the appropriate error messages to our monitoring system, so we didn’t have obvious alarm bells ringing that would show us where the specific problem was.
Our devs worked non-stop to sort out the issue. The first focus was on app code and the database itself to make sure we didn’t have a problem there. Once we established that the problem was in our networking stack, we were quickly able to identify the problem child and remove it from the fleet. After that, all of our status lights turned green and remained that way. A short while later, we re-enabled our (now fixed) recalculation code so that job numbers would update properly. Since then, our error logs have been completely empty and we’re able to get back to our regular work of making FieldClock even better. :)
As is usually the case with these scenarios, having all eyes looking at the problem leads to us seeing other little things and improving the system as a whole. In the process of investigating the site reliability stutters, we rolled out several changes that make syncing more efficient and will lead to a faster and more robust system.
I’ll wrap this up with a hearty THANK YOU to our customers. We know these events, however rare, are annoying when they happen. FieldClock is a team effort and our customers are part of the team – we couldn’t do it without the feedback and support of every member of the team.
~josh