At approximately 6:30AM PST, our load balancers went offline. We have server monitoring tools that check our public IPs from a remote location every minute, so we were aware of the issue immediately. But it took a bit to figure out what was wrong and get them back online. Total downtime was about 90 minutes.
At first we thought it was a network problem, as happens from time to time, but quickly noticed that a few public IPs we have that are not behind load balancers were fine. Ok, so it’s the load balancers then? We have a double load balancer setup with automatic failover, so if one dies, the other takes over. We’ve tested this and it works great. But the chances of both dying at the same time? Impossible. But we couldn’t access their public IPs. That was mildly depressing.
But then we remembered they also have internal “management” IPs for admins to login to. We tunneled into the network and we were able to login to both load balancers, but after looking around a bit, there was nothing obviously wrong. They were online, they could see the internet, they just weren’t passing traffic through to the servers behind them. After a bit of banging head upon keyboard, we checked the load balancer logs and saw some messages we’d never seen before, repeating over and over every minute:
Apr 19 07:09:11 : Machine rebooting too often – Passifying
Passifying? Ok, well at least we see that the load balancers were rebooting themselves for some reason. So we did a manual reboot on one to see what would happen, because they weren’t rebooting anymore – this message was simply being output every minute so that the admin looking at the log files could not miss it.
Upon reboot, they would pass traffic through for about 20-30 seconds, then go boom. We managed to check the load stats before they went offline, and saw they were at 100%. Well that’s not good.
So the problem was that our load balancers were extremely overloaded. We just added pinging to our tracking code, which means visitors to your web sites talk to our tracking servers a lot more often than they used to. We obviously knew this would happen, so what we did was we actually activated our new tracking code last Tuesday for all sites, well before the actual release (Friday night), so we could see the effects that it had on our tracking servers. The load went up significantly, on both the load balancers and the tracking servers, but not enough to be concerned about. It was still a very manageable level.
But at that point, our tracking servers were not actually logging the pings. They were receiving them, but they were discarded. That makes a difference. We made our full release on Friday night so it’s been running over the weekend and were watching it very closely. Things seemed great. Of course, the weekend’s traffic is quite a bit lower than during the week, particularly Monday. Monday is the biggest traffic day of the week for most sites, so it’s the biggest day of the week for our service too.
So Monday morning rolls around, and once most of the US is awake, the spike went up quite a bit higher than it did over the weekend. What happened was the load balancers were getting timeouts when talking to the tracking servers, which means they were keeping connections open for a lot longer than they normally would. This type of thing quickly spirals out of control, hence, the problem this morning.
When our database servers process traffic every minute, they keep a log of what they just did – how many actions were processed, how many pings, how long the process took, etc. This is extremely useful data. So we took a peak at this and saw that pings were accounting for over 80% of traffic being processed. This means that the pinging functionality increased the hits to our tracking servers by 400%, overnight. That’s a little more than we were expecting. We were expecting maybe a 200% increase at the most.
In an effort to get things back online as quickly as possible, we changed the tracking code so that the total pings it will send are half as many, and the initial 15 second quick-check ping doesn’t happen. After uploading that to the CDN, we rebooted the load balancers again and they came online and have stayed online since, with a load of about 65%. When viewing the database logs now, the pings are more inline with our initial expectations – about a 200% increase over normal traffic.
We want to ping more often though, so we’ll be tweaking all sorts of things over the course of the day and probably throughout the week, to get the most pings possible with the least amount of load on our servers. We’ll post an update later to let you know how it’s going.
This is another one of those unfortunate cases that aren’t really testable until they’re live. We knew there would be a big spike on our end, and we thought we had it under control, but we just didn’t know quite how big it was going to be until the system was fully live and all of our sites were sending us Monday traffic. But for now, things are stable, and that’s what’s important.