Around 12pm PST, Clicky was 100% offline for about 30 minutes. This means there is a bit of missing data from everyone’s stats today.
So what happened? Network activity was much higher than normal, enough to overload our load balancers and cause them to crash. Before that happened, we did notice the site seeming to get slower and slower over a few hours leading up to the crash, and load on our tracking servers (behind the load balancers) was almost twice what it normally is. We initially thought we may have been under attack because the network traffic was so much higher than normal, but this does not appear to be the case.
Someone on Twitter suggested it may be related to increased activity due to so called “Cyber Monday”, a shopping holiday invented by the retail industry in 2005 to get you to spend more money. This seems plausible, however I would expect most of the extra browsing on the internet in general today to be on major site’s like Amazon – sites that are way too big for us to track so we probably wouldn’t be affected by it, in other words.
A similar thing happened back in April when we first added “pinging” support to our tracking code. Pinging is what lets us determine much more accurately than most services your bounce rate and time-on-site values, because our tracking code continually talks to our servers while a visitor sits on one page. Back then, we rolled out this feature over the weekend, but then when Monday rolled around, the extra spike in traffic from these pings was so high that the same thing happened – load balancers went boom. We were pinging too agressively, it seemed. Some quick detective work showed that the pings were accounting for 80% of the traffic we were now logging – a 400% increase, in other words, basically overnight. This was quite a bit higher than we had anticipated, so we made some changes to the code to basically cut in half how many pings our code sends – they now only account for about about 50% of our total incoming “hits”.
The point is, making this changed saved our skin then, and today it has done the same thing. After we crashed, we quickly updated the tracking code to disable pinging and threw it on the CDN. Within minutes, the load balancers were happily chugging along again. This does mean for the time being, your bounce rate and time-on-site values will be back to non-awesome mode, but if it’s a choice between that and Clicky offline, we choose the former.
This change is only temporary though. We’re evaluating how to make it more efficient. One complaint people had when we first released the new pinging is that lots of people open tabs in the background and leave them there for a long time before actually viewing the page. So not only does this make your time-on-site values perhaps higher than they should be, it also leads to increased load on our end to track all these excess pings.
So right now, for our own stats, we’re testing a new version of the tracking code that only starts tracking a page view and hence starts pinging when one of the following events occurs:
- Mouse movement
- Key press
- Page scroll
- Page coming into focus (when a tab is loaded in the background then displayed later, this will fire)
We think monitoring for these events before logging a page view and starting the pinging will lead to much more accurate traffic data, and quite a bit less load on our end. There may be some odd cases where none of the aforementioned events occur on a “valid” page view, but we’d guess they account for less than 1% of your traffic. Of course, if anyone has feedback or ideas on other events we should be listening for, we’re all ears.
Ideally we’d always log a page view immediately when it is loaded “in focus” but Javascript does not provide a way to check the current focus status – only to detect events when the focus changes. This is an extremely frustrating limitation but we must live with it.
Update
It turns out there is a new method with HTML5, document.hasFocus(), which is a way to determine on demand whether the current document is in focus. It is not yet supported in Opera, and Chrome always returns “true” even when it shouldn’t (bug filed), but Firefox 3+, IE7+, and Safari 5 (at least, haven’t tested in Safari 4) all support it properly. The test tracking code we are running on our blog and on getclicky.com has been updated to use this new method instead of relying on other events (mouse movement etc).
Relying on events led to about a 5% drop in visitors being tracked, so we knew this wouldn’t work. Somehow, we discovered the hasFocus() method, which has VERY little discussion online. Most people must not know about it yet. But anyways, this is what our new test code does:
- If the browser supports document.hasFocus()…
- If the current document has focus at the time the tracking code is executed, a page view is logged immediately and pinging starts. (Since Chrome has a bug where this method always returns true even when it shouldn’t, Chrome users will always be logged immediately. All other modern browsers, except Opera, will work properly).
- If the current document does NOT have focus, we setup an event listener to wait for the “onfocus” event to fire, at which point the page view is logged and pinging starts. This will apply to anyone who opens a page on your site in a background tab or window – we won’t log their visit until they actually start viewing your page. This will help alleviate the extra load of “pings” on our end, and will result in more accurate usage data of your web site since we won’t log a visitor until they are actually viewing your web site.
- If the browser does NOT support document.hasFocus()… (currently only Opera, and old, outdated browsers)
- Page view is logged immediately
- Pinging starts immediately
- (This is the way it used to work by default anyways)
We think this way of doing things will work great, and if all goes well from our testing, we will deploy it to our CDN and it will become the new default tracking code.
By the way, pinging is still disabled in our public tracking code. Once this new code is tested and deployed, pinging still start again, and your bounce / time-on-site values will return to their normal state.