It’s been a rough week. I wanted to explain what has been happening recently with our CDN, and talk about all of the problems we’ve had with CDNs in general. If you can stomach a novel, you’ll discover the good news that it’s been resolved to the point where we don’t foresee any further issues.
In June, we decided to move away from our home brew CDN and get a real one, because we were outgrowing it and it was becoming a real pain to manage amongst other things.
The main requirement was that we needed support for HTTPS with our own domain name (static.getclicky.com). There are surprisingly few CDN’s out there that offer this service without selling your soul and first born child. Most CDN’s only let you use a generic sub-domain of their CDN’s domain to get HTTPS, such as omg-secure.somecdn.net. This is fine the assets on the CDN are only for your web site, but that obviously is not the case with us.
Literally the only two we could find that offered this feature at a reasonable price were CloudFlare and MaxCDN, so we decided to test these out. We also wanted to try one of the enterprise level ones, just to see the difference in performance. For this we chose the 800lb gorilla that is Akamai.
MaxCDN offers HTTPS for $99 setup + $99/month, on top of the normal bandwidth costs. Very reasonable. The service was perfectly fine, but they only have locations in the US and Europe. This is definitely a majority of our market but we wanted Asia too. Well, they do offer Asia, but you have to upgrade to their enterprise service, NetDNA, for considerably more money. It was still less than what we were paying for our home brew CDN though, so I decided to try it.
This was one of the worst days I’ve ever had. I didn’t know when the transition was occurring, because I had to submit a ticket for it and then just wait. When they finished it, they let me know, but they messed up the configuration so the HTTPS didn’t work. (They forgot the chain file. If you know how certificates work, that’s kind of important). It was several hours before I realized this however, because DNS hadn’t propagated yet – I was still hitting their old servers for a while, which were still working fine. Once I realized there was a problem, the damage had already been done to anyone who was tracking a secure site. Not to mention it completely broke our web site for our Pro+ members, since they get HTTPS interface by default and none of the assets were loading for them. I immediately emailed them to get it fixed, meanwhile I pointed the domain back to our old CDN so HTTPS would work in the meantime. But they never actually got it fixed. I don’t know what the problem was, we had a lot of back and forth, but it was clear this was not going to work.
Next was Cloudflare. I’d met the founders at TechCrunch Disrupt the previous September, they’re great. Thing is, they’re not technically a pure CDN. You point your DNS to them, and then all of your site’s traffic passes through their network. They automatically cache all of your static resources on their servers, and then “accelerate” your HTML / dynamic content. Accelerating means requests to your server pass through their network directly to speed them up, but they don’t cache the actual HTML – it just gets to you faster because the route is optimized.
All in all it’s a fantastic service, and I’d be all for it, but they didn’t (and still don’t) support wildcard DNS – which is another do-or-die feature for us because of our white label analytics service. But their rock star support guy, John, told me they could setup a special integration with us where we could just point a sub-domain to them to act as a traditional CDN. Well, it was worth trying because there weren’t any other options at this price level, especially since HTTPS only costs $1/month on top of their normal pricing, and they have servers in Asia too. It seemed too good to be true really. How could they be doing this for such a great price and have such good support? I’m pretty sure John doesn’t sleep, no matter what time I email him I have a reply in minutes it seems.
Anyways, the service worked great. We had it live for a week or two. At some point there was a problem that caused us to move back to our home brew CDN, although I don’t recall what it was exactly. But overall I was happy and planned to test it again in the future, but I still had Akamai to test.
Akamai is what the big boys use. Facebook, etc. I knew it was good, but also expensive. However, I figured it was worth it if the service was as good as I expected it to be. They literally have thousands of data centers, including South America and Africa which very very few CDN’s have, and my speed tests on their edge servers were off the charts. Using just-ping.com, which tests response time from over 50 locations worldwide, I could barely find a single location that had higher than 10ms response time. Ridiculous to say the least.
They gave us a 90 day no commitment trial to test their service, which was appreciated. Their sales and engineer team were great. Very professional, timely, and helpful. But man did I hate their control panel. It was nothing short of the most confusing interface I have ever laid eyes on. I had no idea how to do anything, and I’m usually the guy who figures that kind of thing out.
They walked me through a basic setup, but then the next thing I didn’t like was discovered – any changes you want to make take 4 hours to deploy. What if you screw something up? That’s gonna be a nail biting 4 hour ball of stress waiting for it to get fixed.
I never actually got to really test their service because I was just too scared of screwing it up. A few weeks had passed and I had forgotten how to configure anything. My patience was wearing thin, as our custom CDN continued to deteriorate and I was dealing with other junk too. There’s always a thousand things going on around here.
John from Cloudflare continued to email me to ask how our testing was going with these other services. He was confident Cloudflare would meet our needs. I was pretty sure too, just hadn’t made up my mind yet. But I decided to go back to them because I didn’t have much other choice.
That was early August and, well, we’ve been with them ever since. No problems at all. Great service. Overall I have nothing but good things to say.
Well, it turns out there was a problem. A few weeks ago, our “pull” server (that they pull static files from) crashed, and at the same time our tracking code stopped being served. It was fixed quickly but… How could this be? They should be caching everything from this server, right?
I emailed them about it and they weren’t sure how the server crashing would affect cached files being served. But unless the cache expired at the exact same time as the crash, something was definitely up.
I did some digging and finally ended up “watch”ing the ifconfig output on the pull server, which shows bandwidth usage amongst other things. We were pushing almost 3MB per second of data out of that thing. Hmm, that doesn’t seem right.
I renamed the tracking code file as a quick test, and sure enough, suddenly Cloudflare wouldn’t serve it. Put it back, bam, it worked.
Clearly this file was not being cached. But why? Well, it wasn’t their fault. The problem was the rather strange URL for our tracking code. Instead of e.g. static.getclicky.com/track.js, the URL is just static.getclicky.com/js. This is one of those “Why the hell did I ever do that” type things, but is too late to change now with almost 400,000 sites already pointing to it.
I emailed them about this and only then discovered that they cache based on file extension, not mime type or cache headers, which we of course properly serve. I wish I knew this beforehand, but wish in one hand shit in the other, see which one fills up first.
At this point I knew I needed to do something, since this single file was not being cached properly, it relied 100% on the single pull server being online at all times. I should have made it my #1 priority but with only a single 5 minute outage in 2 months, I somehow convinced myself I could think about it for a while. This was a big mistake on my part and I apologize profusely for it – it won’t happen again. I could have spent a few grand with Dyn to get failover immediately to give us a safeguard until I found the right (affordable) solution, but I didn’t (more on this in a second). I’m really sorry and I won’t compromise our reliability like that again. Clearly it was not worth it.
So anyways, the same day I discover this caching issue, the server crashed… again. I got it fixed quickly, and as a quick precaution I setup another server and setup round robin DNS to serve both IPs so in case one crashed, there’d be backup. However there was not monitoring/failover on this config, but if DNS serves multiple IPs for a domain, theoretically the requester is supposed to fall back on the second one if the first one fails. I had never actually tested this scenario, but it was just an intermittent fingers-crossed fix until I got a real solution in place.
And then the server crashed again… and I discovered this did not work as I hoped (surprise).
Ok, so we need failover on this, like yesterday. This is now my #1 priority. Our DNS provider, Dyn, offers this feature, but what I hate about their implementation is the restrictions they place on the TTL (time to live), which is how long DNS will cache a query for. Obviously the TTL should be fairly short for maximum uptime, but the max they allow you to set with failover is 7.5 minutes. And with our level of traffic, this increases our bill several thousand dollars a month which is a bit steep for my liking. Not to mention the expensive monthly base fee just to have this feature enabled in the first place.
I finally came up with a plan though. I found another DNS provider, DNSMadeEasy.com, that offers monitoring/failover for very reasonable pricing and no restrictions on TTL. I specifically emailed them about this like 4 times to confirm it would work exactly as I expected. However I can’t just transfer getclicky.com to be hosted there, because we’re in a contract with Dyn (sigh). So I was going to setup a different domain on their servers, and then using CNAME’s, point Cloudflare to pull files from that domain, instead of the sub-domain we were using for getclicky.com.
That was yesterday. “Great!”, I said to myself. “I’ll set it up first thing tomorrow because it’s almost midnight!”
And then this morning………. that’s right, the freaking server crashed again. My phone was on silent by accident and I slept in, so for almost 2 hours our tracking code was only being served for about 75% of requests (because DNS IP fallback does work some of the time, it seems). Hence, more problems this morning.
ARGH. I screamed at my computer and just about burned down my house I was so mad. I had come up with a plan that I knew would work and was going to implement it first thing the next day, but the server crashes in the meantime and here I am in bed, blissfully dreaming of puppies and unicorns, unaware of any problems because my STUPID PHONE IS ON SILENT. WHY. ME.
But the good news is, today, I got this all setup. Monitoring/failover is now live on our pull servers, and they are checked every 2 minutes – so if there is a problem with any of them, DNS will stop serving that IP to Cloudflare within 2 minutes at the most, and I verified it works properly by intentionally killing a server. And the TTL is only 5 minutes, so the absolute maximum amount of time there could potentially be a problem for any individual person is 7 minutes. And we added a third pull server, so at the most this would only affect 1/3 of anyone, and even then, for a maximum of 7 minutes.
(Note: Above I was complaining about Dyn’s 7.5 minute max TTL, and here I am with a 5 minute one. Well, this one’s a bit different because only Cloudflare’s servers talk to it, so the total queries generated are quite small. The real issue is we’re also going to be doing this same thing in order to “load balance the load balancers” (really?), because we’re adding two more of them this week. Using failover on this is what would be really expensive, so we’re avoiding that by using another DNS provider for it, and we figure we might as well do all of that monitoring and failover in one place. Load balancers are stable and reliable, so the TTL will be a bit higher – and even if not, their pricing is considerably cheaper than Dyn’s, so it’s all good).
On top of all that, Cloudflare desperately wants to “fix” this caching “problem” on their end too. (I say “problem” in quotes because their service is working exactly as they designed it to work, I just didn’t know ahead of time that caching was based on file extension only). They are working on a solution that will allow us to rewrite URLs on their end so that their servers will see the tracking code file as something that ends with a .js file extension and hence cache it properly, without us having to make any changes on our end. Once that’s live, even if all 3 of our pull servers were offline (knock on wood), it should have zero impact because that stupid legacy URL file will be actually be cached!
So that, my friends, is as short a summary as I can write about everything we’ve been through with CDNs.
And on top of all this, we also made an update to the tracking code on Nov 1 that caused issues for some of you. This update has been reverted but that was the last thing we needed with the CDN also causing issues at the same time. [Update: And there was a small network hiccup at our data center on Nov 9 that caused a short outage. Worst week ever.]
So I don’t really feel like we have earned your money this month (and to think, it’s only the 8th…) If anyone wants a refund, send us an email we’ll happily refund you a full month of service.
No matter what, know that I value the quality of our service above anything else and will always do everything in my power to make sure it works flawlessly. This has been a horrible week, but as of now the CDN should not impact anyone.
Thanks for reading and (hopefully) understanding.