Two complaints we receive fairly often are that too many bots get logged, and backups on Friday night are annoying. Well here I am on a Friday night letting you know that things are looking up!
Bots
Let me first be clear that this problem is not unique to Clicky. Most bots don’t interact with Javascript, so most are not logged by Javascript-based trackers. We also have a fairly big regular expression that aims to filter out any that do the Javascript thing, and it works pretty good. I think we are definitely one of the best at filtering out bots already, but the complaints keep coming in. People see it as a defect of Clicky, even though it affects every tracker. And the bots keep getting trickier.
Both Microsoft and Google have started sending out bots in disguise in the last year or two, the theory being that they’re ensuring your content doesn’t appear differently if your web site thinks it’s a regular visitor instead of a crawler. These bots have “real” user agents so you can’t tell they’re bots. However, a few people pointed out something unique – their user agent is always Windows XP / MSIE 6.0, and they always report a screen resolution of 1024×768. That alone is not enough to filter out a visitor – chances are good someone on IE6 has a real dinosaur of a computer on their hands – but since Clicky tracks organizations, we can dig deeper. When we look up the organization info for these visitors, if it’s Google or Microsoft, we can be 99.9% confident this is definitely a bot. (Because if either of these extremely rich companies still seriously have computers this horrible that are used by employees… well, they should be sued).
The problem was however, we didn’t look up the organization of a visitor until after that visitor was inserted into the database. But tonight, I re-arranged some things, and now we check for those three unique factors – XP, IE6, 1024×768 – before inserting into the database. If we have a match, we’ll look up the organization immediately and pull a little preg_match(“#(microsoft|google)#i”, $organization) magic out of our hats, and if it returns true – BAM. Not logged.
There will still be bots who sneak through, I’m sure of that. However, Google and Microsoft seem to be the biggest “problems”, and I’ve never seen what was obviously a bot from either of them that did not have XP / IE6 / 1024. They might update that in the future to make our lives more difficult again, but for now I’m confident this will eliminate almost all of the bots that we log that shouldn’t be getting logged. Yay!
Backups
We do full database backups every Friday night starting at 10pm PST (GMT -7), during which traffic processing is halted. As the databases grow in size, these take longer and longer. We were investigating improvements to this process earlier this week, and I realized I was not setting a flag that would basically cut the time needed to do the backup in half. This was a horrible oversight on my part but I’ll own up to it, and this has now been fixed. Most databases complete their backup in 1-2 hours, but some that are 3-4 years old were getting near the 4 hour mark. Now, the max any of them should take is about 2 hours, and most should be an hour or less.
But wait, there’s more! We’re going to be moving to a new database engine in the near future (goal: 3 months) that will be much more backup friendly. We won’t have to halt processing at all while the backups are taking place. That will be a nice change. ^_^
Good night!