One of the biggest complaints we receive is how long it can take to look up historical data that spans a range of more than a couple of days. I want to explain why this was and what steps we have taken to fix this.
Previously we had only been storing summary data on a per-day basis. So if you wanted to view an entire month’s worth of searches for, say, July 2009, our system would have to do 31 queries (one for each day) to get all of the searches for the entire month, then sum all that data together before outputting anything. This is inefficient and could take quite a while.
To fix this, we have added systems that also summarize all of your data on both a weekly and monthly basis. So now if you do a query for July 2009 searches, it only has to do one database query, instead of 31. However, because there is more data in that single query output, the time it takes will be longer than just 1/31 of the original. For the average site it will be about 10-20% of the original size of data. (This is relevant because the more data it has to read off the disk, the longer it takes to get it). This means for the average site, large date range queries will take just 10-20% of the original time that they used to. That’s a huge improvement!
The same thing applies to full week (Sunday – Saturday) queries. Instead of having to do seven queries per week, it only has to do one. And all of this works fine for spanning multiple weeks and months also.
(We’ve been running scripts in the background the past week or so that have been calculating summarizing this data into weekly and monthly chunks, all the way back to Jan 1. All servers are done with this, except db7, which should be done within 24 hours. At the end of each day going forward, it adds your daily totals onto the week and month that that date belongs to. This means that for data prior to Jan 1, these optimizations will not apply).
What if your date range doesn’t fit exactly into full month or full week increments?
No problemo. We have designed this system to be very flexible, using the most optimized queries possible for whatever query you throw at it. Let’s pretend instead of July 2009 (1-31), you queried for May 31 – July 15.
Here’s how our new system deals with this query. This is a bit technical, but it’s worth a read to learn how the system optimizes the queries, which will also allow you to create date ranges that are as optimized as possible.
First, it figures out all of the individual days in this date range. In this case, there are 46 full days. This is how many queries would have to be done under the old system.
Next, it checks if there are any full months in this date range. In this case there is one (June 2009), so it removes all of June from the daily queries, and adds that month to the monthly queries. We are now down to 17 queries, from 46.
It then checks if there are any full weeks (Sunday – Saturday) in the remaining single days. In this example, July 5-11 is the only full week. So it removes those 7 days from the daily queries, and adds that week to the weekly queries.
We are now at 11 queries – 1 monthly, 1 weekly, and 9 daily. This is down from 46 daily queries originally, which is ~75% less queries that need to be run, which will result in much faster response times. Hope you enjoy.