Databases - All up and running well

Is your server running well? How is it performing compared to yesterday or last week? Overall it might seem to be all OK, with CPUs sitting comfortably at 20% and the disks whirring softly in their rack...Yet someone’s not happy! Would you know who? What’s different about today?

How's that database server going?

At the most fundamental level - “How’s it going?” can be answered with a simple “Yes - it’s going” or “No - it’s not”.

Over time, aggregating these answers gives us the often cited statistic: Server Uptime. A server being “up 99.99% of the time” is of course only part of a more complex picture. A server being performant and responsive to a workload serving hundreds of users and thousands of tasks is harder to measure, and certainly harder to summarise into a single statistic. Let’s consider then not just “is the server going”, but “is it performing all tasks well - right now”.

Database monitoring and profiling

Conventional wisdom states that if there’s a problem affecting performance, it’s likely to be caused by a large or complex procedure. Monitoring solutions therefore focus only on procedures taking a long time to run. Procedures that are quick aren’t tracked at all, even if they’re run thousands of times by hundreds of people.

Another frequent reason for tracking only long-running transactions is the impracticality of logging everything in detail. Logging thousands of short transactions would generate so much data that the logging itself would only add to performance issues.

The result of this conventional wisdom and pragmatism is that for all logged SQL taking more than a few seconds to run - there are another 10,000 small blocks ignored.

Death by “A Thousand Cuts”

Small doesn’t necessarily mean unimportant. Let’s consider there’s a small procedure, which is currently taking 0.3 seconds to run. That might seem fast, until we find it’s ran 1,000 times! Suddenly, 0.3 seconds becomes five minutes, and that’s far longer than most people would want to sit around waiting!

If we knew that it should normally take 0.003 seconds, then by comparison 0.3 seconds suddenly looks very slow indeed! However as noted above, many monitoring and profiling solutions will not record any of this as each discrete run is still considered trivial.

Running counters of execution statistics

Microsoft SQL Server does however record running totals of CPU use, execution time and also disk I/O statistics. These are maintained in an internal system dynamic management view: “Execution Query Statistics”.

Each individual part of every procedure across all databases has its own running total. Over time, run-counts, CPU use and total execution times are incremented each time a procedure is run, similar to the odometer of a car.

When looking at these totals, it’s often the case that the biggest consumers of CPU aren’t the long-running processes you might expect (like month-end KPI reports). Instead they’re often innocuous things - though run very frequently.

The limitation of the running-totals stored in “Execution Query Statistics” is that the statistics are aggregated from the point in time the server first starts-up. On a server with 99.99% uptime we can expect that generally we’ll be looking at running sums that have had plenty of time to become very large numbers. Considering the example above of a change in execution times from 0.003 seconds to 0.3 seconds; this will not have a detectable impact on a very large running sum for quite a while!

But what is happening right now?

While the totals for CPU and execution duration can’t show us procedures that are only recently performing poorly - we can work out what’s going on if we take a few snapshots over time.

By recording a snapshot of execution counters periodically, the subtracted difference from one snapshot to the next will show the actual CPU used for the intervening period.

In practice, taking snapshots is easy, and also quick thanks to the handy inclusion by Microsoft of a timestamp showing the last run-time for each query. It’s easy therefore to snapshot new totals for just queries run since the previous snapshot.

Once a baseline is established, or by comparing to the previous days’ averages, queries using more CPU or time to complete can be easily spotted.

What we’ve found

Over time, execution times for procedures can change. Changes might occur because the volume of data increases over time causing SQL to behave differently. For example; an index-lookup for rows might become a table-scan once the rowcount crosses a certain threshold.

Detecting these kinds of changes quickly can mean problems are resolved even before the phone starts to ring.

Although detailed logging of long-running procedures is useful, we’ve found that keeping tabs on the Execution Statistics’ running totals has allowed us to not just say “Yes - the servers’ are up” but also “Yes - all the systems’ are running well, as they have been all week!”