DataSphereChronicle

How to be Proactive with your Database Monitoring

Apr 05, 2024·By Ellie Najewicz

Adopting a proactive stance towards database monitoring is paramount for ensuring optimal performance and reliability. Effective monitoring not only safeguards production systems but also plays a critical role during stress testing and user monitoring. Let’s explore the key areas of interest and the actual database metrics needed to maintain a healthy database.

Input/Output Metrics

I/O (Input/Output) refers to the interaction between a system and external devices, such as disks or networks. Also consider throughput, which measures the rate of data transfer between the system and these devices. Metrics for I/O includes the following:

- Disk read/write speed: should be captured in MB/s or IOPS (input output per second). This will essentially capture the speed of throughput, if this starts to decrease then it could indicate potential I/O bottlenecks.

- I/O queue length: quantified as the number of pending I/O requests. A concerning trend is this this queue length increasing - teams should set an active alert on this metric when the queue gets too long.

- I/O time-out and error rate: Understand the ratio of calls that end in an error or a time-out. This is an important to track as excessive errors or increases in this rate could indicate a larger issue that needs to be addressed. Also make sure to specifically track time out occurrences as these could indicate that the database is down or larger performance problems.

- I/O retry rate: Related to the metrics about error rates - it is important to track the retry rate of database. Understanding how many of the I/O operations result in a retry indicates that there may be a performance bottleneck. This is also good to compare against the error rate as retry logic is essential for resiliency - you want almost all I/O errors to have some kind of retry logic.

- Sequential vs. random I/O ratio: Tracks the proportion of I/O operations that are sequential vs. those that are random. Sequential means operations that are predictable and related such as reading a large file iteratively from start to finish. As opposed to random operations which are more scattered and unrelated. Understanding this ratio is not only important to the design of the database, but also unprompted changes in this ratio should raise alerts for cascading issues.

It's essential to analyze how fluctuations in I/O and throughput metrics impact overall system performance. For example, prolonged periods of high disk queue length may lead to increased query response times and degraded user experience. By correlating I/O metrics with performance metrics, developers can proactively address potential performance bottlenecks and optimize system resources.

Query Time and Table Lock Metrics

Query times refer to the duration it takes for a database system to execute a query. Table locks occur when a transaction holds a lock on a table, preventing other transactions from accessing it. Query operations can be captured in the following metrics:

- Query execution times: Generally captured in milliseconds and associated to each query operation in a database. Deteriorating query times indicates inefficient queries or database configuration such as index, view, or even schema design.

- Query plan efficiency: Computed by comparing the ratio to rows scanned vs. rows returned. If a query is well indexed, then it should scan as close to the number return as possible. If the number scanned grows larger - this indicates the query is not optimized properly.

- Query plan stability: Measures the consistency of query execution plans over time. The more stable or consistent the better. If the query plan is unpredictable or varied, this highlights an area of opportunity for improvement.

- Lock wait times: Generally measured in milliseconds and is associated to a unique table (or document) impacted by a query operation. As table lock waiting times increase, it could indicate an issue. A table lock wait time is linked to the query causing the wait, so teams should be able to track down the offending issue with proactive monitoring and alerting.

- Lock escalation events: Quantified as the number of times locks are escalated. Similar to lock wait times - an increase in this should send off alerts that there is an issue in the database.

In addition to monitoring overall query times, it's crucial to identify specific queries that contribute disproportionately to system load. By pinpointing query hotspots, administrators can prioritize optimization efforts, such as query tuning or index creation, to alleviate performance bottlenecks and improve overall system efficiency.

Buffer Pool Usage Metrics

Storage usage encompasses the allocation and utilization of storage resources within the database system. But we want to go beyond just disk usage as this is usually well understood and monitored. Databases also have a use of buffer pools, partitions, and tables spaces which all have storage limitations. For the sake of conciseness - buffer pools are a critical and often over-looked resource to monitor. Buffer pools are areas of memory used to cache frequently accessed data pages and their configuration is critical to the performance of a transactional system. You can best monitor your buffer pool configuration with the following metrics:

- Buffer pool hit ratio: Percentage of data requests satisfied by the buffer pool without disk access. This ratio is to be kept as high as possible and any decreases could indicate a change in workload behavior or performance issue.

- Buffer pool latency: Calculated by the average time taken to read/write data pages from/to the buffer pool. This is an essential benchmark to ensure the buffer pool is being used effectively. Sudden increases in this time could mean there is insufficient memory, increased memory pressure, or other performance hindrances.

- Page replacement rate: Defined as the frequency of pages being replaced in the buffer pool. A high page replacement rate suggests the buffer pool is frequently clearing out data pages from its memory to make room for new ones. This may indicate that the buffer pool size is insufficient and not effectively caching the data pages.

- Buffer Pool Contention: Counts the occurrences of multiple processes contending for access to the buffer pool at a given point in time. This is an important metric to indicate if your buffer pool is being overwhelmed - and increase in the metric should trigger investigation.

- Dirty Pages: the number of modified pages in the buffer pool. A change that does not commit all the way to disk is what we call 'dirty'. As this number rises there could be significant issues with data integrity. Some databases do not allow this process at all so any metric greater than zero should set off alarms in those use cases.

As data volumes grow, it's crucial to proactively manage storage growth to prevent capacity issues and performance degradation. Implementing strategies such as data archiving, partitioning, or compression can help optimize storage utilization and mitigate the impact of growing data volumes on system performance.

Index Performance Metrics

Indexes are data structures used to optimize query performance by facilitating efficient data retrieval. The index metrics are similar to those in the query time section as indexes are aligned to queries. Given indexes impact on query performance, I wanted to call out index specific metrics to monitor for performance:

- Index seek vs. scan ratio: Similar to query efficiency listed in query time metrics section, we can compare the rows returned by an indexed vs. rows scanned to measure the effectiveness of an index. The closer these two numbers, then the more efficient the index. If they grow further and further apart, it indicates the index should be optimized or dropped.

- Index fragmentation: Can be measured as the degree of fragmentation within a specific index as well as the average level of fragmentation across all indexes. Fragmentation is the degree to which the logical order of data in an index does not match with the physical order of the data stored on disk. High levels of fragmentation erode an index performance and can help identify indexes that need maintenance.

- Index size overhead and growth rate: Measured by the size in MB or GB of the index as a whole and its growth rate. An index overhead size that is very large may indicate that it needs to be optimized or eliminated altogether. Spikes in growth rate are also important to monitor as rapid growth will degrade the effectiveness of the index.

- Index usage: The number of times an index is being used over a period of time. Now this is a common metric that is usually used to see if indexes are over-used but is also important to see where they are underutilized. Especially combined with overhead size, indexes that are not being used are a burden on the system and should be dropped.

Maintaining an efficient index strategy is essential for maximizing query performance and minimizing resource overhead. Regularly reviewing index usage patterns and identifying unused or underutilized indexes can help administrators streamline index maintenance efforts and ensure that database resources are utilized effectively.

In conclusion, proactive database monitoring is a cornerstone of effective database management, enabling administrators to anticipate issues, optimize performance, and ensure the reliability of mission-critical systems. Many of these metrics are early indicators of performance issues and are more proactive than simply monitoring CPU, RAM, and disk consumption. By diligently monitoring key performance indicators and implementing proactive measures to address emerging challenges, organizations can maintain a competitive edge and deliver exceptional user experiences.

By Ellie Najewicz