A few months ago, after some reports of lag with one of the systems in a distributed architecture I built and maintain, I discovered how important it was to pay attention to how queries were constructed. One day, near the end of a shift, after some complaints from staff members, I observed a high server load in the database server that backs this architecture, and traced it back to its source.
One service (nanny) that runs in this architecture was used to monitor and maintain the health of the processes on each server. This service depended on some information from the database, which was fetched by another service that processes RPC (remote procedure calls). Digging into the RPC service, I discovered several problems that weren’t obvious when it was first built. These problems became more obvious as more systems were added to the pool, increasing the load on the RPC service in question.
At it’s peak, the database server backing this process was showing a peak load of 3.2 and an average of over 1. After digging into it and fixing the problems, the server now sits at a peak load of 0.18 and an average load of 0.04.
- Modified nanny service to only request some service information on every other run instead of every run.
This reduced the overall load placed on the RPC service.
- Added a LIMIT to the RPC service query in question.
Before applying the LIMIT, this query was generating around 100k results, but only being used to get the latest information.
- Added a composite index to improve the query time.
This was the single best gain in query speed, resulting in the query time going from a maximum of 1.7 seconds to a maximum of 0.02 seconds.
tload – A tool that generates a ASCII CPU load graph in a console.
- Always limit your queries if you only fetch one row from them.
- Revisit your queries to examine and update indexing to be more suitable for the data and access patterns.
- Visualization over time is helpful in diagnosing problems.