The following quotes (or something very similar) came from our interactions with customers:

“We paid a lot of money for this hardware, why isn’t your database making full use of it?”

“The machine is peaking at 100% CPU, the sky is falling, help, NOW!”

This is a problem, because I can emphasize with both sides. On the one hand, having just put a five or six figure sum into new hardware, it can be depressing to see is “going to waste”. On the other hand, seeing the system under high load gives you that sinking feeling that the boat is going to overturn at any moment and production will go down.

Balancing resource consumption is a really hard problem, mostly because we don’t have any control over our work intake. We can’t control how many requests we accept nor do we control what kind of work is being asked of us. Actually, that isn’t true. We could control that, but in most cases, that is a false distinction.

At some point, RavenDB had a max number of concurrent request limit, and users have hit that in the past. This resulted in angry calls from customers about RavenDB refusing requests. The fact that we did that to maintain the overall health of the system was immaterial. Refusing requests meant that the system (or some portion of it) was down. In those cases, it was actually better, from the customer’s perspective, for the whole thing to slow down a bit, as long as there were no errors.

Inside RavenDB, we attempt to manage our CPU consumption using separation of concerns. First, we have the processing of requests. The assumption is that such requests end up being waited on by an actual human, directly or indirectly, so we process them first, prioritizing them above almost everything else. The only thing that has higher priority is the cluster health and monitoring system, which ensure that all nodes are up, running and in the same state.

As it turns out, RavenDB have a lot of additional processes internally that can be given a lower priority under load. For example, indexing, which are something that RavenDB runs in the background, are something that we can increase the latency of to give more resources for request processing.

We have a lot of experience in balancing the overall needs, and I’m still not sure that I have a good answer here. The reason for this post is that I just analyzed a dump file where it looked like requests were waiting for indexing to complete, but they were actually starving the indexes from the CPU time that they needed to actually run.  The system progressed, but not fast enough for the user to not notice things.

Actually, that is the primary criteria that we use. If the system is slow, but no one notices, the system ain’t slow.

Previous articleEn Liten Podd Om IT – Avsnitt 186 – Spinner vidare på amfetamintrenden
Next articleImage Factory 4.0 is available for download
Ayende (real name Oren Eini) is the founder and CEO of Hibernating Rhinos, with experience spanning over 15 years in development. He is a frequent bloggerunder the pseudonym Ayende Rahien, where he focuses on the Microsoft .NET ecosystem, which earned him recognition and awards as Microsoft’s Most Valuable Professional since 2007. Ayende is an internationally acclaimed presenter and you can catch him speaking at DevTech, JAOO, QCon, Oredev, NDC, Yow! and Progressive.NET conferences. Ayende shares his extensive knowledge when speaking at conferences and through his written works, such as "DSLs in Boo: Domain Specific Languages in .NET", published by Manning and recently through his book "Inside RavenDB". Professionally, Ayende remains dedicated and focused on architecture and best practices that promote quality software and zero-friction development. In his personal life, Ayende is an avid reader who is now completely captivated by his personal “novel”, namely his daughter who was born in the spring of 2015.