The devil finds work for idle transactions: A Birdie Engineering Update

A Birdie Engineering Team update from our Software Engineer, Leo.

devs working on computers at at a wooden desk wearing coats with a whiteboard behind them
(not *actually Leo)

The devil finds work for idle transactions:

Idle what?

At the core of our architecture at Birdie is a Node.js app which we scale across multiple nodes in a Kubernetes cluster. This app is responsible for authentication, exposing many of our endpoints as well as acting as the API gateway for our microservices.

To query our PostgreSQL database, we use TypeORM.  Each application node has a pool of 10 connections in order to perform these queries. We'll need this bit of information later on to understand an issue we had on production which cost us some downtime recently.

As we've acquired more partners (and therefore API clients), we have been slowly trying to enforce the use of one database transaction for each request to our API. This means we can benefit from the ACID properties of transactions and guarantee a level of consistency for API clients.

To give an example, let's say that we have an endpoint whose job is to delete someone from our application.

In NestJS, the framework we're using, it could look like something like this:

Developer code example
Example NestJS Framework

Obviously, it's the responsibility of the PeopleService to do the actual deleting of the database record, but we're likely going to do other things there too: dispatching events, queuing messages, updating other entities, calling other services etc.

The point is, when doing these "other things", for a long time we weren't using the same database transaction, which is obviously a problem when the "other things" includes reading or writing to your database - we can't reap the rewards of ACIDity.

Context is key

You might well ask:

"Why weren't you using transactions you plebs?"

Fair enough. A touch harsh, but fair enough.

The answer is that we simply didn't need to: we didn't have the volume of traffic to warrant the extra legwork of setting this up for our endpoints.

So each time we queried our database our API, we'd take one connection out of our connection pool and use it to, in our example, delete a user. Then we'd use another connection for any subsequent tasks like triggering event listeners, all before returning some response to the client.

Obviously this is not a great situation to be in if you're trying to provide some level of consistency from your API. So we developed a context decorator which can be used to inject a context manager object into controller methods.

How does it work? Well, under the hood it basically uses Nest's createParamDecorator to return an instance of what we call a ContextManager object.

Structure of our ContextManager (implementation left as an exercise to the reader 😉)

This object can then create a TypeORM QueryRunner which always uses the same connection until it is released. How, you ask? Well, abstracted away from the developer is a global interceptor (TL;DR it's a middleware) which rolls the transaction back when it's passed an error, and commits it otherwise.

Refactored to use this approach, our example above could now look like this:

Context Manager code example
Refactored to use the new approach

So, with this ContextManager object, we have access to a transaction which can then be used to do all your queries for you - problem solved, right?

Well, sort of.

You still have to pass this transactional context all the way down to any other services in your app that interact with your database.

And therein lies the danger.

As I mentioned above, the connection used by this transaction is not released back to the pool until the transaction is committed or rolled back. If the same query manager is used for all subsequent queries from your controller method, then that's no problem:

Even if there are no connections left in the pool, the next request will just have to wait for a commit or a rollback, and then it can start to be processed.

Unfortunately, in our case there was one service - called by a very important, heavily used endpoint  - where this ContextManager object was only partially used. What I mean by that is that there was a piece of code which ignored this context object and tried to query the database using a regular TypeORM Repository that was injected into that service using Nest's internal dependency injection.

As a result, an API node used up all 10 of its available database connections (for 10 different requests) by beginning a transaction for each HTTP request. Then each request tries to create another connection when using this injected Repository.

The consequence is: you end up with many many idle transactions, your API grinds to a halt and you can't serve any more requests. Self-DoS, nice 🤦‍♂️

COMMIT; to taking some action

With a large API codebase, it can sometimes be difficult to spot this sort of bug in the code itself. You can end up missing some crucial function call somewhere, exhausting all the solutions that seem available to you, chasing your tail, starting again and then still not finding an answer.
Me, debugging

It's not all doom and gloom though: there are three things we'd advise from this experience and we'd love to share them with you.

  • First and foremost, if you can afford the engineering effort to implement greater consistency sooner rather than later, you won't regret it. As your client-base grows, problems like this magnify in both likelihood and severity.
  • Secondly, make sure you've got monitoring in place for any queries that lie idle for a reasonable period. It sounds obvious but it indicates a bottleneck that could easily be avoided.
  • Lastly, if you can't find where you might be running into this issue, make a machine do the hard work for you.
  • We've since changed our integration tests to ensure that these endpoints will only use one connection at a time, meaning that we can catch this kind of bug in our CI. You could do the same, or if that's not feasible, write some code that monitors your test database and spits out any idle queries afterwards. At least you get a clue of where your problem might be and don't dizzy yourself like this doggo and me.

Of course, there are other tools at your disposal too. A regular, thorough and realistic load test ought to catch this kind of issue. Failing that you've got rolling deployments, canary deployments, red/black or blue/green or whatever other colour combination deployments. ( rgb(143,6,11)/ rgb(55,33,55) never really caught on, did it?)

Now, where was I? Ah yes, chasing my tail.

Want to read more from Leo? Check out his work on Burndown Charts, here.