Of late, we’ve stared seeing a raft of errors of type “java.lang.theadDeath” on our Coldfusion-based servers. In every case, these errors were generated by the CFSchedule User Agent, meaning that scheduled tasks were triggering them. Of more interest, they were, with 1 exception, being thrown by a scheduled CFPOP call to handle email newsletter bounce-backs by our eNewsletter delivery system. The scary part is that, more often than not, our server would start to hang shortly after getting a bunch of these and we’d have to restart the service. Not ideal on servers that require very uptime on account of some exceedingly busy eCommerce sites.
After casting about fruitlessly on Google & Bing for some answers to this, I buckled down to investigate the errors more closely myself. And I believe I have figured it out:
Each of our websites has a Cron handler: We add in any number of scripts we wish to call, and interval between & whether it is allowed to run concurrently with anything or not. The central ColdFusion scheduler service then simply calls this handler for each site at regular intervals (generally, we call the handler about every minute. The tasks themselves range from every 2 minutes to monthly). This handler runs multi-threaded – that is to say that, if a task is allowed to run concurrently with other tasks, it’ll spawn a thread for each task, run it, then tie up the threads afterwards.
So here’s where I think this issue arrives from: According to the very scant documentation and blogs I could find, this error-type will be handed off when a thread is killed off by the server after being queued for too long, awaiting execution, or for the loser in race conditions. We had our server to allow for 10 concurrent threads. Which is pretty small, but seemed ok. My guess is the connection to the mail server when popping messages runs threaded itself. So we were spawning massive number of threads. Further to that, our allowed timeout on mail server connections was longer than our allowed request time or thread-queue time. So threads were constantly timing out, or being killed off because they were queued for too long – which would then spit that error.
Given that we have oodles of RAM, I’ve both upped the number of concurrent threads we allow, as well as reduced the mail-server connection timeout to be less than both our request-timeout and queue-timeout. Testing throughout today appears to have solved this, but I’ll have to watch & see over the next few if indeed this is now solved.