Issue background
1) Spotfire statistics server does not limit maximum job life span.
2) Amount of engine slots for jobs is limited (usually according to CPU cores capacity).
3) Some user-submitted data functions may get into the endless loop, occupying slots forever.
4) As result, we will run out of slots if users submit such data functions. This may happen almost instantly if even a single user is persistent enough in retrying faulty data function, or may take some time until cumulative effect will cause total failure. As result, Statistics server becomes unavailable for ALL users until manual intervention.
Additional facts
1) It is easy to kill server by invoking one line data function like below
repeat{ print(".") } several times.
2) Endless loop may happen even if nothing really stupid is made by user and may depend on the behavior of third party libraries. As an example, we had a total Stats server outage because of invoked RSclient trying to reach an outdated RServe. We cannot predict what code will cause an endless loop, and therefore such situations cannot be avoided, nor from admin, nor from the user side.
3) Even a server restart does not guarantee service restoration - as soon as a stack of endless loop jobs may also get into the queue to block Stats servers again, being picked up after restart.
4) Typical user behavior when data function did not work correctly is to resubmit it a couple of times in a row, hoping that something will change and that was just a glitch. That's why running out of slots with even a single a hanging data function is quite probable.
5) Even if not all engine slots are occupied by hanged jobs, it still reduces overall cluster capacity, wasting valuable computing resources, decreasing amount of useful jobs that could be run in parallel, and making worse user experience by forcing users to wait in queue.
Conclusion
In current state, Spotfire statistics servers cluster may be completely knocked down literally by every user who has access to it, and this actually happens.
Proposed solution
Statistics server should have an internal watchdog, that will kill jobs that run more than N seconds, and this timeout should be configurable in spserver.properties.
Result
Service will be able to self heal from the failure caused by the hanging jobs on it's own. If all the slots are not occupied yet, it will free slots after timeout, preventing potential failure by excluding cumulative effect of hanged jobs piling up with the endless "RUNNING" state.