Statistics server server should have a job life span limit

Issue background

1) Spotfire statistics server does not limit maximum job life span.

2) Amount of engine slots for jobs is limited (usually according to CPU cores capacity).

3) Some user-submitted data functions may get into the endless loop, occupying slots forever.

4) As result, we will run out of slots if users submit such data functions. This may happen almost instantly if even a single user is persistent enough in retrying faulty data function, or may take some time until cumulative effect will cause total failure. As result, Statistics server becomes unavailable for ALL users until manual intervention.

Additional facts

1) It is easy to kill server by invoking one line data function like below

repeat{ print(".") } several times.

2) Endless loop may happen even if nothing really stupid is made by user and may depend on the behavior of third party libraries. As an example, we had a total Stats server outage because of invoked RSclient trying to reach an outdated RServe. We cannot predict what code will cause an endless loop, and therefore such situations cannot be avoided, nor from admin, nor from the user side.

3) Even a server restart does not guarantee service restoration - as soon as a stack of endless loop jobs may also get into the queue to block Stats servers again, being picked up after restart.

4) Typical user behavior when data function did not work correctly is to resubmit it a couple of times in a row, hoping that something will change and that was just a glitch. That's why running out of slots with even a single a hanging data function is quite probable.

5) Even if not all engine slots are occupied by hanged jobs, it still reduces overall cluster capacity, wasting valuable computing resources, decreasing amount of useful jobs that could be run in parallel, and making worse user experience by forcing users to wait in queue.

Conclusion

In current state, Spotfire statistics servers cluster may be completely knocked down literally by every user who has access to it, and this actually happens.

Proposed solution

Statistics server should have an internal watchdog, that will kill jobs that run more than N seconds, and this timeout should be configurable in spserver.properties.

Result

Service will be able to self heal from the failure caused by the hanging jobs on it's own. If all the slots are not occupied yet, it will free slots after timeout, preventing potential failure by excluding cumulative effect of hanged jobs piling up with the endless "RUNNING" state.

Attach files
Enter a subject
Drop here to upload

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for Spotfire products and services. It is for informational purposes only and its contents are subject to change without notice. Planning to implement - generally 6-12 months out. Likely to Implement - generally means 12-18 months out. Copyright © 2014-2023 Cloud Software Group, Inc. All Rights Reserved. Cloud Software Group, Inc. ("Company") follows the EU Standard Contractual Clauses as per the Company's Data Processing Agreement. Terms of Use | Privacy Policy | Trademarks | Patents | Contact Us

Please enter your email address

RELATED IDEAS

Statistics server server should have a job life span limit