Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs that timeout will never be able to run again #2

Open
lwc opened this issue Aug 27, 2014 · 6 comments
Open

Jobs that timeout will never be able to run again #2

lwc opened this issue Aug 27, 2014 · 6 comments

Comments

@lwc
Copy link
Member

lwc commented Aug 27, 2014

When a job overruns it's TTR, beanstalkd will increment the job's timeout stat and put it back on the work queue for another worker to reserve.

In an effort to prevent pathological jobs from dog-piling all available workers, cmdstalk will bury a task it reserves that has timeouts greater than 1. This means that once a task is buried because of a timeout, it will always re-bury instantly each time it is kicked: the job becomes un-runnable.

Using just the buried, kicked and timeout counters, there does not appear to be a way to differentiate between "kicks due buries due to timeouts" in the way that would allow cmdstalk to bury a job the next time it is reserved after a timeout.

The beanstalkd protocol docs make mention of a one second grace period at the end of a reserve time - would it be possible to use this grace period to bury a timed out job in the "same run" as the timeout occurred?

@lwc
Copy link
Member Author

lwc commented Aug 27, 2014

Upon further reading I'm less clear on how DEADLINE_SOON is meant to operate 😕

@lox
Copy link

lox commented Sep 11, 2014

DEADLINE_SOON is sent to a client that is in a blocking reserve if there are no other jobs to send it and a job it has is nearing TTR deadline.

The issue with racing beanstalkd to bury a task is that you miss out on the timed-out metadata. It's simply buried, if you beat the server to it.

Perhaps it's just the fact that the job is buried on timeout? What should actually happen to timed-out jobs? If we just kick them at the minute then perhaps we should change the behaviour to release with delay to prevent dog-piling. Perhaps timeouts could result in a more aggressive exponential backoff, or a more premature bury.

Either way, seems like we haven't got it 100% right. Thoughts @pda @rbone?

@rbone
Copy link

rbone commented Sep 11, 2014

A longer backoff sounds like a reasonable change for the moment. It is tricky however, as some tasks may merit more aggressive burying strategies while others may be safe to retry very frequently. I'd say a longer backoff makes sense as a default, but it might be nice in the future to make this behaviour configurable, possibly even on a per tube basis.

@lox
Copy link

lox commented Sep 11, 2014

Should the backoff be proportional to the TTR?

@rbone
Copy link

rbone commented Sep 14, 2014

Honestly I can't make up my mind on what the default behaviour should be, so it probably doesn't matter too much what way you go. A proportional TTR should be fine. I think having it be configurable per tube will become pretty important however.

@pda
Copy link
Contributor

pda commented Sep 15, 2014

I think a simple function of the try count c should work fine for now.

PR #4 proposes 3 tries with c*c * time.Hour; delays are 0 (first try), 1 hour, 4 hours; total of 5 hours.
4 tries at c*c * time.Hour could also work; that would add an extra retry after an additional 9 hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants