-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(inputs.nvidia-smi): Add test_on_startup
option
#15916
base: master
Are you sure you want to change the base?
feat(inputs.nvidia-smi): Add test_on_startup
option
#15916
Conversation
5ffab57
to
4f5b6bf
Compare
4f5b6bf
to
1eec743
Compare
BinPath: "/usr/bin/nvidia-smi", | ||
Timeout: config.Duration(5 * time.Second), | ||
TestOnStartup: false, | ||
nvidiaSMIArgs: []string{"-q", "-x"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: the reason I'm breaking nvidiaSMIArgs
out is because I want to modify the arguments used in testing to force a non-zero exit code. The tests themselves don't appear to actually integrate with a real nvidia-smi executable, so the solution I came up with is to just run /bin/bash -c "exit 9"
as an example of the command returning non-zero.
Why not returning a startup error without adding a new option? |
Backwards compatibility reasons, I'm not sure if someone's system might strangely rely on this old behavior. I could envision a situation where nvidia-smi fails when called but if someone did not set What do you think? |
The question is, would an error in executing the command be repairable at runtime? If that's not the case, I'd say go without the option as people still do have a way to fix the situation using another startup-error-behavior. Otherwise, we should add this option... |
I think it depends on the issue at hand. In some cases it could be reparable, like if hot-rebooting a GPU fixes device memory corruption. In other situations like mine, it's not. Personally, I think there is no risk at all to adding this option to at least give people the choice. Not adding the option has some risk, so why not do the zero risk option? |
Summary
There are some cases where the nvidia-smi plugin might be found in PATH and executable, but upon running it might always return a non-zero exit code. For various reasons, in the environment I work in, this might be expected. It's thus disruptive for system logs to be polluted with infinite error messages. It's preferable in this situation to check if nvidia-smi returns a good result on plugin startup, and if not, allow the error to be bubbled up and handled according to
startup_error_behavior
.Checklist
Related issues
resolves #15915