Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(inputs.nvidia-smi): Add test_on_startup option #15916

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

LandonTClipp
Copy link

Summary

There are some cases where the nvidia-smi plugin might be found in PATH and executable, but upon running it might always return a non-zero exit code. For various reasons, in the environment I work in, this might be expected. It's thus disruptive for system logs to be polluted with infinite error messages. It's preferable in this situation to check if nvidia-smi returns a good result on plugin startup, and if not, allow the error to be bubbled up and handled according to startup_error_behavior.

Checklist

  • No AI generated code was used in this PR

Related issues

resolves #15915

@telegraf-tiger telegraf-tiger bot added feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins labels Sep 19, 2024
@LandonTClipp LandonTClipp force-pushed the LandonTClipp/nvidia_smi_test_on_startup branch from 5ffab57 to 4f5b6bf Compare September 19, 2024 19:57
@LandonTClipp LandonTClipp force-pushed the LandonTClipp/nvidia_smi_test_on_startup branch from 4f5b6bf to 1eec743 Compare September 19, 2024 20:02
BinPath: "/usr/bin/nvidia-smi",
Timeout: config.Duration(5 * time.Second),
TestOnStartup: false,
nvidiaSMIArgs: []string{"-q", "-x"},
Copy link
Author

@LandonTClipp LandonTClipp Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the reason I'm breaking nvidiaSMIArgs out is because I want to modify the arguments used in testing to force a non-zero exit code. The tests themselves don't appear to actually integrate with a real nvidia-smi executable, so the solution I came up with is to just run /bin/bash -c "exit 9" as an example of the command returning non-zero.

@srebhan
Copy link
Member

srebhan commented Sep 20, 2024

Why not returning a startup error without adding a new option?

@LandonTClipp
Copy link
Author

LandonTClipp commented Sep 20, 2024

Why not returning a startup error without adding a new option?

Backwards compatibility reasons, I'm not sure if someone's system might strangely rely on this old behavior. I could envision a situation where nvidia-smi fails when called but if someone did not set startup_error_behavior = "ignore" then this would suddenly make their whole service shutdown (I think). There is also a potential issue with the plugin startup getting hung if nvidia-smi itself hangs. I'm just being conservative by retaining the old behavior.

What do you think?

@srebhan
Copy link
Member

srebhan commented Sep 20, 2024

The question is, would an error in executing the command be repairable at runtime? If that's not the case, I'd say go without the option as people still do have a way to fix the situation using another startup-error-behavior. Otherwise, we should add this option...

@srebhan srebhan self-assigned this Sep 20, 2024
@LandonTClipp
Copy link
Author

The question is, would an error in executing the command be repairable at runtime?

I think it depends on the issue at hand. In some cases it could be reparable, like if hot-rebooting a GPU fixes device memory corruption. In other situations like mine, it's not.

Personally, I think there is no risk at all to adding this option to at least give people the choice. Not adding the option has some risk, so why not do the zero risk option?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins
Projects
None yet
Development

Successfully merging this pull request may close these issues.

inputs.nvidia-smi: Add config option to test a single run of nvidia-smi on plugin startup
2 participants