-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Handle async delete in stateless cni #2967
base: master
Are you sure you want to change the base?
Conversation
c75da1b
to
ccaa383
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add any UTs for this?
cni/network/network.go
Outdated
// if stateless CNI fail to get the endpoint from CNS for any reason other than Endpoint Not found | ||
if err != nil && !strings.Contains(err.Error(), "endpoint state could not be found in the statefile") { | ||
if strings.Contains(err.Error(), "No connection could be made") { | ||
addErr := fsnotify.AddFile("stateless_"+args.ContainerID, args.ContainerID, watcherPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why adding stateless prefix? why can't we follow same pattern for release ip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no way for me to specify that this is a stateless case other than writing to the file. CNS only delete HNS in statelss case and there is no way for CNS to determine if the CNI is stateless or statefull
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not always try to delete the HNS state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because for stateful CNI we don't have the HNS ID in the state so we have to find the HNS via IP address first and then issue a delete. That's what we decided to do. We have to also make sure that the manage Endpoint state is indeed set to true. If in the future CIlium for Windows introduced then we need to revisit this again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THe check has moved to a function named isStalessCNIWindows()
c456ffa
to
5aed0db
Compare
36ac33f
to
2d55822
Compare
14dc742
to
e2dfe7d
Compare
Co-authored-by: tamilmani1989 <[email protected]> Signed-off-by: Behzad Mirkhanzadeh <[email protected]>
Co-authored-by: tamilmani1989 <[email protected]> Signed-off-by: Behzad Mirkhanzadeh <[email protected]>
Co-authored-by: tamilmani1989 <[email protected]> Signed-off-by: Behzad Mirkhanzadeh <[email protected]>
2df2c99
to
11ea4e1
Compare
11ea4e1
to
0ab2e16
Compare
network/errors.go
Outdated
errSubnetV6NotFound = errors.New("Couldn't find ipv6 subnet in network info") // nolint | ||
errV6SnatRuleNotSet = errors.New("ipv6 snat rule not set. Might be VM ipv6 address missing") // nolint | ||
ErrEndpointStateNotFound = errors.New("endpoint state could not be found in the statefile") | ||
ErrConnectionFailure = errors.New("couldn't connect to CNS daemonset") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably not daemonset?
ErrConnectionFailure = errors.New("couldn't connect to CNS daemonset") | |
ErrConnectionFailure = errors.New("couldn't connect to CNS") |
cnsclient, err := cnsclient.New("", cnsReqTimeout) //nolint | ||
if err != nil { | ||
z.Error("failed to create cnsclient", zap.Error(err)) | ||
} | ||
go func() { | ||
_ = retry.Do(func() error { | ||
z.Info("starting fsnotify watcher to process missed Pod deletes") | ||
w, err := fsnotify.New(cnsclient, cnsconfig.AsyncPodDeletePath, z) | ||
logger.Printf("starting fsnotify watcher to process missed Pod deletes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think we need to dupe logs like this
} | ||
|
||
type watcher struct { | ||
cli releaseIPsClient | ||
path string | ||
log *zap.Logger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you change this from injecting a zap logger to the old bad global logger package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zap logger was not logging anything at all in CNS log file. I don't know if it works for Linuxor not but in case of Windows I was only able to see the logs in CNS log file after I used the old logger. Seems like we cannot use the two logger to write on the same file.
cns/fsnotify/fsnotify.go
Outdated
pendingDelete map[string]struct{} | ||
lock sync.Mutex | ||
cnsconfig *configuration.CNSConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't pass the cns config in here, this package does not need to know all of that 🙁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. I'll only pass the IntiFromCNI and ManagedEndpointState.
cns/fsnotify/fsnotify.go
Outdated
|
||
w.log.Info("releasing IP for missed delete", zap.String("podInterfaceID", podInterfaceID), zap.String("containerID", containerID)) | ||
// in case of stateless CNI for Windows, CNS needs to remove HNS endpoitns first | ||
if isStalessCNIWindows(w.cnsconfig) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The asyncdelete watcher should not need to know about the CNI mode. You need to compose it in such a way that it does the right thing with no additional knowledge instead of teaching it about all this extra CNS config and adding branches and complication.
One way you might accomplish this is by decorating the ReleaseIPs
implementation that gets passed in during initialization. In main
when we bulid this object, you already have all the context you need about if this is stateless CNI. There, wrap ReleaseIPs
with this extra functionality and the call the underlying impl after you deleteEndpoint, like middleware.
I think you could implement that in a way that there are actually no changes to the fsnotify package at all, and only a small anonymous wrapper func added to this initialization in main.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my first thought implementing this then I got some issues with the pattern of going through CNS client and Rest APIs internally from CNS. The async delete is not a direct request and I don't think we should go through Rest APIs for its related operations. For the ReleaseIP and GetEndpointState, I think we don't have any other choice other than CNS client and APIs since they are written that way. However, for HNS we still can avoid this.
I had some concerns about having let's say 250 async delete ops and as a result it will create 250*3 rest APIs for (releaseIP + GetEndpointState + DeleteHNSEndpoint).
I'm more inclined to a solution to avoid going through cnsClient for this. Not sure if we can achieve it without tying up asynchdelte watched to extra packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tamilmani1989 Let me know what you think.
cns/client/client.go
Outdated
@@ -1023,35 +1024,63 @@ func (c *Client) GetHomeAz(ctx context.Context) (*cns.GetHomeAzResponse, error) | |||
return &getHomeAzResponse, nil | |||
} | |||
|
|||
// GetEndpoint calls the EndpointHandlerAPI in CNS to retrieve the state of a given EndpointID | |||
func (c *Client) GetHNSEndpointID(ctx context.Context, endpointID string) ([]string, [][]net.IPNet, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a brief comment on what this function returns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
funciton name is GetHNSEndpointID and it returns IPs too. lets consider renaming it and should we return as [] struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or a map[string][]net.IPNet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up removing the function since we can cover the previous version (with GetEndpoint+deleteEndpoint) in EndpointManger package without import cycle
36e5db7
to
03434f9
Compare
03434f9
to
d66252e
Compare
Issue Fixed:
This is an attempt to fix the HNS endpoint leak by stateless CNI when the aynch delete is enabled.
CNS needs to delete the associate HNS endpoint ids upon releasing IPs in asynch delete mode.
Notes: