Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Npm Lite Feature #3005

Open
wants to merge 25 commits into
base: master
Choose a base branch
from
Open

Npm Lite Feature #3005

wants to merge 25 commits into from

Conversation

rejain456
Copy link

@rejain456 rejain456 commented Sep 10, 2024

Reason for Change:

This pr adds code changes for a new feature called NPM-Lite which is a subset of NPM.
NPM Lite provides Singularity team with a “light” NPM which addresses:

  • Heavy load on API Server: NPM Lite must support ~4K Nodes (NPM currently supports ~250 Nodes).
  • Windows NPM performance issues caused by HNS SetPolicies.
  • Issue Fixed:

npm lite consists of 2 main limitations:

  • Only accepts CIDR-based network policies
  • Enables the daemon set to watch for pods only under its node

This pr adds code to cover both the points above

Testing
image

Issue Fixed:

Requirements:

Notes:

@rejain456 rejain456 requested a review from a team as a code owner September 10, 2024 22:45
@rejain456
Copy link
Author

/azp run NPM Conformance Tests

@rejain456
Copy link
Author

/azp run NPM Scale Test

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rejain456
Copy link
Author

/azp run Azure Container Networking PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@huntergregory huntergregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realizing an edge case, and we'll have to be careful with reading in the config.

Also for testing, could you add to the PR description steps or screenshots for how you verified that a non-CIDR policy throws an error, and how you verified that applying a CIDR policy correctly blocks / allows traffic?

@@ -387,6 +390,8 @@ func translateRule(npmNetPol *policies.NPMNetworkPolicy, netPolName string, dire
}
}

cfg := npmconfig.DefaultConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to use the config read in at runtime (read in from the k8s ConfigMap). That config is read in here:

func newStartNPMCmd() *cobra.Command {
// getTuplesCmd represents the getTuples command
startNPMCmd := &cobra.Command{
Use: "start",
Short: "Starts the Azure NPM process",
RunE: func(cmd *cobra.Command, args []string) error {
config := &npmconfig.Config{}
err := viper.Unmarshal(config)
if err != nil {
return fmt.Errorf("failed to load config with error: %w", err)
}
flags := npmconfig.Flags{
KubeConfigPath: viper.GetString(flagKubeConfigPath),
}
return start(*config, flags)
},
}
startNPMCmd.Flags().String(flagKubeConfigPath, flagDefaults[flagKubeConfigPath], "path to kubeconfig")
return startNPMCmd
}
func start(config npmconfig.Config, flags npmconfig.Flags) error {

It looks like we'll have to pass the NPMLite toggle down from start() -> NetPol controller -> translation

Copy link
Author

@rejain456 rejain456 Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to reflect this, thank you!

QQ: @huntergregory , What is the benefit of reading the config in at runtime only?

@@ -403,11 +408,24 @@ func translateRule(npmNetPol *policies.NPMNetworkPolicy, netPolName string, dire
return err
}
}

// if npm lite is configured, check network policy only consists of CIDR blocks
err := NpmLiteValidPolicy(peer, NpmLiteEnabled)
Copy link
Contributor

@huntergregory huntergregory Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we're letting NPM Lite create "allow-all-internal" policies (no peers). That uses the all-namespaces list ipset, which will be inaccurate for NPM Lite.

Realizing we should validate the following for NPM Lite: policy must have 1+ peers and each rule must have an IpBlock but no selectors.

Could we simplify that check to one location in this function instead of two?

Copy link
Contributor

@huntergregory huntergregory Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we're letting NPM Lite create "allow-all-internal" policies (no peers)

Correction: this code has the "allow-all-internal" part:

// #0. TODO(jungukcho): cannot come up when this condition is met.
// The code inside if condition is to handle allowing all internal traffic, but the case is handled in #2.4.
// So, this code may not execute. After confirming this, need to delete it.
if !portRuleExists && !peerRuleExists && !allowExternal {
acl := policies.NewACLPolicy(policies.Allowed, direction)
ruleIPSets, allowAllInternalSetInfo := allowAllInternal(matchType)
npmNetPol.RuleIPSets = append(npmNetPol.RuleIPSets, ruleIPSets)
acl.AddSetInfo([]policies.SetInfo{allowAllInternalSetInfo})
npmNetPol.ACLs = append(npmNetPol.ACLs, acl)
return nil
}

It seems like the if statement could be taken if it's possible to have no port rules, 1+ peers, and peers are empty (ipblock + selectors are nil)

func ruleExists(ports []networkingv1.NetworkPolicyPort, peer []networkingv1.NetworkPolicyPeer) (allowExternal, portRuleExists, peerRuleExists bool) {
// TODO(jungukcho): need to clarify and summarize below flags + more comments
portRuleExists = len(ports) > 0
if peer != nil {
if len(peer) == 0 {
peerRuleExists = true
allowExternal = true
}
for _, peerRule := range peer {
if peerRule.PodSelector != nil ||
peerRule.NamespaceSelector != nil ||
peerRule.IPBlock != nil {
peerRuleExists = true
break
}
}
} else if !portRuleExists {
allowExternal = true
}
return allowExternal, portRuleExists, peerRuleExists
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"No peers" actually corresponds to "default deny all" or "default allow all" I believe?
#3005 (comment)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-factored so we call the validate function only once. @huntergregory , plase let me know if that looks better

npm/pkg/controlplane/translation/translatePolicy.go Outdated Show resolved Hide resolved
wantErr bool
}{
{
name: "peer nameSpaceSelector and ipblock in ingress rules",
Copy link
Contributor

@huntergregory huntergregory Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing tests. Could we add these cases as well?

Copy link
Contributor

@huntergregory huntergregory Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is 3 peers:

[]networkingv1.NetworkPolicyPeer{
				{
					NamespaceSelector: &metav1.LabelSelector{
						MatchLabels: map[string]string{
							"peer-nsselector-kay": "peer-nsselector-value",
						},
					},
				},
				{
					IPBlock: &networkingv1.IPBlock{
						CIDR:   "172.17.0.0/16",
						Except: []string{"172.17.1.0/24", "172.17.2.0/24"},
					},
				},
				{
					IPBlock: &networkingv1.IPBlock{
						CIDR: "172.17.0.0/16",
					},
				},
			},

but you can have a pod selector and a namespace selector in the same Peer.

Each peer corresponds to an element of the YAML's list (more intuitively, a "section" denoted by a -) in a NetworkPolicy spec, where each peer/element is OR'd together. Within a peer, the components are AND'd. For more details, see "namespaceSelector and podSelector" under https://kubernetes.io/docs/concepts/services-networking/network-policies/#behavior-of-to-and-from-selectors

single peer with both ipblock and selector

It sounds like this is actually impossible from the docs though. It is not one of the 4 kinds of to/from selectors (peers). Do you mind verifying that it's impossible to create such a NetPol peer with ipblock and pod selector?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test cases for:

  • CIDR + label selector combo in same peer
  • no peers

for allow-all-internal case, added an exception check within the code for it

@@ -34,6 +35,8 @@ var (
)
// ErrUnsupportedIPAddress is returned when an unsupported IP address, such as IPV6, is used
ErrUnsupportedIPAddress = errors.New("unsupported IP address")
// ErrUnsupportedNonCIDR is returned when non-CIDR blocks (i.e pod selectors or namespace selectors) are passed with a NPM Lite configuration
ErrUnsupportedNonCIDR = errors.New("unsupported Non-CIDR block for NPM Lite")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're actually allowing default allow and default deny policies as well (in TranslatePolicy()). Could we modify this error description to express that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the description thanks

Copy link
Contributor

@huntergregory huntergregory Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we modify the error description as well to help a user in case they try debugging this error in logs?

I believe this Lite mode will actually work for all Policies except those with an ingress/egress pod selector or namespace selector

EDIT: or named ports

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the description

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I mean inside the code 😅

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha got it! I updated it, @huntergregory let me know how that sounds

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also great point out for named ports, added logic to output an error in case of named ports as well

Copy link
Contributor

@huntergregory huntergregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor suggestions

@@ -2925,6 +2925,8 @@ func TestNpmLiteCidrPolicy(t *testing.T) {
// Test 1) Npm lite enabled, CIDR + Namespace label Peers, returns error
// Test 2) NPM lite disabled, CIDR + Namespace label Peers, returns no error
// Test 3) Npm Lite enabled, CIDR Peers , returns no error
// Test 4) NPM Lite enabled, Combination of CIDR + Label in same peer, returns an error
// test 5) NPM Lite enabled, no peer, returns no error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry to miss this earlier. Could we also test that NPM lite supports a policy without CIDRs or selectors, with a port + protocol specified

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup added a test case with just port + protocol, it does not return an error @huntergregory

@@ -34,6 +35,8 @@ var (
)
// ErrUnsupportedIPAddress is returned when an unsupported IP address, such as IPV6, is used
ErrUnsupportedIPAddress = errors.New("unsupported IP address")
// ErrUnsupportedNonCIDR is returned when non-CIDR blocks (i.e pod selectors or namespace selectors) are passed with a NPM Lite configuration
ErrUnsupportedNonCIDR = errors.New("unsupported Non-CIDR block for NPM Lite")
Copy link
Contributor

@huntergregory huntergregory Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we modify the error description as well to help a user in case they try debugging this error in logs?

I believe this Lite mode will actually work for all Policies except those with an ingress/egress pod selector or namespace selector

EDIT: or named ports

Copy link
Contributor

@huntergregory huntergregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realizing that NPM Lite cannot support "named port" fields... We will need to check for that too

@@ -34,6 +35,8 @@ var (
)
// ErrUnsupportedIPAddress is returned when an unsupported IP address, such as IPV6, is used
ErrUnsupportedIPAddress = errors.New("unsupported IP address")
// ErrUnsupportedNonCIDR is returned when non-CIDR blocks (i.e pod selectors or namespace selectors) are passed with a NPM Lite configuration
ErrUnsupportedNonCIDR = errors.New("unsupported Non-CIDR block for NPM Lite")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I mean inside the code 😅

huntergregory
huntergregory previously approved these changes Sep 16, 2024
npm/azure-npm.yaml Outdated Show resolved Hide resolved
npm/azure-npm.yaml Outdated Show resolved Hide resolved
npm/cacheencoder.go Outdated Show resolved Hide resolved
npm/npm.go Outdated Show resolved Hide resolved
@@ -51,6 +51,7 @@ var DefaultConfig = Config{
ApplyInBackground: true,
// NetPolInBackground is currently used in Linux to apply NetPol controller Add events in the background
NetPolInBackground: true,
EnableNPMLite: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default config should always be false, or else we will break existing deployments where the configmap is not explicitly set to false.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup I had just enabled it for testing purposes with the intent of disabling it before merging the pr in. I can disable it in the pr and enable locally for testing purposes if that is the better way @vakalapa

@@ -425,7 +425,7 @@ func UpdatePolicy(policy *networkingv1.NetworkPolicy) *Action {

// Do models policy updates in the NetworkPolicyController
func (p *PolicyUpdateAction) Do(dp *DataPlane) error {
npmNetPol, err := translation.TranslatePolicy(p.Policy)
npmNetPol, err := translation.TranslatePolicy(p.Policy, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need any test cases for this set to true ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants