Skip to content

Commit

Permalink
Merge pull request #2752 from GoogleCloudPlatform/release-candidate
Browse files Browse the repository at this point in the history
Release v1.36.0
  • Loading branch information
ankitkinra committed Jul 19, 2024
2 parents dbe05ee + 30bebab commit da56862
Show file tree
Hide file tree
Showing 139 changed files with 2,435 additions and 348 deletions.
83 changes: 45 additions & 38 deletions cmd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,30 +7,38 @@ clusters, also referred to as the gHPC Engine.

### Usage - ghpc

`ghpc [FLAGS]`

`ghpc [SUBCOMMAND]`
```bash
ghpc [FLAGS]
ghpc [SUBCOMMAND]
```

### Subcommands - ghpc

[create](#ghpc-create): Create a new deployment
* [`deploy`](#ghpc-deploy): Deploy an HPC cluster on Google Cloud
* [`create`](#ghpc-create): Create a new deployment
* [`expand`](#ghpc-expand): Expand the blueprint without creating a new deployment
* [`completion`](#ghpc-completion): Generate completion script
* [`help`](#ghpc-help): Display help information for any command

[expand](#ghpc-expand): Expand the blueprint without creating a new deployment
### Flags - ghpc

[completion](#ghpc-completion): Generate completion script
* `-h, --help`: displays detailed help for the ghpc command.
* `-v, --version`: displays the version of ghpc being used.

[help](#ghpc-help): Display help information for any command
### Example - ghpc

### Flags - ghpc
```bash
ghpc --version
```

+ -h, --help: displays detailed help for the ghpc command.
## ghpc deploy

+ -v, --version: displays the version of ghpc being used.
`ghpc deploy` deploys an HPC cluster on Google Cloud using the deployment directory created by `ghpc create` or creates one from supplied blueprint file.

### Example - ghpc
### Usage - deploy

```bash
ghpc --version
ghpc deploy (<DEPLOYMENT_DIRECTORY> | <BLUEPRINT_FILE>) [flags]
```

## ghpc create
Expand All @@ -39,38 +47,37 @@ ghpc --version

### Usage - create

`ghpc create BLUEPRINT_NAME [FLAGS]`
```sh
ghpc create BLUEPRINT_FILE [FLAGS]
```

### Positional arguments - create

`BLUEPRINT_NAME`: the name of the blueprint file that is used for the deployment.
`BLUEPRINT_FILE`: the name of the blueprint file that is used for the deployment.

### Flags - create

+ `--backend-config strings`: Comma-separated list of name=value variables to set Terraform backend configuration. Can be used multiple times.

+ `-h, --help`: display detailed help for the create command.

+ `-o, --out string`: sets the output directory where the HPC deployment directory will be created.

+ `-w, --overwrite-deployment`: If specified, an existing deployment directory is overwritten by the new deployment.

+ Terraform state IS preserved.
+ Terraform workspaces are NOT supported (behavior undefined).
+ Packer is NOT supported.

+ `-l, --validation-level string`: sets validation level to one of ("ERROR", "WARNING", "IGNORE") (default "WARNING").

+ `--vars strings`: comma-separated list of name=value variables to override YAML configuration. Can be used multiple times. Arrays or maps containing comma-separated values must be enclosed in double quotes. The double quotes may require escaping depending on the shell used. Examples below have been tested using a `bash` shell:
+ `--vars foo=bar,baz=2`
+ `--vars bar=2 --vars baz=3.14`
+ `--vars foo=true`
+ `--vars "foo={bar: baz}"`
+ `--vars "\"foo={bar: baz, qux: quux}\""`
+ `--vars "\"foo={bar: baz}\"",\"b=[foo,3,3.14]\"`
+ `--vars "\"a={foo: [bar, baz]}\"",\"b=[foo,3,3.14]\"`
+ `--vars \"b=[foo,3,3.14]\"`
+ `--vars \"b=[[foo,bar],3,3.14]\"`
* `--backend-config strings`: Comma-separated list of name=value variables to set Terraform backend configuration. Can be used multiple times.
* `-h, --help`: display detailed help for the create command.
* `-o, --out string`: sets the output directory where the HPC deployment directory will be created.
* `-w, --overwrite-deployment`: If specified, an existing deployment directory is overwritten by the new deployment.

* Terraform state IS preserved.
* Terraform workspaces are NOT supported (behavior undefined).
* Packer is NOT supported.

* `-l, --validation-level string`: sets validation level to one of ("ERROR", "WARNING", "IGNORE") (default "WARNING").
* `--vars strings`: comma-separated list of name=value variables to override YAML configuration. Can be used multiple times. Arrays or maps containing comma-separated values must be enclosed in double quotes. The double quotes may require escaping depending on the shell used. Examples below have been tested using a `bash` shell:

* `--vars foo=bar,baz=2`
* `--vars bar=2 --vars baz=3.14`
* `--vars foo=true`
* `--vars "foo={bar: baz}"`
* `--vars "\"foo={bar: baz, qux: quux}\""`
* `--vars "\"foo={bar: baz}\"",\"b=[foo,3,3.14]\"`
* `--vars "\"a={foo: [bar, baz]}\"",\"b=[foo,3,3.14]\"`
* `--vars \"b=[foo,3,3.14]\"`
* `--vars \"b=[[foo,bar],3,3.14]\"`

### Example - create

Expand Down
2 changes: 1 addition & 1 deletion cmd/root.go
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ HPC deployments on the Google Cloud Platform.`,
logging.Fatal("cmd.Help function failed: %s", err)
}
},
Version: "v1.35.1",
Version: "v1.36.0",
Annotations: annotation,
}
)
Expand Down
4 changes: 2 additions & 2 deletions community/examples/hpc-build-slurm-image.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,8 @@ deployment_groups:
}
- type: shell
destination: install_slurm.sh
# Note: changes to slurm-gcp `/scripts` folder in the built image will not reflect in the deployed cluster.
# Instead the scripts referenced in `schedmd-slurm-gcp-v6-controller/slurm_files` will be used.
content: |
#!/bin/bash
set -e -o pipefail
Expand Down Expand Up @@ -117,5 +119,3 @@ deployment_groups:
settings:
machine_type: n2d-standard-4
instance_image: $(vars.built_instance_image)
# Will cause Slurm auto-scaling scripts to be sourced from built image
enable_devel: false
13 changes: 3 additions & 10 deletions community/examples/ml-gke.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -50,23 +50,16 @@ deployment_groups:
cidr_block: $(vars.authorized_cidr)
outputs: [instructions]

- id: g2-pool
- id: g2_pool
source: community/modules/compute/gke-node-pool
use: [gke_cluster]
settings:
disk_type: pd-balanced
machine_type: g2-standard-4
guest_accelerator:
- type: nvidia-l4
count: 1
gpu_partition_size: null
gpu_sharing_config: null
gpu_driver_installation_config:
- gpu_driver_version: "DEFAULT"

- id: job-template
- id: job_template
source: community/modules/compute/gke-job-template
use: [g2-pool]
use: [g2_pool]
settings:
image: nvidia/cuda:11.0.3-runtime-ubuntu20.04
command:
Expand Down
4 changes: 2 additions & 2 deletions community/front-end/ofe/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ git+https://github.com/jazzband/django-revproxy.git@d2234005135dc0771b7c4e0bb046
Django==4.2.11
django-allauth==0.54.0
django-extensions==3.2.3
djangorestframework==3.14.0
djangorestframework==3.15.2
filelock==3.12.2
google-api-core==2.11.1
google-api-python-client==2.90.0
Expand Down Expand Up @@ -92,7 +92,7 @@ tomlkit==0.11.8
typing-inspect==0.9.0
typing_extensions==4.6.3
uritemplate==4.1.1
urllib3==1.26.18
urllib3==1.26.19
uvicorn==0.22.0
virtualenv==20.23.1
wrapt==1.15.0
Expand Down
56 changes: 56 additions & 0 deletions community/modules/compute/gke-node-pool/gpu_definition.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
/**
* Copyright 2023 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

## Required variables:
# guest_accelerator
# machine_type

locals {
# example state; terraform will ignore diffs if last element of URL matches
# guest_accelerator = [
# {
# count = 1
# type = "https://www.googleapis.com/compute/beta/projects/PROJECT/zones/ZONE/acceleratorTypes/nvidia-tesla-a100"
# },
# ]
accelerator_machines = {
"a2-highgpu-1g" = { type = "nvidia-tesla-a100", count = 1 },
"a2-highgpu-2g" = { type = "nvidia-tesla-a100", count = 2 },
"a2-highgpu-4g" = { type = "nvidia-tesla-a100", count = 4 },
"a2-highgpu-8g" = { type = "nvidia-tesla-a100", count = 8 },
"a2-megagpu-16g" = { type = "nvidia-tesla-a100", count = 16 },
"a2-ultragpu-1g" = { type = "nvidia-a100-80gb", count = 1 },
"a2-ultragpu-2g" = { type = "nvidia-a100-80gb", count = 2 },
"a2-ultragpu-4g" = { type = "nvidia-a100-80gb", count = 4 },
"a2-ultragpu-8g" = { type = "nvidia-a100-80gb", count = 8 },
"a3-highgpu-8g" = { type = "nvidia-h100-80gb", count = 8 },
"g2-standard-4" = { type = "nvidia-l4", count = 1 },
"g2-standard-8" = { type = "nvidia-l4", count = 1 },
"g2-standard-12" = { type = "nvidia-l4", count = 1 },
"g2-standard-16" = { type = "nvidia-l4", count = 1 },
"g2-standard-24" = { type = "nvidia-l4", count = 2 },
"g2-standard-32" = { type = "nvidia-l4", count = 1 },
"g2-standard-48" = { type = "nvidia-l4", count = 4 },
"g2-standard-96" = { type = "nvidia-l4", count = 8 },
}
generated_guest_accelerator = try([local.accelerator_machines[var.machine_type]], [])

# Select in priority order:
# (1) var.guest_accelerator if not empty
# (2) local.generated_guest_accelerator if not empty
# (3) default to empty list if both are empty
guest_accelerator = try(coalescelist(var.guest_accelerator, local.generated_guest_accelerator), [])
}
33 changes: 22 additions & 11 deletions community/modules/compute/gke-node-pool/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ locals {
locals {
sa_email = var.service_account_email != null ? var.service_account_email : data.google_compute_default_service_account.default_sa.email

has_gpu = var.guest_accelerator != null || contains(["a2", "g2"], local.machine_family)
preattached_gpu_machine_family = contains(["a2", "a3", "g2"], local.machine_family)
has_gpu = (local.guest_accelerator != null && length(local.guest_accelerator) > 0) || local.preattached_gpu_machine_family
gpu_taint = local.has_gpu ? [{
key = "nvidia.com/gpu"
value = "present"
Expand Down Expand Up @@ -73,16 +74,26 @@ resource "google_container_node_pool" "node_pool" {
}

node_config {
disk_size_gb = var.disk_size_gb
disk_type = var.disk_type
resource_labels = local.labels
labels = var.kubernetes_labels
service_account = var.service_account_email
oauth_scopes = var.service_account_scopes
machine_type = var.machine_type
spot = var.spot
image_type = var.image_type
guest_accelerator = var.guest_accelerator
disk_size_gb = var.disk_size_gb
disk_type = var.disk_type
resource_labels = local.labels
labels = var.kubernetes_labels
service_account = var.service_account_email
oauth_scopes = var.service_account_scopes
machine_type = var.machine_type
spot = var.spot
image_type = var.image_type

dynamic "guest_accelerator" {
for_each = local.guest_accelerator
content {
type = guest_accelerator.value.type
count = guest_accelerator.value.count
gpu_driver_installation_config = try(guest_accelerator.value.gpu_driver_installation_config, [{ gpu_driver_version = "DEFAULT" }])
gpu_partition_size = try(guest_accelerator.value.gpu_partition_size, null)
gpu_sharing_config = try(guest_accelerator.value.gpu_sharing_config, null)
}
}

dynamic "taint" {
for_each = concat(var.taints, local.gpu_taint)
Expand Down
2 changes: 1 addition & 1 deletion community/modules/compute/gke-node-pool/versions.tf
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,6 @@ terraform {
}
}
provider_meta "google" {
module_name = "blueprints/terraform/hpc-toolkit:gke-node-pool/v1.35.1"
module_name = "blueprints/terraform/hpc-toolkit:gke-node-pool/v1.36.0"
}
}
3 changes: 2 additions & 1 deletion community/modules/compute/htcondor-execute-point/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,7 @@ limitations under the License.
|------|--------|---------|
| <a name="module_execute_point_instance_template"></a> [execute\_point\_instance\_template](#module\_execute\_point\_instance\_template) | terraform-google-modules/vm/google//modules/instance_template | 10.1.1 |
| <a name="module_mig"></a> [mig](#module\_mig) | terraform-google-modules/vm/google//modules/mig | 10.1.1 |
| <a name="module_startup_script"></a> [startup\_script](#module\_startup\_script) | github.com/GoogleCloudPlatform/hpc-toolkit//modules/scripts/startup-script | v1.34.0&depth=1 |
| <a name="module_startup_script"></a> [startup\_script](#module\_startup\_script) | github.com/GoogleCloudPlatform/hpc-toolkit//modules/scripts/startup-script | v1.35.0&depth=1 |
## Resources
Expand All @@ -229,6 +229,7 @@ limitations under the License.
| <a name="input_central_manager_ips"></a> [central\_manager\_ips](#input\_central\_manager\_ips) | List of IP addresses of HTCondor Central Managers | `list(string)` | n/a | yes |
| <a name="input_deployment_name"></a> [deployment\_name](#input\_deployment\_name) | HPC Toolkit deployment name. HTCondor cloud resource names will include this value. | `string` | n/a | yes |
| <a name="input_disk_size_gb"></a> [disk\_size\_gb](#input\_disk\_size\_gb) | Boot disk size in GB | `number` | `100` | no |
| <a name="input_disk_type"></a> [disk\_type](#input\_disk\_type) | Disk type for template | `string` | `"pd-balanced"` | no |
| <a name="input_distribution_policy_target_shape"></a> [distribution\_policy\_target\_shape](#input\_distribution\_policy\_target\_shape) | Target shape across zones for instance group managing execute points | `string` | `"ANY"` | no |
| <a name="input_enable_oslogin"></a> [enable\_oslogin](#input\_enable\_oslogin) | Enable or Disable OS Login with "ENABLE" or "DISABLE". Set to "INHERIT" to inherit project OS Login setting. | `string` | `"ENABLE"` | no |
| <a name="input_enable_shielded_vm"></a> [enable\_shielded\_vm](#input\_enable\_shielded\_vm) | Enable the Shielded VM configuration (var.shielded\_instance\_config). | `bool` | `false` | no |
Expand Down
3 changes: 2 additions & 1 deletion community/modules/compute/htcondor-execute-point/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ resource "google_storage_bucket_object" "execute_config" {
}

module "startup_script" {
source = "github.com/GoogleCloudPlatform/hpc-toolkit//modules/scripts/startup-script?ref=v1.34.0&depth=1"
source = "github.com/GoogleCloudPlatform/hpc-toolkit//modules/scripts/startup-script?ref=v1.35.0&depth=1"

project_id = var.project_id
region = var.region
Expand All @@ -151,6 +151,7 @@ module "execute_point_instance_template" {

machine_type = var.machine_type
disk_size_gb = var.disk_size_gb
disk_type = var.disk_type
gpu = one(local.guest_accelerator)
preemptible = var.spot
startup_script = local.is_windows_image ? null : module.startup_script.startup_script
Expand Down
6 changes: 6 additions & 0 deletions community/modules/compute/htcondor-execute-point/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,12 @@ variable "disk_size_gb" {
default = 100
}

variable "disk_type" {
description = "Disk type for template"
type = string
default = "pd-balanced"
}

variable "windows_startup_ps1" {
description = "Startup script to run at boot-time for Windows-based HTCondor execute points"
type = list(string)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,6 @@ terraform {
}

provider_meta "google" {
module_name = "blueprints/terraform/hpc-toolkit:htcondor-execute-point/v1.35.1"
module_name = "blueprints/terraform/hpc-toolkit:htcondor-execute-point/v1.36.0"
}
}
2 changes: 1 addition & 1 deletion community/modules/compute/mig/versions.tf
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,6 @@ terraform {
}
}
provider_meta "google" {
module_name = "blueprints/terraform/hpc-toolkit:mig/v1.35.1"
module_name = "blueprints/terraform/hpc-toolkit:mig/v1.36.0"
}
}
6 changes: 3 additions & 3 deletions community/modules/compute/pbspro-execution/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,9 +74,9 @@ No providers.
| Name | Source | Version |
|------|--------|---------|
| <a name="module_execution_startup_script"></a> [execution\_startup\_script](#module\_execution\_startup\_script) | github.com/GoogleCloudPlatform/hpc-toolkit//modules/scripts/startup-script | v1.34.0&depth=1 |
| <a name="module_pbs_execution"></a> [pbs\_execution](#module\_pbs\_execution) | github.com/GoogleCloudPlatform/hpc-toolkit//modules/compute/vm-instance | v1.34.0&depth=1 |
| <a name="module_pbs_install"></a> [pbs\_install](#module\_pbs\_install) | github.com/GoogleCloudPlatform/hpc-toolkit//community/modules/scripts/pbspro-install | v1.34.0&depth=1 |
| <a name="module_execution_startup_script"></a> [execution\_startup\_script](#module\_execution\_startup\_script) | github.com/GoogleCloudPlatform/hpc-toolkit//modules/scripts/startup-script | v1.35.0&depth=1 |
| <a name="module_pbs_execution"></a> [pbs\_execution](#module\_pbs\_execution) | github.com/GoogleCloudPlatform/hpc-toolkit//modules/compute/vm-instance | v1.35.0&depth=1 |
| <a name="module_pbs_install"></a> [pbs\_install](#module\_pbs\_install) | github.com/GoogleCloudPlatform/hpc-toolkit//community/modules/scripts/pbspro-install | v1.35.0&depth=1 |
## Resources
Expand Down
6 changes: 3 additions & 3 deletions community/modules/compute/pbspro-execution/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ locals {
}
module "pbs_install" {
source = "github.com/GoogleCloudPlatform/hpc-toolkit//community/modules/scripts/pbspro-install?ref=v1.34.0&depth=1"
source = "github.com/GoogleCloudPlatform/hpc-toolkit//community/modules/scripts/pbspro-install?ref=v1.35.0&depth=1"

pbs_exec = var.pbs_exec
pbs_home = var.pbs_home
Expand All @@ -53,7 +53,7 @@ module "pbs_install" {
}

module "execution_startup_script" {
source = "github.com/GoogleCloudPlatform/hpc-toolkit//modules/scripts/startup-script?ref=v1.34.0&depth=1"
source = "github.com/GoogleCloudPlatform/hpc-toolkit//modules/scripts/startup-script?ref=v1.35.0&depth=1"

deployment_name = var.deployment_name
project_id = var.project_id
Expand All @@ -68,7 +68,7 @@ module "execution_startup_script" {
}

module "pbs_execution" {
source = "github.com/GoogleCloudPlatform/hpc-toolkit//modules/compute/vm-instance?ref=v1.34.0&depth=1"
source = "github.com/GoogleCloudPlatform/hpc-toolkit//modules/compute/vm-instance?ref=v1.35.0&depth=1"

instance_count = var.instance_count
spot = var.spot
Expand Down
Loading

0 comments on commit da56862

Please sign in to comment.