Releases: GoogleCloudPlatform/cluster-toolkit
Releases · GoogleCloudPlatform/cluster-toolkit
V1.20.0 Filestore for GKE, Improved Windows support, & git-hosted Packer Modules
Key New Features
- Native GKE support for Filestore: storage-gke example.
- Improved support for Windows - Packer and
windows-startup-script
; - Packer "packages" - treat remote (git-hosted) Packer modules as packages when using Terraform's "//" notation;
- Automate DAOS server/client images
New Modules
gke-persistent-volume
: automatically creates persistent volumes and persistent volume claims for shared storage.windows-startup-script
: a simple module that curates scripts for customizing Windows VMs.
Module Improvements
-
- do not swap boot disk (and VM) each time a new disk image is available;
-
vpc
:- enabling TCP tunneling to the WinRM port used by PowerShell;
- add firewall rule for SSH from arbitrary IP ranges;
-
- add option for static node count;
- add option to enable gcfs;
-
- expose the option to not create a system node pool;
- add option to create and update timeouts;
- update service account variable to separate email and scopes;
-
gke-job-template
: add templating for persistent volume claims -
- add
disk_type
support; - add Powershell script support;
- improved support for Windows;
- treat remote (git-hosted) Packer modules as packages when using Terraform's "//" notation
- add
-
- add support for fixed version of HTCondor;
- improve resilience;
-
schedmd-slurm-gcp-v5-controller
: allow providing short references for image project -
batch-job-template
: use Batch HPC CentOS images as default image
Version updates
- Update to
slurm-gcp
5.7.4 - Update
google-cloud-daos
from v0.4.0 to v0.4.1
What's Changed
- Merge v1.18.1 back to develop by @nick-stroud in #1421
- Improve vpc module by @tpdownes in #1422
- Expose the option on gke-cluster to not create a system node pool by @nick-stroud in #1425
- Add option to gke-node-pool for static node count by @nick-stroud in #1424
- Add options for create and update timeouts to gke modules by @nick-stroud in #1426
- Update gke service account variable to separate email and scopes by @nick-stroud in #1427
- Update HTCondor example to use Rocky Linux 8 by @tpdownes in #1432
- Update develop with release-candidate: Fix Ansible installation upon re-run by @rohitramu in #1439
- Improver Packer support for Windows by @tpdownes in #1431
- Update HTCondor execute point module by @tpdownes in #1434
- Add option to enable gcfs on gke-node-pool by @nick-stroud in #1428
resreader.go
code clean up by @mr0re1 in #1445- Improve HTCondor example and integration tests by @tpdownes in #1440
- Do not use
log.Fatal
inpkg/config
by @mr0re1 in #1444 - Remove
Module.RequiredApis
by @mr0re1 in #1446 - Conditionally exclude nodeSelector when not needed by @nick-stroud in #1441
- Adopt latest release of startup-script modules by @tpdownes in #1411
- Add
docs/module-guidelines.md
by @mr0re1 in #1423 - Improve Packer experience for Windows by @tpdownes in #1447
- Add inert
Module.RequiresApis
for backward compatibility by @mr0re1 in #1450 - Bump google.golang.org/api from 0.125.0 to 0.126.0 by @dependabot in #1430
- Fix regex for validating GroupName by @mr0re1 in #1449
- Reduce size of expanded blueprint by adding
omitempty
where applicable by @mr0re1 in #1452 - Remove
Module.DeploymentSource
, compute it on demand by @mr0re1 in #1453 - Merging v1.19.0 from main back into develop by @rohitramu in #1462
- Fix static_check warnings in
cmd/root*.go
by @mr0re1 in #1460 - Slurm gcp 5.7.4 by @SkylerMalinowski in #1459
- Fix panic while attempting to tokenize Null-value by @mr0re1 in #1468
- Update "google", "google-beta" providers to 4.69.1 by @rohitramu in #1464
- Adds gke-persistent-volume module by @nick-stroud in #1442
- Add documentation for gke-persistent-volume-module by @nick-stroud in #1478
- Merge v1.19.1 hotfix release into develop by @tpdownes in #1480
- Bump golang.org/x/sys from 0.8.0 to 0.9.0 by @dependabot in #1475
- Bump github.com/otiai10/copy from 1.11.0 to 1.12.0 by @dependabot in #1476
- Bump google.golang.org/api from 0.126.0 to 0.128.0 by @dependabot in #1477
- Address minor warnings and lint issues by @tpdownes in #1481
- Relax
TestNetworkStorage
to accomodategke-persistent-volume
by @mr0re1 in #1483 - Deprecated
WrapSettingsWith
by @mr0re1 in #1466 - Print advanced instructions after
ghpc deploy
by @mr0re1 in #1463 - Add
terraform_backend_defaults
section to some examples by @mr0re1 in #1469 - Use consistent order in "product of module use" mark. by @mr0re1 in #1484
- Add support for Packer packages by @tpdownes in #1467
- Add community example of how to use filestore with gke by @nick-stroud in #1443
- Remove excessive error messages by @mr0re1 in #1485
- Add rich error messages with position and snippet by @mr0re1 in #1448
- Update "google" provider in OFE from 3.x to 4.x by @rohitramu in #1470
- Use strict
Path
builder to reduce human error. by @mr0re1 in #1489 - Remove
settingsToIgnore
fromuseModule
by @mr0re1 in #1486 - Allow providing short names for image project by @rohitramu in #1472
- Remove debug output from create command by @tpdownes in #1492
- Add custom unmarshaler for
Module.Use
for better error messaging by @mr0re1 in #1473 - Use
regexall
instead ofstrcontains
to stay compatible with terraform 1.2 by @rohitramu in #1493 - Don't swap VM boot disk (and VM) each time a new disk image is available by @issacg in #1474
- Module documentation update and improved DAOS examples by @tpdownes in #1488
- Add
automatic_restart
tovm-instance
by @mr0re1 in #1288 - Add
pipefail
to Makefile to prevent swallowing failed tests by @mr0re1 in #1498 - Vm instance boot disk lifecycle changes by @cboneti in #1494
- Remove "failed tests" check from
enforce_coverage
since it doesn't … by @mr0re1 in #1499 - Drop coverage requirement for pkg/shell by @tpdownes in #1507
- Update google-cloud-daos version from v0.4.0 to v0.4.1 b...
v1.19.1 Fix panic on null fields in terraform outputs
What's Changed
Full Changelog: v1.19.0...v1.19.1
v1.19.0: ghpc destroy command, automatic ssh configuration, and Ramble integration
Key New Features
- New
destroy
command that automates deletion of all infrastructure from a deployment - New
ramble-execute
module. Example blueprint:ramble.yaml
. - Automated SSH configuration using startup-script module with
configure_ssh_host_patterns
setting.
Module Improvements
ramble-setup
Made the module idempotent.- Blueprint
labels
are now added to all resources in these modules: packer/custom-image
: Remove temporary users from the final image.project/service-account
: Simplified Service Account usage.startup-script
: Enable custom service accounts with startup-scriptgke-cluster
: Exposed Container Storage Interface drivers addons for several different GKE storage types.- Eliminated the need to activate a Python virtual environment to run Ansible.
Improvements
- Add support for indexing to "simple blueprint expressions"
- Added community wrapper blueprint for LLNL flux-framework example
Version updates
- Intel DAOS from 0.3.0 to 0.4.0:
hpc-slurm-daos.yaml
: Server updatepfs-daos.yaml
: Client update
- Upgraded Terraform provider from 4.63.1 to 4.65.2
- Upgraded Spack default version from 0.19.0 to 0.20.0
- Update to slurm-gcp 5.7.3
- Allow metadata key slurmd_feature to initiate dynamic node setup.
- Disable TreeWidth when dynamic nodes are configured.
- Fix NVIDIA driver install after kernel upgrade for rocky-linux-8.
What's Changed
- Add integration tests runs for
release-candidate
brunch by @mr0re1 in #1335 - Non-exclusive debug partition for hpc-slurm by @cboneti in #1345
- Reword module descriptions so that they fit on single line by @nick-stroud in #1343
- Improve instance ID printout in tests by @tpdownes in #1346
- Add
hpc-enterprise-slurm
integration test by @mr0re1 in #1331 - Add pre-commit to check for ghpc_module label by @nick-stroud in #1344
- Add example for the lustre file system. by @rohitramu in #1348
- Fix label value validation. by @rohitramu in #1349
- making hpc-slurm-ubuntu debug partition non-exclusive by @cboneti in #1350
- Merge main into develop by @cboneti in #1358
- Change GCP API packages
cloud.google.com/go
>google.golang.org/api
by @mr0re1 in #1356 - Upgrading terraform provider to 4.65.2 by @cboneti in #1359
- Add a ramble execute module by @douglasjacobsen in #1310
- Disable release tests by @mr0re1 in #1361
- DAOSGCP-175 Updates for google-cloud-daos v0.4.0 by @mark-olson in #1351
- Bump google.golang.org/api from 0.122.0 to 0.123.0 by @dependabot in #1364
- Simplify adoption of Spack build caches in Google Cloud Storage by @tpdownes in #1352
- Remove
modulereader.ModuleFS
usesourcereader.ModuleFS
instead. by @mr0re1 in #1365 - Improve test coverage by @tpdownes in #1368
- Bump github.com/cloudflare/circl from 1.1.0 to 1.3.3 by @dependabot in #1372
- Update Django to 4.1.9 to address CVE-2023-31047 by @tpdownes in #1370
- Address CVE-2023-32681 by upgrading requests by @tpdownes in #1371
- Add test that all
file-system
mods outputnetwork_storage
by @mr0re1 in #1373 - Expose csi driver addons in gke-cluster by @nick-stroud in #1374
- Eliminate need to activate virtual environment to run Ansible by @tpdownes in #1353
- Update spack default version to v0.20.0 by @saltysoup in #1367
- Add all Ansible binaries to default PATH by @tpdownes in #1379
- Fix batch mpi example by @tpdownes in #1380
- Update ramble-setup module to be idempotent by @douglasjacobsen in #1375
- Ensure that all modules take labels if they create resources by @rohitramu in #1362
- Add support for indexing to "simple blueprint expressions" by @mr0re1 in #1377
- Remove omnia dependencies from GHPC virtual environment by @tpdownes in #1381
- Add validation for deprecated input variables by @rohitramu in #1390
- Add destroy command for deployments by @tpdownes in #1382
- Bump google.golang.org/api from 0.123.0 to 0.124.0 by @dependabot in #1385
- Disable auto_activate_base in conda examples by @tpdownes in #1392
- Ensure local users are not present in final image by @tpdownes in #1393
- add config-ssh as a startup script option by @cboneti in #1378
- Bump github.com/zclconf/go-cty from 1.13.1 to 1.13.2 by @dependabot in #1384
- Bump tomlkit from 0.11.7 to 0.11.8 in /community/front-end/ofe by @dependabot in #1395
- Bump google-cloud-storage from 2.8.0 to 2.9.0 in /community/front-end/ofe by @dependabot in #1396
- Bump github.com/go-git/go-git/v5 from 5.6.1 to 5.7.0 by @dependabot in #1383
- Bump cachetools from 5.3.0 to 5.3.1 in /community/front-end/ofe by @dependabot in #1397
- Bump typing-inspect from 0.8.0 to 0.9.0 in /community/front-end/ofe by @dependabot in #1398
- Fix typos by @tpdownes in #1402
- Bump urllib3 from 1.26.15 to 2.0.2 in /community/front-end/ofe by @dependabot in #1399
- Add community wrapper blueprint for LLNL flux-framework by @wkharold in #1369
- Enable remote git Packer by @tpdownes in #1401
- Update READMEs about "literal variables" by @mr0re1 in #1391
- Bump google.golang.org/api from 0.124.0 to 0.125.0 by @dependabot in #1409
- Simplify service-account module by @tpdownes in #1400
- Bump github.com/hashicorp/hcl/v2 from 2.16.2 to 2.17.0 by @dependabot in #1408
- Enable custom service accounts with startup-script by @tpdownes in #1404
- Update to slurm-gcp 5.7.3 by @SkylerMalinowski in #1410
- Validate a blueprint's top-level "labels" variable by @rohitramu in #1394
- Update vm-instance instructions to handle the case that there are zero instances by @nick-stroud in #1413
- Identify G2 family as having accelerators by @tpdownes in #1415
- Fix CRD Slurm example by @nick-stroud in #1414
- updating examples and documentation to point to newer SchedMD images by @cboneti in #1412
- Move hpc-enterprise-slurm test to avoid stockouts by @nick-stroud in #1416
- Fix nfs-server attached disk mounting. by @mr0re1 in #1406
- Do not swallow error during expanded.yaml write by @mr0re1 in #1417
- Update Open Front...
v1.18.1: Update Package Requirements for Open Front End
What's Changed
- Bump cryptography from 40.0.2 to 41.0.0 in /community/front-end/ofe by @dependabot in #1418
Full Changelog: v1.18.0...v1.18.1
v1.18.0: ghpc deploy, new examples, better examples names, slurm-gcp 5.7.2
Key New Features
ghpc deploy
is now the recommended way of deploying your environments- multigroup blueprints may now use module outputs from one group to another
- e.g., a. network may be dynamically created in group 1 and its name will be available directly in group 2
- New hpc-enterprise blueprint with various high performance options
- New ML blueprints: ml-slurm.yaml and ml-gke.yaml
- Blueprints renamed for more clarity
- Ability to communicate variables across deployment groups with
ghpc deploy
orghpc export-outputs
andghpc import-inputs
- Slurm on GCP V4.x is now deprecated, all core examples are moved to V5.7.2
Examples
htc-slurm.yaml
: shows how to provision a cluster with configuration tuned for many short-duration, loosely coupled jobs.client-google-cloud-storage.yaml
: demonstrates different ways to use Google Cloud Storage (GCS) buckets in the HPC Toolkit.
New Modules
gke-job-template
: Creates a Kubernetes job templated file that can be used to submit jobs.kubernetes-operations
: Performs pre-defined operations on Kubernetes resources that would otherwise be executed usingkubectl
.
Module Improvements
gke-cluster
: Added GPU support and automated installation of Nvidia drivers.
Deprecations
- Slurm V4.x modules: partition, controller and login-node.
Version updates
schedmd-slurm-gcp-v5-controller
: update SchedMD modules to 5.7.2- Min required Terraform version bumped 1.0 -> 1.2
- Min required Packer version bumped 1.6 -> 1.7.9
What's Changed
- Include group kind in deployment metadata by @tpdownes in #1213
- Bump google.golang.org/api from 0.118.0 to 0.119.0 by @dependabot in #1209
- Bump github.com/otiai10/copy from 1.10.0 to 1.11.0 by @dependabot in #1210
- Increase get URL timeout for CRD module by @tpdownes in #1211
- Use optimize utilization autoscaling profile by @nick-stroud in #1214
- Retry project cleanup up to 4 times each night by @tpdownes in #1217
- Use deadline instead of retries in wait-for-startup by @mr0re1 in #1216
- Silence daily cleanup notifications and enable retries for other builds by @tpdownes in #1218
- Bump minimal Terraform version 1.0 -> 1.2 by @mr0re1 in #1178
- Implement stub export-outputs command by @tpdownes in #1219
- Bump minimum Terraform in golden copy deployments by @tpdownes in #1222
- Use Dict for Module.Settings, derive connectivity from it by @mr0re1 in #1205
- Initial implementation of export-outputs command by @tpdownes in #1225
- Minor refactoring config.go by @mr0re1 in #1223
- Implement stub import-inputs command by @tpdownes in #1226
- Add better version comparator for Makefile by @mr0re1 in #1215
- Fix whitespace in deployment directories by @tpdownes in #1227
- Update git clone instruction to use HTTPS instead of SSH by @mr0re1 in #1233
- Add ghpc version to expanded blueprint by @mr0re1 in #1224
- Update GKE settings to match recommendations from GKE team by @nick-stroud in #1231
- Bump min packer version to 1.7.9 by @mr0re1 in #1232
- Fail wait-for-startup fast if log can not be fetched by @mr0re1 in #1220
- Remove typo in README heading by @nick-stroud in #1237
- Fix missing command to print out by @mr0re1 in #1238
- Handle "wrong-type-of-packer" in
make warn-packer-missing
by @mr0re1 in #1239 - Fix Chrome Remote Desktop NVIDIA Grid installation by @tpdownes in #1240
- Address
shellcheck -o all wait-for-startup-status.sh
by @mr0re1 in #1242 - Fix retry configuration for daily integration tests by @tpdownes in #1236
- Do not store ModuleInfo in DeploymentConfig by @mr0re1 in #1230
- Create a gke-job-template module, which creates a Kubernetes job file by @nick-stroud in #1234
- Ensure that terraform cleanup always runs by @tpdownes in #1235
- Remove unused method
HasKind
by @mr0re1 in #1246 - Add option to select zones for gke-node-pool by @nick-stroud in #1245
- Add the gke-job-template module to the list of modules by @nick-stroud in #1243
- Initial implementation of import-inputs command by @tpdownes in #1228
- Remove ansible-lint to unblock PRs by @mr0re1 in #1257
- Skip TestFindTerraform if no terraform is installed by @mr0re1 in #1255
- Unify shared code of create and expand commands by @mr0re1 in #1244
- Bump google.golang.org/api from 0.119.0 to 0.120.0 by @dependabot in #1253
- Add documentation warning about lustre license cost by @nick-stroud in #1254
- Remove modReference by @mr0re1 in #1247
- Bump cryptography from 40.0.1 to 40.0.2 in /community/front-end/ofe by @dependabot in #1252
- Bump protobuf from 4.22.1 to 4.22.3 in /community/front-end/ofe by @dependabot in #1250
- Bump pyasn1-modules from 0.2.8 to 0.3.0 in /community/front-end/ofe by @dependabot in #1251
- Bump pyasn1 from 0.4.8 to 0.5.0 in /community/front-end/ofe by @dependabot in #1248
- Make Expression into interface by @mr0re1 in #1260
- Refactor create_deployment.sh by @nick-stroud in #1258
- Address usability suggestions for multi-group deployments by @tpdownes in #1262
- Eliminate deployment metadata by @tpdownes in #1265
- Use dedicated dtype ModuleID and GroupName instead of string by @mr0re1 in #1264
- Adds a basic gke test which provisions and destroys a cluster by @nick-stroud in #1259
- Fix link in image builder example by @tpdownes in #1269
- Eliminate warnings by @tpdownes in #1277
- Add Terraform state download command to stdout of integration tests by @tpdownes in #1278
- Write Packer intergroup input values by @tpdownes in #1268
- Resolve conflicts before merging
main
...
v1.17.0: Initial Support for GKE, Slurm v5.6.3
Key New Features
- Initial Support for Kubernetes with GKE (example).
- Enable specification of all fields of module outputs
- Instructions to run the toolkit from Cloud Workstations
New Modules
gke-cluster
: module to create a Google Kubernetes Engine (GKE) clustergke-node-pool
: module to create a Google Kubernetes Engine (GKE) node pool
Module Improvements
startup-script
: replace example scripts with bool inputscustom-image
: addedimage_storage_locations
inputcustom-image
: use a unique Packer SSH username to avoid clashes with previous Packer buildshtcondor-configure
: address need for SystemD overridehtcondor-configure
: ensure that a central manager optimization is configured even when high availability is not enabledchrome-remote-desktop
: updated for Slurm image support
Improvements
- Added support for OFE deployment from a configuration file
Version updates
schedmd-slurm-gcp-v5-controller
: update SchedMD modules to 5.6.3
What's Changed
- Replace startup-srcipt examples with bool inputs by @mr0re1 in #1100
- Copy all embedded modules into deployment, use unique source for locals by @mr0re1 in #1086
- Close copy file descriptor in EmbeddedSourceReader by @mr0re1 in #1114
- Improve error match in embedded_test by @mr0re1 in #1115
- Adds a gke-cluster module to community by @nick-stroud in #1113
- DAOS docs update by @cboneti in #1116
- Simplify and relax type constraints for variables.tf by @mr0re1 in #1111
- Make every integration test into individual build config by @mr0re1 in #1112
- Fix validator test_deployment_variable_not_used by @mr0re1 in #1120
- Add basic documentation for gke-cluster module and example by @nick-stroud in #1117
- Updating packer documentation to make usage easier to find by @cboneti in #1118
- Add
image_storage_locations
input tomodules/packer/custom-image
by @mr0re1 in #1123 - Add TF definition for DAILY-test-X,PR-test-X, and PR-validation by @mr0re1 in #1119
- Add "babysit_tests" tool to automatically approve PR tests by @mr0re1 in #1106
- Solve state/world discrepancies in TF dev infra. by @mr0re1 in #1126
- Move SlurmV5 tests affected by stockouts to us-west4-c by @mr0re1 in #1124
- Improve variable references by @tpdownes in #1127
- Remove test groups, update documentation by @mr0re1 in #1128
- Fix bug in check for mixing module kinds within a group by @mr0re1 in #1130
- Update GitHub bug report template by @mr0re1 in #1131
- Remove deprecated pod_security_policy by @nick-stroud in #1133
- Add test selectors to babysit tool by @mr0re1 in #1136
- Add TF for legacy PR tests. To be removed after release by @mr0re1 in #1135
- Add SPACK_CACHE secret to spack-gromacs test by @mr0re1 in #1132
- Add instructions for connecting to the gke-cluster by @nick-stroud in #1138
- Address need for SystemD override in HTCondor module by @tpdownes in #1139
- Update TFLint and rules plugin for Google Cloud Platform by @tpdownes in #1146
- Add double quotes on variables: SC2086 – ShellCheck by @nick-stroud in #1148
- Add support for sensitive output values by @tpdownes in #1129
- Represent TerraformBackend.Config with cty.Value by @mr0re1 in #1141
- Bump github.com/otiai10/copy from 1.9.0 to 1.10.0 by @dependabot in #1143
- Bump github.com/spf13/cobra from 1.6.1 to 1.7.0 by @dependabot in #1145
- Truncate short sha length to 7 chars when filtering from cloud build by @nick-stroud in #1151
- Bump google.golang.org/api from 0.114.0 to 0.117.0 by @dependabot in #1150
- Bring develop up to date with release of v1.16.0 by @nick-stroud in #1153
- Pin google terraform provider to latest version by @nick-stroud in #1154
- Add selectors for batch and spack tests to babysit_tests tool by @nick-stroud in #1155
- Reduce the number of execution hosts in pbs test to reduce the change… by @nick-stroud in #1149
- Ensure that PBS test config explicitly uses network module by @tpdownes in #1159
- Align internal use of Toolkit GitHub refs by @tpdownes in #1160
- Move Ubuntu test and example to reduce chance of stockout by @nick-stroud in #1163
- Fix HTCondor central manager configuration by @tpdownes in #1162
- Add specialized tokenizer to handle
((HCL literals))
by @mr0re1 in #1167 - Move Slurm v5 high io test to reduce stockouts by @nick-stroud in #1168
- Gke node pool by @nick-stroud in #1140
- Make babysit_tests compatible with Python3.7 (VertexAI) by @mr0re1 in #1173
- Instructions to run the toolkit from Cloud Workstations by @cboneti in #1170
- Write group metadata to deployment folder by @tpdownes in #1169
- Update quantum example with new build instructions by @tpdownes in #1176
- Add TransformSimpleToHcl for cty.Value by @mr0re1 in #1165
- Developer setup on login is causing workstation to crash on startup by @nick-stroud in #1177
- Add conditions on Slurm partition enable_placement, exclusive, Oversu… by @mr0re1 in #1174
- Move tests to avoid stockouts by @nick-stroud in #1179
- Use a unique Packer SSH username to avoid clashes with previous Packer builds by @nick-stroud in #1184
- Bump google.golang.org/api from 0.117.0 to 0.118.0 by @dependabot in #1183
- Bump cloud.google.com/go/compute from 1.19.0 to 1.19.1 by @dependabot in #1182
- Update SchedMD modules to 5.6.3 (from 5.6.2) by @SkylerMalinowski in #1171
- Updated chrome rem...
v1.16.0: New Lustre Example, Slurm v5.6.2, & HTCondor Improvements
Improvements
- New simple Lustre example. (blueprint, documentation)
htcondor-execute-point
: Added option to HTCondor autoscaler for minimum number of idle VMs to decrease job startup time.htcondor-execute-point
: Added option to set boot disk size.- New validator reports unused deployment variables. (documentation)
- Expanded options for skipping individual validators. (documentation)
terraform.tfvars
file generated in the deployment folder is written in stable order, making it easier to track in version control.- Test and documentation updates.
Version updates
- Slurm V5 modules minor update v5.6.0 > v5.6.2 (full changelog)
- Now compatible with Terraform 1.4.0.
- Resume failures now notify
srun
of the error. setup.log
now discoverable in GCP Cloud Logger.- Fix slurm and slurm-gcp logs not showing up in Cloud Logging.
What's Changed
- Bump googleapis-common-protos from 1.54.0 to 1.58.0 in /community/front-end/ofe by @dependabot in #1014
- Bump httplib2 from 0.20.4 to 0.21.0 in /community/front-end/ofe by @dependabot in #1015
- Bump wrapt from 1.13.3 to 1.15.0 in /community/front-end/ofe by @dependabot in #1016
- Bump django-extensions from 3.1.5 to 3.2.1 in /community/front-end/ofe by @dependabot in #1018
- Remove ModuleToGroup from DeploymentConfig by @mr0re1 in #1022
- Add enforcement of minimum # of idle VMs to HTCondor autoscaler by @tpdownes in #983
- Bump grpcio-status from 1.43.0 to 1.51.3 in /community/front-end/ofe by @dependabot in #1017
- Fix broken markdown link by @mr0re1 in #1024
- Update year of Terraform deployment license by @tpdownes in #1026
- Fix startup-options test by @mr0re1 in #1029
- Fix broken test config use-resources.yaml by @mr0re1 in #1030
- Add an integration test for chrome-remote-desktop module by @nick-stroud in #1027
- Fix: dependabot proposed incompatible requirements by @nick-stroud in #1028
- Merge v1.14.1 back to develop by @nick-stroud in #1033
- Add retries to apt tasks in chrome-remote-desktop to account for lock contention with unattended-upgrades by @nick-stroud in #1025
- Bring all provider_meta versions up to current version by @nick-stroud in #1034
- Update pre-commit hook repos by @tpdownes in #1035
- Fix dtype of slurm node-group.preemtible to bool by @mr0re1 in #1036
- Set label dtype = map(string) for community/modules by @mr0re1 in #1038
- Fix labels dtype in DDN-EXAScaler by @mr0re1 in #1039
- Make readers usable outside of package modulereader by @mr0re1 in #1040
- Fix slurm gcp v5 validation message in partition by @cboneti in #1043
- Add a test to enforce contracts on modules interfaces by @mr0re1 in #1041
- Bump google.golang.org/api from 0.112.0 to 0.114.0 by @dependabot in #1044
- Bump github.com/googleapis/gax-go/v2 from 2.7.1 to 2.8.0 by @dependabot in #1045
- Bump cloud.google.com/go/serviceusage from 1.5.0 to 1.6.0 by @dependabot in #1046
- Bump github.com/zclconf/go-cty from 1.13.0 to 1.13.1 by @dependabot in #1047
- Bump github.com/go-git/go-git/v5 from 5.6.0 to 5.6.1 by @dependabot in #1048
- Add integration test coverage for add_deployment_name_before_prefix by @nick-stroud in #1037
- Minor refactoring of code around expressions by @mr0re1 in #1042
- Upgrade pip in builder image by @tpdownes in #1056
- Add simple sanity test installing the OFE virtual environment by @tpdownes in #1057
- Order settings alphabettically and format main.tf by @mr0re1 in #1059
- Add links to related material on YouTube by @nick-stroud in #1061
- Resolve anticipated merge conflict between develop and main by @tpdownes in #1066
- Merge v1.15.0 release into develop by @tpdownes in #1070
- Bump github.com/hashicorp/go-getter from 1.7.0 to 1.7.1 by @dependabot in #1067
- Use stable order while writing variables and backend configs by @mr0re1 in #1071
- Increase version of Terraform google providers by @tpdownes in #1072
- Implement exponential backoff in startup-script by @tpdownes in #1073
- Update to slurm-gcp 5.6.2 (from 5.6.0) by @SkylerMalinowski in #1074
- Update slurm images by @tpdownes in #1075
- Add deployment_name to HTCondor example VMs by @tpdownes in #1076
- Remove labels-specific logic from tfwriter. by @mr0re1 in #1064
- Add additional documentation on enable_reconfigure troubleshooting by @nick-stroud in #1077
- Use stable order while writing terraform.tfvars by @mr0re1 in #1080
- Refactoring to support edge-tracking in graph by @tpdownes in #1078
- Refactor moduleConnections as a map tracking source module of connection by @tpdownes in #1079
- Remove field Module.ModuleName add DeploymentSource instead by @mr0re1 in #1081
- Improve version constraint on batch-job-template module by @tpdownes in #1084
- Lustre example by @cboneti in #1082
- Resolve accidental destruction of startup-scripts by @tpdownes in #1085
- Updating image in the slurm-gcp-v5 Ubuntu example by @cboneti in #1087
- Show core examples first by @cboneti in #1088
- Use latest update to startup-script module by @tpdownes in #1090
- Bump cloud.google.com/go/compute from 1.18.0 to 1.19.0 by @dependabot in #1089
- Check validations during
make tests
; Add flag to skip validators by @mr0re1 in #1032 - Changing CRD test to us-central1-f by @cboneti in #1093
- Changing the GPU type of quantum-circuit-simulator to T4. by @cboneti in #1091
- Unify network_storage variables type by @mr0re1 in #1092
- Several minor fixes by @tpdownes in #1095
- Bump google-api-python-client from 2.37.0 to 2.82.0 in /community/front-end/ofe by @dependabot in #1068
- Bump platformdirs from 2.5.0 to 3.1.1 in /community/front-end/ofe by @dependabot in #1053
- Bump pyparsing fro...
v1.15.0: Improvements to Slurm and HTCondor solutions
Key New Features
- Support for HTCondor pools with both On-demand and Spot VMs
- Slurm solution updated to 5.6.0
- Support for custom machine types
- Label exclusive nodes with job ID for cost-tracking
- New zone_target_shape parameter corresponding to bulkInsert targetShape parameter
- FIX: lustre mounting regression introduced in 5.5.0
Improvements
- [
filestore
] module added supported for Shared VPCs viause
keyword andpre-existing-vpc
module - HTCondor modules now use minimally-scoped authentication for each daemon
- HTCondor execute points disable benchmarks to decrease time to join pool
- Improved type alignment across modules. e.g.
var.labels
aligned tomap(string)
What's Changed
- Rename filestore network_name to network_id to enable shared VPC via use by @nick-stroud in #962
- Improve attribute tracking in HTCondor scheduler by @tpdownes in #965
- Update fluent tutorial to use pre-existing-vpc module and other minor syntax updates by @nick-stroud in #963
- Revert "Rename filestore network_name to network_id to enable shared VPC via use" by @nick-stroud in #967
- Mask sleep/suspend targets on chrome-remote-desktop to prevent shutdown by @nick-stroud in #968
- Update image building example to use Slurm V5 by @mr0re1 in #964
- Improve HTCondor job matchmaking speed by @tpdownes in #971
- Roll-forward:"Rename filestore network_name to network_id to enable shared VPC via use" by @nick-stroud in #969
- Increase reliability of blueprints using DDN Exascaler by @tpdownes in #972
- Further increase speed at which HTCondor daemons update their ClassAds by @tpdownes in #974
- Initial support for Spot VMs within HTCondor pools by @tpdownes in #973
- Convert HTCondor autoscaler to SystemD timer by @tpdownes in #975
- Add validation to prevent usage of variables in backend block. by @mr0re1 in #970
- Making OFE deploy.sh MacOS compatible. Fixes #978 by @ek-nag in #979
- Improve Slurm log capturing by @tpdownes in #980
- Support Spot VMs in HTCondor pools by @tpdownes in #981
- Add utils for parising and normalizing HCL dtype by @mr0re1 in #977
- Enable depth-first filling of HTCondor pools by @tpdownes in #982
- Escalate to root priveleges to fetch Slurm logs by @mr0re1 in #987
- Bump google.golang.org/api from 0.110.0 to 0.111.0 by @dependabot in #984
- Bump github.com/spf13/afero from 1.9.4 to 1.9.5 by @dependabot in #985
- Bump github.com/go-git/go-git/v5 from 5.4.2 to 5.6.0 by @dependabot in #986
- Bump dill from 0.3.4 to 0.3.6 in /community/front-end/ofe by @dependabot in #990
- Bump google-cloud-core from 2.2.2 to 2.3.2 in /community/front-end/ofe by @dependabot in #991
- Bump astroid from 2.9.3 to 2.15.0 in /community/front-end/ofe by @dependabot in #992
- Bump proto-plus from 1.20.1 to 1.22.2 in /community/front-end/ofe by @dependabot in #993
- Bump isort from 5.10.1 to 5.12.0 in /community/front-end/ofe by @dependabot in #994
- Merge main into develop after release v1.14.0 by @mr0re1 in #997
- Bump terraform providers version 4.53.1 -> 4.56.0 by @mr0re1 in #998
- Clean up Filestore regardless of instances presence by @mr0re1 in #999
- Upgrade to slurm-gcp 5.6.0 by @SkylerMalinowski in #995
- Fix nfs-server example to use local_mounts instead of local_mount by @nick-stroud in #1001
- Add missing description for gcs_bucket_path by @nick-stroud in #1002
- Doc fix by @issacg in #1010
- Add mounting of cloud-storage-bucket to Slurm v5 test by @nick-stroud in #1007
- Use DeploymentName getter instead of looking up Vars by @mr0re1 in #1005
- Specify strict type for labels = map(string) by @mr0re1 in #1000
- Pass empty string instead of null to avoid mounting failure in Slurm by @nick-stroud in #1003
- Remove ghpc_role setting from nfs-server example by @nick-stroud in #1008
- Actually check mount instead of just checking dir exists by @nick-stroud in #1004
- Remove hostname test as it is not providing incremental value by @nick-stroud in #1006
- Double length of time for HTCondor integration test to detect job queue by @tpdownes in #1020
- Bump github.com/googleapis/gax-go/v2 from 2.7.0 to 2.7.1 by @dependabot in #1011
- Bump github.com/hashicorp/hcl/v2 from 2.16.1 to 2.16.2 by @dependabot in #1012
- Update slurm v5 readme about local-exec dependencies by @mr0re1 in #1023
- Bump google.golang.org/api from 0.111.0 to 0.112.0 by @dependabot in #1013
- Update OFE Dependabot configuration by @tpdownes in #1055
- Release v1.15.0 by @tpdownes in #1065
New Contributors
Full Changelog: v1.14.1...v1.15.0
v1.14.1: Fix vm-instance naming
What's Changed
- Hotfix: incorrect syntax for terraform interpolation in string by @nick-stroud in #1031
Full Changelog: v1.14.0...v1.14.1
v1.14.0: HTCondor highly available, HCLS blueprint
Key New Features
- HCLS blueprint supports running GROMACS on GPUs and has added several tutorials.
- Support for highly available HTCondor pools
- Job queue (SchedD) high availability remains experimental see README
Module Improvements
vpc
: new option to enable firewall rule that allows tunneling of Windows Remote Desktop connectionsschedmd-slurm-gcp-v5-partition
: all deprecated variables have been removed; these have migrated toschedmd-slurm-gcp-v5-node-group
htcondor-configure
:- job history will now include VM instance ID, zone and machine type
- VMs are now provisioned with minimally-permissioned IDTOKENs for their respective daemons (e.g. ADVERTISE_STARTD)
startup-script
: installation script for Cloud Ops Agent on Debian platforms will retry when other processes are blocking apt operationshtcondor-execute-point
: add a simple health check of port 9618 on any machine within the execute pointvm-instance
: vm-instance can be named using both a prefix and the deployment name
Improvements
- Improved error message when YAML blueprint has syntax errors preventing it from being loaded
- Regular updates to Go and Python dependencies to address potential security vulnerabilities
- Fixed Open Front End (OFE) issue with static content (icons) not displaying properly
What's Changed
- Add Windows Remote Desktop IAP firewall rule by @tpdownes in #885
- Bump oauthlib from 3.2.1 to 3.2.2 in /community/front-end/ofe by @dependabot in #886
- Fix: having the same share name and local mount caused slurm failure by @nick-stroud in #887
- Implicitly add outputs to modules when they are being used across deployment groups by @tpdownes in #878
- Reorder validator list to test blueprint correctness first by @heyealex in #889
- Bump cryptography from 36.0.1 to 39.0.1 in /community/front-end/ofe by @dependabot in #890
- Adding example sbatch and config for Factor Xa protein by @nick-stroud in #888
- Update HCLS blueprint examples to run Gromacs w/ GPUs by @nick-stroud in #891
- Add support for highly available HTCondor Central Managers by @tpdownes in #892
- Remove deprecations from slurm-gcp v5 partition by @heyealex in #893
- Bump cryptography from 36.0.1 to 39.0.1 in /community/front-end/ofe by @tpdownes in #896
- Bump cryptography from 36.0.1 to 39.0.1 in /community/front-end/ofe by @dependabot in #894
- Ensure Open Front End dependabot updates target develop by @tpdownes in #897
- Fix: Always generate Batch instance template to avoid known at apply time error by @nick-stroud in #898
- Update sbatch to copy results to output bucket & minor tweaks by @nick-stroud in #895
- Add troubleshooting documentation for filestore share name exportfs bug by @nick-stroud in #899
- Bump github.com/hashicorp/hcl/v2 from 2.16.0 to 2.16.1 by @dependabot in #900
- Fix addlicense check in weekly image building by @tpdownes in #901
- Update HTCondor modules by @tpdownes in #902
- Bump django from 3.2.16 to 3.2.17 in /community/front-end/ofe by @dependabot in #905
- Bump cryptography from 36.0.1 to 39.0.1 in /community/front-end/ofe by @dependabot in #906
- Bump oauthlib from 3.2.1 to 3.2.2 in /community/front-end/ofe by @dependabot in #907
- Bump TFLint version in image to latest by @tpdownes in #903
- Add instructions to ssh to VM created by vm-instance by @nick-stroud in #880
- Address terraform_unused_required_providers errors by @tpdownes in #908
- Manage HTCondor yum repo configuration directly by @tpdownes in #904
- Bump github.com/aws/aws-sdk-go from 1.33.0 to 1.34.0 by @dependabot in #911
- Fail integration tests on validation warnings. by @mr0re1 in #910
- Set minimal scopes for HTCondor IDTOKENs by @tpdownes in #919
- Fix
cmd/root_test.go
test runs from linked Git worktrees. by @mr0re1 in #918 - OFE update 14/02/2023. by @ek-nag in #913
- Improve error message for yaml parsing failures by @heyealex in #923
- HTCondor job track machine information by @tpdownes in #924
- Update develop with release v1.13.0 by @nick-stroud in #928
- Add explicit output dependencies to HTCondor by @tpdownes in #925
- Pin terraform google provider to v4.53.1 by @nick-stroud in #929
- Fix root_test failure on MacOS by @mr0re1 in #932
- Update htcondor-configure README example snippet by @tpdownes in #935
- Bump github.com/hashicorp/go-getter from 1.6.2 to 1.7.0 by @dependabot in #930
- Add retries to cloud ops install by @heyealex in #933
- Bump django from 3.2.17 to 3.2.18 in /community/front-end/ofe by @dependabot in #922
- Remove unused error message by @tpdownes in #939
- Bump google.golang.org/api from 0.109.0 to 0.110.0 by @dependabot in #937
- Fix typo in hcls instructions command by @nick-stroud in #940
- Update hcls example to use lysozyme protein instead of factor xa by @nick-stroud in #942
- HTCondor Job Queue High Availability by @tpdownes in #934
- Refactor useModule by @tpdownes in #941
- Bugfix to HTCondor autoscaler script by @tpdownes in #945
- Update hcls spack builder to use c2 machine by @nick-stroud in #948
- Enable OS Login by default in HTCondor execute points by @tpdownes in #944
- Remove dependency on unused module by @mr0re1 in #947
- Add health check for HTCondor VMs by @tpdownes in #946
- HCLS tutorial update by @nick-stroud in #950
- Bump github.com/spf13/afero from 1.9.3 to 1.9.4 by @dependabot in #955
- Bump github.com/zclconf/go-cty from 1.12.1 to 1.13.0 by @dependabot in #954
- Allow VM instance name to include prefix and deployment name by @nick-stroud in #949
- Use HTCondor Python bindings in autoscaler by @tpdownes in #951
- Update hcls Lysozyme example to include visualization instructions by @nick-stroud in #958
- Unify validator...