[DON'T MERGE] GCS Checkpointing Testing Workload modification #782

bernardhan33 · 2024-07-17T17:29:15Z

This is created as a draft PR for GCS internal members to comment. This will not be merged to main.

Checkpointing a 64B model through MaxText

Read and Write times to be collected and sent to GCS buckets before a separate Python program aggregates and uploads to BQ. I've created b/353631904 to track the improvement of letting each pod to write directly to BQ, which is currently blocked by needed nodepool recreation.
A sample YAML file is provided for code review purposes.

MattIrv · 2024-07-17T17:55:31Z

MaxText/configs/base.yml

@@ -390,3 +390,6 @@ enable_single_controller: False

 # Split physical axes for https://jax.readthedocs.io/en/latest/_autosummary/jax.experimental.mesh_utils.create_device_mesh.html
 allow_split_physical_axes: False
+
+# Attributes needed for the MaxText checkpointing loop using standalone_checkpointer.
+gcs_metrics_bucket: distributed-checkpointing-metrics


Should there be / is there any ability to provide a folder or path instead of just the bucket?

* Configurable number of parameters and save/restore loop. This adds the following features to the gcs-checkpointing setup: - Configurable number of parameters by setting the $PARAMETERS env variable - Reloads all checkpoints immediately after saving them - Adds a configurable minimum step time (when saving checkpoints) * Address code review comments * Address round 2 of code review

…g workload (#850) * Use quantized kv values in einsum. * Add support for proxy backend when running maxengine_server * Add tp support for MoE * Incorporate GKE's stopgap solution in the absence of in-place pod restart * Add enable_model_warmup flag for AOT compilation at model server start * Install Transformer Engine from github for stable and nightly mode * Add support for cloud logging in emergency checkpoint manager. * Adding support for creating maxtext docker images with jax-ss * Copybara import of the project: -- a3ed121 by RissyRan <[email protected]>: Update tp sharding annotation for MoE block COPYBARA_INTEGRATE_REVIEW=#788 from google:tp_update a3ed121 PiperOrigin-RevId: 653803904 * Llama2 forward pass checker * support pre-tokenized dataset and fix hf issues * Fix a Goodput upload to TB bug * Temp fix for c4_mlperf pipeline * Adds new remat policy to save out_proj activations during training. PiperOrigin-RevId: 655745941 * Add wrap argument when creating XAOT topology mesh * Refactor code * Fix maxtext profiler failures * Changes to Copybara to include renamed files PiperOrigin-RevId: 656444119 * Copybara import of the project: -- adea1b1 by Param Bole <[email protected]>: Refactoring Maxtext-JAX-Stable-Stack Build Process COPYBARA_INTEGRATE_REVIEW=#796 from google:pb_jax_ss_maxtext adea1b1 PiperOrigin-RevId: 656474261 * Refactoring Maxtext-JAX-Stable-Stack Build Process * Remove pre-commit install from setup.sh * Internal change PiperOrigin-RevId: 657990080 * Pin jax stable version > 0.4 * Revert internal change to maxengine * Add unique_indices to jnp.take * Removing unused arg PLATFORM * Add Gemma2 9B Config, Window Attention and attention logic soft cap. Fixing attention_type type annotation Fixing trailing-whitespace Updating 9B test_gemma.sh scripts Update gemma 9B 2_test_gemma.sh Updating gemma 9B checkpoint directory Update default BASE_OUTPUT_PATH in gemma 9B Remove finetunning script for now as it OOMs on TPU v4 Updating gemma-9B to gemma2-9B Fixing comment in gemma2-9b Updating comments on attention kernel and attention_type. * Removing unused agument PLATFORM from preflight.sh * Revert "Add unique_indices to jnp.take" * Consolidate usages of `MultiprocessingOptions` and `AsyncOptions`. Formalize `StandardCheckpointer` as an `AsyncCheckpointer` that doesn't require a `CheckpointArgs` object, and instead allows directly passing the state and extra args. PiperOrigin-RevId: 660588942 * Remove unused checkpoint import * Add A3 AoT compilation support using mock GPU client Support XAOT on GPU Update mock GPU client API and avoid initializing backend in pyconfig Allow NoneType in SystemCharacteristics Fix spacing Use Optional instead of Union Make topology_name optional and improve readme Use dot_product attention in README example * generates padding batch in hf pipeline to improve data utilization * Add Gemma2 Support to MaxText: Gemma2 Decoder Layers, Checkpoint Converter, Config Files, Flop Calculation and Run Scripts * Change scripts for llama2 7b a3 GPU to use llama2_7b_gpu.yml PiperOrigin-RevId: 662216695 * Copybara import of the project: -- 80a49ff by RissyRan <[email protected]>: Add instruction for Mixtral COPYBARA_INTEGRATE_REVIEW=#822 from google:mixtral_instruct 80a49ff PiperOrigin-RevId: 662219295 * [MLPerf][GPT3] Bypass setting eval_interval in using synthetic dataset * Removing the resgistration of the proxy backend used by Pathways. If needed, this is handled by a separate, Pathways specific utilities package. * Support AoT in 16-vm GPU Llama2 train script * Initial grad accumulate with scan Gradient accumulation manually tested globals are sus Add grad accumulation test Gradient accumulation config assert Gradient accumulation config assert pylint fixed tests global and micro batch size Clean up with microbatches Gradient accumulation Gradient accumulation Gradient accumulation Gradient accumulation Gradient accumulation Gradient accumulation Gradient accumulation grad acc grad acc * Add support for local sliding window attention in TPU splash_attention * add kl divergence and more logging for forward_pass_logit_checker * Add dropping strategy * add node attributes; fix GCS upload; add checkpointID * add header to the checkpointing workload * add rank to node attributes list --------- Co-authored-by: Mitali Singh <[email protected]> Co-authored-by: Shaurya Gupta <[email protected]> Co-authored-by: maxtext authors <[email protected]> Co-authored-by: RissyRan <[email protected]> Co-authored-by: Xuefeng Gu <[email protected]> Co-authored-by: Vivian Wu <[email protected]> Co-authored-by: michelle-yooh <[email protected]> Co-authored-by: Abhinav Singh <[email protected]> Co-authored-by: Param Bole <[email protected]> Co-authored-by: khatwanimohit <[email protected]> Co-authored-by: aireenmei <[email protected]> Co-authored-by: Dipannita Shaw <[email protected]> Co-authored-by: ZhiyuLi-goog <[email protected]> Co-authored-by: Raymond Zou <[email protected]> Co-authored-by: hengtaoguo <[email protected]> Co-authored-by: Gagik Amirkhanyan <[email protected]> Co-authored-by: Colin Gaffney <[email protected]> Co-authored-by: gobbleturk <[email protected]> Co-authored-by: Jon Bolin <[email protected]> Co-authored-by: Zhaoyue Cheng <[email protected]> Co-authored-by: Luke Baumann <[email protected]>

…checkpoint sizes (#882)

check in modification for checkpointing workload

d304567

MattIrv reviewed Jul 17, 2024

View reviewed changes

MattIrv and others added 3 commits August 20, 2024 14:03

Check in changes to support older checkpoint deletion and customized …

06433b9

…checkpoint sizes (#882)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DON'T MERGE] GCS Checkpointing Testing Workload modification #782

[DON'T MERGE] GCS Checkpointing Testing Workload modification #782

bernardhan33 commented Jul 17, 2024

MattIrv Jul 17, 2024

[DON'T MERGE] GCS Checkpointing Testing Workload modification #782

Are you sure you want to change the base?

[DON'T MERGE] GCS Checkpointing Testing Workload modification #782

Conversation

bernardhan33 commented Jul 17, 2024

This is created as a draft PR for GCS internal members to comment. This will not be merged to main.

Checkpointing a 64B model through MaxText

MattIrv Jul 17, 2024

Choose a reason for hiding this comment