Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing (quasi-Newton solver) #693

Draft
wants to merge 20 commits into
base: develop
Choose a base branch
from
Draft

Checkpointing (quasi-Newton solver) #693

wants to merge 20 commits into from

Conversation

cnpetra
Copy link
Collaborator

@cnpetra cnpetra commented Aug 28, 2024

Adds API for checkpointing and support for restart via user options.

closes #686

  • display iteration number from before restart
  • provide example for the load/save API
  • update user manual
  • metadata

@cnpetra cnpetra marked this pull request as draft August 28, 2024 18:28
Copy link
Member

@tepperly tepperly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check more later.

src/Optimization/hiopAlgFilterIPM.hpp Outdated Show resolved Hide resolved
src/Optimization/hiopAlgFilterIPM.hpp Outdated Show resolved Hide resolved
src/Optimization/hiopAlgFilterIPM.hpp Outdated Show resolved Hide resolved
src/Optimization/hiopAlgFilterIPM.hpp Outdated Show resolved Hide resolved
src/Optimization/hiopAlgFilterIPM.hpp Outdated Show resolved Hide resolved
src/Optimization/hiopAlgFilterIPM.hpp Show resolved Hide resolved
src/Optimization/hiopAlgFilterIPM.cpp Outdated Show resolved Hide resolved
src/Optimization/hiopAlgFilterIPM.cpp Outdated Show resolved Hide resolved
src/Optimization/hiopAlgFilterIPM.cpp Outdated Show resolved Hide resolved
src/Optimization/hiopAlgFilterIPM.cpp Outdated Show resolved Hide resolved
Copy link
Member

@tepperly tepperly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is all my feedback.

using IndType = sidre::IndexType;
sidre::Group* nlp_group = ds->getRoot()->createGroup("hiop solver");

//create views for each member that needs to be saved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to include a little metadata too such as:

  • the version of HiOp that's writing the file
  • the iteration number?

src/Optimization/hiopAlgFilterIPM.cpp Outdated Show resolved Hide resolved

constexpr char msgcf[] = "Path to checkpoint file. If present character '#' will be replaced "
"with the iteration number.";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation only replaces the first occurrence of '#'.

src/Optimization/hiopAlgFilterIPM.cpp Show resolved Hide resolved
@tepperly
Copy link
Member

tepperly commented Sep 4, 2024

I thought about it some more over lunch, and I think load_state_from_data_store() and save_state_to_data_store() should take a ::axom::sidre::Group & as it's argument rather than a ::axom::sidre::DataStore &. The Group argument should specify the group where the save/load should take place from rather than it being a fixed location relative to the root DataStore.

You can get the DataStore from the Group in case there is a need for that.

IMHO, this is an improvement because if HiOp is part of a complex program, the outer program may want to encapsulate the HiOp part of the checkpoint in some group that's potentially several levels down from the DataStore root group. It also leaves the door open for their to be multiple instances of HiOp active in the running program that all need to be checkpointed. The current design can only support checkpointing of one HiOp instance in a given DataStore.

@cnpetra
Copy link
Collaborator Author

cnpetra commented Sep 5, 2024

I thought about it some more over lunch, and I think load_state_from_data_store() and save_state_to_data_store() should take a ::axom::sidre::Group & as it's argument rather than a ::axom::sidre::DataStore &. The Group argument should specify the group where the save/load should take place from rather than it being a fixed location relative to the root DataStore.

You can get the DataStore from the Group in case there is a need for that.

IMHO, this is an improvement because if HiOp is part of a complex program, the outer program may want to encapsulate the HiOp part of the checkpoint in some group that's potentially several levels down from the DataStore root group. It also leaves the door open for their to be multiple instances of HiOp active in the running program that all need to be checkpointed. The current design can only support checkpointing of one HiOp instance in a given DataStore.

I take it that load_state_from_data_store() and save_state_to_data_store() "APIs" would do the job on your end. I see the flexibility of working with the sidre::group and will make the change. Will have you as a reviewer to have a final look when the PR ready.

{
using IndType = sidre::IndexType;
const IndType size = vec->get_local_size();
sidre::View* dest = nlp_group->createViewAndAllocate(name, sidre::DOUBLE_ID, size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In LiDO, we often use the same DataStore for the whole run. If we were to call this multiple times, I think it will fail because the views would already exist.
Right now this method always makes a new view. It would be better if it checked first if the view already exists. If it already exists in the Group, use that one. If it doesn't already exist, create a new view.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last commit does that. note that an exception will be thrown if the view exists and has a different number of elements than HiOp's state variable.

Copy link
Collaborator

@nychiang nychiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add minor comments

src/Optimization/hiopAlgFilterIPM.hpp Outdated Show resolved Hide resolved
src/Optimization/hiopAlgFilterIPM.hpp Outdated Show resolved Hide resolved
src/Optimization/hiopAlgFilterIPM.hpp Outdated Show resolved Hide resolved
src/Utils/hiopOptions.cpp Show resolved Hide resolved

}

void hiopAlgFilterIPMQuasiNewton::checkpointing_stuff()
Copy link
Collaborator

@nychiang nychiang Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style Guidelines.
There should be spaces before and after each operator, e.g. line 1675, 1655,...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style Guidelines. There should be spaces before and after each operator, e.g. line 1675, 1680,...

can you be more specific about the guideline?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean space before and after operator "==".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our CMake build that uses BLT auto formats everything using clang-format (and a configuration file) with a make style. Also make check verifies that the code matches the clang-format style configuration, so a PR can't be merged without complying with the style.

src/Optimization/hiopAlgFilterIPM.cpp Show resolved Hide resolved
src/Optimization/hiopAlgFilterIPM.cpp Outdated Show resolved Hide resolved
@tepperly
Copy link
Member

tepperly commented Sep 9, 2024

I thought about it some more over lunch, and I think load_state_from_data_store() and save_state_to_data_store() should take a ::axom::sidre::Group & as it's argument rather than a ::axom::sidre::DataStore &. The Group argument should specify the group where the save/load should take place from rather than it being a fixed location relative to the root DataStore.
You can get the DataStore from the Group in case there is a need for that.
IMHO, this is an improvement because if HiOp is part of a complex program, the outer program may want to encapsulate the HiOp part of the checkpoint in some group that's potentially several levels down from the DataStore root group. It also leaves the door open for their to be multiple instances of HiOp active in the running program that all need to be checkpointed. The current design can only support checkpointing of one HiOp instance in a given DataStore.

I take it that load_state_from_data_store() and save_state_to_data_store() "APIs" would do the job on your end. I see the flexibility of working with the sidre::group and will make the change. Will have you as a reviewer to have a final look when the PR ready.

The load/save methods offer the most flexibility for LiDO and its clients. I haven't double checked that we have the right HiOp object handle to call them.

@cnpetra
Copy link
Collaborator Author

cnpetra commented Sep 11, 2024

I thought about it some more over lunch, and I think load_state_from_data_store() and save_state_to_data_store() should take a ::axom::sidre::Group & as it's argument rather than a ::axom::sidre::DataStore &. The Group argument should specify the group where the save/load should take place from rather than it being a fixed location relative to the root DataStore.
You can get the DataStore from the Group in case there is a need for that.
IMHO, this is an improvement because if HiOp is part of a complex program, the outer program may want to encapsulate the HiOp part of the checkpoint in some group that's potentially several levels down from the DataStore root group. It also leaves the door open for their to be multiple instances of HiOp active in the running program that all need to be checkpointed. The current design can only support checkpointing of one HiOp instance in a given DataStore.

I take it that load_state_from_data_store() and save_state_to_data_store() "APIs" would do the job on your end. I see the flexibility of working with the sidre::group and will make the change. Will have you as a reviewer to have a final look when the PR ready.

The load/save methods offer the most flexibility for LiDO and its clients. I haven't double checked that we have the right HiOp object handle to call them.

LiDO for sure instantiates the algorithm class somewhere in there. That "algorithm" object should be used. For example, attach it to the interface. Will provide an example in this PR.

@cnpetra cnpetra changed the title Checkpoiting (quasi-Newton solver) Checkpointing (quasi-Newton solver) Sep 13, 2024
@cnpetra
Copy link
Collaborator Author

cnpetra commented Sep 14, 2024

@tepperly please take a look here and here at the use of the load/save API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add advanced checkpoint/restart capabilities to Hiop
3 participants