[🐛 BUG] Fix bugs when collecting results from `mp.spawn` in multi-GPU training #1875

ChenglongMa · 2023-09-22T15:39:38Z

Bug description

We are unable to collect results directly from mp.spawn, e.g.,

Lines 25 to 38 in 96eb311

    
           res = mp.spawn( 
        
               run_recboles, 
        
               args=( 
        
                   args.model, 
        
                   args.dataset, 
        
                   config_file_list, 
        
                   args.ip, 
        
                   args.port, 
        
                   args.world_size, 
        
                   args.nproc, 
        
                   args.group_offset, 
        
               ), 
        
               nprocs=args.nproc, 
        
           )

torch.multiprocessing.Queue can be applied for this purpose.

Changelog in this PR

Fixd result collection bug when using mp.spawn;
Updated documents of distributed training;
Fix bugs in significance_test.py

ChenglongMa added 5 commits September 23, 2023 01:22

Fix bugs when collecting results from mp.spawn

d0ba4d4

Update docs of distributed training

f3d92df

Update run_recboles function

85634a6

Add data type check

d6c1cf2

Add clean-up function

d7fd793

Sherry-XLL requested a review from Ethan-TZ October 7, 2023 03:06

Sherry-XLL assigned Yilu114 Oct 8, 2023

Ethan-TZ self-assigned this Oct 8, 2023

Ethan-TZ merged commit 22d8ad8 into RUCAIBox:master Oct 14, 2023
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[🐛 BUG] Fix bugs when collecting results from `mp.spawn` in multi-GPU training #1875

[🐛 BUG] Fix bugs when collecting results from `mp.spawn` in multi-GPU training #1875

ChenglongMa commented Sep 22, 2023 •

edited

Loading

	res = mp.spawn(
	run_recboles,
	args=(
	args.model,
	args.dataset,
	config_file_list,
	args.ip,
	args.port,
	args.world_size,
	args.nproc,
	args.group_offset,
	),
	nprocs=args.nproc,
	)

[🐛 BUG] Fix bugs when collecting results from mp.spawn in multi-GPU training #1875

[🐛 BUG] Fix bugs when collecting results from mp.spawn in multi-GPU training #1875

Conversation

ChenglongMa commented Sep 22, 2023 • edited Loading

Bug description

Changelog in this PR

[🐛 BUG] Fix bugs when collecting results from `mp.spawn` in multi-GPU training #1875

[🐛 BUG] Fix bugs when collecting results from `mp.spawn` in multi-GPU training #1875

ChenglongMa commented Sep 22, 2023 •

edited

Loading