Distributed Training of the Decoder (#121)

* Converted decoder trainer to use accelerate

* Fixed issue where metric evaluation would hang on distributed mode

* Implemented functional saving
Loading still fails due to some issue with the optimizer

* Fixed issue with loading decoders

* Fixed issue with tracker config

* Fixed issue with amp
Updated logging to be more logical

* Saving checkpoint now saves position in training as well
Fixed an issue with running out of gpu space due to loading weights into the gpu twice

* Fixed ema for distributed training

* Fixed isue where get_pkg_version was reintroduced

* Changed decoder trainer to upload config as a file

Fixed issue where loading best would error
This commit is contained in:
Aidan Dempster
2022-06-19 12:25:54 -04:00
committed by GitHub
parent e37072a48c
commit 58892135d9
7 changed files with 331 additions and 207 deletions

View File

@@ -1,4 +1,5 @@
import time
import importlib
# time helpers