DeepSpeed结合Megatron-LM训练GPT2模型笔记

依赖安装
准备训练数据
训练详细流程和踩坑
参数量估计
训练显存占用估计
2卡数据并行
2卡模型并行
0x0. 前言
本文基于deepspeedexamples仓库中给出的megatron相关例子探索一下训练gpt2模型的流程。主要包含3个部分，第一个部分是基于原始的megatron如何训练gpt2模型，第二个部分是如何结合deepspeed的特性进行训练megatron gpt2，由于篇幅原因这篇文章只写了第一部分，主要是非常细致的记录了跑起来megatron gpt2训练流程碰到的一些问题和如何解决的。本文主要以这里的codebase展开写作。
0x1. megatron使用单卡训练gpt2
首先阅读 https://github.com/microsoft/deepspeedexamples/tree/bdf8e59aede8c8e0577e8d4d557298ca8515268f/megatron-lm 这里的readme。这里不关注bert的部分，目的是把gpt2的训练和推理跑起来。
首先提到，megatron是一款大型且强大的transformer，这个代码库用于进行大的transformer语言模型的持续研究。目前，megatron支持gpt2和bert的模型并行、多节点训练，并采用混合精度。megatron的代码库能够使用512个gpu进行8路模型和64路数据并行来高效地训练一个72层、83亿参数的gpt2语言模型。作者发现，更大的语言模型（指的是前面的83亿参数的gpt2）能够在仅5个训练epoch内超越当前gpt2-1.5b wikitext perplexities。
依赖安装
首先进入到megatron-lm目录，安装一下依赖，pip install -r requirements.txt，注意在requirements.txt里面依赖了tensorflow，这个是和bert训练相关，我这里不关心，就不安装tensorflow了。requiresment.txt的内容如下：
nltk>=3.4numpy>=1.15.4pandas>=0.24.0sentencepiece>=0.1.8# tensorflow>=1.12.0boto3==1.11.11regex==2020.1.8
安装的时候会报错：
error: could not find a version that satisfies the requirement boto3==1.11.11 (from versions: none)error: no matching distribution found for boto3==1.11.11
我直接使用 pip install boto3 安装了个最新版本。
接着按照教程，执行bash scripts/pretrain_gpt2.sh。这里有一个pytorch的报错：
modulenotfounderror: no module named 'torch._six'
这个错误是由于pytorch版本变化产生的，搜索了一下，发现只需要把from torch._six import inf 这行代码改成 from torch import inf 就可以了。继续执行，报错为：assertionerror: make sure to set path for wikipedia data_utils/corpora.py 。这是因为在 scripts/pretrain_gpt2.sh 里面指定了训练的数据集为 wikipedia ，所以需要在 deepspeedexamples/megatron-lm/data_utils/corpora.py 这里的 path = 'data/wikipedia/wikidump_lines.json' 指定我们本地下载的 wikipedia 数据路径。
准备训练数据
下载数据的时候发现这个 wikipedia 数据实在太大了，所以改用 webtext 数据集，关于这个数据集 megatron 的readme介绍如下：
“我们”利用公开可用的openwebtext（https://github.com/eukaryote31/openwebtext）库，该库由jcpeterson（https://github.com/jcpeterson/openwebtext）和eukaryote31（https://github.com/eukaryote31/openwebtext）共同开发，用于下载url。然后，我们根据我们在openwebtext目录中描述的过程对所有下载的内容进行了过滤、清理和去重。对于截至2018年10月的reddit url对应的内容，我们得到了约37gb的内容。37g对于跑训练来说还是太大了，所以我只下载了几十个url中的第一个1url文件。
然后把这个文件复制到megatron-lm的openwebtxt目录下：
在这里插入图片描述
接下来按照 openwebtext 的 readme 开始执行。
pip install ftfy langdetect numpy torch pandas nltk sentencepiece boto3 tqdm regex bs4 newspaper3k htmlmin tldextract git clone https://github.com/mattilyra/lsh cd lsh python setup.py install
安装 lsh 碰到了两个 python 版本不兼容引起的问题：
lsh/cminhash.cpp21: error: ‘pythreadstate’ {aka ‘struct _ts’} has no member named ‘exc_type’; did you mean ‘curexc_type’?19292 | *type = tstate->exc_type;
可以将exc_type替换为curexc_type来解决这个问题。
lsh/cminhash.cpp26: error: ‘pytypeobject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’17704 | __pyx_type___pyx_array.tp_print = 0;
可以将tp_print替换为tp_vectorcall_offset来解决这个问题。
接下来，执行去重url的命令：
python3 blacklist_urls.py rs_2011-01.bz2.deduped.txt clean_urls.txt
我发现执行这个命令之后clean_urls.txt是空的。看了下代码发现这个脚本要求去重的url文件必须在一个目录下，并且把这个目录的路径传递给脚本。
在这里插入图片描述
因此，在当前文件夹下新建一个 urls 目录，把刚才的url文件放进去。如下所示：
在这里插入图片描述
然后执行：python3 blacklist_urls.py urls clean_urls.txt 就可以完成去重了。接下来使用https://github.com/eukaryote31/openwebtext/blob/master/download.py 下载去重后的 url 对应的文本。
在这里插入图片描述
这里要全部下载完需要的时间很长，我只下载50条url对应的数据做一个演示作用。这里要把下载的每条url对应的数据保存为json文件需要修改一下download.py里面的--sqlite_meta和--save_uncompressed默认值，分别改成false和true，这样执行python3 openwebtext/download.py clean_urls.txt 之后就会生成一个scraped文件夹，每个url下载的文本就保存在data子文件夹下:
然后我们使用下面的脚本（merge_jsons.py）来把文件夹中的所有txt合并成一个json文件，其中每一行都作为一个text字段对应的内容：
import globimport sysimport jsonimport argparseif __name__ == '__main__': parser = argparse.argumentparser() parser.add_argument(--data_path, type=str, default=., help=path where all the json files are located) parser.add_argument(--output_file, type=str, default=merged_output.json, help=filename where the merged json should go) args = parser.parse_args() data_path = args.data_path out_file = args.output_file text_files = glob.glob(data_path + '/*.txt') counter = 0 with open(out_file, 'w') as outfile: for fname in text_files: counter += 1 if counter % 1024 == 0: print(merging at , counter, flush=true) with open(fname, 'r') as infile: for row in infile: tmp = {} tmp['text'] = row outfile.write(json.dumps(tmp)) outfile.write('') print(merged file, out_file, flush=true)
执行这个脚本获得merged_output.json：python3 merge_jsons.py --data_pathdeepspeedexamples/megatron-lm/openwebtext/scraped/data。
接着，我们在openwebtext文件夹下执行一下cleanup_dataset.py来把tokens数量少于128的文本都删掉。python3 cleanup_dataset.py merged_output.json merged_cleand.json。
训练详细流程和踩坑
数据准备好之后，我们修改一下deepspeedexamples/megatron-lm/scripts/pretrain_gpt2.sh下面的--train-data为webtext。此外将deepspeedexamples/megatron-lm/data_utils/corpora.py中webtext类的path设置为我们刚才获得的merged_cleand.json所在的路径。
此外，由于我这里只用了几十条数据来做训练过程的演示，这里还需要改一下deepspeedexamples/megatron-lm/scripts/pretrain_gpt2.sh下面的--split参数，将其改成400,300,300，也就是训练，测试，验证集的数据比例为43，这样才可以避免把测试集的数量设成0。
接下来就可以使用bash scripts/pretrain_gpt2.sh来启动训练了。给一些训练日志出来：
setting ds_accelerator to cuda (auto detect)using world size: 1 and model-parallel size: 1 > using dynamic loss scaling> initializing model parallel with size 1pretrain gpt2 modelarguments: pretrained_bert .............. false attention_dropout ............ 0.1 num_attention_heads .......... 16 hidden_size .................. 1024 intermediate_size ............ none num_layers ................... 24 layernorm_epsilon ............ 1e-05 hidden_dropout ............... 0.1 max_position_embeddings ...... 1024 vocab_size ................... 30522 deep_init .................... false make_vocab_size_divisible_by . 128 cpu_optimizer ................ false cpu_torch_adam ............... false fp16 ......................... true fp32_embedding ............... false fp32_layernorm ............... false fp32_tokentypes .............. false fp32_allreduce ............... false hysteresis ................... 2 loss_scale ................... none loss_scale_window ............ 1000 min_scale .................... 1 batch_size ................... 8 weight_decay ................. 0.01 checkpoint_activations ....... true checkpoint_num_layers ........ 1 deepspeed_activation_checkpointing false clip_grad .................... 1.0 train_iters .................. 320000 log_interval ................. 100 exit_interval ................ none seed ......................... 1234 reset_position_ids ........... false reset_attention_mask ......... false lr_decay_iters ............... none lr_decay_style ............... cosine lr ........................... 0.00015 warmup ....................... 0.01 save ......................... checkpoints/gpt2_345m save_interval ................ 5000 no_save_optim ................ false no_save_rng .................. false load ......................... checkpoints/gpt2_345m no_load_optim ................ false no_load_rng .................. false finetune ..................... false resume_dataloader ............ true distributed_backend .......... nccl local_rank ................... none eval_batch_size .............. none eval_iters ................... 100 eval_interval ................ 1000 eval_seq_length .............. none eval_max_preds_per_seq ....... none overlapping_eval ............. 32 cloze_eval ................... false eval_hf ...................... false load_openai .................. false temperature .................. 1.0 top_p ........................ 0.0 top_k ........................ 0 out_seq_length ............... 256 model_parallel_size .......... 1 shuffle ...................... false train_data ................... ['webtext'] use_npy_data_loader .......... false train_data_path .............. val_data_path ................ test_data_path ............... input_data_sizes_file ........ sizes.txt delim ........................ , text_key ..................... sentence eval_text_key ................ none valid_data ................... none split ........................ 400,300,300 test_data .................... none lazy_loader .................. true loose_json ................... false presplit_sentences ........... false num_workers .................. 2 tokenizer_model_type ......... bert-large-uncased tokenizer_path ............... tokenizer.model tokenizer_type ............... gpt2bpetokenizer cache_dir .................... cache use_tfrecords ................ false seq_length ................... 1024 max_preds_per_seq ............ none deepspeed .................... false deepspeed_config ............. none deepscale .................... false deepscale_config ............. none deepspeed_mpi ................ false cuda ......................... true rank ......................... 0 world_size ................... 1 dynamic_loss_scale ........... true> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234configuring data> padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)> found end-of-document token: 50256building gpt2 model ... > number of parameters on model parallel rank 0: 354871296optimizer = fusedadamlearning rate decaying cosinewarning: could not find the metadata file checkpoints/gpt2_345m/latest_checkpointed_iteration.txt will not load any checkpoints and will start from randompartition activations false and correctness check false iteration 100/ 320000 | elapsed time per iteration (ms): 963.3 | learning rate 3.937e-06 | lm loss 8.995377e+00 | loss scale 131072.0 |/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py futurewarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved warnings.warn(/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py futurewarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved warnings.warn(after 100 iterations memory (mb) | allocated: 6784.88427734375 | max allocated: 11927.470703125 | cached: 13826.0 | max cached: 13826.0time (ms) | forward: 276.11 | backward: 672.99 | allreduce: 13.96 | optimizer: 14.00 | batch generator: 5.22 | data loader: 4.53 iteration 200/ 320000 | elapsed time per iteration (ms): 950.6 | learning rate 8.625e-06 | lm loss 3.041360e+00 | loss scale 131072.0 |time (ms) | forward: 259.24 | backward: 674.56 | allreduce: 13.45 | optimizer: 16.63 | batch generator: 0.78 | data loader: 0.14
从 nvidia-smi 的截图里也可以看到megatron的训练正在卡0运行：
在训练的时候可能会发生下面的 stopiteration 错误：
time (ms) | forward: 259.07 | backward: 671.87 | allreduce: 13.03 | optimizer: 16.64 | batch generator: 0.76 | data loader: 0.13╭─────────────────────────────── traceback (most recent call last) ────────────────────────────────╮│ /home/zhangxiaoyu/deepspeedexamples/megatron-lm/pretrain_gpt2.py:713 in ││ ││ 710 ││ 711 ││ 712 if __name__ == __main__: ││ ❱ 713 │ main() ││ 714 ││ ││ /home/zhangxiaoyu/deepspeedexamples/megatron-lm/pretrain_gpt2.py:686 in main ││ ││ 683 │ iteration = 0 ││ 684 │ if args.train_iters > 0: ││ 685 │ │ if args.do_train: ││ ❱ 686 │ │ │ iteration, skipped = train(model, optimizer, ││ 687 │ │ │ │ │ │ │ │ │ lr_scheduler, ││ 688 │ │ │ │ │ │ │ │ │ train_data_iterator, ││ 689 │ │ │ │ │ │ │ │ │ val_data_iterator, ││ ││ /home/zhangxiaoyu/deepspeedexamples/megatron-lm/pretrain_gpt2.py:415 in train ││ ││ 412 │ report_memory_flag = true ││ 413 │ while iteration using dynamic loss scaling> initializing model parallel with size 1> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234prepare tokenizer donebuilding gpt2 model ... > number of parameters on model parallel rank 0: 354823168global rank 0 is loading checkpoint /home/zhangxiaoyu/deepspeedexamples/megatron-lm/checkpoints/gpt2_345m/iter_0000600/mp_rank_00/model_optim_rng.pt╭─────────────────────────────── traceback (most recent call last) ────────────────────────────────╮│ /home/zhangxiaoyu/deepspeedexamples/megatron-lm/generate_samples.py:277 in ││ ││ 274 ││ 275 ││ 276 if __name__ == __main__: ││ ❱ 277 │ main() ││ 278 ││ 279 ││ 280 ││ ││ /home/zhangxiaoyu/deepspeedexamples/megatron-lm/generate_samples.py:267 in main ││ ││ 264 │ tokenizer = prepare_tokenizer(args) ││ 265 │ ││ 266 │ # model, optimizer, and learning rate. ││ ❱ 267 │ model = setup_model(args) ││ 268 │ ││ 269 │ #setting default batch size to 1 ││ 270 │ args.batch_size = 1 ││ ││ /home/zhangxiaoyu/deepspeedexamples/megatron-lm/generate_samples.py:80 in setup_model ││ ││ 77 │ model = get_model(args) ││ 78 │ ││ 79 │ if args.load is not none: ││ ❱ 80 │ │ _ = load_checkpoint( ││ 81 │ │ │ model, none, none, args) ││ 82 │ ││ 83 │ return model ││ ││ /home/zhangxiaoyu/deepspeedexamples/megatron-lm/utils.py:305 in load_checkpoint ││ ││ 302 │ │ ││ 303 │ │ # model. ││ 304 │ │ try: ││ ❱ 305 │ │ │ model.load_state_dict(sd['model']) ││ 306 │ │ except keyerror: ││ 307 │ │ │ print_rank_0('a metadata file exists but unable to load model ' ││ 308 │ │ │ │ │ │ 'from checkpoint {}, exiting'.format(checkpoint_name)) ││ ││ /home/zhangxiaoyu/deepspeedexamples/megatron-lm/model/distributed.py:90 in load_state_dict ││ ││ 87 │ │ return sd ││ 88 │ ││ 89 │ def load_state_dict(self, state_dict, strict=true): ││ ❱ 90 │ │ self.module.load_state_dict(state_dict, strict=strict) ││ 91 │ ││ 92 │ ''' ││ 93 │ def _sync_buffers(self): ││ ││ /home/zhangxiaoyu/deepspeedexamples/megatron-lm/fp16/fp16.py:71 in load_state_dict ││ ││ 68 │ │ return self.module.state_dict(destination, prefix, keep_vars) ││ 69 │ ││ 70 │ def load_state_dict(self, state_dict, strict=true): ││ ❱ 71 │ │ self.module.load_state_dict(state_dict, strict=strict) ││ 72 ││ 73 # todo: update overflow check + downscale to use carl's fused kernel. ││ 74 class fp16_optimizer(object): ││ ││ /home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/nn/modules/module.py:20 ││ 41 in load_state_dict ││ ││ 2038 │ │ │ │ │ │ ', '.join('{}'.format(k) for k in missing_keys))) ││ 2039 │ │ ││ 2040 │ │ if len(error_msgs) > 0: ││ ❱ 2041 │ │ │ raise runtimeerror('error(s) in loading state_dict for {}: {}'.format( ││ 2042 │ │ │ │ │ │ │ self.__class__.__name__, .join(error_msgs))) ││ 2043 │ │ return _incompatiblekeys(missing_keys, unexpected_keys) ││ 2044 │╰──────────────────────────────────────────────────────────────────────────────────────────────────╯runtimeerror: error(s) in loading state_dict for gpt2model: size mismatch for word_embeddings.weight: copying a param with shape torch.size([50304, 1024]) from checkpoint, the shape in current model is torch.size([50257, 1024]).
可以看到加载模型的时候提示word_embeddings.weight的shape不匹配，我们看一下word_embeddings在gpt2中的定义：
所以这个问题应该是训练和测试的时候的vocab_size不同引起的。定位后发现这是因为训练的时候需要把tokens数num_tokens pad到可以被args.make_vocab_size_divisible_by=128整除，但是预测的时候就没这个限制了，因此导致了embedding的维度不匹配，我们修改一下deepspeedexamples/megatron-lm/generate_samples.py对num_token的处理逻辑，使得和训练一致。
再次执行bash scripts/generate_text.sh，我们就可以和gpt2对话了，输出一条prompt模型会给你不同的补全输出，然后输入stop结束对话。
由于这里的模型只用了很少的数据做演示，所以基本没有什么好的补全效果，后面可以加大数据量训练一个更好的gpt2对话模型。
0x3. 参数量和显存估计
在 https://zhuanlan.zhihu.com/p/624740065 这篇文章里面有对 gpt2 这种架构的 transformer 的参数量和训练显存占用的推导，我们这里套用里面总结的公示计算一下我们当前的gpt2模型的参数量和训练时的理论显存占用。
参数量估计
套用下面的公示：我们这里的：l=24，hidden_size=1024，12lh^2=12x24x1024x1024=301989888=0.3b。所以我们这里训练的gpt2模型只有大约0.3b参数。从模型的命名345m，我们也可以知道这个计算结果和真实大小基本一致。
训练显存占用估计
根据上述公式，模型参数，梯度，优化器状态在训练时的显存占用大约为301989888*20bytes=6039797760bytes=5898240kb=5760mb=5.6g。然后激活占用的显存如下：
我们训练的时候 batch_size=8，s=1024，h=1024，a=num-attention-heads=16，l=24,那么。
所以0.3b的gpt2的训练显存占用大约为5.6g+21g=26.6g。但在0x1节中，我们可以看到我们的显卡单卡显存是24g，并且训练过程中的显存消耗只有15107mib=14.75g，也就是说激活占用的显存并不是我们计算的21g，而是14.75-5.6=9.15g，这是为什么呢？
这是因为在deepspeedexamples/megatron-lm/scripts/pretrain_gpt2.sh里面打开了--checkpoint-activations，做了activation checkpoint。我们可以定位到这部分代码，在deepspeedexamples/megatron-lm/mpu/transformer.py:406-413：
在这里插入图片描述
可以看到现在对于每个transformer层来说，都可以省掉内部self-attention和mlp做backward时需要保存的中间激活，达到了减少显存的目的。
0x4. megatron使用多卡训练gpt2模型
2卡数据并行
上面已经完成了单卡的gpt2模型的训练，启动多卡训练比较简单，修改一下deepspeedexamples/megatron-lm/scripts/pretrain_gpt2_distributed.sh里面的--train-data为webtext，然后--train-iters改成600/num_gpus。实际上这个脚本启动的是数据并行的训练，那么我们只需要把iter数设置为600/num_gpus就可以和单卡扫到一样规模的数据了。训练数据，验证集，测试的配比也要改一下，因为这里只是模拟数据太少了，按照原始的比例会把测试集的数据条数算成0而报错。最后把gpus_per_node设成2，代表使用2卡进行数据并行训练。接着就可以启动训练了：bash scripts/pretrain_gpt2_distributed.sh，日志如下：
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/distributed/launch.py futurewarning: the module torch.distributed.launch is deprecatedand will be removed in future. use torchrun.note that --use-env is set by default in torchrun.if your script expects `--local-rank` argument to be set, pleasechange it to read from `os.environ['local_rank']` instead. see https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions warnings.warn(warning*****************************************setting omp_num_threads environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. *****************************************setting ds_accelerator to cuda (auto detect)setting ds_accelerator to cuda (auto detect)using world size: 2 and model-parallel size: 1 > using dynamic loss scaling> initializing model parallel with size 1pretrain gpt2 modelarguments: pretrained_bert .............. false attention_dropout ............ 0.1 num_attention_heads .......... 16 hidden_size .................. 1024 intermediate_size ............ none num_layers ................... 24 layernorm_epsilon ............ 1e-05 hidden_dropout ............... 0.1 max_position_embeddings ...... 1024 vocab_size ................... 30522 deep_init .................... false make_vocab_size_divisible_by . 128 cpu_optimizer ................ false cpu_torch_adam ............... false fp16 ......................... true fp32_embedding ............... false fp32_layernorm ............... false fp32_tokentypes .............. false fp32_allreduce ............... false hysteresis ................... 2 loss_scale ................... none loss_scale_window ............ 1000 min_scale .................... 1 batch_size ................... 8 weight_decay ................. 0.01 checkpoint_activations ....... true checkpoint_num_layers ........ 1 deepspeed_activation_checkpointing false clip_grad .................... 1.0 train_iters .................. 300 log_interval ................. 100 exit_interval ................ none seed ......................... 1234 reset_position_ids ........... false reset_attention_mask ......... false lr_decay_iters ............... none lr_decay_style ............... cosine lr ........................... 0.00015 warmup ....................... 0.01 save ......................... checkpoints/gpt2_345m save_interval ................ 5000 no_save_optim ................ false no_save_rng .................. false load ......................... checkpoints/gpt2_345m no_load_optim ................ false no_load_rng .................. false finetune ..................... false resume_dataloader ............ true distributed_backend .......... nccl local_rank ................... 0 eval_batch_size .............. none eval_iters ................... 100 eval_interval ................ 1000 eval_seq_length .............. none eval_max_preds_per_seq ....... none overlapping_eval ............. 32 cloze_eval ................... false eval_hf ...................... false load_openai .................. false temperature .................. 1.0 top_p ........................ 0.0 top_k ........................ 0 out_seq_length ............... 256 model_parallel_size .......... 1 shuffle ...................... false train_data ................... ['webtext'] use_npy_data_loader .......... false train_data_path .............. val_data_path ................ test_data_path ............... input_data_sizes_file ........ sizes.txt delim ........................ , text_key ..................... sentence eval_text_key ................ none valid_data ................... none split ........................ 400,300,300 test_data .................... none lazy_loader .................. true loose_json ................... false presplit_sentences ........... false num_workers .................. 2 tokenizer_model_type ......... bert-large-uncased tokenizer_path ............... tokenizer.model tokenizer_type ............... gpt2bpetokenizer cache_dir .................... cache use_tfrecords ................ false seq_length ................... 1024 max_preds_per_seq ............ none deepspeed .................... false deepspeed_config ............. none deepscale .................... false deepscale_config ............. none deepspeed_mpi ................ false cuda ......................... true rank ......................... 0 world_size ................... 2 dynamic_loss_scale ........... true> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234configuring data> padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)> found end-of-document token: 50256building gpt2 model ... > number of parameters on model parallel rank 0: 354871296optimizer = fusedadamoptimizer = fusedadamlearning rate decaying cosinewarning: could not find the metadata file checkpoints/gpt2_345m/latest_checkpointed_iteration.txt will not load any checkpoints and will start from randompartition activations false and correctness check false iteration 100/ 300 | elapsed time per iteration (ms): 1048.5 | learning rate 1.258e-04 | lm loss 4.799004e+00 | loss scale 32768.0 |/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py futurewarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved warnings.warn(/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py futurewarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved warnings.warn(/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py futurewarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved warnings.warn(/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py futurewarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved warnings.warn(after 100 iterations memory (mb) | allocated: 6784.88427734375 | max allocated: 11927.470703125 | cached: 13826.0 | max cached: 13826.0time (ms) | forward: 284.78 | backward: 749.95 | allreduce: 93.32 | optimizer: 13.60 | batch generator: 14.88 | data loader: 14.19 iteration 200/ 300 | elapsed time per iteration (ms): 1020.9 | learning rate 5.257e-05 | lm loss 7.708308e-02 | loss scale 32768.0 |time (ms) | forward: 256.87 | backward: 747.37 | allreduce: 93.08 | optimizer: 16.52 | batch generator: 0.71 | data loader: 0.11 iteration 300/ 300 | elapsed time per iteration (ms): 1018.4 | learning rate 1.806e-06 | lm loss 4.669175e-03 | loss scale 32768.0 |time (ms) | forward: 256.74 | backward: 744.96 | allreduce: 93.51 | optimizer: 16.53 | batch generator: 0.73 | data loader: 0.12-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- validation loss at the end of training for val data | lm loss: 1.170473e+01 | lm ppl: 1.211437e+05----------------------------------------------------------------------------------------------------global rank 0 is saving checkpoint at iteration 300 to checkpoints/gpt2_345m/iter_0000300/mp_rank_00/model_optim_rng.pt/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/nn/modules/module.py userwarning: positional args are being deprecated, use kwargs instead. refer to https://pytorch.org/docs/master/generated/torch.nn.module.html#torch.nn.module.state_dict for details. warnings.warn( successfully saved checkpoints/gpt2_345m/iter_0000300/mp_rank_00/model_optim_rng.ptevaluating iter 100/100--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- validation loss at the end of training for test data | lm loss: 1.169765e+01 | lm ppl: 1.202885e+05-----------------------------------------------------------------------------------------------------
显存占用截图：
由于是数据并行，单张卡的显存占用和使用单卡进行训练时差不多。
基于数据并行训练出的模型进行推理也可以正常运行：
在这里插入图片描述
2卡模型并行
我们使用这个脚本deepspeedexamples/megatron-lm/scripts/pretrain_gpt2_model_parallel.sh来进行2卡的模型并行训练，除了2卡数据并行相关的修改之外我们还需要去掉这个脚本里面的--deepspeed参数，因为要使用上deepspeed还需要执行deepspeed的config配置文件。和deepspeed相关的训练特性，我们留到下一篇文章中探索。
使用bash scripts/pretrain_gpt2_model_parallel.sh 启动2卡的模型并行训练。日志：
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/distributed/launch.py futurewarning: the module torch.distributed.launch is deprecatedand will be removed in future. use torchrun.note that --use-env is set by default in torchrun.if your script expects `--local-rank` argument to be set, pleasechange it to read from `os.environ['local_rank']` instead. see https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions warnings.warn(warning*****************************************setting omp_num_threads environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. *****************************************setting ds_accelerator to cuda (auto detect)setting ds_accelerator to cuda (auto detect)using world size: 2 and model-parallel size: 2 > using dynamic loss scaling> initializing model parallel with size 2pretrain gpt2 modelarguments: pretrained_bert .............. false attention_dropout ............ 0.1 num_attention_heads .......... 16 hidden_size .................. 1024 intermediate_size ............ none num_layers ................... 24 layernorm_epsilon ............ 1e-05 hidden_dropout ............... 0.1 max_position_embeddings ...... 1024 vocab_size ................... 30522 deep_init .................... false make_vocab_size_divisible_by . 128 cpu_optimizer ................ false cpu_torch_adam ............... false fp16 ......................... true fp32_embedding ............... false fp32_layernorm ............... false fp32_tokentypes .............. false fp32_allreduce ............... false hysteresis ................... 2 loss_scale ................... none loss_scale_window ............ 1000 min_scale .................... 1 batch_size ................... 8 weight_decay ................. 0.01 checkpoint_activations ....... true checkpoint_num_layers ........ 1 deepspeed_activation_checkpointing false clip_grad .................... 1.0 train_iters .................. 600 log_interval ................. 100 exit_interval ................ none seed ......................... 1234 reset_position_ids ........... false reset_attention_mask ......... false lr_decay_iters ............... none lr_decay_style ............... cosine lr ........................... 0.00015 warmup ....................... 0.01 save ......................... checkpoints/gpt2_345m_mp2 save_interval ................ 5000 no_save_optim ................ false no_save_rng .................. false load ......................... checkpoints/gpt2_345m_mp2 no_load_optim ................ true no_load_rng .................. false finetune ..................... false resume_dataloader ............ true distributed_backend .......... nccl local_rank ................... 0 eval_batch_size .............. none eval_iters ................... 100 eval_interval ................ 1000 eval_seq_length .............. none eval_max_preds_per_seq ....... none overlapping_eval ............. 32 cloze_eval ................... false eval_hf ...................... false load_openai .................. false temperature .................. 1.0 top_p ........................ 0.0 top_k ........................ 0 out_seq_length ............... 256 model_parallel_size .......... 2 shuffle ...................... false train_data ................... ['webtext'] use_npy_data_loader .......... false train_data_path .............. val_data_path ................ test_data_path ............... input_data_sizes_file ........ sizes.txt delim ........................ , text_key ..................... sentence eval_text_key ................ none valid_data ................... none split ........................ 400,300,300 test_data .................... none lazy_loader .................. true loose_json ................... false presplit_sentences ........... false num_workers .................. 2 tokenizer_model_type ......... bert-large-uncased tokenizer_path ............... tokenizer.model tokenizer_type ............... gpt2bpetokenizer cache_dir .................... none use_tfrecords ................ false seq_length ................... 1024 max_preds_per_seq ............ none deepspeed .................... false deepspeed_config ............. none deepscale .................... false deepscale_config ............. none deepspeed_mpi ................ false cuda ......................... true rank ......................... 0 world_size ................... 2 dynamic_loss_scale ........... true> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234configuring data> padded vocab (size: 50257) with 175 dummy tokens (new size: 50432)> found end-of-document token: 50256building gpt2 model ... > number of parameters on model parallel rank 0: 178100224 > number of parameters on model parallel rank 1: 178100224optimizer = fusedadamlearning rate decaying cosinewarning: could not find the metadata file checkpoints/gpt2_345m_mp2/latest_checkpointed_iteration.txt will not load any checkpoints and will start from randomoptimizer = fusedadampartition activations false and correctness check falses iteration 100/ 600 | elapsed time per iteration (ms): 810.9 | learning rate 1.444e-04 | lm loss 5.023855e+00 | loss scale 8192.0 |/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py futurewarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved warnings.warn(/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py futurewarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved warnings.warn(/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py futurewarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved warnings.warn(/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py futurewarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved warnings.warn(after 100 iterations memory (mb) | allocated: 3447.24365234375 | max allocated: 6237.830078125 | cached: 7890.0 | max cached: 7890.0time (ms) | forward: 252.44 | backward: 550.96 | allreduce: 12.11 | optimizer: 7.26 | batch generator: 7.15 | data loader: 6.35 iteration 200/ 600 | elapsed time per iteration (ms): 844.2 | learning rate 1.210e-04 | lm loss 1.112287e-01 | loss scale 8192.0 |time (ms) | forward: 242.53 | backward: 589.63 | allreduce: 11.37 | optimizer: 10.92 | batch generator: 4.28 | data loader: 2.71 iteration 300/ 600 | elapsed time per iteration (ms): 824.7 | learning rate 8.518e-05 | lm loss 8.868908e-03 | loss scale 8192.0 |time (ms) | forward: 240.10 | backward: 572.66 | allreduce: 11.63 | optimizer: 11.32 | batch generator: 3.64 | data loader: 2.12 iteration 400/ 600 | elapsed time per iteration (ms): 790.5 | learning rate 4.666e-05 | lm loss 2.208042e-03 | loss scale 8192.0 |time (ms) | forward: 233.81 | backward: 547.29 | allreduce: 11.90 | optimizer: 9.11 | batch generator: 1.16 | data loader: 0.21 iteration 500/ 600 | elapsed time per iteration (ms): 792.8 | learning rate 1.574e-05 | lm loss 8.129998e-04 | loss scale 8192.0 |time (ms) | forward: 234.04 | backward: 549.56 | allreduce: 13.62 | optimizer: 9.02 | batch generator: 0.91 | data loader: 0.16 iteration 600/ 600 | elapsed time per iteration (ms): 787.7 | learning rate 6.939e-07 | lm loss 6.003926e-04 | loss scale 8192.0 |time (ms) | forward: 234.25 | backward: 544.30 | allreduce: 10.23 | optimizer: 9.00 | batch generator: 0.83 | data loader: 0.12-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- validation loss at the end of training for val data | lm loss: 1.231077e+01 | lm ppl: 2.220759e+05----------------------------------------------------------------------------------------------------global rank 1 is saving checkpoint at iteration 600 to checkpoints/gpt2_345m_mp2/iter_0000600/mp_rank_01/model_optim_rng.ptglobal rank 0 is saving checkpoint at iteration 600 to checkpoints/gpt2_345m_mp2/iter_0000600/mp_rank_00/model_optim_rng.pt/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/nn/modules/module.py userwarning: positional args are being deprecated, use kwargs instead. refer to https://pytorch.org/docs/master/generated/torch.nn.module.html#torch.nn.module.state_dict for details. warnings.warn(/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/nn/modules/module.py userwarning: positional args are being deprecated, use kwargs instead. refer to https://pytorch.org/docs/master/generated/torch.nn.module.html#torch.nn.module.state_dict for details. warnings.warn( successfully saved checkpoints/gpt2_345m_mp2/iter_0000600/mp_rank_01/model_optim_rng.pt successfully saved checkpoints/gpt2_345m_mp2/iter_0000600/mp_rank_00/model_optim_rng.ptevaluating iter 100/100--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- validation loss at the end of training for test data | lm loss: 1.215604e+01 | lm ppl: 1.902403e+05-----------------------------------------------------------------------------------------------------
显存占用截图：
在这里插入图片描述
由于对模型参数进行了切分，现在单卡的显存占用峰值从数据并行的15个g左右降低到了9个g。
这里如果直接使用这个模型进行推理，会在load checkpoint的时候出现参数和模型定义不匹配的问题。这是因为这个版本的meagtron代码没有考虑到加载模型并行训练存储下来的模型，所以这里只能通过把两个模型并行的子模型合并为一个完整的单卡模型来让megatron加载并进行推理。
但这但本文所在的这份megatron-lm源码中也没有提供模型合并的工具，所以这里就不对这个模型并行训练的模型进行推理了。如果你想对模型并行训练的checkpoint进行推理，最简单的方法就是直接用nvidia的megatron-lm的最新代码进行模型训练和推理，它不仅支持模型并行还支持流水并行并且可以加载任意组合并行的模型进行推理。此外，官方megatron还提供了工具将原始任意模型并行大小和流水并行大小的checkpoint转换为用户指定的模型并行大小和流水并行大小的checkpoint。(https://github.com/nvidia/megatron-lm/tree/main#evaluation-and-tasks) 如下图所示：
在这里插入图片描述

Linear推出双通道、单片式同步降压型开关稳压器LTC3618
5G+VR游戏会产生什么样的化学反应
全球不同国家和地区插头电压频率表
Rambus HBM3内存控制器IP速率达到9.6 Gbps
射频滤波器的作用与市场趋势
DeepSpeed结合Megatron-LM训练GPT2模型笔记
一加5上手评测：除了性能，剩下的全都是抄iPhone了？
曦智研究院发布光电混合计算系列白皮书，以大规模光电集成构建算力网络新范式
一文解读AI未来发展趋势、影响和挑战
集创北方加强产业联动，推动中国半导体跨越式发展
采用LCD模块为显尚光电的DST2001PH TFTLCD
荣耀新品性能全面进阶，荣耀商城周年庆典福利享不停
SystemVerilog中的tagged Unions是什么
多频带OFDM为何比直接序列(DS)技术更适合高速UWB通讯
如果你只能学习一门语言，除了Python，别无选择
安歌科技Enotek如何实现全链路智能物流解决方案？
基于USB 2.0集成芯片的H.264解码器芯片设计
水溶液电解质电池的安全使用建议
Wi-Fi:从局域网到物联网
移动通讯微波发射塔UPS后备电源电池设计方案