8点PyTorch提速技巧总结

导读
本文总结了8点pytorch提速技巧：分别为硬件层面、如何测试训练过程的瓶颈、图片解码、数据增强加速、data prefetch、多gpu并行处理、混合精度训练、其他细节。
训练大型的数据集的速度受很多因素影响，由于数据集比较大，每个优化带来的时间提升就不可小觑。硬件方面，cpu、内存大小、gpu、机械硬盘orssd存储等都会有一定的影响。软件实现方面，pytorch本身的dataloader有时候会不够用，需要额外操作，比如使用混合精度、数据预读取、多线程读取数据、多卡并行优化等策略也会给整个模型优化带来非常巨大的作用。那什么时候需要采取这篇文章的策略呢？那就是明明gpu显存已经占满，但是显存的利用率很低。本文将搜集到的资源进行汇总，由于目前笔者训练的gpu利用率已经很高，所以并没有实际实验，可以在参考文献中看一下其他作者做的实验。
1. 硬件层面 cpu的话尽量看主频比较高的，缓存比较大的，核心数也是比较重要的参数。显卡尽可能选现存比较大的，这样才能满足大batch训练，多卡当让更好。内存要求64g，4根16g的内存条插满绝对够用了。主板性能也要跟上，否则装再好的cpu也很难发挥出全部性能。电源供电要充足，gpu运行的时候会对功率有一定要求，全力运行的时候如果电源供电不足对性能影响还是比较大的。存储如果有条件，尽量使用ssd存放数据，ssd和机械硬盘的在训练的时候的读取速度不是一个量级。笔者试验过，相同的代码，将数据移动到ssd上要比在机械硬盘上快10倍。操作系统尽量用ubuntu就可以（实验室用）如何实时查看ubuntu下各个资源利用情况呢？
gpu使用 watch -n 1 nvidia-smi 来动态监控 io情况，使用iostat命令来监控 cpu情况，使用htop命令来监控笔者对硬件了解很有限，欢迎补充，如有问题轻喷。
2. 如何测试训练过程的瓶颈如果现在程序运行速度很慢，那应该如何判断瓶颈在哪里呢？pytorch中提供了工具，非常方便的可以查看设计的代码在各个部分运行所消耗的时间。
瓶颈测试：https://pytorch.org/docs/stable/bottleneck.html
可以使用pytorch中bottleneck工具，具体使用方法如下：
python -m torch.utils.bottleneck /path/to/source/script.py [args] 详细内容可以看上面给出的链接。当然，也可用cprofile这样的工具来测试瓶颈所在,先运行以下命令。
python -m cprofile -o 100_percent_gpu_utilization.prof train.py 这样就得到了文件100_percent_gpu_utilization.prof对其进行可视化（用到了snakeviz包，pip install snakeviz即可）
snakeviz 100_percent_gpu_utilization.prof 可视化的结果如下图所示：
在浏览器中打开就可以找到这张分析图其他方法：
# profile cpu bottleneckspython -m cprofile training_script.py --profiling# profile gpu bottlenecksnvprof --print-gpu-trace python train_mnist.py# profile system calls bottlenecksstrace -fct python training_script.py -e trace=open,close,read 还可以用以下代码分析：
def test_loss_profiling(): loss = nn.bcewithlogitsloss() with torch.autograd.profiler.profile(use_cuda=true) as prof: input = torch.randn((8, 1, 128, 128)).cuda() input.requires_grad = true target = torch.randint(1, (8, 1, 128, 128)).cuda().float() for i in range(10): l = loss(input, target) l.backward() print(prof.key_averages().table(sort_by=self_cpu_time_total)) 3. 图片解码 pytorch中默认使用的是pillow进行图像的解码，但是其效率要比opencv差一些，如果图片全部是jpeg格式，可以考虑使用turbojpeg库解码。具体速度对比如下图所示：
各个库图片解码方式对比（图源德澎）对于jpeg读取也可以考虑使用jpeg4py库（pip install jpeg4py）,重写一个loader即可。存bmp图也可以降低解码耗时，其他方案还有recordio,hdf5,pth,n5,lmdb等格式
4. 数据增强加速在pytorch中，通常使用transformer做图片分类任务的数据增强，而其调用的是cpu做一些crop、flip、jitter等操作。如果你通过观察发现你的cpu利用率非常高，gpu利用率比较低，那说明瓶颈在于cpu预处理，可以使用nvidia提供的dali库在gpu端完成这部分数据增强操作。
dali链接：https://github.com/nvidia/dali
文档也非常详细：
dali文档：https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/index.html
当然，dali提供的操作比较有限，仅仅实现了常用的方法，有些新的方法比如cutout需要自己搞。具体实现可以参考这一篇：https://zhuanlan.zhihu.com/p/77633542
5. data prefetch nvidia apex中提供的解决方案
参考来源：https://zhuanlan.zhihu.com/p/66145913
apex提供的策略就是预读取下一次迭代需要的数据。
class data_prefetcher(): def __init__(self, loader): self.loader = iter(loader) self.stream = torch.cuda.stream() self.mean = torch.tensor([0.485 * 255, 0.456 * 255, 0.406 * 255]).cuda().view(1,3,1,1) self.std = torch.tensor([0.229 * 255, 0.224 * 255, 0.225 * 255]).cuda().view(1,3,1,1) # with amp, it isn't necessary to manually convert data to half. # if args.fp16: # self.mean = self.mean.half() # self.std = self.std.half() self.preload() def preload(self): try: self.next_input, self.next_target = next(self.loader) except stopiteration: self.next_input = none self.next_target = none return with torch.cuda.stream(self.stream): self.next_input = self.next_input.cuda(non_blocking=true) self.next_target = self.next_target.cuda(non_blocking=true) # with amp, it isn't necessary to manually convert data to half. # if args.fp16: # self.next_input = self.next_input.half() # else: self.next_input = self.next_input.float() self.next_input = self.next_input.sub_(self.mean).div_(self.std) 在训练函数中进行如下修改：原先是：
training_data_loader = dataloader( dataset=train_dataset, num_workers=opts.threads, batch_size=opts.batchsize, pin_memory=true, shuffle=true,)for iteration, batch in enumerate(training_data_loader, 1): # 训练代码修改以后：
data, label = prefetcher.next()iteration = 0while data is not none: iteration += 1 # 训练代码 data, label = prefetcher.next() 用prefetch库实现
https://zhuanlan.zhihu.com/p/97190313
安装：
pip install prefetch_generator 使用：
from torch.utils.data import dataloaderfrom prefetch_generator import backgroundgeneratorclass dataloaderx(dataloader): def __iter__(self): return backgroundgenerator(super().__iter__()) 然后用dataloaderx替换原本的dataloadercuda.steam加速拷贝过程
https://zhuanlan.zhihu.com/p/97190313
实现：
class dataprefetcher(): def __init__(self, loader, opt): self.loader = iter(loader) self.opt = opt self.stream = torch.cuda.stream() # with amp, it isn't necessary to manually convert data to half. # if args.fp16: # self.mean = self.mean.half() # self.std = self.std.half() self.preload() def preload(self): try: self.batch = next(self.loader) except stopiteration: self.batch = none return with torch.cuda.stream(self.stream): for k in self.batch: if k != 'meta': self.batch[k] = self.batch[k].to(device=self.opt.device, non_blocking=true) # with amp, it isn't necessary to manually convert data to half. # if args.fp16: # self.next_input = self.next_input.half() # else: # self.next_input = self.next_input.float() def next(self): torch.cuda.current_stream().wait_stream(self.stream) batch = self.batch self.preload() return batch 调用：
# ----改造前----for iter_id, batch in enumerate(data_loader): if iter_id >= num_iters: break for k in batch: if k != 'meta': batch[k] = batch[k].to(device=opt.device, non_blocking=true) run_step()# ----改造后----prefetcher = dataprefetcher(data_loader, opt)batch = prefetcher.next()iter_id = 0while batch is not none: iter_id += 1 if iter_id >= num_iters: break run_step() batch = prefetcher.next() 国外大佬实现
数据加载部分
import threadingimport numpy as npimport cv2import random class threadsafe_iter: takes an iterator/generator and makes it thread-safe by serializing call to the `next` method of given iterator/generator. def __init__(self, it): self.it = it self.lock = threading.lock() def __iter__(self): return self def next(self): with self.lock: return self.it.next()def get_path_i(paths_count): cyclic generator of paths indice current_path_id = 0 while true: yield current_path_id current_path_id = (current_path_id + 1) % paths_countclass inputgen: def __init__(self, paths, batch_size): self.paths = paths self.index = 0 self.batch_size = batch_size self.init_count = 0 self.lock = threading.lock() #mutex for input path self.yield_lock = threading.lock() #mutex for generator yielding of batch self.path_id_generator = threadsafe_iter(get_path_i(len(self.paths))) self.images = [] self.labels = [] def get_samples_count(self): returns the total number of images needed to train an epoch return len(self.paths) def get_batches_count(self): returns the total number of batches needed to train an epoch return int(self.get_samples_count() / self.batch_size) def pre_process_input(self, im,lb): do your pre-processing here need to be thread-safe function return im, lb def next(self): return self.__iter__() def __iter__(self): while true: #in the start of each epoch we shuffle the data paths with self.lock: if (self.init_count == 0): random.shuffle(self.paths) self.images, self.labels, self.batch_paths = [], [], [] self.init_count = 1 #iterates through the input paths in a thread-safe manner for path_id in self.path_id_generator: img, label = self.paths[path_id] img = cv2.imread(img, 1) label_img = cv2.imread(label,1) img, label = self.pre_process_input(img,label_img) #concurrent access by multiple threads to the lists below with self.yield_lock: if (len(self.images)) < self.batch_size: self.images.append(img) self.labels.append(label) if len(self.images) % self.batch_size == 0: yield np.float32(self.images), np.float32(self.labels) self.images, self.labels = [], [] #at the end of an epoch we re-init data-structures with self.lock: self.init_count = 0 def __call__(self): return self.__iter__() 使用方法：
class thread_killer(object): boolean object for signaling a worker thread to terminate def __init__(self): self.to_kill = false def __call__(self): return self.to_kill def set_tokill(self,tokill): self.to_kill = tokilldef threaded_batches_feeder(tokill, batches_queue, dataset_generator): threaded worker for pre-processing input data. tokill is a thread_killer object that indicates whether a thread should be terminated dataset_generator is the training/validation dataset generator batches_queue is a limited size thread-safe queue instance. while tokill() == false: for batch, (batch_images, batch_labels) in enumerate(dataset_generator): #we fill the queue with new fetched batch until we reach the max size. batches_queue.put((batch, (batch_images, batch_labels)) , block=true) if tokill() == true: returndef threaded_cuda_batches(tokill,cuda_batches_queue,batches_queue): thread worker for transferring pytorch tensors into gpu. batches_queue is the queue that fetches numpy cpu tensors. cuda_batches_queue receives numpy cpu tensors and transfers them to gpu space. while tokill() == false: batch, (batch_images, batch_labels) = batches_queue.get(block=true) batch_images_np = np.transpose(batch_images, (0, 3, 1, 2)) batch_images = torch.from_numpy(batch_images_np) batch_labels = torch.from_numpy(batch_labels) batch_images = variable(batch_images).cuda() batch_labels = variable(batch_labels).cuda() cuda_batches_queue.put((batch, (batch_images, batch_labels)), block=true) if tokill() == true: returnif __name__ =='__main__': import time import thread import sys from queue import empty,full,queue num_epoches=1000 #model is some pytorch cnn model model.cuda() model.train() batches_per_epoch = 64 #training set list suppose to be a list of full-paths for all #the training images. training_set_list = none #our train batches queue can hold at max 12 batches at any given time. #once the queue is filled the queue is locked. train_batches_queue = queue(maxsize=12) #our numpy batches cuda transferer queue. #once the queue is filled the queue is locked #we set maxsize to 3 due to gpu memory size limitations cuda_batches_queue = queue(maxsize=3) training_set_generator = inputgen(training_set_list,batches_per_epoch) train_thread_killer = thread_killer() train_thread_killer.set_tokill(false) preprocess_workers = 4 #we launch 4 threads to do load && pre-process the input images for _ in range(preprocess_workers): t = thread(target=threaded_batches_feeder, args=(train_thread_killer, train_batches_queue, training_set_generator)) t.start() cuda_transfers_thread_killer = thread_killer() cuda_transfers_thread_killer.set_tokill(false) cudathread = thread(target=threaded_cuda_batches, args=(cuda_transfers_thread_killer, cuda_batches_queue, train_batches_queue)) cudathread.start() #we let queue to get filled before we start the training time.sleep(8) for epoch in range(num_epoches): for batch in range(batches_per_epoch): #we fetch a gpu batch in 0's due to the queue mechanism _, (batch_images, batch_labels) = cuda_batches_queue.get(block=true) #train batch is the method for your training step. #no need to pin_memory due to diminished cuda transfers using queues. loss, accuracy = train_batch(batch_images, batch_labels) train_thread_killer.set_tokill(true) cuda_transfers_thread_killer.set_tokill(true) for _ in range(preprocess_workers): try: #enforcing thread shutdown train_batches_queue.get(block=true,timeout=1) cuda_batches_queue.get(block=true,timeout=1) except empty: pass print training done 6. 多gpu并行处理 pytorch中提供了分布式训练api, nn.distributeddataparallel, 推理的时候也可以使用nn.dataparallel或者nn.distributeddataparallel。推荐一个库，里面实现了多种分布式训练的demo: https://github.com/tczhangzhi/pytorch-distributed 其中包括：
nn.dataparallel torch.distributed torch.multiprocessing apex再加速 horovod实现 slurm gpu集群分布式 7. 混合精度训练 mixed precision yyds，之前分享过mixed precision论文阅读，实现起来非常简单。在pytorch中，可以使用apex库。如果用的是最新版本的pytorch，其自身已经支持了混合精度训练，非常nice。简单来说，混合精度能够让你在精度不掉的情况下，batch提升一倍。其原理就是将原先float point32精度的数据变为float point16的数据，不管是数据传输还是训练过程，都极大提升了训练速度，炼丹必备。

ROHM开发出高级车载仪表盘用2.8W大输出扬声器放大器“BD783xxEFJ-M”
基于S7-1500双边通信的组态编程步骤
自动驾驶技术面临新挑战：雪地等极端环境成为未来测试重点
关于坚果Pro出货量与真实不符，看罗永浩如何回应
4月畅销手机排行出炉三星将发布C10抢中国市场
8点PyTorch提速技巧总结
你需要知道的2017Wi-Fi协议五大进化趋势
一百多年前的人竟然就预测到了未来会出现视频通话这种聊天方式？
宇芯数码 CES Asia 2019全志科技联合参展商
中颖电子发布《2018年第三季度报告》前三季度公司实现销售收入5.66亿元
什么是分时电价该功能电表如何选型
多参数食品安全快速检测仪测定项目介绍
智能电视大战：构建新生态体系应用开发遭冷遇
我的森林天使app系统开发
华为5G全球第一 5G专利持有量领先他人
肖特基二极管代换原则
阿尔法罗密欧Stelvio,史上最快suv,逆天改命的四驱系统,风里雨里我在秋名山等着你！
长时间旅行拍摄存储解决方案：aigo移动固态硬盘S01 Pro了解一下
苹果压力来了！华为和小米准备进军美国市场
iOS13个人热点BUG有望在iOS 13.4正式版中修复