NPU和CPU对比运行速度有何不同？基于i.MX 8M Plus处理器的MYD-JX8MPQ开发板

参考
https://www.toradex.cn/blog/nxp-imx8ji-yueiq-kuang-jia-ce-shi-machine-learning
imx-machine-learning-ug.pdf
cpu和npu图像分类
cd /usr/bin/tensorflow-lite-2.4.0/examples
cpu运行
./label_image -m mobilenet_v1_1.0_224_quant.tflite -i grace_hopper.bmp -l labels.txt
info: loaded model mobilenet_v1_1.0_224_quant.tflite
info: resolved reporter
info: invoked
info: average time: 50.66 ms
info: 0.780392: 653 military uniform
info: 0.105882: 907 windsor tie
info: 0.0156863: 458 bow tie
info: 0.0117647: 466 bulletproof vest
info: 0.00784314: 835 suit
gpu/npu加速运行
./label_image -m mobilenet_v1_1.0_224_quant.tflite -i grace_hopper.bmp -l labels.txt -a 1
info: loaded model mobilenet_v1_1.0_224_quant.tflite
info: resolved reporter
info: created tensorflow lite delegate for nnapi.
info: applied nnapi delegate.
info: invoked
info: average time: 2.775 ms
info: 0.768627: 653 military uniform
info: 0.105882: 907 windsor tie
info: 0.0196078: 458 bow tie
info: 0.0117647: 466 bulletproof vest
info: 0.00784314: 835 suit
use_gpu_inference=0 ./label_image -m mobilenet_v1_1.0_224_quant.tflite -i grace_hopper.bmp -l labels.txt --external_delegate_path=/usr/lib/libvx_delegate.so
python运行
python3 label_image.py
info: created tensorflow lite delegate for nnapi.
applied nnapi delegate.
warm-up time: 6628.5 ms
inference time: 2.9 ms
0.870588: military uniform
0.031373: windsor tie
0.011765: mortarboard
0.007843: bow tie
0.007843: bulletproof vest
基准测试cpu单核运行
./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite
starting!
log parameter values verbosely: [0]
graph: [mobilenet_v1_1.0_224_quant.tflite]
loaded model mobilenet_v1_1.0_224_quant.tflite
the input model file size (mb): 4.27635
initialized session in 15.076ms.
running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=4 first=166743 curr=161124 min=161054 max=166743 avg=162728 std=2347
running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=161039 curr=161030 min=160877 max=161292 avg=161039 std=94
inference timings in us: init: 15076, first inference: 166743, warmup (avg): 162728, inference (avg): 161039
note: as the benchmark tool itself affects memory footprint, the following is only approximate to the actual memory footprint of the model at runtime. take the information at your discretion.
peak memory footprint (mb): init=2.65234 overall=9.00391
cpu多核运行
./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --num_threads=4
4核--num_threads设置为4性能最好
starting!
log parameter values verbosely: [0]
num threads: [4]
graph: [mobilenet_v1_1.0_224_quant.tflite]
#threads used for cpu inference: [4]
loaded model mobilenet_v1_1.0_224_quant.tflite
the input model file size (mb): 4.27635
initialized session in 2.536ms.
running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=11 first=48722 curr=44756 min=44597 max=49397 avg=45518.9 std=1679
running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=44678 curr=44591 min=44590 max=50798 avg=44965.2 std=1170
inference timings in us: init: 2536, first inference: 48722, warmup (avg): 45518.9, inference (avg): 44965.2
note: as the benchmark tool itself affects memory footprint, the following is only approximate to the actual memory footprint of the model at runtime. take the information at your discretion.
peak memory footprint (mb): init=1.38281 overall=8.69922
gpu/npu加速
./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --num_threads=4 --use_nnapi=true
starting!
log parameter values verbosely: [0]
num threads: [4]
graph: [mobilenet_v1_1.0_224_quant.tflite]
#threads used for cpu inference: [4]
use nnapi: [1]
nnapi accelerators available: [vsi-npu]
loaded model mobilenet_v1_1.0_224_quant.tflite
info: created tensorflow lite delegate for nnapi.
explicitly applied nnapi delegate, and the model graph will be completely executed by the delegate.
the input model file size (mb): 4.27635
initialized session in 3.968ms.
running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=6611085
running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=369 first=2715 curr=2623 min=2572 max=2776 avg=2634.2 std=20
inference timings in us: init: 3968, first inference: 6611085, warmup (avg): 6.61108e+06, inference (avg): 2634.2
note: as the benchmark tool itself affects memory footprint, the following is only approximate to the actual memory footprint of the model at runtime. take the information at your discretion.
peak memory footprint (mb): init=2.42188 overall=28.4062
结果对比
cpu运行 cpu多核多线程 npu加速
图像分类 50.66 ms 2.775 ms
基准测试 161039us 44965.2us 2634.2us
opencv dnn
cd /usr/share/opencv/samples/bin
./example_dnn_classification --input=dog416.png --zoo=models.yml squeezenet
下载模型
cd /usr/share/opencv4/testdata/dnn/
python3 download_models_basic.py
图像分类
cd /usr/share/opencv/samples/bin
./example_dnn_classification --input=dog416.png --zoo=models.yml squeezenet
文件浏览器地址栏输入
ftp://ftp.toradex.cn/linux/i.mx8/eiq/opencv/image_classification.zip
下载文件
解压得到文件models.yml和squeezenet_v1.1.caffemodel
cd /usr/share/opencv/samples/bin
将文件导入到开发板的/usr/share/opencv/samples/bin目录下
$cp/usr/share/opencv4/testdata/dnn/dog416.png /usr/share/opencv/samples/bin/
$cp/usr/share/opencv4/testdata/dnn/squeezenet_v1.1.prototxt /usr/share/opencv/samples/bin/
$cp/usr/share/opencv/samples/data/dnn/classification_classes_ilsvrc2012.txt /usr/share/opencv/samples/bin/
$ cd /usr/share/opencv/samples/bin/
图片输入
./example_dnn_classification --input=dog416.png --zoo=models.yml squeezenet
报错
root@myd-jx8mp:/usr/share/opencv/samples/bin# ./example_dnn_classification --input=dog416.png --zoo=model.yml squeezenet
errors:
missing parameter: 'mean'
missing parameter: 'rgb'
加入参数--rgb 和 --mean=1
还是报错加入参数--mode
root@myd-jx8mp:/usr/share/opencv/samples/bin# ./example_dnn_classification --rgb --mean=1 --input=dog416.png --zoo=models.yml squeezenet
[warn:0]global/usr/src/debug/opencv/4.4.0.imx-r0/git/modules/videoio/src/cap_gstreamer.cpp (898) open opencv | gstreamer warning: unable to query duration of stream
[warn:0]global/usr/src/debug/opencv/4.4.0.imx-r0/git/modules/videoio/src/cap_gstreamer.cpp (935) open opencv | gstreamer warning: cannot query video position: status=1, value=0, duration=-1
root@myd-jx8mp:/usr/share/opencv/samples/bin# ./example_dnn_classification --rgb --mean=1 --input=dog416.png --zoo=models.yml squeezenet --mode
[warn:0]global/usr/src/debug/opencv/4.4.0.imx-r0/git/modules/videoio/src/cap_gstreamer.cpp (898) open opencv | gstreamer warning: unable to query duration of stream
[warn:0]global/usr/src/debug/opencv/4.4.0.imx-r0/git/modules/videoio/src/cap_gstreamer.cpp (935) open opencv | gstreamer warning: cannot query video position: status=1, value=0, duration=-1
视频输入
./example_dnn_classification --device=2 --zoo=models.yml squeezenet
问题
如果testdata目录下没有文件,则查找下
lhj@desktop-binn7f8:~/myd-jx8mp-yocto$ find . -name dog416.png
./build-xwayland/tmp/work/cortexa53-crypto-mx8mp-poky-linux/opencv/4.4.0.imx-r0/extra/testdata/dnn/dog416.png
再将相应的文件复制到开发板
cd./build-xwayland/tmp/work/cortexa53-crypto-mx8mp-poky-linux/opencv/4.4.0.imx-r0/extra/testdata/
tar -cvf /mnt/e/dnn.tar ./dnn/
cd /usr/share/opencv4/testdata 目录不存在则先创建
rz导入dnn.tar
解压 tar -xvf dnn.tar
terminate called after throwing an instance of 'cv::exception'
what():opencv(4.4.0)/usr/src/debug/opencv/4.4.0.imx-r0/git/samples/dnn/classification.cpperrorassertion failed) !model.empty() in function 'main'
aborted
lhj@desktop-binn7f8:~/myd-jx8mp-yocto/build-xwayland$ find . -name classification.cpp
lhj@desktop-binn7f8:~/myd-jx8mp-yocto/build-xwayland$ cp ./tmp/work/cortexa53-crypto-mx8mp-poky-linux/opencv/4.4.0.imx-r0/packages-split/opencv-src/usr/src/debug/opencv/4.4.0.imx-r0/git/samples/dnn/classification.cpp /mnt/e
lhj@desktop-binn7f8:~/myd-jx8mp-yocto/build-xwayland$
yolo对象检测
cd /usr/share/opencv/samples/bin
./example_dnn_object_detection --width=1024 --height=1024 --scale=0.00392 --input=dog416.png --rgb --zoo=models.yml yolo
https://pjreddie.com/darknet/yolo/下载cfg和weights文件
cd /usr/share/opencv/samples/bin/
导入上面下载的文件
cp /usr/share/opencv/samples/data/dnn/object_detection_classes_yolov3.txt /usr/share/opencv/samples/bin/
cp/usr/share/opencv4/testdata/dnn/yolov3.cfg /usr/share/opencv/samples/bin/./example_dnn_object_detection --width=1024 --height=1024 --scale=0.00392 --input=dog416.png --rgb --zoo=models.yml yolo
opencv经典机器学
cd /usr/share/opencv/samples/bin
线性svm
./example_tutorial_introduction_to_svm
非线性svm
./example_tutorial_non_linear_svms
pca分析
./example_tutorial_introduction_to_pca ../data/pca_test1.jpg
逻辑回归
./example_cpp_logistic_regression

有了单片机，为什么还要使用操作系统？
利用零漂移仪表放大器(IA)应对传感器测量的设计挑战
随着越来越多的行业接受自动化电机及其驱动控制市场继续扩大
智能家居行业需要一把“中国梯”
电池使用寿命原因技术分析
NPU和CPU对比运行速度有何不同？基于i.MX 8M Plus处理器的MYD-JX8MPQ开发板
ICT行业的绿色发展存在哪些误区
红魔Mars电竞手机搭载了RGB灯可以根据不同的游戏节奏变换颜色
华为P10国内最出色的旗舰机，不是之一！
关于过氧化苯甲酰检测仪的功能介绍
CPU选择攻略 Intel处理器该如何适当的选择
Rust和C++哪个更好入门
ANDK测试座之RF射频芯片测试座
常见的恒流源电路及作为有源负载的应用
一加8系列将全系标配5G 价格上涨成定局
Dialog DA1469x发布:支持蓝牙5.1寻向功能,可实现厘米级实时定位
区块链编程语言Move和共识协议LibraBFT介绍
无线充电接收SOC芯片IP6832介绍
树莓派+HomeAssistant打造智能语音管家
科学家发现极可能存在外星生命的天体，距地球约200光年的一颗“超级地球”