当前位置:网站首页>RuntimeError: CUDA error: an illegal memory access was encountered
RuntimeError: CUDA error: an illegal memory access was encountered
2022-07-20 23:35:00 【Ton ton don't fight wild】
List of articles
1. Error description
Routine roast , The first time I encountered this error , I am very speechless . In the past, there was no error reporting , Compared with before , Different places have
- There's more data , from 80 Example becomes 100 example
- A new one docker Mirror image , Probably pytorch Version and cuda There are some problems with the version
- Code checked , No changes
2. Try yourself
2.1 Reduce batch_size
It may have something to do with video memory
For the first time
- Combine your own code , I'm the first epoch Training done ,validation The error reported in the stage (line 243)
- Adjust after reporting an error
batch_size
smaller (10→8), Continue to report mistakes
- But the last step validation The place of has gone ,line 258 Report errors
- Explain batch_size Getting smaller has a certain effect
According to this idea, continue to batch_size
The small (8→5), Another position error ..
- line 305 Report errors
2.2 Change card and code
Change your mind
- Changed a card , from 0 The card has changed 1 card ..
- Deleted non essential CPU and GPU Exchange the code of the data part
- Still wrong
- Use two cards ,batch_size Set up 10, No pre training model , Or an error
- Training from scratch , Single card ,batch_size=5
- Training from scratch , Single card ,batch_size=4
Things are a little better , All run to No 8 individual epoch 了 . But it still broke
2.3 cuda and pytorch edition
The current mirror cuda and pytorch The version information is :
- CUDA Version: 11.6
- pytorch 1.8.0
according to pytorch file :INSTALLING PREVIOUS VERSIONS OF PYTORCH
In terms of version , It's really not a good match .
in addition , according to TensorFlow、PyTorch Corresponding to each version CUDA、cuDNN Relationship
It's true that the version doesn't match very well .
Try changing the old mirror image of previous training
- Find the old mirror cuda Version and pytorch The version is the same as above
- That is, someone's response to the server cuda The version has been upgraded , Then upgrade the container by yourself pytorch Version .
3. Research situation
The error message is CUDA A runtime error thrown out , Illegal memory access occurred . There are also many discussions about this issue on the Internet , But no real reason was found .
Many are based on feelings
Reference resources :
- pytorch Of github issue:RuntimeError: CUDA error: an illegal memory access was encountered
- This answer seems to be effective for more people , A painful debug The experience of -RuntimeError: CUDA error: an illegal memory access was encountered, That's how this man solved it
- Others are empirical ,
- yolo Of GitHub issue:Cuda illegal memory access when running inference on *.engine #6311
4. My solution
It's not hard to see , I report the wrong position basically from gpu
Go to cpu
Problems during conversion .
- So consider whether cpu Not enough memory , So there is an error in memory access
- Because I use containers , So in docker-compose perhaps dockerfile Change the configuration item to :
shm_size: 64G → shm_size: 128G
- shm_size, Shared memory (shared memory)
- After that, there is basically no error report ...
边栏推荐
- DNS域名解析服务
- MoveIt2——10.URDF与SRDF
- LeetCode 题集 SQL (一)
- Teach you to use cann to convert photos into cartoon style
- ECCV 2022 | semantic novelty detection based on relational reasoning
- 改善用户体验的404页面最佳实践
- Skywalking分布式链路跟踪,相关图形,dljd,cat
- Review and Reflection on the development of this round of market 2021-04-05
- 王者荣耀商城异地多活架构
- 浅析IM即时通讯开发之扫码登录二维码
猜你喜欢
Analysis of the market trend in the second half of this bull market? 2021-04-07
VLAN再见,我选择用QinQ!1000字带你详细了解QinQ技术
费解的开关
改善用户体验的404页面最佳实践
404 page best practices to improve user experience
华为无线设备配置同一业务VLAN的AP间快速漫游
Teach you to use cann to convert photos into cartoon style
MySQL 19: database and table splitting practice
Review and Reflection on the development of this round of market 2021-04-05
【Kaggle】如何有效避免OOM和漫长的炼丹过程
随机推荐
MoveIt2——8.运动规划API
MySQL十九:分库分表实践
Review and Reflection on the development of this round of market 2021-04-05
Job hopping After 3 rounds of interviews for byte test post, 4 hours of soul torture, the ending is cool
力扣第五天
Camtasia 2022新版本发布CS喀秋莎2022功能亮点
驱动虚拟环境搭建记录
[cloud native] IVX low code development was introduced into Tencent map and previewed online
JVM memory model
脚手架cli3
Chengdu small products in 1998, joined a state-owned enterprise for two and a half years, and the salary exceeded 18K
【云原生】 iVX 低代码开发 引入腾讯地图并在线预览
【文件上传】解析文本文件通过JDBC连接进行批处理入库(动态建表动态入库)
Leetcode question set SQL (I)
本轮牛市下半场的行情走势分析?2021-04-07
简单斐波那契
It is said that software testing is OK, but why are there still so many dissuasions?
自定义View处理不当的内存泄漏
mysql之select查询篇3
Microservice architecture | link tracking - [sleuth]