当前位置:网站首页>Implementation of multi GPU distributed training with horovod in Amazon sagemaker pipeline mode
Implementation of multi GPU distributed training with horovod in Amazon sagemaker pipeline mode
2020-11-07 20:15:00 【InfoQ】
At present , We can use a variety of techniques to train deep learning models with a small amount of data , It includes transfer learning for image classification tasks 、 Small sample learning and even one-time learning , It can also be based on pre training BERT or GPT2 Models fine tune language models . however , In some application cases, we still need to introduce a lot of training data . for example , If the current image and ImageNet The images in the dataset are completely different , Or is the current language corpus only for specific areas 、 It's not a generic type , So it's very difficult for transfer learning to bring about the ideal model performance . As a deep learning researcher , You may need to try new ideas or approaches from scratch . under these circumstances , We have to use large datasets to train large deep learning models ; Without finding the best way to train , The whole process can take a few days 、 Weeks, even months .
In this paper , We'll learn how to do it together Amazon SageMaker Run many on a single instance of GPU Training , And discuss how to do it in Amazon SageMaker On the implementation of more efficient GPU And multi node distributed training .
Link to the original text :【https://www.infoq.cn/article/0867pYEmzviBfvZxW37k】. Without the permission of the author , Prohibited reproduced .
版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
边栏推荐
猜你喜欢
11.Service更新
Reflection on a case of bus card being stolen and swiped
一种超参数优化技术-Hyperopt
快進來!花幾分鐘看一下 ReentrantReadWriteLock 的原理!
Kubernetes服务类型浅析:从概念到实践
RECH8.0版本学习 days 12 rh134部分
滴滴的分布式ID生成器(Tinyid),好用的一批
CI / CD of gitlab continuous integrated development environment
CPU瞒着内存竟干出这种事
Mate 40系列发布 搭载华为运动健康服务带来健康数字生活
随机推荐
Classroom exercises
嘉宾专访|2020 PostgreSQL亚洲大会中文分论坛:岳彩波
C language I blog assignment 03
After pulling four message queues into a group, they quarreled
Don't treat exceptions as business logic, which you can't afford
The samesite problem of cross domain cookie of Chrome browser results in abnormal access to iframe embedded pages
使用RabbitMQ实现分布式事务
DOM节点操作
Exclusive interview with Yue Caibo
如何高效的学习技术
是时候结束 BERTology了
建议患者自杀,OpenAI警告:GPT-3用于医疗目的风险太高
C語言重點——指標篇(一文讓你完全搞懂指標)| 從記憶體理解指標 | 指標完全解析
Web API系列(三)统一异常处理
The JS solution cannot be executed after Ajax loads HTML
某618大促项目的复盘总结
在 Amazon SageMaker 管道模式下使用 Horovod 实现多 GPU 分布式训练
AFO记
Huawei HCIA notes
Bgfx compilation tutorial