Scheduling Deep Learning Training in GPU Cluster Using the Model-Similarity-Based Policy

Conference proceedings article

ผู้เขียน/บรรณาธิการ

กิตติชัย ลวันยานนท์

กลุ่มสาขาการวิจัยเชิงกลยุทธ์

เรียนรู้ของเครื่อง (การวิเคราะห์ข้อมูลขนาดใหญ่)

รายละเอียดสำหรับงานพิมพ์

รายชื่อผู้แต่ง: Thanapol P.; Lavangnananda K.; Leprvost F.; Schleich J.; Bouvry P.

ผู้เผยแพร่: Springer Science and Business Media Deutschland GmbH

ปีที่เผยแพร่ (ค.ศ.): 2023

Volume number: 13996 LNAI

หน้าแรก: 363

หน้าสุดท้าย: 374

จำนวนหน้า: 12

ISBN: 978-981995836-8

นอก: 3029743

URL: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85173575623&doi=10.1007%2f978-981-99-5837-5_30&partnerID=40&md5=3f0a32644833b07180699b661811e641

ภาษา: English-Great Britain (EN-GB)

ดูบนเว็บไซต์ของสำนักพิมพ์

บทคัดย่อ

Training large neural networks with huge amount of data using multiple Graphic Processing Units (GPUs) became widespread with the emergence of Deep Learning (DL) technology. It is usually operated in datacenters featuring multiple GPU clusters, which are shared amongst users. However, different GPU architectures co-exist on the market and differ in training performance. To maximise the utilisation of a GPU cluster, the scheduler plays an important role in managing the resources by dispatching the jobs to the GPUs. An efficient scheduling strategy should take into account that the training performance of each GPU architecture varies for the different DL models. In this work, an original model-similarity-based scheduling policy is introduced that takes into account the GPU architectures that match with the DL models. The results show that using the model-similarity-based scheduling policy for distributed training across multiple GPUs of a DL model with a large batch size can reduce the makespan. ฉ The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023.

คำสำคัญ

Deep learning, Distributed Training, GPU Cluster, scheduling, Scheduling Policy, Similarity Measurement