loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Min-Chi Chiang and Jerry Chou

Affiliation: National Tsing Hua University, Hsinchu, Taiwan

Keyword(s): Deep Learning, GPU Resource Management, Job Scheduling, Performance Optimization.

Abstract: The recent success of deep learning applications is driven by the computing power of GPUs. However, as the workflow of deep learning becomes increasingly complicated and resource-intensive, how to manage the expensive GPU resources for Machine Learning (ML) workload becomes a critical problem. Existing resource managers mostly only focus on a single specific type of workload, like batch processing or web services, and lacks runtime optimization and application performance awareness. Therefore, this paper proposes a set of runtime dynamic management techniques (including auto-scaling, job preemption, workload-aware scheduling, and elastic GPU sharing) to handle a mixture of ML workloads consisting of modeling, training, and inference jobs. Our proposed system is implemented as a set of extended operators on Kubernetes and has the strength of complete transparency and compatibility to the application code as well as the deep learning frameworks. Our experiments conducted on AWS GPU clu sters prove our approach can out-perform the native Kubernetes by 60% system throughput improvement, 70% training time reduction without causing any SLA violations on inference services. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 18.224.38.3

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Chiang, M. and Chou, J. (2021). DynamoML: Dynamic Resource Management Operators for Machine Learning Workloads. In Proceedings of the 11th International Conference on Cloud Computing and Services Science - CLOSER; ISBN 978-989-758-510-4; ISSN 2184-5042, SciTePress, pages 122-132. DOI: 10.5220/0010483401220132

@conference{closer21,
author={Min{-}Chi Chiang. and Jerry Chou.},
title={DynamoML: Dynamic Resource Management Operators for Machine Learning Workloads},
booktitle={Proceedings of the 11th International Conference on Cloud Computing and Services Science - CLOSER},
year={2021},
pages={122-132},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010483401220132},
isbn={978-989-758-510-4},
issn={2184-5042},
}

TY - CONF

JO - Proceedings of the 11th International Conference on Cloud Computing and Services Science - CLOSER
TI - DynamoML: Dynamic Resource Management Operators for Machine Learning Workloads
SN - 978-989-758-510-4
IS - 2184-5042
AU - Chiang, M.
AU - Chou, J.
PY - 2021
SP - 122
EP - 132
DO - 10.5220/0010483401220132
PB - SciTePress