DynamoML: Dynamic Resource Management Operators for Machine Learning Workloads

Min-Chi Chiang, Jerry Chou

Abstract

The recent success of deep learning applications is driven by the computing power of GPUs. However, as the workflow of deep learning becomes increasingly complicated and resource-intensive, how to manage the expensive GPU resources for Machine Learning (ML) workload becomes a critical problem. Existing resource managers mostly only focus on a single specific type of workload, like batch processing or web services, and lacks runtime optimization and application performance awareness. Therefore, this paper proposes a set of runtime dynamic management techniques (including auto-scaling, job preemption, workload-aware scheduling, and elastic GPU sharing) to handle a mixture of ML workloads consisting of modeling, training, and inference jobs. Our proposed system is implemented as a set of extended operators on Kubernetes and has the strength of complete transparency and compatibility to the application code as well as the deep learning frameworks. Our experiments conducted on AWS GPU clusters prove our approach can out-perform the native Kubernetes by 60% system throughput improvement, 70% training time reduction without causing any SLA violations on inference services.

Download


Paper Citation


in Harvard Style

Chiang M. and Chou J. (2021). DynamoML: Dynamic Resource Management Operators for Machine Learning Workloads. In Proceedings of the 11th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER, ISBN 978-989-758-510-4, pages 122-132. DOI: 10.5220/0010483401220132


in Bibtex Style

@conference{closer21,
author={Min-Chi Chiang and Jerry Chou},
title={DynamoML: Dynamic Resource Management Operators for Machine Learning Workloads},
booktitle={Proceedings of the 11th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,},
year={2021},
pages={122-132},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010483401220132},
isbn={978-989-758-510-4},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 11th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,
TI - DynamoML: Dynamic Resource Management Operators for Machine Learning Workloads
SN - 978-989-758-510-4
AU - Chiang M.
AU - Chou J.
PY - 2021
SP - 122
EP - 132
DO - 10.5220/0010483401220132