DynamoML: Dynamic Resource Management Operators for Machine Learning Workloads

Min-Chi Chiang; Jerry Chou

Research.Publish.Connect.

*Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search *Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search *Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search *Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

DynamoML: Dynamic Resource Management Operators for Machine Learning Workloads

Topics: Cloud Management and Operations; Cloud Middleware Frameworks; Cloud Workflow Management Systems; High Performance Cloud Computing

In Proceedings of the 11th International Conference on Cloud Computing and Services Science CLOSER - Volume 1, 122-132, 2021

Authors: Min-Chi Chiang and Jerry Chou

Affiliation: National Tsing Hua University, Hsinchu, Taiwan

Keyword(s): Deep Learning, GPU Resource Management, Job Scheduling, Performance Optimization.

Abstract: The recent success of deep learning applications is driven by the computing power of GPUs. However, as the workflow of deep learning becomes increasingly complicated and resource-intensive, how to manage the expensive GPU resources for Machine Learning (ML) workload becomes a critical problem. Existing resource managers mostly only focus on a single specific type of workload, like batch processing or web services, and lacks runtime optimization and application performance awareness. Therefore, this paper proposes a set of runtime dynamic management techniques (including auto-scaling, job preemption, workload-aware scheduling, and elastic GPU sharing) to handle a mixture of ML workloads consisting of modeling, training, and inference jobs. Our proposed system is implemented as a set of extended operators on Kubernetes and has the strength of complete transparency and compatibility to the application code as well as the deep learning frameworks. Our experiments conducted on AWS GPU clu sters prove our approach can out-perform the native Kubernetes by 60% system throughput improvement, 70% training time reduction without causing any SLA violations on inference services. (More)

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 18.224.38.3

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Chiang, M. and Chou, J. (2021). DynamoML: Dynamic Resource Management Operators for Machine Learning Workloads. In Proceedings of the 11th International Conference on Cloud Computing and Services Science - CLOSER; ISBN 978-989-758-510-4; ISSN 2184-5042, SciTePress, pages 122-132. DOI: 10.5220/0010483401220132

@conference{closer21,
author={Min{-}Chi Chiang. and Jerry Chou.},
title={DynamoML: Dynamic Resource Management Operators for Machine Learning Workloads},
booktitle={Proceedings of the 11th International Conference on Cloud Computing and Services Science - CLOSER},
year={2021},
pages={122-132},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010483401220132},
isbn={978-989-758-510-4},
issn={2184-5042},
}

TY - CONF

JO - Proceedings of the 11th International Conference on Cloud Computing and Services Science - CLOSER
TI - DynamoML: Dynamic Resource Management Operators for Machine Learning Workloads
SN - 978-989-758-510-4
IS - 2184-5042
AU - Chiang, M.
AU - Chou, J.
PY - 2021
SP - 122
EP - 132
DO - 10.5220/0010483401220132
PB - SciTePress