运行 RayJob
运行 Kueue 调度的 RayJob。
此页面展示了在运行 KubeRay 的 RayJob 时,如何利用 Kueue 的调度和资源管理功能。
本指南适用于对 Kueue 有基本了解的 批处理用户。有关更多信息,请参阅 Kueue 概述。
开始之前
查看 管理集群配额,了解 Kueue 初始设置的详细信息。
有关 KubeRay 的安装和配置详细信息,请参阅 KubeRay 安装。
RayJob 定义
在 Kueue 上运行 RayJob 时,请考虑以下方面
a. 队列选择
目标 本地队列 应在 RayJob 配置的 metadata.labels
部分中指定。
metadata:
labels:
kueue.x-k8s.io/queue-name: user-queue
b. 配置资源需求
工作负载的资源需求可以在 spec.rayClusterSpec
中配置。
headGroupSpec:
template:
spec:
containers:
- resources:
requests:
cpu: "1"
workerGroupSpecs:
- template:
spec:
containers:
- resources:
requests:
cpu: "1"
c. 限制
- Kueue 管理的 RayJob 无法使用现有的 RayCluster。
- RayCluster 应在作业执行结束时删除,
spec.ShutdownAfterJobFinishes
应为true
。 - 由于 Kueue 会为 RayCluster 预留资源,因此
spec.rayClusterSpec.enableInTreeAutoscaling
应为false
。 - 由于 Kueue 工作负载最多可以有 8 个 PodSet,因此
spec.rayClusterSpec.workerGroupSpecs
的最大数量为 7。
示例 RayJob
在此示例中,代码通过 ConfigMap 提供给 Ray 框架。
apiVersion: v1
kind: ConfigMap
metadata:
name: ray-job-code-sample
data:
sample_code.py: |
import ray
import os
import requests
ray.init()
@ray.remote
class Counter:
def __init__(self):
# Used to verify runtimeEnv
self.name = os.getenv("counter_name")
self.counter = 0
def inc(self):
self.counter += 1
def get_counter(self):
return "{} got {}".format(self.name, self.counter)
counter = Counter.remote()
for _ in range(5):
ray.get(counter.inc.remote())
print(ray.get(counter.get_counter.remote()))
print(requests.__version__)
RayJob 如下所示
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: ray-job-sample
labels:
kueue.x-k8s.io/queue-name: user-queue
spec:
suspend: true
shutdownAfterJobFinishes: true
entrypoint: python /home/ray/samples/sample_code.py
runtimeEnv: ewogICAgInBpcCI6IFsKICAgICAgICAicmVxdWVzdHM9PTIuMjYuMCIsCiAgICAgICAgInBlbmR1bHVtPT0yLjEuMiIKICAgIF0sCiAgICAiZW52X3ZhcnMiOiB7ImNvdW50ZXJfbmFtZSI6ICJ0ZXN0X2NvdW50ZXIifQp9Cg==
rayClusterSpec:
rayVersion: '2.4.0' # should match the Ray version in the image of the containers
# Ray head pod template
headGroupSpec:
# the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ...
rayStartParams:
dashboard-host: '0.0.0.0'
num-cpus: '1' # can be auto-completed from the limits
#pod template
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.4.0
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265 # Ray dashboard
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
resources:
limits:
cpu: "2"
requests:
cpu: "1"
volumeMounts:
- mountPath: /home/ray/samples
name: code-sample
volumes:
# You set volumes at the Pod level, then mount them into containers inside that Pod
- name: code-sample
configMap:
# Provide the name of the ConfigMap you want to mount.
name: ray-job-code-sample
# An array of keys from the ConfigMap to create as files
items:
- key: sample_code.py
path: sample_code.py
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 3
minReplicas: 1
maxReplicas: 5
# logical group name, for this called small-group, also can be functional
groupName: small-group
rayStartParams: {}
#pod template
template:
spec:
containers:
- name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
image: rayproject/ray:2.4.0
lifecycle:
preStop:
exec:
command: [ "/bin/sh","-c","ray stop" ]
resources:
limits:
cpu: "2"
requests:
cpu: "1"
您可以使用以下命令运行此 RayJob
# Create the code ConfigMap (once)
kubectl apply -f ray-job-code-sample.yaml
# Create a RayJob. You can run this command multiple times
# to observe the queueing and admission of the jobs.
kubectl create -f ray-job-sample.yaml