运行 RayJob

运行 Kueue 调度的 RayJob。

此页面展示了在运行 KubeRayRayJob 时,如何利用 Kueue 的调度和资源管理功能。

本指南适用于对 Kueue 有基本了解的 批处理用户。有关更多信息,请参阅 Kueue 概述

开始之前

  1. 查看 管理集群配额,了解 Kueue 初始设置的详细信息。

  2. 有关 KubeRay 的安装和配置详细信息,请参阅 KubeRay 安装

RayJob 定义

在 Kueue 上运行 RayJob 时,请考虑以下方面

a. 队列选择

目标 本地队列 应在 RayJob 配置的 metadata.labels 部分中指定。

metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue

b. 配置资源需求

工作负载的资源需求可以在 spec.rayClusterSpec 中配置。

    headGroupSpec:
      template:
        spec:
          containers:
            - resources:
                requests:
                  cpu: "1"
    workerGroupSpecs:
      - template:
          spec:
            containers:
              - resources:
                  requests:
                    cpu: "1"

c. 限制

  • Kueue 管理的 RayJob 无法使用现有的 RayCluster。
  • RayCluster 应在作业执行结束时删除,spec.ShutdownAfterJobFinishes 应为 true
  • 由于 Kueue 会为 RayCluster 预留资源,因此 spec.rayClusterSpec.enableInTreeAutoscaling 应为 false
  • 由于 Kueue 工作负载最多可以有 8 个 PodSet,因此 spec.rayClusterSpec.workerGroupSpecs 的最大数量为 7。

示例 RayJob

在此示例中,代码通过 ConfigMap 提供给 Ray 框架。

apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  sample_code.py: |
    import ray
    import os
    import requests

    ray.init()

    @ray.remote
    class Counter:
        def __init__(self):
            # Used to verify runtimeEnv
            self.name = os.getenv("counter_name")
            self.counter = 0

        def inc(self):
            self.counter += 1

        def get_counter(self):
            return "{} got {}".format(self.name, self.counter)

    counter = Counter.remote()

    for _ in range(5):
        ray.get(counter.inc.remote())
        print(ray.get(counter.get_counter.remote()))

    print(requests.__version__)    

RayJob 如下所示

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: ray-job-sample
  labels:
    kueue.x-k8s.io/queue-name: user-queue
spec:
  suspend: true
  shutdownAfterJobFinishes: true
  entrypoint: python /home/ray/samples/sample_code.py
  runtimeEnv: ewogICAgInBpcCI6IFsKICAgICAgICAicmVxdWVzdHM9PTIuMjYuMCIsCiAgICAgICAgInBlbmR1bHVtPT0yLjEuMiIKICAgIF0sCiAgICAiZW52X3ZhcnMiOiB7ImNvdW50ZXJfbmFtZSI6ICJ0ZXN0X2NvdW50ZXIifQp9Cg==
  rayClusterSpec:
    rayVersion: '2.4.0' # should match the Ray version in the image of the containers
    # Ray head pod template
    headGroupSpec:
      # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ...
      rayStartParams:
        dashboard-host: '0.0.0.0'
        num-cpus: '1' # can be auto-completed from the limits
      #pod template
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.4.0
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
              resources:
                limits:
                  cpu: "2"
                requests:
                  cpu: "1"
              volumeMounts:
                - mountPath: /home/ray/samples
                  name: code-sample
          volumes:
            # You set volumes at the Pod level, then mount them into containers inside that Pod
            - name: code-sample
              configMap:
                # Provide the name of the ConfigMap you want to mount.
                name: ray-job-code-sample
                # An array of keys from the ConfigMap to create as files
                items:
                  - key: sample_code.py
                    path: sample_code.py
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 3
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        rayStartParams: {}
        #pod template
        template:
          spec:
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: rayproject/ray:2.4.0
                lifecycle:
                  preStop:
                    exec:
                      command: [ "/bin/sh","-c","ray stop" ]
                resources:
                  limits:
                    cpu: "2"
                  requests:
                    cpu: "1"

您可以使用以下命令运行此 RayJob

# Create the code ConfigMap (once)
kubectl apply -f ray-job-code-sample.yaml
# Create a RayJob. You can run this command multiple times
# to observe the queueing and admission of the jobs.
kubectl create -f ray-job-sample.yaml