Skip to content

Create a Ray Cluster SDK upgrade scenarios during OLM upgrade #424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .github/workflows/olm_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -122,12 +122,12 @@ jobs:
BUNDLE_PUSH_OPT: "--tls-verify=false"
CATALOG_PUSH_OPT: "--tls-verify=false"

- name: Run OLM Upgrade e2e AppWrapper creation test
- name: Run OLM Pre Upgrade test scenarios
run: |
export CODEFLARE_TEST_OUTPUT_DIR=${{ env.TEMP_DIR }}
echo "CODEFLARE_TEST_OUTPUT_DIR=${CODEFLARE_TEST_OUTPUT_DIR}" >> $GITHUB_ENV
set -euo pipefail
go test -timeout 30m -v ./test/upgrade -run TestMNISTCreateAppWrapper -json 2>&1 | tee ${CODEFLARE_TEST_OUTPUT_DIR}/gotest.log | gotestfmt
go test -timeout 30m -v ./test/upgrade -run 'TestMNISTCreateAppWrapper|TestMNISTRayClusterUp' -json 2>&1 | tee ${CODEFLARE_TEST_OUTPUT_DIR}/gotest.log | gotestfmt

- name: Update Operator to the built version
run: |
Expand Down Expand Up @@ -158,12 +158,12 @@ jobs:
SUBSCRIPTION_NAME: "codeflare-operator"
SUBSCRIPTION_NAMESPACE: "openshift-operators"

- name: Run OLM Upgrade e2e Appwrapper Job status test to monitor training
- name: Run OLM Post Upgrade test scenarios
run: |
export CODEFLARE_TEST_OUTPUT_DIR=${{ env.TEMP_DIR }}
echo "CODEFLARE_TEST_OUTPUT_DIR=${CODEFLARE_TEST_OUTPUT_DIR}" >> $GITHUB_ENV
set -euo pipefail
go test -timeout 30m -v ./test/upgrade -run TestMNISTCheckAppWrapperStatus -json 2>&1 | tee ${CODEFLARE_TEST_OUTPUT_DIR}/gotest.log | gotestfmt
go test -timeout 30m -v ./test/upgrade -run 'TestMNISTCheckAppWrapperStatus|TestMnistJobSubmit' -json 2>&1 | tee ${CODEFLARE_TEST_OUTPUT_DIR}/gotest.log | gotestfmt

- name: Run e2e tests against built operator
run: |
Expand Down
46 changes: 46 additions & 0 deletions test/e2e/mnist_rayjob.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
import sys

from time import sleep

from torchx.specs.api import AppState, is_terminal

from codeflare_sdk.cluster.cluster import get_cluster
from codeflare_sdk.job.jobs import DDPJobDefinition

namespace = sys.argv[1]

cluster = get_cluster('mnist',namespace)

cluster.details()

jobdef = DDPJobDefinition(
name="mnist",
script="mnist.py",
scheduler_args={"requirements": "requirements.txt"},
)
job = jobdef.submit(cluster)

done = False
time = 0
timeout = 900
while not done:
status = job.status()
if is_terminal(status.state):
break
if not done:
print(status)
if timeout and time >= timeout:
raise TimeoutError(f"job has timed out after waiting {timeout}s")
sleep(5)
time += 5

print(f"Job has completed: {status.state}")

print(job.logs())

cluster.down()

if not status.state == AppState.SUCCEEDED:
exit(1)
else:
exit(0)
50 changes: 50 additions & 0 deletions test/e2e/start_ray_cluster.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
import sys
import os

from time import sleep

from codeflare_sdk.cluster.cluster import Cluster, ClusterConfiguration

namespace = sys.argv[1]
ray_image = os.getenv('RAY_IMAGE')
host = os.getenv('CLUSTER_HOSTNAME')

ingress_options = {}
if host is not None:
ingress_options = {
"ingresses": [
{
"ingressName": "ray-dashboard",
"port": 8265,
"pathType": "Prefix",
"path": "/",
"host": host,
},
]
}

cluster = Cluster(ClusterConfiguration(
name='mnist',
namespace=namespace,
num_workers=1,
head_cpus='500m',
head_memory=2,
min_cpus='500m',
max_cpus=1,
min_memory=1,
max_memory=2,
num_gpus=0,
instascale=False,
image=ray_image,
ingress_options=ingress_options,
))

cluster.up()

cluster.status()

cluster.wait_ready()

cluster.status()

cluster.details()
Loading