Day 2 Operations

This guide covers common operational tasks for managing HyperShift hosted clusters after initial deployment.

Persona: HyperShift Administrator

Day 2 operations are performed on running clusters to manage capacity, upgrade versions, and maintain cluster health.

Prerequisites

At least one HostedCluster successfully deployed
oc/oc access to the management cluster
HyperShift CLI binary available

Managing NodePools

NodePools define the worker node configuration and count for your hosted cluster. You can add, modify, scale, and delete NodePools as your workload requirements change.

Understanding NodePools

What is a NodePool? - Defines a group of worker nodes with identical configuration - Specifies VM size, node count, and Azure-specific settings - Can have multiple NodePools per cluster with different configurations - Each NodePool manages its own set of Azure VMs

Use cases for multiple NodePools: - Different VM sizes for different workload types (CPU vs memory optimized) - Separate pools for different availability zones - Testing new VM configurations before migrating workloads - Gradual node upgrades with blue/green approaches

Adding Additional NodePools

Create a new NodePool with different configuration:

# Set variables (from your Taskfile or environment)
CLUSTER_NAME="myprefix-hc"
CLUSTER_NAMESPACE="clusters"
AZURE_CREDS="azure-credentials.json"

# Create a new NodePool with larger VMs
hypershift create nodepool azure \
    --cluster-name "${CLUSTER_NAME}" \
    --namespace "${CLUSTER_NAMESPACE}" \
    --name "${CLUSTER_NAME}-large-workers" \
    --node-count 3 \
    --azure-instance-type Standard_D8s_v3 \
    --azure-creds ${AZURE_CREDS}

Common VM sizes: - Standard_D2s_v3: 2 vCPU, 8GB RAM (default, balanced) - Standard_D4s_v3: 4 vCPU, 16GB RAM (larger workloads) - Standard_D8s_v3: 8 vCPU, 32GB RAM (high-performance) - Standard_E4s_v3: 4 vCPU, 32GB RAM (memory-optimized) - Standard_F8s_v2: 8 vCPU, 16GB RAM (compute-optimized)

See Azure VM sizes for complete list.

Verify NodePool creation:

# Check NodePool status
oc get nodepool -n ${CLUSTER_NAMESPACE}

# Watch nodes join the cluster (switch to hosted cluster context)
export KUBECONFIG=${CLUSTER_NAME}-kubeconfig
oc get nodes -w

Configuring NodePool Marketplace Images

For OpenShift 4.20+, NodePools automatically use the release payload defaults. For custom configurations:

Option 1: Specify VM Generation Only

hypershift create nodepool azure \
    --cluster-name "${CLUSTER_NAME}" \
    --namespace "${CLUSTER_NAMESPACE}" \
    --name "${CLUSTER_NAME}-gen1-workers" \
    --node-count 2 \
    --image-generation Gen1 \
    --azure-creds ${AZURE_CREDS}

Option 2: Use Custom Marketplace Image

hypershift create nodepool azure \
    --cluster-name "${CLUSTER_NAME}" \
    --namespace "${CLUSTER_NAMESPACE}" \
    --name "${CLUSTER_NAME}-custom-workers" \
    --node-count 2 \
    --marketplace-publisher azureopenshift \
    --marketplace-offer aro4 \
    --marketplace-sku aro_421 \
    --marketplace-version 421.0.20250101 \
    --azure-creds ${AZURE_CREDS}

Scaling NodePools

Adjust the number of worker nodes in a NodePool:

Scale using oc:

# Scale to 5 replicas
oc scale nodepool/${CLUSTER_NAME} \
    -n ${CLUSTER_NAMESPACE} \
    --replicas=5

# Verify scaling
oc get nodepool -n ${CLUSTER_NAMESPACE}

Scale using patch:

# Scale specific NodePool
oc patch nodepool/${CLUSTER_NAME}-large-workers \
    -n ${CLUSTER_NAMESPACE} \
    --type merge \
    --patch '{"spec":{"replicas":10}}'

Scale using edit:

# Edit NodePool interactively
oc edit nodepool/${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE}

# Modify spec.replicas field and save

Monitor scaling progress:

# Watch NodePool status
oc get nodepool -n ${CLUSTER_NAMESPACE} -w

# Watch machines being created
oc get machines -n clusters-${CLUSTER_NAME} -w

# In hosted cluster context, watch nodes joining
export KUBECONFIG=${CLUSTER_NAME}-kubeconfig
oc get nodes -w

Scaling Best Practices

Scale gradually in production (add/remove 1-2 nodes at a time)
Monitor cluster resource usage before scaling down
Ensure workloads have PodDisruptionBudgets configured
Drain nodes before scaling down to avoid disruption

Deleting NodePools

Remove a NodePool when no longer needed:

# Delete a specific NodePool
oc delete nodepool/${CLUSTER_NAME}-large-workers \
    -n ${CLUSTER_NAMESPACE}

# Verify deletion
oc get nodepool -n ${CLUSTER_NAMESPACE}

NodePool Deletion

Deleting a NodePool:

Immediately deletes the NodePool resource
Terminates all Azure VMs in that NodePool
Evicts all pods running on those nodes
Cannot be undone

Before deleting: - Drain workloads to other nodes - Ensure sufficient capacity remains - Verify no critical workloads are pinned to specific nodes

Upgrading Clusters

Upgrade hosted clusters to new OpenShift versions by changing the release image.

Understanding HyperShift Upgrades

How upgrades work: 1. Update the HostedCluster spec.release.image field 2. Control plane components upgrade first (pods on management cluster) 3. NodePools upgrade next (worker nodes replaced or upgraded) 4. Cluster operators coordinate the upgrade process

Upgrade characteristics: - Control plane upgrades are fast (pod restarts) - Worker node upgrades create new VMs and drain old ones - Minimal downtime with proper planning - Can upgrade control plane and NodePools separately

Upgrading the Control Plane

Update the cluster release image:

# Variables
CLUSTER_NAME="myprefix-hc"
CLUSTER_NAMESPACE="clusters"
NEW_RELEASE_IMAGE="quay.io/openshift-release-dev/ocp-release:4.21.1-x86_64"

# Patch the HostedCluster
oc patch hostedcluster/${CLUSTER_NAME} \
    -n ${CLUSTER_NAMESPACE} \
    --type merge \
    --patch "{\"spec\":{\"release\":{\"image\":\"${NEW_RELEASE_IMAGE}\"}}}"

Monitor control plane upgrade:

# Watch HostedCluster status
oc get hostedcluster ${CLUSTER_NAME} \
    -n ${CLUSTER_NAMESPACE} \
    -w

# Check control plane pod rollouts
oc get pods -n clusters-${CLUSTER_NAME} -w

# In hosted cluster context, check cluster version
export KUBECONFIG=${CLUSTER_NAME}-kubeconfig
oc get clusterversion

Expected progression:

# HostedCluster shows progressing
NAME          VERSION   AVAILABLE   PROGRESSING   MESSAGE
myprefix-hc   4.21.0    True        True          Upgrading to 4.21.1

# After completion
NAME          VERSION   AVAILABLE   PROGRESSING   MESSAGE
myprefix-hc   4.21.1    True        False         The hosted cluster is available

Upgrading NodePools

Upgrade worker nodes to match the control plane version:

# Option 1: Patch the NodePool release
oc patch nodepool/${CLUSTER_NAME} \
    -n ${CLUSTER_NAMESPACE} \
    --type merge \
    --patch "{\"spec\":{\"release\":{\"image\":\"${NEW_RELEASE_IMAGE}\"}}}"

# Option 2: Edit NodePool interactively
oc edit nodepool/${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE}
# Update spec.release.image and save

Monitor NodePool upgrade:

# Watch NodePool status
oc get nodepool -n ${CLUSTER_NAMESPACE} -w

# Watch machines being replaced
oc get machines -n clusters-${CLUSTER_NAME} -w

# In hosted cluster context, watch node versions
export KUBECONFIG=${CLUSTER_NAME}-kubeconfig
oc get nodes -w

Upgrade strategies:

Replace Strategy (default): - Creates new VMs with new version - Drains and deletes old VMs - Rolling replacement maintains capacity - Safer but slower

InPlace Strategy: - Upgrades existing VMs in place - Faster but requires node reboots - Less common in cloud environments

Upgrade Best Practices

Pre-Upgrade Checklist

Before upgrading:

[ ] Review release notes for breaking changes
[ ] Test upgrade in non-production cluster first
[ ] Ensure cluster operators are healthy: oc get co
[ ] Verify sufficient capacity in management cluster
[ ] Back up critical data and configurations
[ ] Notify users of potential disruption
[ ] Plan maintenance window if required

Upgrade sequence: 1. Upgrade control plane first 2. Wait for control plane to become Available 3. Verify all cluster operators are healthy 4. Upgrade NodePools (one at a time in production) 5. Verify cluster functionality

Rollback: Downgrades are not supported. If issues occur: - Fix issues in upgraded version - Or restore from backup/recreate cluster

Monitoring Cluster Health

Checking Cluster Status

From management cluster:

# HostedCluster health
oc get hostedcluster ${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE}

# NodePool health
oc get nodepool -n ${CLUSTER_NAMESPACE}

# Control plane pods
oc get pods -n clusters-${CLUSTER_NAME}

# Control plane resource usage
oc top pods -n clusters-${CLUSTER_NAME}

From hosted cluster:

# Switch to hosted cluster
export KUBECONFIG=${CLUSTER_NAME}-kubeconfig

# Cluster version and available updates
oc get clusterversion

# Cluster operators
oc get co

# Node health
oc get nodes

# Node resource usage
oc top nodes

Key Health Indicators

Healthy HostedCluster:

status:
  conditions:
  - type: Available
    status: "True"
  - type: Progressing
    status: "False"
  - type: Degraded
    status: "False"

Healthy NodePool:

status:
  replicas: 3        # Matches desired count
  readyReplicas: 3   # All nodes ready
  conditions:
  - type: Ready
    status: "True"

Healthy Cluster Operators:

# All operators should show AVAILABLE=True, DEGRADED=False
oc get co

NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED
authentication                             4.21.0    True        False         False
cloud-credential                           4.21.0    True        False         False
cluster-autoscaler                         4.21.0    True        False         False
...

Cluster Metrics and Logging

View control plane metrics (from management cluster):

# CPU and memory usage
oc top pods -n clusters-${CLUSTER_NAME}

# Node resource allocation on management cluster
oc top nodes

View hosted cluster metrics:

# Switch to hosted cluster
export KUBECONFIG=${CLUSTER_NAME}-kubeconfig

# Worker node metrics
oc top nodes

# Pod metrics across cluster
oc top pods --all-namespaces

Access control plane logs:

# From management cluster
oc logs -n clusters-${CLUSTER_NAME} deployment/kube-apiserver
oc logs -n clusters-${CLUSTER_NAME} deployment/kube-controller-manager
oc logs -n clusters-${CLUSTER_NAME} statefulset/etcd

Modifying Cluster Configuration

Changing Cluster Networking

Update cluster network settings:

# Edit HostedCluster
oc edit hostedcluster/${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE}

# Modify spec.networking fields
# Note: Some changes require cluster recreation

Network Configuration Changes

Many networking changes cannot be applied to running clusters and require recreation:

Service network CIDR
Pod network CIDR
Network type (OVNKubernetes vs others)

Changes that can be applied: - DNS configuration - Proxy settings

Managing Cluster Credentials

Rotate kubeadmin password:

# From management cluster
oc delete secret kubeadmin -n clusters-${CLUSTER_NAME}

# Controller will regenerate with new password
# Retrieve new password
oc get secret kubeadmin \
    -n clusters-${CLUSTER_NAME} \
    -o jsonpath='{.data.password}' | base64 -d

Add cluster administrators:

# From hosted cluster
export KUBECONFIG=${CLUSTER_NAME}-kubeconfig

# Grant cluster-admin to user
oc create clusterrolebinding admin-user \
    --clusterrole=cluster-admin \
    --user=admin@example.com

NodePool Advanced Configuration

Configuring Node Labels and Taints

Add labels to NodePool:

oc patch nodepool/${CLUSTER_NAME} \
    -n ${CLUSTER_NAMESPACE} \
    --type merge \
    --patch '{"spec":{"nodeLabels":{"workload-type":"database"}}}'

Add taints to NodePool:

oc edit nodepool/${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE}

# Add to spec:
spec:
  taints:
  - key: dedicated
    value: database
    effect: NoSchedule

Verify labels and taints:

# From hosted cluster
export KUBECONFIG=${CLUSTER_NAME}-kubeconfig
oc get nodes --show-labels
oc describe node <node-name> | grep Taints

Configuring Root Volume Size

Increase root volume size for NodePool VMs:

oc patch nodepool/${CLUSTER_NAME} \
    -n ${CLUSTER_NAMESPACE} \
    --type merge \
    --patch '{"spec":{"platform":{"azure":{"diskSizeGB":256}}}}'

Volume Size Changes

Only affects new nodes created after the change
Existing nodes keep their current disk size
Scale down and up to replace nodes with new disk size

Next Steps

Troubleshooting - Diagnose and resolve common issues
Reference - CLI commands and configuration reference

Quick Reference

Common Operations

# Add NodePool
hypershift create nodepool azure \
    --cluster-name ${CLUSTER_NAME} \
    --namespace ${CLUSTER_NAMESPACE} \
    --name ${CLUSTER_NAME}-new-pool \
    --node-count 3 \
    --azure-instance-type Standard_D4s_v3 \
    --azure-creds ${AZURE_CREDS}

# Scale NodePool
oc scale nodepool/${CLUSTER_NAME} \
    -n ${CLUSTER_NAMESPACE} \
    --replicas=5

# Upgrade cluster
NEW_VERSION="quay.io/openshift-release-dev/ocp-release:4.21.1-x86_64"
oc patch hostedcluster/${CLUSTER_NAME} \
    -n ${CLUSTER_NAMESPACE} \
    --type merge \
    --patch "{\"spec\":{\"release\":{\"image\":\"${NEW_VERSION}\"}}}"

# Delete NodePool
oc delete nodepool/${CLUSTER_NAME}-new-pool \
    -n ${CLUSTER_NAMESPACE}

# Check cluster health
oc get hostedcluster,nodepool -n ${CLUSTER_NAMESPACE}
oc get co  # From hosted cluster context