Skip to content

Troubleshooting

This guide provides diagnostic procedures and solutions for common issues encountered with HyperShift hosted clusters on Azure.

Common Issues

Cluster Creation Failures

Authentication Errors with Workload Identities

Symptom: Cluster creation fails with "failed to authenticate" or Azure authentication errors in control plane pods.

Common error messages:

Failed to get managed identity credential: AADSTS70021: No matching federated identity record found

Root causes: - Federated credentials not created or misconfigured - OIDC issuer URL mismatch - Service account namespace/name mismatch - Managed identity doesn't exist

Diagnostic steps:

# 1. Verify managed identities exist
az identity list \
    --resource-group ${PERSISTENT_RG_NAME} \
    --query "[?contains(name, '${CLUSTER_NAME}')].{Name:name, ClientId:clientId}" \
    --output table

# 2. Check federated credentials for a specific identity
task azure:list-federated-creds

# Or manually check:
az identity federated-credential list \
    --identity-name "${CLUSTER_NAME}-disk-csi" \
    --resource-group ${PERSISTENT_RG_NAME} \
    --output table

# 3. Verify OIDC issuer is accessible
curl -s "${OIDC_ISSUER_URL}/.well-known/openid-configuration" | jq .

# 4. Verify OIDC issuer URL matches
oc get hostedcluster ${CLUSTER_NAME} \
    -n ${CLUSTER_NAMESPACE} \
    -o jsonpath='{.spec.issuerURL}'

Solutions:

Solution 1: Recreate federated credentials

# Delete and recreate federated credentials
task azure:delete-federated-creds
task azure:federated-creds

Solution 2: Verify configuration matches

# Ensure PREFIX and CLUSTER_NAME are identical between:
# - Identity creation (task azure:identities)
# - Federated credential creation (task azure:federated-creds)
# - Cluster creation (task cluster:create)

# Check Taskfile.yml values
grep -E 'PREFIX|CLUSTER_NAME' hack/dev-preview/Taskfile.yml

Solution 3: Verify OIDC issuer accessibility

# Test OIDC endpoint
curl -v "${OIDC_ISSUER_URL}/.well-known/openid-configuration"

# Ensure storage account allows public access
az storage account show \
    --name ${OIDC_STORAGE_ACCOUNT_NAME} \
    --resource-group ${PERSISTENT_RG_NAME} \
    --query "allowBlobPublicAccess"

Precondition Failures

Symptom: task cluster:create fails with missing precondition errors.

Error example:

task: precondition not met: Managed resource group myprefix-managed-rg not found

Solutions:

# Missing managed resource group
task azure:infra

# Missing workload identities
task azure:identities

# Missing federated credentials
task azure:federated-creds

# Missing network IDs file
# This is created by task azure:infra
ls -l .azure-net-ids

# Missing credential files
ls -l ${AZURE_CREDS} ${PULL_SECRET}

Azure Quota or Capacity Issues

Symptom: Cluster creation fails with Azure quota exceeded errors.

Error messages:

Operation could not be completed as it results in exceeding approved Total Regional vCPUs quota

Solutions:

# Check current quota usage
az vm list-usage \
    --location ${LOCATION} \
    --output table

# Request quota increase through Azure portal:
# Portal → Subscriptions → Usage + quotas → Request increase

# Workaround: Use smaller VM size
# In cluster creation, reduce node count or VM size
NODE_POOL_REPLICAS=1  # Reduce from 2+

Control Plane Issues

Control Plane Pods Not Starting

Symptom: Control plane pods stuck in Pending, ImagePullBackOff, or CrashLoopBackOff.

Diagnostic steps:

# Check pod status
oc get pods -n clusters-${CLUSTER_NAME}

# Describe problematic pod
oc describe pod <pod-name> -n clusters-${CLUSTER_NAME}

# Check pod logs
oc logs <pod-name> -n clusters-${CLUSTER_NAME}

# Check events
oc get events -n clusters-${CLUSTER_NAME} --sort-by='.lastTimestamp'

Common causes and solutions:

Insufficient Resources on Management Cluster:

# Check node resources
oc top nodes

# Check pod resource requests
oc describe pod <pod-name> -n clusters-${CLUSTER_NAME} | grep -A 5 Requests

# Each hosted cluster needs approximately:
# - 4 vCPU
# - 8 GB memory

# Solutions:
# - Add more nodes to management cluster
# - Delete unused hosted clusters
# - Reduce resource requests (not recommended for production)

Image Pull Errors:

# Check pull secret
oc get secret ${CLUSTER_NAME}-pull-secret \
    -n clusters-${CLUSTER_NAME} \
    -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d | jq .

# Verify pull secret is valid at cloud.redhat.com
# Recreate cluster with valid pull secret if needed

etcd Issues:

# Check etcd pods
oc get pods -n clusters-${CLUSTER_NAME} -l app=etcd

# Check etcd logs
oc logs -n clusters-${CLUSTER_NAME} statefulset/etcd

# Check etcd persistent volumes
oc get pvc -n clusters-${CLUSTER_NAME}

Control Plane Performance Issues

Symptom: Slow API response times or timeouts.

Diagnostic steps:

# Check control plane pod CPU/memory usage
oc top pods -n clusters-${CLUSTER_NAME}

# Check API server logs for slow requests
oc logs -n clusters-${CLUSTER_NAME} deployment/kube-apiserver | grep -i "slow\|timeout"

# Check etcd performance
oc logs -n clusters-${CLUSTER_NAME} statefulset/etcd | grep -i "slow"

Solutions: - Scale up management cluster nodes - Reduce load on hosted cluster - Check management cluster storage performance

Worker Node Issues

Worker Nodes Not Joining Cluster

Symptom: Azure VMs are created but don't appear in oc get nodes (from hosted cluster context).

Diagnostic steps:

# 1. Check NodePool status
oc get nodepool -n ${CLUSTER_NAMESPACE} -o yaml

# 2. Check Machine resources
oc get machines -n clusters-${CLUSTER_NAME}

# 3. Check Azure VMs
az vm list \
    --resource-group ${MANAGED_RG_NAME} \
    --output table

# 4. Check ignition server logs
oc logs -n clusters-${CLUSTER_NAME} deployment/ignition-server

# 5. Get VM boot diagnostics (if enabled)
VM_NAME=$(az vm list --resource-group ${MANAGED_RG_NAME} --query "[0].name" -o tsv)
az vm boot-diagnostics get-boot-log \
    --resource-group ${MANAGED_RG_NAME} \
    --name ${VM_NAME}

Common causes and solutions:

NSG Blocking Required Ports:

# Check NSG rules
az network nsg show \
    --name ${NSG} \
    --resource-group ${NSG_RG_NAME}

# Required inbound ports:
# - 6443 (API server)
# - 22623 (machine config server / ignition)
# - 443 (ingress)
# - 10250-10259 (kubelet, kube-proxy)

# Fix: Update NSG rules
az network nsg rule create \
    --resource-group ${NSG_RG_NAME} \
    --nsg-name ${NSG} \
    --name allow-ignition \
    --priority 1000 \
    --source-address-prefixes '*' \
    --destination-port-ranges 22623 \
    --access Allow \
    --protocol Tcp

Ignition Server Not Reachable:

# Check ignition server service
oc get svc -n clusters-${CLUSTER_NAME} ignition-server

# From a node, test connectivity (requires VM access)
# curl -k https://ignition-server.clusters-${CLUSTER_NAME}.svc:443/config/worker

Workload Identity Issues:

# Check nodepool-mgmt identity has proper roles
az role assignment list \
    --assignee $(az identity show \
        --name "${CLUSTER_NAME}-nodepool-mgmt" \
        --resource-group ${PERSISTENT_RG_NAME} \
        --query clientId -o tsv) \
    --output table

Node Status Shows NotReady

Symptom: Nodes appear in oc get nodes but show status NotReady.

Diagnostic steps:

# From hosted cluster context
export KUBECONFIG=${CLUSTER_NAME}-kubeconfig

# Check node status
oc get nodes -o wide

# Describe the NotReady node
oc describe node <node-name>

# Check node conditions
oc get node <node-name> -o jsonpath='{.status.conditions}'  | jq .

Common causes: - Container runtime issues - Network plugin not ready - Insufficient resources on node - Disk pressure

Solutions:

# Check kubelet logs (requires SSH access to node)
# ssh -i ${CLUSTER_NAME}-ssh-key core@<node-ip>
# sudo journalctl -u kubelet -f

# Delete and recreate the node
oc delete node <node-name>
# NodePool controller will create replacement VM

DNS and Networking Issues

DNS Records Not Created (External DNS)

Symptom: Cannot resolve cluster API or app hostnames via DNS.

Diagnostic steps:

# Check External DNS pod
oc get pods -n hypershift -l app=external-dns

# Check External DNS logs
oc logs -n hypershift deployment/external-dns

# Check Route resources
oc get routes -n clusters-${CLUSTER_NAME}

# Verify DNS zone exists
az network dns zone show \
    --name ${PARENT_DNS_ZONE} \
    --resource-group ${PERSISTENT_RG_NAME}

# Check DNS records
az network dns record-set list \
    --resource-group ${PERSISTENT_RG_NAME} \
    --zone-name ${PARENT_DNS_ZONE} \
    --output table

Common issues:

External DNS Authentication Failures:

# Check External DNS secret
oc get secret azure-config-file -n default

# Verify service principal permissions
az role assignment list \
    --assignee $(jq -r '.aadClientId' < azure_mgmt.json) \
    --output table

# Should have "DNS Zone Contributor" role on DNS zone resource group

External DNS Not Watching Correct Namespace:

# Check External DNS configuration
oc get deployment external-dns -n hypershift -o yaml | grep -A 10 args

# Ensure it's watching routes in hosted cluster namespaces

Solution: Recreate External DNS configuration

# Reinstall HyperShift operator with correct External DNS config
hypershift install \
    --external-dns-provider=azure \
    --external-dns-credentials azure_mgmt.json \
    --pull-secret ${PULL_SECRET} \
    --external-dns-domain-filter ${EXTRN_DNS_ZONE_NAME} \
    --limit-crd-install Azure \
    --render | oc apply -f -

LoadBalancer Service Stuck Pending (Without External DNS)

Symptom: API server LoadBalancer service doesn't get external IP.

Diagnostic steps:

# Check service
oc get svc -n clusters-${CLUSTER_NAME} kube-apiserver

# Check cloud-controller-manager logs
oc logs -n clusters-${CLUSTER_NAME} deployment/cloud-controller-manager

# Check Azure load balancer
az network lb list \
    --resource-group ${MANAGED_RG_NAME} \
    --output table

Solutions: - Verify cloud-provider workload identity has correct permissions - Check Azure subscription quotas for load balancers - Verify network connectivity from management cluster to Azure APIs

Upgrade Issues

Upgrade Stuck or Fails

Symptom: Cluster upgrade doesn't complete or control plane pods crash during upgrade.

Diagnostic steps:

# Check HostedCluster upgrade status
oc get hostedcluster ${CLUSTER_NAME} \
    -n ${CLUSTER_NAMESPACE} \
    -o jsonpath='{.status.version}'

# Check cluster version operator
oc logs -n clusters-${CLUSTER_NAME} deployment/cluster-version-operator

# Check control plane pod status
oc get pods -n clusters-${CLUSTER_NAME}

# From hosted cluster, check cluster operators
export KUBECONFIG=${CLUSTER_NAME}-kubeconfig
oc get co

Solutions:

# If upgrade is stuck, check for:
# - Degraded cluster operators before upgrade
# - Insufficient resources during upgrade
# - Breaking changes in release notes

# Cannot rollback - must fix forward or recreate cluster

Diagnostic Commands Reference

HostedCluster Health

# Overall cluster status
oc get hostedcluster ${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE}

# Detailed status with conditions
oc get hostedcluster ${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE} -o yaml

# Check HostedControlPlane
oc get hostedcontrolplane -n clusters-${CLUSTER_NAME}

# Describe for events
oc describe hostedcluster ${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE}

NodePool Health

# NodePool status
oc get nodepool -n ${CLUSTER_NAMESPACE}

# Detailed NodePool information
oc get nodepool ${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE} -o yaml

# Check Machine resources
oc get machines -n clusters-${CLUSTER_NAME}

# Machine details
oc describe machine <machine-name> -n clusters-${CLUSTER_NAME}

Control Plane Diagnostics

# All control plane pods
oc get pods -n clusters-${CLUSTER_NAME}

# Pod resource usage
oc top pods -n clusters-${CLUSTER_NAME}

# Specific component logs
oc logs -n clusters-${CLUSTER_NAME} deployment/kube-apiserver -f
oc logs -n clusters-${CLUSTER_NAME} deployment/kube-controller-manager -f
oc logs -n clusters-${CLUSTER_NAME} deployment/kube-scheduler -f
oc logs -n clusters-${CLUSTER_NAME} statefulset/etcd -f

# Check all resources in control plane namespace
oc get all -n clusters-${CLUSTER_NAME}

# Events in control plane namespace
oc get events -n clusters-${CLUSTER_NAME} --sort-by='.lastTimestamp'

HyperShift Operator Diagnostics

# Operator pod status
oc get pods -n hypershift

# Operator logs
oc logs -n hypershift deployment/operator -f

# Check operator reconciliation
oc logs -n hypershift deployment/operator | grep ${CLUSTER_NAME}

Azure Resource Diagnostics

# List all resource groups for cluster
az group list \
    --query "[?contains(name, '${PREFIX}')].{Name:name, Location:location, State:properties.provisioningState}" \
    --output table

# Check VMs
az vm list \
    --resource-group ${MANAGED_RG_NAME} \
    --output table

# Check VM power state
az vm list \
    --resource-group ${MANAGED_RG_NAME} \
    --show-details \
    --query "[].{Name:name, PowerState:powerState}" \
    --output table

# Check load balancers
az network lb list \
    --resource-group ${MANAGED_RG_NAME} \
    --output table

# Check NSG rules
az network nsg rule list \
    --nsg-name ${NSG} \
    --resource-group ${NSG_RG_NAME} \
    --output table

Workload Identity Diagnostics

# List managed identities
task azure:list-identities

# Or manually:
az identity list \
    --resource-group ${PERSISTENT_RG_NAME} \
    --query "[?contains(name, '${CLUSTER_NAME}')]" \
    --output table

# List federated credentials
task azure:list-federated-creds

# Check specific component identity
az identity show \
    --name "${CLUSTER_NAME}-disk-csi" \
    --resource-group ${PERSISTENT_RG_NAME}

# Check role assignments
az role assignment list \
    --assignee $(az identity show \
        --name "${CLUSTER_NAME}-disk-csi" \
        --resource-group ${PERSISTENT_RG_NAME} \
        --query principalId -o tsv) \
    --output table

Collecting Debug Information

Comprehensive Cluster Dump

#!/bin/bash
# Save to a file: cluster-debug.sh

CLUSTER_NAME="myprefix-hc"
CLUSTER_NAMESPACE="clusters"
OUTPUT_DIR="debug-$(date +%Y%m%d-%H%M%S)"

mkdir -p ${OUTPUT_DIR}

echo "Collecting debug information for ${CLUSTER_NAME}..."

# HostedCluster
oc get hostedcluster ${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE} -o yaml > ${OUTPUT_DIR}/hostedcluster.yaml

# NodePools
oc get nodepool -n ${CLUSTER_NAMESPACE} -o yaml > ${OUTPUT_DIR}/nodepools.yaml

# Control plane pods
oc get pods -n clusters-${CLUSTER_NAME} -o wide > ${OUTPUT_DIR}/control-plane-pods.txt
oc get pods -n clusters-${CLUSTER_NAME} -o yaml > ${OUTPUT_DIR}/control-plane-pods.yaml

# Control plane events
oc get events -n clusters-${CLUSTER_NAME} --sort-by='.lastTimestamp' > ${OUTPUT_DIR}/control-plane-events.txt

# Operator logs
oc logs -n hypershift deployment/operator --tail=1000 > ${OUTPUT_DIR}/operator-logs.txt

# Control plane logs
for deployment in kube-apiserver kube-controller-manager kube-scheduler; do
    oc logs -n clusters-${CLUSTER_NAME} deployment/${deployment} --tail=500 > ${OUTPUT_DIR}/${deployment}-logs.txt
done

# Machines
oc get machines -n clusters-${CLUSTER_NAME} -o yaml > ${OUTPUT_DIR}/machines.yaml

echo "Debug information collected in ${OUTPUT_DIR}/"
tar -czf ${OUTPUT_DIR}.tar.gz ${OUTPUT_DIR}/
echo "Archive created: ${OUTPUT_DIR}.tar.gz"

Getting Help

Before Requesting Support

Collect the following information:

  1. Cluster details:
  2. OpenShift version
  3. HyperShift operator version
  4. Azure region

  5. Error symptoms:

  6. Exact error messages
  7. When the issue started
  8. Steps to reproduce

  9. Debug information (use script above):

  10. HostedCluster YAML
  11. Control plane pod status and logs
  12. NodePool status
  13. Operator logs

Support Channels

For Developer Preview: - HyperShift GitHub Issues - HyperShift Slack (#hypershift channel)

For Production (when GA): - Red Hat Support Portal - OpenShift support cases

Useful Resources

Next Steps