Troubleshooting

This guide provides diagnostic procedures and solutions for common issues encountered with HyperShift hosted clusters on Azure.

Common Issues

Cluster Creation Failures

Authentication Errors with Workload Identities

Symptom: Cluster creation fails with "failed to authenticate" or Azure authentication errors in control plane pods.

Common error messages:

Failed to get managed identity credential: AADSTS70021: No matching federated identity record found

Root causes: - Federated credentials not created or misconfigured - OIDC issuer URL mismatch - Service account namespace/name mismatch - Managed identity doesn't exist

Diagnostic steps:

# 1. Verify managed identities exist
az identity list \
    --resource-group ${PERSISTENT_RG_NAME} \
    --query "[?contains(name, '${CLUSTER_NAME}')].{Name:name, ClientId:clientId}" \
    --output table

# 2. Check federated credentials for a specific identity
task azure:list-federated-creds

# Or manually check:
az identity federated-credential list \
    --identity-name "${CLUSTER_NAME}-disk-csi" \
    --resource-group ${PERSISTENT_RG_NAME} \
    --output table

# 3. Verify OIDC issuer is accessible
curl -s "${OIDC_ISSUER_URL}/.well-known/openid-configuration" | jq .

# 4. Verify OIDC issuer URL matches
oc get hostedcluster ${CLUSTER_NAME} \
    -n ${CLUSTER_NAMESPACE} \
    -o jsonpath='{.spec.issuerURL}'

Solutions:

Solution 1: Recreate federated credentials

# Delete and recreate federated credentials
task azure:delete-federated-creds
task azure:federated-creds

Solution 2: Verify configuration matches

# Ensure PREFIX and CLUSTER_NAME are identical between:
# - Identity creation (task azure:identities)
# - Federated credential creation (task azure:federated-creds)
# - Cluster creation (task cluster:create)

# Check Taskfile.yml values
grep -E 'PREFIX|CLUSTER_NAME' hack/dev-preview/Taskfile.yml

Solution 3: Verify OIDC issuer accessibility

# Test OIDC endpoint
curl -v "${OIDC_ISSUER_URL}/.well-known/openid-configuration"

# Ensure storage account allows public access
az storage account show \
    --name ${OIDC_STORAGE_ACCOUNT_NAME} \
    --resource-group ${PERSISTENT_RG_NAME} \
    --query "allowBlobPublicAccess"

Precondition Failures

Symptom: task cluster:create fails with missing precondition errors.

Error example:

task: precondition not met: Managed resource group myprefix-managed-rg not found

Solutions:

# Missing managed resource group
task azure:infra

# Missing workload identities
task azure:identities

# Missing federated credentials
task azure:federated-creds

# Missing network IDs file
# This is created by task azure:infra
ls -l .azure-net-ids

# Missing credential files
ls -l ${AZURE_CREDS} ${PULL_SECRET}

Azure Quota or Capacity Issues

Symptom: Cluster creation fails with Azure quota exceeded errors.

Error messages:

Operation could not be completed as it results in exceeding approved Total Regional vCPUs quota

Solutions:

# Check current quota usage
az vm list-usage \
    --location ${LOCATION} \
    --output table

# Request quota increase through Azure portal:
# Portal → Subscriptions → Usage + quotas → Request increase

# Workaround: Use smaller VM size
# In cluster creation, reduce node count or VM size
NODE_POOL_REPLICAS=1  # Reduce from 2+

Control Plane Issues

Control Plane Pods Not Starting

Symptom: Control plane pods stuck in Pending, ImagePullBackOff, or CrashLoopBackOff.

Diagnostic steps:

# Check pod status
oc get pods -n clusters-${CLUSTER_NAME}

# Describe problematic pod
oc describe pod <pod-name> -n clusters-${CLUSTER_NAME}

# Check pod logs
oc logs <pod-name> -n clusters-${CLUSTER_NAME}

# Check events
oc get events -n clusters-${CLUSTER_NAME} --sort-by='.lastTimestamp'

Common causes and solutions:

Insufficient Resources on Management Cluster:

# Check node resources
oc top nodes

# Check pod resource requests
oc describe pod <pod-name> -n clusters-${CLUSTER_NAME} | grep -A 5 Requests

# Each hosted cluster needs approximately:
# - 4 vCPU
# - 8 GB memory

# Solutions:
# - Add more nodes to management cluster
# - Delete unused hosted clusters
# - Reduce resource requests (not recommended for production)

Image Pull Errors:

# Check pull secret
oc get secret ${CLUSTER_NAME}-pull-secret \
    -n clusters-${CLUSTER_NAME} \
    -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d | jq .

# Verify pull secret is valid at cloud.redhat.com
# Recreate cluster with valid pull secret if needed

etcd Issues:

# Check etcd pods
oc get pods -n clusters-${CLUSTER_NAME} -l app=etcd

# Check etcd logs
oc logs -n clusters-${CLUSTER_NAME} statefulset/etcd

# Check etcd persistent volumes
oc get pvc -n clusters-${CLUSTER_NAME}

Control Plane Performance Issues

Symptom: Slow API response times or timeouts.

Diagnostic steps:

# Check control plane pod CPU/memory usage
oc top pods -n clusters-${CLUSTER_NAME}

# Check API server logs for slow requests
oc logs -n clusters-${CLUSTER_NAME} deployment/kube-apiserver | grep -i "slow\|timeout"

# Check etcd performance
oc logs -n clusters-${CLUSTER_NAME} statefulset/etcd | grep -i "slow"

Solutions: - Scale up management cluster nodes - Reduce load on hosted cluster - Check management cluster storage performance

Worker Node Issues

Worker Nodes Not Joining Cluster

Symptom: Azure VMs are created but don't appear in oc get nodes (from hosted cluster context).

Diagnostic steps:

# 1. Check NodePool status
oc get nodepool -n ${CLUSTER_NAMESPACE} -o yaml

# 2. Check Machine resources
oc get machines -n clusters-${CLUSTER_NAME}

# 3. Check Azure VMs
az vm list \
    --resource-group ${MANAGED_RG_NAME} \
    --output table

# 4. Check ignition server logs
oc logs -n clusters-${CLUSTER_NAME} deployment/ignition-server

# 5. Get VM boot diagnostics (if enabled)
VM_NAME=$(az vm list --resource-group ${MANAGED_RG_NAME} --query "[0].name" -o tsv)
az vm boot-diagnostics get-boot-log \
    --resource-group ${MANAGED_RG_NAME} \
    --name ${VM_NAME}

Common causes and solutions:

NSG Blocking Required Ports:

# Check NSG rules
az network nsg show \
    --name ${NSG} \
    --resource-group ${NSG_RG_NAME}

# Required inbound ports:
# - 6443 (API server)
# - 22623 (machine config server / ignition)
# - 443 (ingress)
# - 10250-10259 (kubelet, kube-proxy)

# Fix: Update NSG rules
az network nsg rule create \
    --resource-group ${NSG_RG_NAME} \
    --nsg-name ${NSG} \
    --name allow-ignition \
    --priority 1000 \
    --source-address-prefixes '*' \
    --destination-port-ranges 22623 \
    --access Allow \
    --protocol Tcp

Ignition Server Not Reachable:

# Check ignition server service
oc get svc -n clusters-${CLUSTER_NAME} ignition-server

# From a node, test connectivity (requires VM access)
# curl -k https://ignition-server.clusters-${CLUSTER_NAME}.svc:443/config/worker

Workload Identity Issues:

# Check nodepool-mgmt identity has proper roles
az role assignment list \
    --assignee $(az identity show \
        --name "${CLUSTER_NAME}-nodepool-mgmt" \
        --resource-group ${PERSISTENT_RG_NAME} \
        --query clientId -o tsv) \
    --output table

Node Status Shows NotReady

Symptom: Nodes appear in oc get nodes but show status NotReady.

Diagnostic steps:

# From hosted cluster context
export KUBECONFIG=${CLUSTER_NAME}-kubeconfig

# Check node status
oc get nodes -o wide

# Describe the NotReady node
oc describe node <node-name>

# Check node conditions
oc get node <node-name> -o jsonpath='{.status.conditions}'  | jq .

Common causes: - Container runtime issues - Network plugin not ready - Insufficient resources on node - Disk pressure

Solutions:

# Check kubelet logs (requires SSH access to node)
# ssh -i ${CLUSTER_NAME}-ssh-key core@<node-ip>
# sudo journalctl -u kubelet -f

# Delete and recreate the node
oc delete node <node-name>
# NodePool controller will create replacement VM

DNS and Networking Issues

DNS Records Not Created (External DNS)

Symptom: Cannot resolve cluster API or app hostnames via DNS.

Diagnostic steps:

# Check External DNS pod
oc get pods -n hypershift -l app=external-dns

# Check External DNS logs
oc logs -n hypershift deployment/external-dns

# Check Route resources
oc get routes -n clusters-${CLUSTER_NAME}

# Verify DNS zone exists
az network dns zone show \
    --name ${PARENT_DNS_ZONE} \
    --resource-group ${PERSISTENT_RG_NAME}

# Check DNS records
az network dns record-set list \
    --resource-group ${PERSISTENT_RG_NAME} \
    --zone-name ${PARENT_DNS_ZONE} \
    --output table

Common issues:

External DNS Authentication Failures:

# Check External DNS secret
oc get secret azure-config-file -n default

# Verify service principal permissions
az role assignment list \
    --assignee $(jq -r '.aadClientId' < azure_mgmt.json) \
    --output table

# Should have "DNS Zone Contributor" role on DNS zone resource group

External DNS Not Watching Correct Namespace:

# Check External DNS configuration
oc get deployment external-dns -n hypershift -o yaml | grep -A 10 args

# Ensure it's watching routes in hosted cluster namespaces

Solution: Recreate External DNS configuration

# Reinstall HyperShift operator with correct External DNS config
hypershift install \
    --external-dns-provider=azure \
    --external-dns-credentials azure_mgmt.json \
    --pull-secret ${PULL_SECRET} \
    --external-dns-domain-filter ${EXTRN_DNS_ZONE_NAME} \
    --limit-crd-install Azure \
    --render | oc apply -f -

LoadBalancer Service Stuck Pending (Without External DNS)

Symptom: API server LoadBalancer service doesn't get external IP.

Diagnostic steps:

# Check service
oc get svc -n clusters-${CLUSTER_NAME} kube-apiserver

# Check cloud-controller-manager logs
oc logs -n clusters-${CLUSTER_NAME} deployment/cloud-controller-manager

# Check Azure load balancer
az network lb list \
    --resource-group ${MANAGED_RG_NAME} \
    --output table

Solutions: - Verify cloud-provider workload identity has correct permissions - Check Azure subscription quotas for load balancers - Verify network connectivity from management cluster to Azure APIs

Upgrade Issues

Upgrade Stuck or Fails

Symptom: Cluster upgrade doesn't complete or control plane pods crash during upgrade.

Diagnostic steps:

# Check HostedCluster upgrade status
oc get hostedcluster ${CLUSTER_NAME} \
    -n ${CLUSTER_NAMESPACE} \
    -o jsonpath='{.status.version}'

# Check cluster version operator
oc logs -n clusters-${CLUSTER_NAME} deployment/cluster-version-operator

# Check control plane pod status
oc get pods -n clusters-${CLUSTER_NAME}

# From hosted cluster, check cluster operators
export KUBECONFIG=${CLUSTER_NAME}-kubeconfig
oc get co

Solutions:

# If upgrade is stuck, check for:
# - Degraded cluster operators before upgrade
# - Insufficient resources during upgrade
# - Breaking changes in release notes

# Cannot rollback - must fix forward or recreate cluster

Diagnostic Commands Reference

HostedCluster Health

# Overall cluster status
oc get hostedcluster ${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE}

# Detailed status with conditions
oc get hostedcluster ${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE} -o yaml

# Check HostedControlPlane
oc get hostedcontrolplane -n clusters-${CLUSTER_NAME}

# Describe for events
oc describe hostedcluster ${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE}

NodePool Health

# NodePool status
oc get nodepool -n ${CLUSTER_NAMESPACE}

# Detailed NodePool information
oc get nodepool ${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE} -o yaml

# Check Machine resources
oc get machines -n clusters-${CLUSTER_NAME}

# Machine details
oc describe machine <machine-name> -n clusters-${CLUSTER_NAME}

Control Plane Diagnostics

# All control plane pods
oc get pods -n clusters-${CLUSTER_NAME}

# Pod resource usage
oc top pods -n clusters-${CLUSTER_NAME}

# Specific component logs
oc logs -n clusters-${CLUSTER_NAME} deployment/kube-apiserver -f
oc logs -n clusters-${CLUSTER_NAME} deployment/kube-controller-manager -f
oc logs -n clusters-${CLUSTER_NAME} deployment/kube-scheduler -f
oc logs -n clusters-${CLUSTER_NAME} statefulset/etcd -f

# Check all resources in control plane namespace
oc get all -n clusters-${CLUSTER_NAME}

# Events in control plane namespace
oc get events -n clusters-${CLUSTER_NAME} --sort-by='.lastTimestamp'

HyperShift Operator Diagnostics

# Operator pod status
oc get pods -n hypershift

# Operator logs
oc logs -n hypershift deployment/operator -f

# Check operator reconciliation
oc logs -n hypershift deployment/operator | grep ${CLUSTER_NAME}

Azure Resource Diagnostics

# List all resource groups for cluster
az group list \
    --query "[?contains(name, '${PREFIX}')].{Name:name, Location:location, State:properties.provisioningState}" \
    --output table

# Check VMs
az vm list \
    --resource-group ${MANAGED_RG_NAME} \
    --output table

# Check VM power state
az vm list \
    --resource-group ${MANAGED_RG_NAME} \
    --show-details \
    --query "[].{Name:name, PowerState:powerState}" \
    --output table

# Check load balancers
az network lb list \
    --resource-group ${MANAGED_RG_NAME} \
    --output table

# Check NSG rules
az network nsg rule list \
    --nsg-name ${NSG} \
    --resource-group ${NSG_RG_NAME} \
    --output table

Workload Identity Diagnostics

# List managed identities
task azure:list-identities

# Or manually:
az identity list \
    --resource-group ${PERSISTENT_RG_NAME} \
    --query "[?contains(name, '${CLUSTER_NAME}')]" \
    --output table

# List federated credentials
task azure:list-federated-creds

# Check specific component identity
az identity show \
    --name "${CLUSTER_NAME}-disk-csi" \
    --resource-group ${PERSISTENT_RG_NAME}

# Check role assignments
az role assignment list \
    --assignee $(az identity show \
        --name "${CLUSTER_NAME}-disk-csi" \
        --resource-group ${PERSISTENT_RG_NAME} \
        --query principalId -o tsv) \
    --output table

Collecting Debug Information

Comprehensive Cluster Dump

#!/bin/bash
# Save to a file: cluster-debug.sh

CLUSTER_NAME="myprefix-hc"
CLUSTER_NAMESPACE="clusters"
OUTPUT_DIR="debug-$(date +%Y%m%d-%H%M%S)"

mkdir -p ${OUTPUT_DIR}

echo "Collecting debug information for ${CLUSTER_NAME}..."

# HostedCluster
oc get hostedcluster ${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE} -o yaml > ${OUTPUT_DIR}/hostedcluster.yaml

# NodePools
oc get nodepool -n ${CLUSTER_NAMESPACE} -o yaml > ${OUTPUT_DIR}/nodepools.yaml

# Control plane pods
oc get pods -n clusters-${CLUSTER_NAME} -o wide > ${OUTPUT_DIR}/control-plane-pods.txt
oc get pods -n clusters-${CLUSTER_NAME} -o yaml > ${OUTPUT_DIR}/control-plane-pods.yaml

# Control plane events
oc get events -n clusters-${CLUSTER_NAME} --sort-by='.lastTimestamp' > ${OUTPUT_DIR}/control-plane-events.txt

# Operator logs
oc logs -n hypershift deployment/operator --tail=1000 > ${OUTPUT_DIR}/operator-logs.txt

# Control plane logs
for deployment in kube-apiserver kube-controller-manager kube-scheduler; do
    oc logs -n clusters-${CLUSTER_NAME} deployment/${deployment} --tail=500 > ${OUTPUT_DIR}/${deployment}-logs.txt
done

# Machines
oc get machines -n clusters-${CLUSTER_NAME} -o yaml > ${OUTPUT_DIR}/machines.yaml

echo "Debug information collected in ${OUTPUT_DIR}/"
tar -czf ${OUTPUT_DIR}.tar.gz ${OUTPUT_DIR}/
echo "Archive created: ${OUTPUT_DIR}.tar.gz"

Getting Help

Before Requesting Support

Collect the following information:

Cluster details:
OpenShift version
HyperShift operator version
Azure region
Error symptoms:
Exact error messages
When the issue started
Steps to reproduce
Debug information (use script above):
HostedCluster YAML
Control plane pod status and logs
NodePool status
Operator logs

Support Channels

For Developer Preview: - HyperShift GitHub Issues - HyperShift Slack (#hypershift channel)

For Production (when GA): - Red Hat Support Portal - OpenShift support cases

Useful Resources

Next Steps

Reference - CLI commands and configuration reference
Understanding HyperShift - Review architecture concepts