Part 4: High Availability Setup - Production-Grade Cluster

Part 4: High Availability Setup - Production-Grade Cluster

TL;DR

Transform your single control plane cluster into a highly available, production-grade setup with multiple control plane nodes, load balancing, and robust disaster recovery capabilities. Learn how to deploy a 3-control plane cluster, configure load balancers, implement etcd best practices, and prepare for node failures.

Key Takeaways:

  • Multiple control plane nodes provide redundancy and fault tolerance
  • Load balancing ensures API server availability during node failures
  • etcd requires careful configuration for high availability
  • Regular backups are essential for disaster recovery
  • Testing failure scenarios validates your HA setup

Introduction

Why This Matters

In Part 2: Talos Installation, you built a single control plane cluster suitable for learning and development. However, production environments require high availability to ensure continuous operation even when nodes fail.

This article teaches you how to:

  • Deploy multiple control plane nodes for redundancy
  • Configure load balancing for the Kubernetes API server
  • Implement etcd best practices for high availability
  • Test and handle node failure scenarios
  • Backup and restore etcd data
  • Plan and execute disaster recovery procedures

What You’ll Learn

  • Multi-control plane architecture and design
  • Load balancer configuration (HAProxy, MetalLB)
  • etcd cluster configuration and best practices
  • Node failure detection and recovery
  • Backup and restore procedures
  • Disaster recovery planning and execution
  • Cluster health monitoring

Prerequisites

Before starting, you should have:

  • Completed Part 2: Talos Installation
  • Completed Part 3: Configuration Management
  • A running Talos Linux cluster (can start with single control plane)
  • Additional hardware/VMs for additional control plane nodes (minimum 2 more nodes)
  • talosctl installed and configured
  • kubectl configured with cluster access
  • Understanding of load balancing concepts
  • Basic knowledge of etcd

High Availability Architecture

Single vs Multi-Control Plane

Single Control Plane (Current Setup):

  • One control plane node
  • Single point of failure
  • Suitable for development/homelab learning
  • No redundancy

Multi-Control Plane (Target Setup):

  • Three or more control plane nodes
  • Fault tolerant
  • Production-ready
  • API server redundancy

Recommended Architecture

┌─────────────────────────────────────────┐
│         Load Balancer                   │
│      (HAProxy / MetalLB)               │
│     192.168.178.201:6443                │
└──────────────┬──────────────────────────┘
               │
       ┌───────┼───────┐
       │       │       │
┌──────▼──┐ ┌──▼───┐ ┌─▼──────┐
│ Control │ │Control│ │Control │
│ Plane 1 │ │Plane 2│ │Plane 3 │
│  .55    │ │ .58   │ │ .59    │
└─────────┘ └───────┘ └────────┘
       │       │       │
       └───────┼───────┘
               │
       ┌───────┼───────┐
       │       │       │
┌──────▼──┐ ┌──▼───┐ ┌─▼──────┐
│ Worker 1│ │Worker│ │Worker 3│
│  .56    │ │ 2 .57│ │  .60   │
└─────────┘ └───────┘ └────────┘

Note: The architecture diagram above shows a multi-control plane HA setup. For homelab setups, a single control plane (192.168.178.55) with two worker nodes (192.168.178.56, 192.168.178.57) is often sufficient. This article demonstrates how HA could be implemented, but the actual homelab setup may remain single control plane for simplicity and resource constraints.

Multi-Control Plane Setup

Planning Your HA Cluster

Requirements:

  • Minimum 3 control plane nodes (odd number for etcd quorum)
  • Each control plane node needs:
    • 2+ CPU cores (4+ recommended)
    • 4GB+ RAM (8GB+ recommended)
    • 50GB+ storage
    • Network connectivity to all nodes

Node Allocation (Example for HA Setup):

  • Control Plane 1: 192.168.178.55 (existing in homelab)
  • Control Plane 2: 192.168.178.58 (hypothetical - for demonstration only)
  • Control Plane 3: 192.168.178.59 (hypothetical - for demonstration only)
  • Worker nodes: 192.168.178.56, .57, .60 (existing workers + hypothetical additional worker)

Note: The above node allocation is an example showing how HA could be configured. The actual homelab setup uses only a single control plane node (192.168.178.55) with two worker nodes (192.168.178.56, 192.168.178.57).

Adding Additional Control Plane Nodes

Step 1: Install Talos on New Nodes

# Install Talos on new control plane nodes
# (Follow Part 2 installation process)
# Nodes should boot and be accessible via network

Step 2: Generate Control Plane Configuration

# Generate configuration for additional control plane nodes
# Use the same cluster endpoint
talosctl gen config discworld-homelab \
  https://192.168.178.201:6443 \
  --output-dir ./ha-configs

# This generates:
# - controlplane.yaml (for all control plane nodes)
# - worker.yaml (for worker nodes)
# - talosconfig

Step 3: Apply Configuration to New Control Plane Nodes

# Apply to control plane node 2
talosctl apply-config \
  --insecure \
  --nodes 192.168.178.58 \
  --file ./ha-configs/controlplane.yaml

# Apply to control plane node 3
talosctl apply-config \
  --insecure \
  --nodes 192.168.178.59 \
  --file ./ha-configs/controlplane.yaml

Step 4: Update Endpoints

# Update talosconfig with all control plane endpoints
talosctl config endpoint 192.168.178.55 192.168.178.58 192.168.178.59

Verifying Multi-Control Plane Setup

# Check cluster members
talosctl get members

# Verify all control plane nodes
kubectl get nodes -l node-role.kubernetes.io/control-plane

# Check etcd cluster status
talosctl --nodes 192.168.178.55 get members

Expected Output:

# kubectl get nodes
NAME            STATUS   ROLES           AGE   VERSION
talos-cp-1      Ready    control-plane   1h    v1.34.3
talos-cp-2      Ready    control-plane   30m   v1.34.3
talos-cp-3      Ready    control-plane   15m   v1.34.3
talos-worker-1  Ready    <none>         1h    v1.34.3
talos-worker-2  Ready    <none>         1h    v1.34.3

Load Balancing for API Server

Why Load Balancing?

The Kubernetes API server must be accessible even if individual control plane nodes fail. A load balancer distributes traffic across all healthy control plane nodes.

Option 1: HAProxy Load Balancer

Installing HAProxy:

# On a separate machine or one of your nodes
# Install HAProxy (example for Ubuntu/Debian)
sudo apt update
sudo apt install -y haproxy

# Or use a containerized HAProxy

HAProxy Configuration:

# /etc/haproxy/haproxy.cfg
global
    log /dev/log local0
    maxconn 4096
    daemon

defaults
    log global
    mode tcp
    option tcplog
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms

frontend kubernetes-api
    bind 192.168.178.201:6443
    default_backend k8s-api-servers

backend k8s-api-servers
    balance roundrobin
    option tcp-check
    server k8s-api-1 192.168.178.55:6443 check
    server k8s-api-2 192.168.178.58:6443 check
    server k8s-api-3 192.168.178.59:6443 check

Starting HAProxy:

# Start HAProxy
sudo systemctl start haproxy
sudo systemctl enable haproxy

# Verify it's running
sudo systemctl status haproxy

Option 2: MetalLB (Kubernetes-Native)

Installing MetalLB:

# Apply MetalLB manifest
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.5/config/manifests/metallb-native.yaml

# Wait for MetalLB to be ready
kubectl wait --namespace metallb-system \
  --for=condition=ready pod \
  --selector=app=metallb \
  --timeout=90s

Configuring MetalLB:

# metallb-config.yaml
# IP range: 192.168.178.201-220 (outside DHCP reservation range 20-200)
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: api-server-pool
  namespace: metallb-system
spec:
  addresses:
  - 192.168.178.201-192.168.178.210  # Range for API server and other LoadBalancer services
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: api-server-l2
  namespace: metallb-system
spec:
  ipAddressPools:
    - api-server-pool

Applying MetalLB Configuration:

kubectl apply -f metallb-config.yaml

# Verify the configuration
kubectl get ipaddresspool -n metallb-system
kubectl get l2advertisement -n metallb-system

Creating LoadBalancer Service:

Note: This example is for demonstration purposes only. With a single control plane setup, direct access to the control plane endpoint (192.168.178.55:6443) is used instead of a LoadBalancer service. The Kubernetes API server runs as a static pod managed by Talos, not as a regular Kubernetes service, so creating a LoadBalancer service for it requires additional configuration (e.g., using a NodePort service or external load balancer). For HA setups with multiple control planes, a load balancer (HAProxy, MetalLB, or external) would route traffic to all control plane nodes.

# k8s-api-loadbalancer.yaml
# NOTE: This is a conceptual example for HA setups
# For single control plane, use direct endpoint: 192.168.178.55:6443
apiVersion: v1
kind: Service
metadata:
  name: k8s-api-loadbalancer
  namespace: default
spec:
  type: LoadBalancer
  loadBalancerIP: 192.168.178.201  # MetalLB will assign this IP
  ports:
  - port: 6443
    targetPort: 6443
    protocol: TCP
  # Note: API server runs as static pod, so this selector won't work directly
  # For actual implementation, you'd need to use NodePort or external load balancer
  # that routes to control plane node IPs

Updating Cluster Endpoint

After setting up load balancing, update your cluster configuration:

# Update kubeconfig to use load balancer endpoint
kubectl config set-cluster discworld-homelab \
  --server=https://192.168.178.201:6443

# Update talosconfig endpoint
talosctl config endpoint 192.168.178.201

Testing Load Balancer:

# Test API server access through load balancer
kubectl cluster-info

# Test with direct API call
curl -k https://192.168.178.201:6443/version

etcd Best Practices

etcd in High Availability

etcd is the distributed key-value store that stores all Kubernetes cluster data. In a multi-control plane setup, etcd runs on each control plane node.

etcd Cluster Health

Checking etcd Status:

# Check etcd members
talosctl --nodes 192.168.178.55 get members

# Check etcd health
talosctl --nodes 192.168.178.55 etcd status

# View etcd logs
talosctl --nodes 192.168.178.55 logs etcd

etcd Configuration Best Practices

  1. Odd Number of Nodes: Always use odd number (3, 5, 7) for quorum
  2. Network Latency: Keep etcd nodes on low-latency network (< 10ms)
  3. Disk Performance: Use fast SSDs for etcd data directory
  4. Resource Limits: Ensure adequate CPU and memory
  5. Backup Regularly: Automated backups are essential

etcd Performance Tuning

# etcd configuration patch
machine:
  etcd:
    extraArgs:
      quota-backend-bytes: "8589934592"  # 8GB
      max-request-bytes: "1572864"        # 1.5MB

Note: Default etcd configuration is usually sufficient for homelab setups. Performance tuning is typically needed for production environments with high write loads or large datasets.

Hands-On Exercise: Node Failure Scenarios and Disaster Recovery

Learning Objective: This exercise helps you understand how Kubernetes clusters handle failures and how to implement disaster recovery procedures. While our homelab uses a single control plane (which means no HA failover), these exercises demonstrate important concepts for production environments.

Exercise: Testing Node Failures

Objective: Understand how Kubernetes handles node failures and practice recovery procedures.

Prerequisites:

  • Running Talos cluster
  • kubectl and talosctl configured
  • Understanding of cluster architecture

Scenario 1: Worker Node Failure

This is safe to test in a homelab environment:

# 1. Deploy a test workload
kubectl create deployment nginx --image=nginx
kubectl scale deployment nginx --replicas=3

# 2. Check where pods are running
kubectl get pods -o wide

# 3. Simulate worker node failure
# Option A: If using VMs, power off one worker node
# Option B: If physical, unplug network cable temporarily
# Option C: Use kubectl to cordon and drain (safer)
kubectl cordon <worker-node-name>
kubectl drain <worker-node-name> --ignore-daemonsets --delete-emptydir-data

# 4. Observe pod rescheduling
watch kubectl get pods -o wide

# 5. Restore the node
kubectl uncordon <worker-node-name>
# Or power on/restore network connection

# 6. Verify pods can reschedule
kubectl get pods -o wide

What to Observe:

  • How long it takes for pods to be rescheduled
  • Which node the pods move to
  • Any service interruptions

Scenario 2: Control Plane Node Failure (Single Control Plane)

Warning: With a single control plane, this will make the cluster unavailable. Only test this if you’re comfortable with cluster downtime.

# 1. Note current cluster state
kubectl get nodes
kubectl get pods --all-namespaces

# 2. Power off control plane node (192.168.178.55)
# Cluster will become unavailable

# 3. Try to access cluster
kubectl get nodes  # Will fail

# 4. Power on control plane node
# Wait for Talos to boot and cluster to recover

# 5. Verify cluster recovery
kubectl get nodes
kubectl get pods --all-namespaces

What to Learn:

  • Single control plane = single point of failure
  • Why HA requires multiple control plane nodes
  • Recovery time after control plane failure

Scenario 3: Understanding HA Failover (Conceptual)

For HA setups with multiple control planes, you would test:

# In an HA setup with 3 control planes:
# 1. Power off one control plane node
# 2. Cluster should continue operating (2 of 3 nodes maintain quorum)
# 3. API server remains accessible via load balancer
# 4. Workloads continue running
# 5. etcd maintains quorum

# This demonstrates why HA requires 3+ control plane nodes

Exercise: Backup and Restore Procedures

Objective: Practice creating and restoring etcd backups.

Creating etcd Backup:

# Backup etcd from control plane node
talosctl --nodes 192.168.178.55 etcd snapshot save \
  /tmp/etcd-backup-$(date +%Y%m%d-%H%M%S).db

# Verify backup was created
ls -lh /tmp/etcd-backup-*.db

# Copy backup to safe location (if you have a backup server)
# scp /tmp/etcd-backup-*.db user@backup-server:/backups/

Automated Backup Script:

Create a backup script for regular backups:

#!/bin/bash
# etcd-backup.sh
BACKUP_DIR="/backups/etcd"
DATE=$(date +%Y%m%d-%H%M%S)
CONTROL_PLANE_IP="192.168.178.55"

# Create backup directory if it doesn't exist
mkdir -p ${BACKUP_DIR}

talosctl --nodes ${CONTROL_PLANE_IP} etcd snapshot save \
  ${BACKUP_DIR}/etcd-backup-${DATE}.db

# Keep only last 7 days of backups
find ${BACKUP_DIR} -name "etcd-backup-*.db" -mtime +7 -delete

echo "Backup completed: ${BACKUP_DIR}/etcd-backup-${DATE}.db"

Setting Up Automated Backups:

# Make script executable
chmod +x etcd-backup.sh

# Test the script
./etcd-backup.sh

# Add to crontab for daily backups at 2 AM
crontab -e
# Add: 0 2 * * * /path/to/etcd-backup.sh

Testing Restore (Advanced - Optional):

Warning: Restore testing will cause cluster downtime. Only attempt in a lab/test environment.

# 1. Create a test backup
talosctl --nodes 192.168.178.55 etcd snapshot save /tmp/test-backup.db

# 2. Note: Full restore requires stopping the cluster
# This is complex and should be tested in a lab environment
# Refer to Talos documentation for complete restore procedures

Exercise: Disaster Recovery Planning

Objective: Create a disaster recovery plan for your cluster.

Document Your Recovery Plan:

  1. Current Setup:

    • Number of control plane nodes: 1 (single control plane)
    • Number of worker nodes: 2
    • Backup location: Document where backups are stored
    • Configuration repository: Document Git repository location
  2. Recovery Scenarios:

    Scenario 1: Control Plane Node Failure

    • Impact: Cluster unavailable (single point of failure)
    • Recovery steps: 1. Power on node, 2. Wait for Talos boot, 3. Verify cluster health
    • Recovery time: Document observed recovery time

    Scenario 2: Worker Node Failure

    • Impact: Pods on failed node need rescheduling
    • Recovery steps: 1. Power on node, 2. Wait for node to join, 3. Verify pods reschedule
    • Recovery time: Document observed recovery time

    Scenario 3: Complete Cluster Failure

    • Impact: All nodes lost, need to rebuild
    • Recovery steps: 1. Reinstall Talos on nodes, 2. Restore etcd from backup, 3. Reapply configurations from Git
    • Recovery time: Estimate based on your setup
  3. Backup Strategy:

    • etcd backup frequency: Daily/Weekly/etc.
    • Configuration backup: Git repository
    • Backup retention: 7 days/30 days/etc.
    • Backup location: Local/Remote/Cloud

Best Practices

High Availability

  • Odd Number of Control Planes: Always use 3, 5, or 7 control plane nodes
  • Load Balancing: Always use load balancer for API server access
  • Network Redundancy: Use multiple network paths if possible
  • Regular Testing: Test failure scenarios regularly
  • Monitoring: Monitor cluster health continuously

etcd

  • Regular Backups: Automated daily backups minimum
  • Off-Site Storage: Store backups in separate location
  • Test Restores: Regularly test backup restoration
  • Performance Tuning: Optimize etcd for your workload
  • Resource Allocation: Ensure adequate resources

Disaster Recovery

  • Documentation: Keep detailed documentation of cluster setup
  • Version Control: Store all configurations in Git
  • Regular Drills: Practice disaster recovery procedures
  • Backup Testing: Test backups regularly
  • Recovery Procedures: Document step-by-step recovery

Troubleshooting

Common Issue 1: Control Plane Node Not Joining

Problem: New control plane node doesn’t join cluster

Solution:

# Check node status
talosctl --nodes <NODE_IP> get members

# Check etcd logs
talosctl --nodes <NODE_IP> logs etcd

# Verify network connectivity
ping <NODE_IP>

# Check configuration
talosctl --nodes <NODE_IP> get machineconfig

Common Issue 2: etcd Cluster Unhealthy

Problem: etcd cluster shows unhealthy status

Solution:

# Check etcd status on all nodes
talosctl --nodes <CP_IP_1> etcd status
talosctl --nodes <CP_IP_2> etcd status
talosctl --nodes <CP_IP_3> etcd status

# Check etcd logs
talosctl --nodes <CP_IP> logs etcd

# Verify network connectivity between nodes

Common Issue 3: Load Balancer Not Working

Problem: Cannot access API server through load balancer

Solution:

# Test load balancer directly
curl -k https://<LB_IP>:6443/version

# Check load balancer status
# (HAProxy: systemctl status haproxy)
# (MetalLB: kubectl get pods -n metallb-system)

# Verify backend servers
# (HAProxy: check haproxy stats)
# (MetalLB: check service endpoints)

Common Issue 4: Backup/Restore Fails

Problem: etcd backup or restore operation fails

Solution:

# Verify backup file exists and is valid
file /backups/etcd-backup-*.db

# Check disk space
df -h

# Verify etcd is stopped before restore
talosctl --nodes <CP_IP> get services | grep etcd

# Check etcd logs during restore
talosctl --nodes <CP_IP> logs etcd

Summary

Key takeaways from high availability setup:

  • Multiple control plane nodes provide fault tolerance
  • Load balancing ensures API server availability
  • etcd requires careful configuration and regular backups
  • Testing failure scenarios validates HA setup
  • Disaster recovery planning is essential

What We Accomplished:

  • Understood multi-control plane HA architecture
  • Learned how to configure load balancers (HAProxy and MetalLB)
  • Explored etcd best practices for high availability
  • Practiced node failure scenarios and recovery procedures
  • Implemented etcd backup and restore procedures
  • Created disaster recovery plans

Next Steps

Now that you understand high availability concepts:

  • Part 5: Storage Configuration (Coming Soon) - Configure persistent storage for workloads
  • Part 6: Networking (Coming Soon) - Advanced networking with CNI and ingress
  • Monitor cluster health and performance
  • Plan for additional worker nodes if needed

Recommended Reading

If you want to dive deeper into Talos Linux and Kubernetes, here are some excellent books that complement this series:

Note: The Amazon links below are affiliate links for Amazon Influencers and Associates. If you make a purchase through these links, I may earn a small commission at no additional cost to you.

Talos Linux Books

Kubernetes Books


Resources

Official Documentation

Related Articles

Tools and Utilities

Community Resources


Series Navigation

Previous: Part 3 - Talos Configuration Management - GitOps for Infrastructure

Current: Part 4 - High Availability Setup - Production-Grade Cluster

Next: Part 5 - Storage Configuration - Persistent Storage for Kubernetes (Coming Soon)

Full Series:

  1. Talos Linux Introduction
  2. Talos Installation - Building Your First Cluster
  3. Talos Configuration Management - GitOps for Infrastructure
  4. High Availability Setup - Production-Grade Cluster (You are here)
  5. Storage Configuration - Persistent Storage for Kubernetes (Coming Soon)
  6. Networking - CNI, Load Balancing, and Ingress (Coming Soon)
  7. Security Hardening - Securing Your Homelab Cluster (Coming Soon)
  8. Monitoring and Maintenance - Keeping Your Cluster Healthy (Coming Soon)

This article is part of the “Talos Linux Homelab” series. Follow along as we build a production-grade Kubernetes homelab from the ground up.

Questions or feedback? Reach out via email or connect on LinkedIn.