Part 4: High Availability Setup - Production-Grade Cluster

TL;DR

Transform your single control plane cluster into a highly available, production-grade setup with multiple control plane nodes, load balancing, and robust disaster recovery capabilities. Learn how to deploy a 3-control plane cluster, configure load balancers, implement etcd best practices, and prepare for node failures.

Key Takeaways:

Multiple control plane nodes provide redundancy and fault tolerance
Load balancing ensures API server availability during node failures
etcd requires careful configuration for high availability
Regular backups are essential for disaster recovery
Testing failure scenarios validates your HA setup

Introduction

Why This Matters

In Part 2: Talos Installation, you built a single control plane cluster suitable for learning and development. However, production environments require high availability to ensure continuous operation even when nodes fail.

This article teaches you how to:

Deploy multiple control plane nodes for redundancy
Configure load balancing for the Kubernetes API server
Implement etcd best practices for high availability
Test and handle node failure scenarios
Backup and restore etcd data
Plan and execute disaster recovery procedures

What You’ll Learn

Multi-control plane architecture and design
Load balancer configuration (HAProxy, MetalLB)
etcd cluster configuration and best practices
Node failure detection and recovery
Backup and restore procedures
Disaster recovery planning and execution
Cluster health monitoring

Prerequisites

Before starting, you should have:

Completed Part 2: Talos Installation
Completed Part 3: Configuration Management
A running Talos Linux cluster (can start with single control plane)
Additional hardware/VMs for additional control plane nodes (minimum 2 more nodes)
talosctl installed and configured
kubectl configured with cluster access
Understanding of load balancing concepts
Basic knowledge of etcd

High Availability Architecture

Single vs Multi-Control Plane

Single Control Plane (Current Setup):

One control plane node
Single point of failure
Suitable for development/homelab learning
No redundancy

Multi-Control Plane (Target Setup):

Three or more control plane nodes
Fault tolerant
Production-ready
API server redundancy

Recommended Architecture

┌─────────────────────────────────────────┐
│         Load Balancer                   │
│      (HAProxy / MetalLB)               │
│     192.168.178.201:6443                │
└──────────────┬──────────────────────────┘
               │
       ┌───────┼───────┐
       │       │       │
┌──────▼──┐ ┌──▼───┐ ┌─▼──────┐
│ Control │ │Control│ │Control │
│ Plane 1 │ │Plane 2│ │Plane 3 │
│  .55    │ │ .58   │ │ .59    │
└─────────┘ └───────┘ └────────┘
       │       │       │
       └───────┼───────┘
               │
       ┌───────┼───────┐
       │       │       │
┌──────▼──┐ ┌──▼───┐ ┌─▼──────┐
│ Worker 1│ │Worker│ │Worker 3│
│  .56    │ │ 2 .57│ │  .60   │
└─────────┘ └───────┘ └────────┘

Note: The architecture diagram above shows a multi-control plane HA setup. For homelab setups, a single control plane (192.168.178.55) with two worker nodes (192.168.178.56, 192.168.178.57) is often sufficient. This article demonstrates how HA could be implemented, but the actual homelab setup may remain single control plane for simplicity and resource constraints.

Multi-Control Plane Setup

Planning Your HA Cluster

Requirements:

Minimum 3 control plane nodes (odd number for etcd quorum)
Each control plane node needs:
- 2+ CPU cores (4+ recommended)
- 4GB+ RAM (8GB+ recommended)
- 50GB+ storage
- Network connectivity to all nodes

Node Allocation (Example for HA Setup):

Control Plane 1: 192.168.178.55 (existing in homelab)
Control Plane 2: 192.168.178.58 (hypothetical - for demonstration only)
Control Plane 3: 192.168.178.59 (hypothetical - for demonstration only)
Worker nodes: 192.168.178.56, .57, .60 (existing workers + hypothetical additional worker)

Note: The above node allocation is an example showing how HA could be configured. The actual homelab setup uses only a single control plane node (192.168.178.55) with two worker nodes (192.168.178.56, 192.168.178.57).

Adding Additional Control Plane Nodes

Step 1: Install Talos on New Nodes

# Install Talos on new control plane nodes
# (Follow Part 2 installation process)
# Nodes should boot and be accessible via network

Step 2: Generate Control Plane Configuration

# Generate configuration for additional control plane nodes
# Use the same cluster endpoint
talosctl gen config discworld-homelab \
  https://192.168.178.201:6443 \
  --output-dir ./ha-configs

# This generates:
# - controlplane.yaml (for all control plane nodes)
# - worker.yaml (for worker nodes)
# - talosconfig

Step 3: Apply Configuration to New Control Plane Nodes

# Apply to control plane node 2
talosctl apply-config \
  --insecure \
  --nodes 192.168.178.58 \
  --file ./ha-configs/controlplane.yaml

# Apply to control plane node 3
talosctl apply-config \
  --insecure \
  --nodes 192.168.178.59 \
  --file ./ha-configs/controlplane.yaml

Step 4: Update Endpoints

# Update talosconfig with all control plane endpoints
talosctl config endpoint 192.168.178.55 192.168.178.58 192.168.178.59

Verifying Multi-Control Plane Setup

# Check cluster members
talosctl get members

# Verify all control plane nodes
kubectl get nodes -l node-role.kubernetes.io/control-plane

# Check etcd cluster status
talosctl --nodes 192.168.178.55 get members

Expected Output:

# kubectl get nodes
NAME            STATUS   ROLES           AGE   VERSION
talos-cp-1      Ready    control-plane   1h    v1.34.3
talos-cp-2      Ready    control-plane   30m   v1.34.3
talos-cp-3      Ready    control-plane   15m   v1.34.3
talos-worker-1  Ready    <none>         1h    v1.34.3
talos-worker-2  Ready    <none>         1h    v1.34.3

Load Balancing for API Server

Why Load Balancing?

The Kubernetes API server must be accessible even if individual control plane nodes fail. A load balancer distributes traffic across all healthy control plane nodes.

Option 1: HAProxy Load Balancer

Installing HAProxy:

# On a separate machine or one of your nodes
# Install HAProxy (example for Ubuntu/Debian)
sudo apt update
sudo apt install -y haproxy

# Or use a containerized HAProxy

HAProxy Configuration:

# /etc/haproxy/haproxy.cfg
global
    log /dev/log local0
    maxconn 4096
    daemon

defaults
    log global
    mode tcp
    option tcplog
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms

frontend kubernetes-api
    bind 192.168.178.201:6443
    default_backend k8s-api-servers

backend k8s-api-servers
    balance roundrobin
    option tcp-check
    server k8s-api-1 192.168.178.55:6443 check
    server k8s-api-2 192.168.178.58:6443 check
    server k8s-api-3 192.168.178.59:6443 check

Starting HAProxy:

# Start HAProxy
sudo systemctl start haproxy
sudo systemctl enable haproxy

# Verify it's running
sudo systemctl status haproxy

Option 2: MetalLB (Kubernetes-Native)

Installing MetalLB:

# Apply MetalLB manifest
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.5/config/manifests/metallb-native.yaml

# Wait for MetalLB to be ready
kubectl wait --namespace metallb-system \
  --for=condition=ready pod \
  --selector=app=metallb \
  --timeout=90s

Configuring MetalLB:

# metallb-config.yaml
# IP range: 192.168.178.201-220 (outside DHCP reservation range 20-200)
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: api-server-pool
  namespace: metallb-system
spec:
  addresses:
  - 192.168.178.201-192.168.178.210  # Range for API server and other LoadBalancer services
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: api-server-l2
  namespace: metallb-system
spec:
  ipAddressPools:
    - api-server-pool

Applying MetalLB Configuration:

kubectl apply -f metallb-config.yaml

# Verify the configuration
kubectl get ipaddresspool -n metallb-system
kubectl get l2advertisement -n metallb-system

Creating LoadBalancer Service:

Note: This example is for demonstration purposes only. With a single control plane setup, direct access to the control plane endpoint (192.168.178.55:6443) is used instead of a LoadBalancer service. The Kubernetes API server runs as a static pod managed by Talos, not as a regular Kubernetes service, so creating a LoadBalancer service for it requires additional configuration (e.g., using a NodePort service or external load balancer). For HA setups with multiple control planes, a load balancer (HAProxy, MetalLB, or external) would route traffic to all control plane nodes.

# k8s-api-loadbalancer.yaml
# NOTE: This is a conceptual example for HA setups
# For single control plane, use direct endpoint: 192.168.178.55:6443
apiVersion: v1
kind: Service
metadata:
  name: k8s-api-loadbalancer
  namespace: default
spec:
  type: LoadBalancer
  loadBalancerIP: 192.168.178.201  # MetalLB will assign this IP
  ports:
  - port: 6443
    targetPort: 6443
    protocol: TCP
  # Note: API server runs as static pod, so this selector won't work directly
  # For actual implementation, you'd need to use NodePort or external load balancer
  # that routes to control plane node IPs

Updating Cluster Endpoint

After setting up load balancing, update your cluster configuration:

# Update kubeconfig to use load balancer endpoint
kubectl config set-cluster discworld-homelab \
  --server=https://192.168.178.201:6443

# Update talosconfig endpoint
talosctl config endpoint 192.168.178.201

Testing Load Balancer:

# Test API server access through load balancer
kubectl cluster-info

# Test with direct API call
curl -k https://192.168.178.201:6443/version

etcd Best Practices

etcd in High Availability

etcd is the distributed key-value store that stores all Kubernetes cluster data. In a multi-control plane setup, etcd runs on each control plane node.

etcd Cluster Health

Checking etcd Status:

# Check etcd members
talosctl --nodes 192.168.178.55 get members

# Check etcd health
talosctl --nodes 192.168.178.55 etcd status

# View etcd logs
talosctl --nodes 192.168.178.55 logs etcd

etcd Configuration Best Practices

Odd Number of Nodes: Always use odd number (3, 5, 7) for quorum
Network Latency: Keep etcd nodes on low-latency network (< 10ms)
Disk Performance: Use fast SSDs for etcd data directory
Resource Limits: Ensure adequate CPU and memory
Backup Regularly: Automated backups are essential

etcd Performance Tuning

# etcd configuration patch
machine:
  etcd:
    extraArgs:
      quota-backend-bytes: "8589934592"  # 8GB
      max-request-bytes: "1572864"        # 1.5MB

Note: Default etcd configuration is usually sufficient for homelab setups. Performance tuning is typically needed for production environments with high write loads or large datasets.

Hands-On Exercise: Node Failure Scenarios and Disaster Recovery

Learning Objective: This exercise helps you understand how Kubernetes clusters handle failures and how to implement disaster recovery procedures. While our homelab uses a single control plane (which means no HA failover), these exercises demonstrate important concepts for production environments.

Exercise: Testing Node Failures

Objective: Understand how Kubernetes handles node failures and practice recovery procedures.

Prerequisites:

Running Talos cluster
kubectl and talosctl configured
Understanding of cluster architecture

Scenario 1: Worker Node Failure

This is safe to test in a homelab environment:

# 1. Deploy a test workload
kubectl create deployment nginx --image=nginx
kubectl scale deployment nginx --replicas=3

# 2. Check where pods are running
kubectl get pods -o wide

# 3. Simulate worker node failure
# Option A: If using VMs, power off one worker node
# Option B: If physical, unplug network cable temporarily
# Option C: Use kubectl to cordon and drain (safer)
kubectl cordon <worker-node-name>
kubectl drain <worker-node-name> --ignore-daemonsets --delete-emptydir-data

# 4. Observe pod rescheduling
watch kubectl get pods -o wide

# 5. Restore the node
kubectl uncordon <worker-node-name>
# Or power on/restore network connection

# 6. Verify pods can reschedule
kubectl get pods -o wide

What to Observe:

How long it takes for pods to be rescheduled
Which node the pods move to
Any service interruptions

Scenario 2: Control Plane Node Failure (Single Control Plane)

Warning: With a single control plane, this will make the cluster unavailable. Only test this if you’re comfortable with cluster downtime.

# 1. Note current cluster state
kubectl get nodes
kubectl get pods --all-namespaces

# 2. Power off control plane node (192.168.178.55)
# Cluster will become unavailable

# 3. Try to access cluster
kubectl get nodes  # Will fail

# 4. Power on control plane node
# Wait for Talos to boot and cluster to recover

# 5. Verify cluster recovery
kubectl get nodes
kubectl get pods --all-namespaces

What to Learn:

Single control plane = single point of failure
Why HA requires multiple control plane nodes
Recovery time after control plane failure

Scenario 3: Understanding HA Failover (Conceptual)

For HA setups with multiple control planes, you would test:

# In an HA setup with 3 control planes:
# 1. Power off one control plane node
# 2. Cluster should continue operating (2 of 3 nodes maintain quorum)
# 3. API server remains accessible via load balancer
# 4. Workloads continue running
# 5. etcd maintains quorum

# This demonstrates why HA requires 3+ control plane nodes

Exercise: Backup and Restore Procedures

Objective: Practice creating and restoring etcd backups.

Creating etcd Backup:

# Backup etcd from control plane node
talosctl --nodes 192.168.178.55 etcd snapshot save \
  /tmp/etcd-backup-$(date +%Y%m%d-%H%M%S).db

# Verify backup was created
ls -lh /tmp/etcd-backup-*.db

# Copy backup to safe location (if you have a backup server)
# scp /tmp/etcd-backup-*.db user@backup-server:/backups/

Automated Backup Script:

Create a backup script for regular backups:

#!/bin/bash
# etcd-backup.sh
BACKUP_DIR="/backups/etcd"
DATE=$(date +%Y%m%d-%H%M%S)
CONTROL_PLANE_IP="192.168.178.55"

# Create backup directory if it doesn't exist
mkdir -p ${BACKUP_DIR}

talosctl --nodes ${CONTROL_PLANE_IP} etcd snapshot save \
  ${BACKUP_DIR}/etcd-backup-${DATE}.db

# Keep only last 7 days of backups
find ${BACKUP_DIR} -name "etcd-backup-*.db" -mtime +7 -delete

echo "Backup completed: ${BACKUP_DIR}/etcd-backup-${DATE}.db"

Setting Up Automated Backups:

# Make script executable
chmod +x etcd-backup.sh

# Test the script
./etcd-backup.sh

# Add to crontab for daily backups at 2 AM
crontab -e
# Add: 0 2 * * * /path/to/etcd-backup.sh

Testing Restore (Advanced - Optional):

Warning: Restore testing will cause cluster downtime. Only attempt in a lab/test environment.

# 1. Create a test backup
talosctl --nodes 192.168.178.55 etcd snapshot save /tmp/test-backup.db

# 2. Note: Full restore requires stopping the cluster
# This is complex and should be tested in a lab environment
# Refer to Talos documentation for complete restore procedures

Exercise: Disaster Recovery Planning

Objective: Create a disaster recovery plan for your cluster.

Document Your Recovery Plan:

Current Setup:
- Number of control plane nodes: 1 (single control plane)
- Number of worker nodes: 2
- Backup location: Document where backups are stored
- Configuration repository: Document Git repository location
Recovery Scenarios:
Scenario 1: Control Plane Node Failure
- Impact: Cluster unavailable (single point of failure)
- Recovery steps: 1. Power on node, 2. Wait for Talos boot, 3. Verify cluster health
- Recovery time: Document observed recovery time
Scenario 2: Worker Node Failure
- Impact: Pods on failed node need rescheduling
- Recovery steps: 1. Power on node, 2. Wait for node to join, 3. Verify pods reschedule
- Recovery time: Document observed recovery time
Scenario 3: Complete Cluster Failure
- Impact: All nodes lost, need to rebuild
- Recovery steps: 1. Reinstall Talos on nodes, 2. Restore etcd from backup, 3. Reapply configurations from Git
- Recovery time: Estimate based on your setup
Backup Strategy:
- etcd backup frequency: Daily/Weekly/etc.
- Configuration backup: Git repository
- Backup retention: 7 days/30 days/etc.
- Backup location: Local/Remote/Cloud

Best Practices

High Availability

Odd Number of Control Planes: Always use 3, 5, or 7 control plane nodes
Load Balancing: Always use load balancer for API server access
Network Redundancy: Use multiple network paths if possible
Regular Testing: Test failure scenarios regularly
Monitoring: Monitor cluster health continuously

etcd

Regular Backups: Automated daily backups minimum
Off-Site Storage: Store backups in separate location
Test Restores: Regularly test backup restoration
Performance Tuning: Optimize etcd for your workload
Resource Allocation: Ensure adequate resources

Disaster Recovery

Documentation: Keep detailed documentation of cluster setup
Version Control: Store all configurations in Git
Regular Drills: Practice disaster recovery procedures
Backup Testing: Test backups regularly
Recovery Procedures: Document step-by-step recovery

Troubleshooting

Common Issue 1: Control Plane Node Not Joining

Problem: New control plane node doesn’t join cluster

Solution:

# Check node status
talosctl --nodes <NODE_IP> get members

# Check etcd logs
talosctl --nodes <NODE_IP> logs etcd

# Verify network connectivity
ping <NODE_IP>

# Check configuration
talosctl --nodes <NODE_IP> get machineconfig

Common Issue 2: etcd Cluster Unhealthy

Problem: etcd cluster shows unhealthy status

Solution:

# Check etcd status on all nodes
talosctl --nodes <CP_IP_1> etcd status
talosctl --nodes <CP_IP_2> etcd status
talosctl --nodes <CP_IP_3> etcd status

# Check etcd logs
talosctl --nodes <CP_IP> logs etcd

# Verify network connectivity between nodes

Common Issue 3: Load Balancer Not Working

Problem: Cannot access API server through load balancer

Solution:

# Test load balancer directly
curl -k https://<LB_IP>:6443/version

# Check load balancer status
# (HAProxy: systemctl status haproxy)
# (MetalLB: kubectl get pods -n metallb-system)

# Verify backend servers
# (HAProxy: check haproxy stats)
# (MetalLB: check service endpoints)

Common Issue 4: Backup/Restore Fails

Problem: etcd backup or restore operation fails

Solution:

# Verify backup file exists and is valid
file /backups/etcd-backup-*.db

# Check disk space
df -h

# Verify etcd is stopped before restore
talosctl --nodes <CP_IP> get services | grep etcd

# Check etcd logs during restore
talosctl --nodes <CP_IP> logs etcd

Summary

Key takeaways from high availability setup:

Multiple control plane nodes provide fault tolerance
Load balancing ensures API server availability
etcd requires careful configuration and regular backups
Testing failure scenarios validates HA setup
Disaster recovery planning is essential

What We Accomplished:

Understood multi-control plane HA architecture
Learned how to configure load balancers (HAProxy and MetalLB)
Explored etcd best practices for high availability
Practiced node failure scenarios and recovery procedures
Implemented etcd backup and restore procedures
Created disaster recovery plans

Next Steps

Now that you understand high availability concepts:

Part 5: Storage Configuration (Coming Soon) - Configure persistent storage for workloads
Part 6: Networking (Coming Soon) - Advanced networking with CNI and ingress
Monitor cluster health and performance
Plan for additional worker nodes if needed

Resources

Official Documentation

High Availability (2025) Sidero Documentation. Available at: https://docs.siderolabs.com/talos/v1.11/kubernetes-guides/configuration/high-availability/ (Accessed: 4 January 2026).
etcd (2025) Sidero Documentation. Available at: https://docs.siderolabs.com/talos/v1.11/kubernetes-guides/configuration/etcd/ (Accessed: 4 January 2026).
Backup and Restore (2025) Sidero Documentation. Available at: https://docs.siderolabs.com/talos/v1.11/kubernetes-guides/backup-restore/ (Accessed: 4 January 2026).

Part 3: Configuration Management
Part 5: Storage Configuration (Coming Soon)

Tools and Utilities

HAProxy (2025). Available at: http://www.haproxy.org/ (Accessed: 4 January 2026).
MetalLB (2025). Available at: https://metallb.universe.tf/ (Accessed: 4 January 2026).

Community Resources

r/homelab - Homelab community on Reddit
r/kubernetes - Kubernetes community on Reddit

Series Navigation

Previous: Part 3 - Talos Configuration Management - GitOps for Infrastructure

Current: Part 4 - High Availability Setup - Production-Grade Cluster ✓

Next: Part 5 - Storage Configuration - Persistent Storage for Kubernetes (Coming Soon)

Full Series:

Talos Linux Introduction
Talos Installation - Building Your First Cluster
Talos Configuration Management - GitOps for Infrastructure
High Availability Setup - Production-Grade Cluster (You are here)
Storage Configuration - Persistent Storage for Kubernetes (Coming Soon)
Networking - CNI, Load Balancing, and Ingress (Coming Soon)
Security Hardening - Securing Your Homelab Cluster (Coming Soon)
Monitoring and Maintenance - Keeping Your Cluster Healthy (Coming Soon)

This article is part of the “Talos Linux Homelab” series. Follow along as we build a production-grade Kubernetes homelab from the ground up.

Questions or feedback? Reach out via email or connect on LinkedIn.

Part 4: High Availability Setup - Production-Grade Cluster

TL;DR

Introduction

Why This Matters

What You’ll Learn

Prerequisites

High Availability Architecture

Single vs Multi-Control Plane

Recommended Architecture

Multi-Control Plane Setup

Planning Your HA Cluster

Adding Additional Control Plane Nodes

Verifying Multi-Control Plane Setup

Load Balancing for API Server

Why Load Balancing?

Option 1: HAProxy Load Balancer

Option 2: MetalLB (Kubernetes-Native)

Updating Cluster Endpoint

etcd Best Practices

etcd in High Availability

etcd Cluster Health

etcd Configuration Best Practices

etcd Performance Tuning

Hands-On Exercise: Node Failure Scenarios and Disaster Recovery

Exercise: Testing Node Failures

Exercise: Backup and Restore Procedures

Exercise: Disaster Recovery Planning

Best Practices

High Availability

etcd

Disaster Recovery

Troubleshooting

Common Issue 1: Control Plane Node Not Joining

Common Issue 2: etcd Cluster Unhealthy

Common Issue 3: Load Balancer Not Working

Common Issue 4: Backup/Restore Fails

Summary

Next Steps

Recommended Reading

Talos Linux Books

Kubernetes Books

Resources

Official Documentation

Related Articles

Tools and Utilities

Community Resources

Series Navigation