Part 4: High Availability Setup - Production-Grade Cluster
TL;DR
Transform your single control plane cluster into a highly available, production-grade setup with multiple control plane nodes, load balancing, and robust disaster recovery capabilities. Learn how to deploy a 3-control plane cluster, configure load balancers, implement etcd best practices, and prepare for node failures.
Key Takeaways:
- Multiple control plane nodes provide redundancy and fault tolerance
- Load balancing ensures API server availability during node failures
- etcd requires careful configuration for high availability
- Regular backups are essential for disaster recovery
- Testing failure scenarios validates your HA setup
Introduction
Why This Matters
In Part 2: Talos Installation, you built a single control plane cluster suitable for learning and development. However, production environments require high availability to ensure continuous operation even when nodes fail.
This article teaches you how to:
- Deploy multiple control plane nodes for redundancy
- Configure load balancing for the Kubernetes API server
- Implement etcd best practices for high availability
- Test and handle node failure scenarios
- Backup and restore etcd data
- Plan and execute disaster recovery procedures
What You’ll Learn
- Multi-control plane architecture and design
- Load balancer configuration (HAProxy, MetalLB)
- etcd cluster configuration and best practices
- Node failure detection and recovery
- Backup and restore procedures
- Disaster recovery planning and execution
- Cluster health monitoring
Prerequisites
Before starting, you should have:
- Completed Part 2: Talos Installation
- Completed Part 3: Configuration Management
- A running Talos Linux cluster (can start with single control plane)
- Additional hardware/VMs for additional control plane nodes (minimum 2 more nodes)
talosctlinstalled and configuredkubectlconfigured with cluster access- Understanding of load balancing concepts
- Basic knowledge of etcd
High Availability Architecture
Single vs Multi-Control Plane
Single Control Plane (Current Setup):
- One control plane node
- Single point of failure
- Suitable for development/homelab learning
- No redundancy
Multi-Control Plane (Target Setup):
- Three or more control plane nodes
- Fault tolerant
- Production-ready
- API server redundancy
Recommended Architecture
┌─────────────────────────────────────────┐
│ Load Balancer │
│ (HAProxy / MetalLB) │
│ 192.168.178.201:6443 │
└──────────────┬──────────────────────────┘
│
┌───────┼───────┐
│ │ │
┌──────▼──┐ ┌──▼───┐ ┌─▼──────┐
│ Control │ │Control│ │Control │
│ Plane 1 │ │Plane 2│ │Plane 3 │
│ .55 │ │ .58 │ │ .59 │
└─────────┘ └───────┘ └────────┘
│ │ │
└───────┼───────┘
│
┌───────┼───────┐
│ │ │
┌──────▼──┐ ┌──▼───┐ ┌─▼──────┐
│ Worker 1│ │Worker│ │Worker 3│
│ .56 │ │ 2 .57│ │ .60 │
└─────────┘ └───────┘ └────────┘
Note: The architecture diagram above shows a multi-control plane HA setup. For homelab setups, a single control plane (192.168.178.55) with two worker nodes (192.168.178.56, 192.168.178.57) is often sufficient. This article demonstrates how HA could be implemented, but the actual homelab setup may remain single control plane for simplicity and resource constraints.
Multi-Control Plane Setup
Planning Your HA Cluster
Requirements:
- Minimum 3 control plane nodes (odd number for etcd quorum)
- Each control plane node needs:
- 2+ CPU cores (4+ recommended)
- 4GB+ RAM (8GB+ recommended)
- 50GB+ storage
- Network connectivity to all nodes
Node Allocation (Example for HA Setup):
- Control Plane 1:
192.168.178.55(existing in homelab) - Control Plane 2:
192.168.178.58(hypothetical - for demonstration only) - Control Plane 3:
192.168.178.59(hypothetical - for demonstration only) - Worker nodes:
192.168.178.56, .57, .60(existing workers + hypothetical additional worker)
Note: The above node allocation is an example showing how HA could be configured. The actual homelab setup uses only a single control plane node (192.168.178.55) with two worker nodes (192.168.178.56, 192.168.178.57).
Adding Additional Control Plane Nodes
Step 1: Install Talos on New Nodes
# Install Talos on new control plane nodes
# (Follow Part 2 installation process)
# Nodes should boot and be accessible via network
Step 2: Generate Control Plane Configuration
# Generate configuration for additional control plane nodes
# Use the same cluster endpoint
talosctl gen config discworld-homelab \
https://192.168.178.201:6443 \
--output-dir ./ha-configs
# This generates:
# - controlplane.yaml (for all control plane nodes)
# - worker.yaml (for worker nodes)
# - talosconfig
Step 3: Apply Configuration to New Control Plane Nodes
# Apply to control plane node 2
talosctl apply-config \
--insecure \
--nodes 192.168.178.58 \
--file ./ha-configs/controlplane.yaml
# Apply to control plane node 3
talosctl apply-config \
--insecure \
--nodes 192.168.178.59 \
--file ./ha-configs/controlplane.yaml
Step 4: Update Endpoints
# Update talosconfig with all control plane endpoints
talosctl config endpoint 192.168.178.55 192.168.178.58 192.168.178.59
Verifying Multi-Control Plane Setup
# Check cluster members
talosctl get members
# Verify all control plane nodes
kubectl get nodes -l node-role.kubernetes.io/control-plane
# Check etcd cluster status
talosctl --nodes 192.168.178.55 get members
Expected Output:
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
talos-cp-1 Ready control-plane 1h v1.34.3
talos-cp-2 Ready control-plane 30m v1.34.3
talos-cp-3 Ready control-plane 15m v1.34.3
talos-worker-1 Ready <none> 1h v1.34.3
talos-worker-2 Ready <none> 1h v1.34.3
Load Balancing for API Server
Why Load Balancing?
The Kubernetes API server must be accessible even if individual control plane nodes fail. A load balancer distributes traffic across all healthy control plane nodes.
Option 1: HAProxy Load Balancer
Installing HAProxy:
# On a separate machine or one of your nodes
# Install HAProxy (example for Ubuntu/Debian)
sudo apt update
sudo apt install -y haproxy
# Or use a containerized HAProxy
HAProxy Configuration:
# /etc/haproxy/haproxy.cfg
global
log /dev/log local0
maxconn 4096
daemon
defaults
log global
mode tcp
option tcplog
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
frontend kubernetes-api
bind 192.168.178.201:6443
default_backend k8s-api-servers
backend k8s-api-servers
balance roundrobin
option tcp-check
server k8s-api-1 192.168.178.55:6443 check
server k8s-api-2 192.168.178.58:6443 check
server k8s-api-3 192.168.178.59:6443 check
Starting HAProxy:
# Start HAProxy
sudo systemctl start haproxy
sudo systemctl enable haproxy
# Verify it's running
sudo systemctl status haproxy
Option 2: MetalLB (Kubernetes-Native)
Installing MetalLB:
# Apply MetalLB manifest
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.5/config/manifests/metallb-native.yaml
# Wait for MetalLB to be ready
kubectl wait --namespace metallb-system \
--for=condition=ready pod \
--selector=app=metallb \
--timeout=90s
Configuring MetalLB:
# metallb-config.yaml
# IP range: 192.168.178.201-220 (outside DHCP reservation range 20-200)
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: api-server-pool
namespace: metallb-system
spec:
addresses:
- 192.168.178.201-192.168.178.210 # Range for API server and other LoadBalancer services
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: api-server-l2
namespace: metallb-system
spec:
ipAddressPools:
- api-server-pool
Applying MetalLB Configuration:
kubectl apply -f metallb-config.yaml
# Verify the configuration
kubectl get ipaddresspool -n metallb-system
kubectl get l2advertisement -n metallb-system
Creating LoadBalancer Service:
Note: This example is for demonstration purposes only. With a single control plane setup, direct access to the control plane endpoint (
192.168.178.55:6443) is used instead of a LoadBalancer service. The Kubernetes API server runs as a static pod managed by Talos, not as a regular Kubernetes service, so creating a LoadBalancer service for it requires additional configuration (e.g., using a NodePort service or external load balancer). For HA setups with multiple control planes, a load balancer (HAProxy, MetalLB, or external) would route traffic to all control plane nodes.
# k8s-api-loadbalancer.yaml
# NOTE: This is a conceptual example for HA setups
# For single control plane, use direct endpoint: 192.168.178.55:6443
apiVersion: v1
kind: Service
metadata:
name: k8s-api-loadbalancer
namespace: default
spec:
type: LoadBalancer
loadBalancerIP: 192.168.178.201 # MetalLB will assign this IP
ports:
- port: 6443
targetPort: 6443
protocol: TCP
# Note: API server runs as static pod, so this selector won't work directly
# For actual implementation, you'd need to use NodePort or external load balancer
# that routes to control plane node IPs
Updating Cluster Endpoint
After setting up load balancing, update your cluster configuration:
# Update kubeconfig to use load balancer endpoint
kubectl config set-cluster discworld-homelab \
--server=https://192.168.178.201:6443
# Update talosconfig endpoint
talosctl config endpoint 192.168.178.201
Testing Load Balancer:
# Test API server access through load balancer
kubectl cluster-info
# Test with direct API call
curl -k https://192.168.178.201:6443/version
etcd Best Practices
etcd in High Availability
etcd is the distributed key-value store that stores all Kubernetes cluster data. In a multi-control plane setup, etcd runs on each control plane node.
etcd Cluster Health
Checking etcd Status:
# Check etcd members
talosctl --nodes 192.168.178.55 get members
# Check etcd health
talosctl --nodes 192.168.178.55 etcd status
# View etcd logs
talosctl --nodes 192.168.178.55 logs etcd
etcd Configuration Best Practices
- Odd Number of Nodes: Always use odd number (3, 5, 7) for quorum
- Network Latency: Keep etcd nodes on low-latency network (< 10ms)
- Disk Performance: Use fast SSDs for etcd data directory
- Resource Limits: Ensure adequate CPU and memory
- Backup Regularly: Automated backups are essential
etcd Performance Tuning
# etcd configuration patch
machine:
etcd:
extraArgs:
quota-backend-bytes: "8589934592" # 8GB
max-request-bytes: "1572864" # 1.5MB
Note: Default etcd configuration is usually sufficient for homelab setups. Performance tuning is typically needed for production environments with high write loads or large datasets.
Hands-On Exercise: Node Failure Scenarios and Disaster Recovery
Learning Objective: This exercise helps you understand how Kubernetes clusters handle failures and how to implement disaster recovery procedures. While our homelab uses a single control plane (which means no HA failover), these exercises demonstrate important concepts for production environments.
Exercise: Testing Node Failures
Objective: Understand how Kubernetes handles node failures and practice recovery procedures.
Prerequisites:
- Running Talos cluster
kubectlandtalosctlconfigured- Understanding of cluster architecture
Scenario 1: Worker Node Failure
This is safe to test in a homelab environment:
# 1. Deploy a test workload
kubectl create deployment nginx --image=nginx
kubectl scale deployment nginx --replicas=3
# 2. Check where pods are running
kubectl get pods -o wide
# 3. Simulate worker node failure
# Option A: If using VMs, power off one worker node
# Option B: If physical, unplug network cable temporarily
# Option C: Use kubectl to cordon and drain (safer)
kubectl cordon <worker-node-name>
kubectl drain <worker-node-name> --ignore-daemonsets --delete-emptydir-data
# 4. Observe pod rescheduling
watch kubectl get pods -o wide
# 5. Restore the node
kubectl uncordon <worker-node-name>
# Or power on/restore network connection
# 6. Verify pods can reschedule
kubectl get pods -o wide
What to Observe:
- How long it takes for pods to be rescheduled
- Which node the pods move to
- Any service interruptions
Scenario 2: Control Plane Node Failure (Single Control Plane)
Warning: With a single control plane, this will make the cluster unavailable. Only test this if you’re comfortable with cluster downtime.
# 1. Note current cluster state
kubectl get nodes
kubectl get pods --all-namespaces
# 2. Power off control plane node (192.168.178.55)
# Cluster will become unavailable
# 3. Try to access cluster
kubectl get nodes # Will fail
# 4. Power on control plane node
# Wait for Talos to boot and cluster to recover
# 5. Verify cluster recovery
kubectl get nodes
kubectl get pods --all-namespaces
What to Learn:
- Single control plane = single point of failure
- Why HA requires multiple control plane nodes
- Recovery time after control plane failure
Scenario 3: Understanding HA Failover (Conceptual)
For HA setups with multiple control planes, you would test:
# In an HA setup with 3 control planes:
# 1. Power off one control plane node
# 2. Cluster should continue operating (2 of 3 nodes maintain quorum)
# 3. API server remains accessible via load balancer
# 4. Workloads continue running
# 5. etcd maintains quorum
# This demonstrates why HA requires 3+ control plane nodes
Exercise: Backup and Restore Procedures
Objective: Practice creating and restoring etcd backups.
Creating etcd Backup:
# Backup etcd from control plane node
talosctl --nodes 192.168.178.55 etcd snapshot save \
/tmp/etcd-backup-.db
# Verify backup was created
ls -lh /tmp/etcd-backup-*.db
# Copy backup to safe location (if you have a backup server)
# scp /tmp/etcd-backup-*.db user@backup-server:/backups/
Automated Backup Script:
Create a backup script for regular backups:
#!/bin/bash
# etcd-backup.sh
BACKUP_DIR="/backups/etcd"
DATE=
CONTROL_PLANE_IP="192.168.178.55"
# Create backup directory if it doesn't exist
mkdir -p
talosctl --nodes etcd snapshot save \
/etcd-backup-.db
# Keep only last 7 days of backups
find -name "etcd-backup-*.db" -mtime +7 -delete
echo "Backup completed: /etcd-backup-.db"
Setting Up Automated Backups:
# Make script executable
chmod +x etcd-backup.sh
# Test the script
./etcd-backup.sh
# Add to crontab for daily backups at 2 AM
crontab -e
# Add: 0 2 * * * /path/to/etcd-backup.sh
Testing Restore (Advanced - Optional):
Warning: Restore testing will cause cluster downtime. Only attempt in a lab/test environment.
# 1. Create a test backup
talosctl --nodes 192.168.178.55 etcd snapshot save /tmp/test-backup.db
# 2. Note: Full restore requires stopping the cluster
# This is complex and should be tested in a lab environment
# Refer to Talos documentation for complete restore procedures
Exercise: Disaster Recovery Planning
Objective: Create a disaster recovery plan for your cluster.
Document Your Recovery Plan:
Current Setup:
- Number of control plane nodes:
1 (single control plane) - Number of worker nodes:
2 - Backup location: Document where backups are stored
- Configuration repository: Document Git repository location
- Number of control plane nodes:
Recovery Scenarios:
Scenario 1: Control Plane Node Failure
- Impact: Cluster unavailable (single point of failure)
- Recovery steps:
1. Power on node, 2. Wait for Talos boot, 3. Verify cluster health - Recovery time: Document observed recovery time
Scenario 2: Worker Node Failure
- Impact: Pods on failed node need rescheduling
- Recovery steps:
1. Power on node, 2. Wait for node to join, 3. Verify pods reschedule - Recovery time: Document observed recovery time
Scenario 3: Complete Cluster Failure
- Impact: All nodes lost, need to rebuild
- Recovery steps:
1. Reinstall Talos on nodes, 2. Restore etcd from backup, 3. Reapply configurations from Git - Recovery time: Estimate based on your setup
Backup Strategy:
- etcd backup frequency: Daily/Weekly/etc.
- Configuration backup: Git repository
- Backup retention: 7 days/30 days/etc.
- Backup location: Local/Remote/Cloud
Best Practices
High Availability
- Odd Number of Control Planes: Always use 3, 5, or 7 control plane nodes
- Load Balancing: Always use load balancer for API server access
- Network Redundancy: Use multiple network paths if possible
- Regular Testing: Test failure scenarios regularly
- Monitoring: Monitor cluster health continuously
etcd
- Regular Backups: Automated daily backups minimum
- Off-Site Storage: Store backups in separate location
- Test Restores: Regularly test backup restoration
- Performance Tuning: Optimize etcd for your workload
- Resource Allocation: Ensure adequate resources
Disaster Recovery
- Documentation: Keep detailed documentation of cluster setup
- Version Control: Store all configurations in Git
- Regular Drills: Practice disaster recovery procedures
- Backup Testing: Test backups regularly
- Recovery Procedures: Document step-by-step recovery
Troubleshooting
Common Issue 1: Control Plane Node Not Joining
Problem: New control plane node doesn’t join cluster
Solution:
# Check node status
talosctl --nodes <NODE_IP> get members
# Check etcd logs
talosctl --nodes <NODE_IP> logs etcd
# Verify network connectivity
ping <NODE_IP>
# Check configuration
talosctl --nodes <NODE_IP> get machineconfig
Common Issue 2: etcd Cluster Unhealthy
Problem: etcd cluster shows unhealthy status
Solution:
# Check etcd status on all nodes
talosctl --nodes <CP_IP_1> etcd status
talosctl --nodes <CP_IP_2> etcd status
talosctl --nodes <CP_IP_3> etcd status
# Check etcd logs
talosctl --nodes <CP_IP> logs etcd
# Verify network connectivity between nodes
Common Issue 3: Load Balancer Not Working
Problem: Cannot access API server through load balancer
Solution:
# Test load balancer directly
curl -k https://<LB_IP>:6443/version
# Check load balancer status
# (HAProxy: systemctl status haproxy)
# (MetalLB: kubectl get pods -n metallb-system)
# Verify backend servers
# (HAProxy: check haproxy stats)
# (MetalLB: check service endpoints)
Common Issue 4: Backup/Restore Fails
Problem: etcd backup or restore operation fails
Solution:
# Verify backup file exists and is valid
file /backups/etcd-backup-*.db
# Check disk space
df -h
# Verify etcd is stopped before restore
talosctl --nodes <CP_IP> get services | grep etcd
# Check etcd logs during restore
talosctl --nodes <CP_IP> logs etcd
Summary
Key takeaways from high availability setup:
- Multiple control plane nodes provide fault tolerance
- Load balancing ensures API server availability
- etcd requires careful configuration and regular backups
- Testing failure scenarios validates HA setup
- Disaster recovery planning is essential
What We Accomplished:
- Understood multi-control plane HA architecture
- Learned how to configure load balancers (HAProxy and MetalLB)
- Explored etcd best practices for high availability
- Practiced node failure scenarios and recovery procedures
- Implemented etcd backup and restore procedures
- Created disaster recovery plans
Next Steps
Now that you understand high availability concepts:
- Part 5: Storage Configuration (Coming Soon) - Configure persistent storage for workloads
- Part 6: Networking (Coming Soon) - Advanced networking with CNI and ingress
- Monitor cluster health and performance
- Plan for additional worker nodes if needed
Recommended Reading
If you want to dive deeper into Talos Linux and Kubernetes, here are some excellent books that complement this series:
Note: The Amazon links below are affiliate links for Amazon Influencers and Associates. If you make a purchase through these links, I may earn a small commission at no additional cost to you.
Talos Linux Books
- Talos Linux for DevOps: Modern Infrastructure Engineering with an Immutable Kubernetes OS - Comprehensive guide to Talos Linux for DevOps professionals
- TALOS LINUX IN DEPTH: A Complete Guide to Deploy and Manage Production-Ready Kubernetes Clusters with Zero Trust Security, Immutable Infrastructure, and GitOps - In-depth guide covering production deployments with security and GitOps
Kubernetes Books
- Kubernetes-Grundlagen: Ein praktischer Leitfaden zur Container-Orchestrierung - Practical guide to container orchestration (German)
- Kubernetes: Das Praxisbuch für Entwickler und DevOps-Teams. Modernes Deployment für Container-Infrastrukturen - Practical book for developers and DevOps teams on modern container infrastructure deployment (German)
- The Kubernetes Book - Comprehensive guide to Kubernetes concepts and practices
Resources
Official Documentation
- High Availability (2025) Sidero Documentation. Available at: https://docs.siderolabs.com/talos/v1.11/kubernetes-guides/configuration/high-availability/ (Accessed: 4 January 2026).
- etcd (2025) Sidero Documentation. Available at: https://docs.siderolabs.com/talos/v1.11/kubernetes-guides/configuration/etcd/ (Accessed: 4 January 2026).
- Backup and Restore (2025) Sidero Documentation. Available at: https://docs.siderolabs.com/talos/v1.11/kubernetes-guides/backup-restore/ (Accessed: 4 January 2026).
Related Articles
- Part 3: Configuration Management
- Part 5: Storage Configuration (Coming Soon)
Tools and Utilities
- HAProxy (2025). Available at: http://www.haproxy.org/ (Accessed: 4 January 2026).
- MetalLB (2025). Available at: https://metallb.universe.tf/ (Accessed: 4 January 2026).
Community Resources
- r/homelab - Homelab community on Reddit
- r/kubernetes - Kubernetes community on Reddit
Series Navigation
Previous: Part 3 - Talos Configuration Management - GitOps for Infrastructure
Current: Part 4 - High Availability Setup - Production-Grade Cluster ✓
Next: Part 5 - Storage Configuration - Persistent Storage for Kubernetes (Coming Soon)
Full Series:
- Talos Linux Introduction
- Talos Installation - Building Your First Cluster
- Talos Configuration Management - GitOps for Infrastructure
- High Availability Setup - Production-Grade Cluster (You are here)
- Storage Configuration - Persistent Storage for Kubernetes (Coming Soon)
- Networking - CNI, Load Balancing, and Ingress (Coming Soon)
- Security Hardening - Securing Your Homelab Cluster (Coming Soon)
- Monitoring and Maintenance - Keeping Your Cluster Healthy (Coming Soon)
This article is part of the “Talos Linux Homelab” series. Follow along as we build a production-grade Kubernetes homelab from the ground up.
Questions or feedback? Reach out via email or connect on LinkedIn.