hyper2kvm

Monitoring & Observability Guide

Comprehensive guide for monitoring migrated VMs and ensuring operational excellence.



Overview

Effective monitoring ensures:


Monitoring Strategy

Three-Phase Approach

Phase 1: Migration (Real-time)
├── Track migration progress
├── Monitor resource usage
└── Detect failures immediately

Phase 2: Post-Migration (Intensive - 7 days)
├── Validate performance vs baseline
├── Monitor for regressions
├── Track stability metrics
└── Identify optimization opportunities

Phase 3: Steady State (Ongoing)
├── Standard monitoring
├── Trend analysis
└── Capacity planning

Pre-Production Monitoring

Migration Progress Tracking

Monitor active migrations in real-time.

Script: monitor-active-migrations.sh

#!/bin/bash
# Monitor active migrations

echo "=== Active Migration Monitor ==="
date

# Check running processes
echo ""
echo "Running Migrations:"
ps aux | grep -E "h2kvmctl|hyper2kvm" | grep -v grep | awk '{print $11, $12, $13}'

# System resources
echo ""
echo "System Resources:"
echo "CPU: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}')% used"
echo "Memory: $(free -h | grep Mem | awk '{print $3 " / " $2}')"
echo "Load: $(uptime | awk -F'load average:' '{print $2}')"

# Disk I/O
echo ""
echo "Disk I/O (top 3):"
iostat -x 1 2 | tail -n +4 | head -n 3

# Recent logs
echo ""
echo "Recent Activity:"
tail -n 5 /var/log/hyper2kvm/*.log 2>/dev/null | grep -E "INFO|ERROR|WARNING" | tail -3

Dashboard: Set up continuous monitoring

watch -n 5 './monitor-active-migrations.sh'

Disk Space Monitoring

Critical during migrations to prevent failures.

Script: check-disk-space.sh

#!/bin/bash
# Disk space monitoring

THRESHOLD=80  # Percentage

echo "=== Disk Space Monitor ==="

df -h | grep -vE '^Filesystem|tmpfs|cdrom' | awk '{print $5 " " $6}' | while read output; do
    usage=$(echo $output | awk '{print $1}' | sed 's/%//')
    partition=$(echo $output | awk '{print $2}')

    if [ $usage -ge $THRESHOLD ]; then
        echo "❌ CRITICAL: $partition is ${usage}% full"
    elif [ $usage -ge 70 ]; then
        echo "⚠️  WARNING: $partition is ${usage}% full"
    else
        echo "✅ OK: $partition is ${usage}% full"
    fi
done

Production Monitoring

VM Health Checks

Continuous validation of migrated VMs.

Script: vm-health-check.sh

#!/bin/bash
# VM health check script

VM_NAME=$1

check_vm_health() {
    local vm=$1
    local status="OK"
    local issues=()

    # Check 1: VM State
    state=$(virsh domstate "$vm" 2>/dev/null)
    if [ "$state" != "running" ]; then
        status="CRITICAL"
        issues+=("VM not running (state: $state)")
    fi

    # Check 2: CPU Usage
    cpu_time=$(virsh cpu-stats "$vm" --total 2>/dev/null | grep "cpu_time" | awk '{print $2}')

    # Check 3: Memory
    mem_stats=$(virsh dommemstat "$vm" 2>/dev/null)
    if [ $? -ne 0 ]; then
        status="WARNING"
        issues+=("Cannot retrieve memory stats")
    fi

    # Check 4: Network
    ip=$(virsh domifaddr "$vm" 2>/dev/null | grep -oE '([0-9]{1,3}\.){3}[0-9]{1,3}' | head -1)
    if [ -n "$ip" ]; then
        if ! ping -c 1 -W 2 "$ip" > /dev/null 2>&1; then
            status="WARNING"
            issues+=("Network unreachable")
        fi
    else
        status="WARNING"
        issues+=("No IP address")
    fi

    # Check 5: Disk I/O
    disk_stats=$(virsh domblkstat "$vm" vda 2>/dev/null)
    if [ $? -ne 0 ]; then
        status="WARNING"
        issues+=("Cannot retrieve disk stats")
    fi

    # Report
    echo "VM: $vm - Status: $status"
    if [ ${#issues[@]} -gt 0 ]; then
        for issue in "${issues[@]}"; do
            echo "  - $issue"
        done
    fi

    # Return status code
    if [ "$status" = "CRITICAL" ]; then
        return 2
    elif [ "$status" = "WARNING" ]; then
        return 1
    else
        return 0
    fi
}

if [ -z "$VM_NAME" ]; then
    # Check all VMs
    for vm in $(virsh list --name); do
        [ -z "$vm" ] && continue
        check_vm_health "$vm"
    done
else
    # Check specific VM
    check_vm_health "$VM_NAME"
fi

Usage:

# Check all VMs
./vm-health-check.sh

# Check specific VM
./vm-health-check.sh web-server-01

# Continuous monitoring
watch -n 30 './vm-health-check.sh'

Performance Baseline Comparison

Compare post-migration performance to baseline.

Collect Baseline (before migration):

#!/bin/bash
# collect-baseline.sh - Run on source VM

VM_NAME=$1
BASELINE_FILE="baseline-${VM_NAME}.txt"

echo "=== Performance Baseline for $VM_NAME ===" > "$BASELINE_FILE"
echo "Collected: $(date)" >> "$BASELINE_FILE"
echo "" >> "$BASELINE_FILE"

# CPU
echo "CPU:" >> "$BASELINE_FILE"
mpstat 1 10 | tail -1 >> "$BASELINE_FILE"

# Memory
echo "" >> "$BASELINE_FILE"
echo "Memory:" >> "$BASELINE_FILE"
free -h >> "$BASELINE_FILE"

# Disk I/O
echo "" >> "$BASELINE_FILE"
echo "Disk I/O:" >> "$BASELINE_FILE"
iostat -x 1 10 | tail -n +4 >> "$BASELINE_FILE"

# Network
echo "" >> "$BASELINE_FILE"
echo "Network:" >> "$BASELINE_FILE"
sar -n DEV 1 10 | grep -v "^$" | tail -5 >> "$BASELINE_FILE"

echo "Baseline collected: $BASELINE_FILE"

Compare Performance (after migration):

#!/bin/bash
# compare-performance.sh

VM_NAME=$1
BASELINE_FILE="baseline-${VM_NAME}.txt"
CURRENT_FILE="current-${VM_NAME}.txt"

# Collect current metrics (same as baseline)
# ... (same collection commands)

# Compare
echo "=== Performance Comparison ==="
echo "Baseline vs Current for $VM_NAME"
echo ""

# Show side-by-side comparison
echo "CPU Idle %:"
echo "  Baseline: $(grep "all" "$BASELINE_FILE" | awk '{print $NF}')"
echo "  Current:  $(grep "all" "$CURRENT_FILE" | awk '{print $NF}')"

# Memory comparison
# Disk I/O comparison
# Network comparison
# ...

Performance Metrics

Key Metrics to Track

1. CPU Metrics

# CPU utilization per VM
virsh cpu-stats VM_NAME --total

# Host CPU usage
mpstat 1 5

# Per-vCPU stats
virsh vcpuinfo VM_NAME

Thresholds:


2. Memory Metrics

# VM memory stats
virsh dommemstat VM_NAME

# Memory balloon
virsh dommeminfo VM_NAME

# Host memory
free -h

Key Metrics:

Thresholds:


3. Disk I/O Metrics

# VM disk stats
virsh domblkstat VM_NAME vda

# Detailed disk I/O
virsh domblklist VM_NAME
for disk in $(virsh domblklist VM_NAME | grep -v "^Target" | awk '{print $1}'); do
    echo "Disk: $disk"
    virsh domblkstat VM_NAME $disk
done

# Host disk I/O
iostat -x 1 5

Key Metrics:

Thresholds (IOPS):


4. Network Metrics

# VM network stats
virsh domifstat VM_NAME vnet0

# All interfaces
for iface in $(virsh domiflist VM_NAME | grep -v "^Interface" | awk '{print $1}'); do
    echo "Interface: $iface"
    virsh domifstat VM_NAME $iface
done

# Detailed network info
virsh domifaddr VM_NAME

# Host network
sar -n DEV 1 5

Key Metrics:


Performance Monitoring Script

Complete performance monitoring solution:

#!/bin/bash
# performance-monitor.sh - Comprehensive performance tracking

VM_NAME=$1
INTERVAL=${2:-60}
LOG_DIR="./perf-logs"

mkdir -p "$LOG_DIR"
LOG_FILE="$LOG_DIR/${VM_NAME}-$(date +%Y%m%d).csv"

# Create CSV header
echo "timestamp,cpu_time,mem_actual,mem_available,disk_rd,disk_wr,net_rx,net_tx" > "$LOG_FILE"

echo "Monitoring $VM_NAME (interval: ${INTERVAL}s)"
echo "Logging to: $LOG_FILE"
echo "Press Ctrl+C to stop"

while true; do
    timestamp=$(date '+%Y-%m-%d %H:%M:%S')

    # Collect metrics
    cpu_time=$(virsh cpu-stats "$VM_NAME" --total 2>/dev/null | grep "cpu_time" | awk '{print $2}')
    mem_actual=$(virsh dommemstat "$VM_NAME" 2>/dev/null | grep "actual" | awk '{print $2}')
    mem_available=$(virsh dommemstat "$VM_NAME" 2>/dev/null | grep "available" | awk '{print $2}')

    disk_stats=$(virsh domblkstat "$VM_NAME" vda 2>/dev/null)
    disk_rd=$(echo "$disk_stats" | grep "rd_bytes" | awk '{print $2}')
    disk_wr=$(echo "$disk_stats" | grep "wr_bytes" | awk '{print $2}')

    net_stats=$(virsh domifstat "$VM_NAME" vnet0 2>/dev/null)
    net_rx=$(echo "$net_stats" | grep "rx_bytes" | awk '{print $2}')
    net_tx=$(echo "$net_stats" | grep "tx_bytes" | awk '{print $2}')

    # Log to CSV
    echo "$timestamp,$cpu_time,$mem_actual,$mem_available,$disk_rd,$disk_wr,$net_rx,$net_tx" >> "$LOG_FILE"

    # Display current values
    echo "$timestamp | CPU: $cpu_time | Mem: $mem_actual/$mem_available | Disk: R:$disk_rd W:$disk_wr | Net: RX:$net_rx TX:$net_tx"

    sleep "$INTERVAL"
done

Usage:

# Monitor single VM
./performance-monitor.sh web-server-01 60

# Monitor multiple VMs in background
for vm in web-01 web-02 db-01; do
    ./performance-monitor.sh $vm 60 &
done

Alerting Best Practices

Alert Levels

CRITICAL (immediate action):

WARNING (investigate soon):

INFO (track trends):


Alerting Script

#!/bin/bash
# alert-handler.sh - Send alerts based on thresholds

VM_NAME=$1
ALERT_EMAIL="ops@example.com"

check_and_alert() {
    local vm=$1
    local metric=$2
    local value=$3
    local threshold=$4
    local level=$5

    if [ "$value" -gt "$threshold" ]; then
        message="[$level] $vm: $metric is $value (threshold: $threshold)"

        echo "$message"

        # Send email
        echo "$message" | mail -s "VM Alert: $vm" "$ALERT_EMAIL"

        # Send to syslog
        logger -t hyper2kvm-alert "$message"

        # Send to monitoring system (example: Prometheus Pushgateway)
        # curl -X POST http://pushgateway:9091/metrics/job/hyper2kvm ...
    fi
}

# Check CPU
cpu_usage=$(virsh cpu-stats "$VM_NAME" --total | grep "cpu_time" | awk '{print $2}')
# Convert to percentage and check
# ...

# Check memory
mem_used=$(virsh dommemstat "$VM_NAME" | grep "actual" | awk '{print $2}')
mem_total=$(virsh dommeminfo "$VM_NAME" | grep "Max memory" | awk '{print $3}')
mem_pct=$((mem_used * 100 / mem_total))

check_and_alert "$VM_NAME" "Memory" "$mem_pct" 90 "WARNING"

# Check disk space (from inside VM)
# Check network connectivity
# ...

Integration with Monitoring Systems

Prometheus Integration

Libvirt Exporter:

# Install libvirt_exporter
wget https://github.com/prometheus-community/libvirt_exporter/releases/download/v0.1.0/libvirt_exporter
chmod +x libvirt_exporter

# Run exporter
./libvirt_exporter &

# Metrics available at http://localhost:9177/metrics

Prometheus Config (prometheus.yml):

scrape_configs:
  - job_name: 'libvirt'
    static_configs:
      - targets: ['localhost:9177']

Example Queries:

# CPU usage per VM
libvirt_domain_info_cpu_time_seconds_total

# Memory usage
libvirt_domain_info_memory_usage_bytes

# Disk read/write
rate(libvirt_domain_block_stats_read_bytes_total[5m])
rate(libvirt_domain_block_stats_write_bytes_total[5m])

Grafana Dashboards

Create Dashboard for migrated VMs:

{
  "dashboard": {
    "title": "Hyper2KVM Migrated VMs",
    "panels": [
      {
        "title": "CPU Usage",
        "targets": [
          {
            "expr": "rate(libvirt_domain_info_cpu_time_seconds_total[5m])"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "targets": [
          {
            "expr": "libvirt_domain_info_memory_usage_bytes / libvirt_domain_info_maximum_memory_bytes * 100"
          }
        ]
      },
      {
        "title": "Disk I/O",
        "targets": [
          {
            "expr": "rate(libvirt_domain_block_stats_read_bytes_total[5m])"
          },
          {
            "expr": "rate(libvirt_domain_block_stats_write_bytes_total[5m])"
          }
        ]
      },
      {
        "title": "Network Traffic",
        "targets": [
          {
            "expr": "rate(libvirt_domain_interface_stats_receive_bytes_total[5m])"
          },
          {
            "expr": "rate(libvirt_domain_interface_stats_transmit_bytes_total[5m])"
          }
        ]
      }
    ]
  }
}

Troubleshooting with Monitoring

Common Scenarios

High CPU After Migration

Symptoms: CPU usage consistently > 80%

Investigation:

# Check per-vCPU stats
virsh vcpuinfo VM_NAME

# Check CPU pinning
virsh vcpupin VM_NAME

# Check host CPU allocation
lscpu

# Inside VM: Check processes
virsh console VM_NAME
top

Possible Causes:


Memory Pressure

Symptoms: High swap usage, OOM errors

Investigation:

# Check memory stats
virsh dommemstat VM_NAME

# Check balloon status
virsh dommeminfo VM_NAME

# Check host memory
free -h

# Inside VM
virsh console VM_NAME
free -h
vmstat 1 10

Possible Causes:


Poor Disk Performance

Symptoms: High I/O wait, slow disk operations

Investigation:

# Check disk stats
virsh domblkinfo VM_NAME vda
virsh domblkstat VM_NAME vda

# Check disk cache mode
virsh dumpxml VM_NAME | grep -A 5 "disk type"

# Check host disk performance
iostat -x 1 10

# Inside VM
virsh console VM_NAME
iostat -x 1 10

Possible Causes:


Monitoring Tools & Integration

1. virt-top (Real-time VM Monitoring)

# Install
sudo apt-get install virt-top

# Run
virt-top

Features:


2. virt-manager (GUI Monitoring)

# Install
sudo apt-get install virt-manager

# Run
virt-manager

Features:


3. Prometheus + Grafana (Production Monitoring)

Why: Industry-standard, scalable, flexible

Setup:

# Install Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.40.0/prometheus-2.40.0.linux-amd64.tar.gz
tar xvfz prometheus-*.tar.gz
cd prometheus-*

# Configure (prometheus.yml)
# Run
./prometheus --config.file=prometheus.yml

# Install Grafana
sudo apt-get install -y grafana

# Start
sudo systemctl start grafana-server

# Access at http://localhost:3000

4. Nagios/Icinga (Traditional Monitoring)

Libvirt Plugin:

# Install check_libvirt plugin
wget https://github.com/vpenso/libvirt-shell-functions/raw/master/check_libvirt
chmod +x check_libvirt

# Check VM state
./check_libvirt -H localhost -v VM_NAME -w state

# Check CPU
./check_libvirt -H localhost -v VM_NAME -w cpu:80 -c cpu:90

Log Aggregation

Centralize logs for easier troubleshooting:

# Configure rsyslog to forward logs
echo "*.* @@logserver.example.com:514" >> /etc/rsyslog.conf
sudo systemctl restart rsyslog

# Or use Elastic Stack (ELK)
# Filebeat -> Logstash -> Elasticsearch -> Kibana

Monitoring Checklist

Post-Migration (First 24 Hours)

Week 1 (Intensive Monitoring)

Ongoing (Steady State)


Additional Resources


Last Updated: February 2026 Documentation Version: 2.1.0