hypersdk

Export Resumption & Checkpoints

Status: ✅ Completed Date: 2026-01-23

Overview

Export resumption with checkpoint-based recovery is now implemented for all providers. This feature allows exports to resume from where they left off after network failures, interruptions, or cancellations.

How It Works

Uses JSON-based checkpoint files to track export state:

Usage

vSphere Example

opts := vsphere.ExportOptions{
    Format:               "ova",
    OutputPath:           "/backups",
    EnableCheckpoints:    true,   // Enable checkpoint support
    ResumeFromCheckpoint: true,   // Resume if checkpoint exists
    CheckpointInterval:   30 * time.Second, // Save every 30s (0 = after each file)
    // CheckpointPath:    "",      // Auto-generate path (optional override)
}

result, err := client.ExportOVF(ctx, "vm-path", opts)

AWS Example

opts := aws.ExportOptions{
    Format:               "vmdk",
    OutputPath:           "/exports",
    S3Bucket:             "my-backups",
    EnableCheckpoints:    true,
    ResumeFromCheckpoint: true,
    CheckpointInterval:   60 * time.Second, // Save every minute
}

result, err := client.ExportInstanceWithOptions(ctx, instanceID, opts)

Azure Example

opts := azure.ExportOptions{
    Format:               "vhd",
    OutputPath:           "/exports",
    EnableCheckpoints:    true,
    ResumeFromCheckpoint: true,
    // CheckpointInterval: 0 = save after each file (default)
}

result, err := client.ExportDiskWithOptions(ctx, diskName, opts)

GCP Example

opts := gcp.ExportOptions{
    Format:               "vmdk",
    OutputPath:           "/exports",
    GCSBucket:            "my-exports",
    EnableCheckpoints:    true,
    ResumeFromCheckpoint: true,
}

result, err := client.ExportDiskWithOptions(ctx, diskName, opts)

Hyper-V Example

opts := hyperv.ExportOptions{
    Format:               "vhdx",
    OutputPath:           "/exports",
    ExportType:           "vhd-only",
    EnableCheckpoints:    true,
    ResumeFromCheckpoint: true,
}

result, err := client.ExportVMWithOptions(ctx, vmName, opts)

Benefits

1. Resilience to Failures

Network Interruption:

Export started: 10 files, 100 GB total
Files 1-5 completed: 50 GB downloaded
Network failure ❌
Resume export: Files 6-10 remaining (50 GB)

2. Graceful Cancellation

User Cancels Export:

Export started: 20 files
Files 1-12 completed
User presses Ctrl+C
Later: Resume completes files 13-20

3. Cost Savings

Cloud Provider Egress:

4. Multi-Day Exports

Large VMs:

Day 1: Download 100 GB (limit hit, cancel)
Day 2: Resume, download next 100 GB
Day 3: Complete remaining 50 GB
Total: 250 GB in manageable chunks

Checkpoint File Format

Location

Default Path:

{outputDir}/.{vmName}.checkpoint

Example:

/backups/.web-server-01.checkpoint

Custom Path:

opts.CheckpointPath = "/custom/path/my-checkpoint.json"

Structure

{
  "version": "1.0",
  "vm_name": "web-server-01",
  "provider": "vsphere",
  "export_format": "ova",
  "output_path": "/backups",
  "created_at": "2026-01-23T10:00:00Z",
  "updated_at": "2026-01-23T10:15:30Z",
  "files": [
    {
      "path": "disk-0.vmdk",
      "url": "https://vcenter/nfc/...",
      "total_size": 53687091200,
      "downloaded_size": 53687091200,
      "checksum": "a1b2c3d4...",
      "status": "completed",
      "last_modified": "2026-01-23T10:10:00Z",
      "retry_count": 0
    },
    {
      "path": "disk-1.vmdk",
      "url": "https://vcenter/nfc/...",
      "total_size": 21474836480,
      "downloaded_size": 0,
      "checksum": "",
      "status": "pending",
      "last_modified": "2026-01-23T10:10:00Z",
      "retry_count": 0
    }
  ],
  "metadata": {}
}

File Status Values

Status Description
pending File queued for download
downloading Download in progress
completed Download finished successfully
failed Download failed (will retry on resume)

Configuration Options

EnableCheckpoints

Type: bool Default: false

Enable checkpoint-based resumption.

opts.EnableCheckpoints = true

ResumeFromCheckpoint

Type: bool Default: false

Resume from existing checkpoint if found.

Important: EnableCheckpoints must also be true.

opts.EnableCheckpoints = true
opts.ResumeFromCheckpoint = true

CheckpointInterval

Type: time.Duration Default: 0 (save after each file)

How often to save checkpoint during export.

Options:

// Save after each file completes (most resilient, small overhead)
opts.CheckpointInterval = 0

// Save every 30 seconds (balanced)
opts.CheckpointInterval = 30 * time.Second

// Save every 5 minutes (minimal overhead, less frequent saves)
opts.CheckpointInterval = 5 * time.Minute

Trade-offs:

Interval Overhead Recovery Granularity
0 (per file) Low Best - resume at exact file
30 seconds Very low Good - max 30s of re-download
5 minutes Minimal Fair - may re-download one file

CheckpointPath

Type: string Default: "" (auto-generate)

Custom checkpoint file path.

Auto-generated:

opts.CheckpointPath = "" // Uses: {outputDir}/.{vmName}.checkpoint

Custom:

opts.CheckpointPath = "/custom/checkpoints/export-2026-01-23.json"

Resume Behavior

Resume Decision Tree

Start export
    |
    v
EnableCheckpoints? ──NO──> Normal export (no checkpoints)
    |
    YES
    v
ResumeFromCheckpoint? ──NO──> Create new checkpoint
    |
    YES
    v
Checkpoint exists? ──NO──> Create new checkpoint
    |
    YES
    v
Load checkpoint ──FAIL──> Warn + create new checkpoint
    |
    OK
    v
Resume export (skip completed files)

File Validation on Resume

For each file marked “completed” in checkpoint:

1. Check if file exists on disk
2. Compare file size with expected size
3. If match: Skip download
4. If mismatch: Re-download file

Example:

disk-0.vmdk: checkpoint says 50 GB completed
    -> File exists: 50 GB ✓
    -> Skip download

disk-1.vmdk: checkpoint says 20 GB completed
    -> File exists: 15 GB ✗ (incomplete)
    -> Re-download file

disk-2.vmdk: checkpoint says pending
    -> Download file

Examples

Basic Resume After Failure

First Run:

$ hyperexport vsphere export \
    --vm /DC/vm/web-server \
    --output /backups \
    --format ova \
    --enable-checkpoints \
    --resume

# Downloads 5/10 files, then network fails
# Checkpoint saved: 5 files completed

Second Run (Resume):

$ hyperexport vsphere export \
    --vm /DC/vm/web-server \
    --output /backups \
    --format ova \
    --enable-checkpoints \
    --resume

# Loads checkpoint
# Skips 5 completed files
# Downloads remaining 5 files
# Deletes checkpoint on success

Manual Checkpoint Management

Check Checkpoint Status:

checkpointPath := common.GetCheckpointPath("/backups", "web-server-01")

if common.CheckpointExists(checkpointPath) {
    checkpoint, err := common.LoadCheckpoint(checkpointPath)
    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("Progress: %.1f%%\n", checkpoint.GetProgress() * 100)
    fmt.Printf("Files: %d\n", len(checkpoint.Files))

    for _, file := range checkpoint.Files {
        fmt.Printf("  %s: %s (%d/%d bytes)\n",
            file.Path, file.Status, file.DownloadedSize, file.TotalSize)
    }
}

Delete Stale Checkpoint:

checkpointPath := common.GetCheckpointPath("/backups", "web-server-01")
err := common.DeleteCheckpoint(checkpointPath)

Time-Based Checkpoint Intervals

Frequent Saves (Small Files):

// For many small files, save after each
opts.CheckpointInterval = 0

Balanced (Mixed Sizes):

// For mixed file sizes, save every 30s
opts.CheckpointInterval = 30 * time.Second

Infrequent Saves (Large Files):

// For few large files, save every 5 minutes
opts.CheckpointInterval = 5 * time.Minute

Monitoring

TUI Display

The TUI shows resume status:

╭─────────────────────────────────────────╮
│ 🚀 Export Progress                      │
├─────────────────────────────────────────┤
│ Total: 1  ✓ 0  ⏳ 1                     │
│                                          │
│ ⬇ web-server-01                         │
│   ████████████░░░░░░░░░░░░░░ 50%       │
│   5.0 GB / 10.0 GB  •  50 MB/s  •  1m   │
│   File 6/10: disk-5.vmdk                │
│   [Resumed from checkpoint]             │
╰─────────────────────────────────────────╯

Logs

Checkpoint Created:

INFO  checkpoint created vm=web-server-01 path=/backups/.web-server-01.checkpoint

Resuming:

INFO  resuming from checkpoint progress=0.5 vm=web-server-01
INFO  skipping already completed file file=disk-0.vmdk
INFO  skipping already completed file file=disk-1.vmdk
...

Checkpoint Saved:

DEBUG checkpoint saved progress=0.75

Checkpoint Deleted:

INFO  checkpoint deleted after successful export

Troubleshooting

Checkpoint Load Fails

Symptom: “Failed to load checkpoint, starting fresh”

Causes:

  1. Corrupted checkpoint file
  2. Incompatible checkpoint version
  3. Permission issues

Solution:

File Re-downloaded Despite Completion

Symptom: Completed file is re-downloaded

Causes:

  1. File size mismatch (partial download)
  2. File was deleted
  3. File was modified

Expected Behavior: File validation failed, re-download is correct

Checkpoint Not Saving

Symptom: No checkpoint file created

Causes:

  1. EnableCheckpoints = false
  2. Permission issues in output directory
  3. Disk full

Solution:

// Verify checkpoint is enabled
opts.EnableCheckpoints = true

// Check output directory permissions
// Check disk space

Resume Not Working

Symptom: Export starts from beginning

Causes:

  1. ResumeFromCheckpoint = false
  2. Checkpoint file doesn’t exist
  3. Different output path or VM name

Solution:

// Enable resume
opts.EnableCheckpoints = true
opts.ResumeFromCheckpoint = true

// Verify checkpoint path matches
// Use same output directory and VM name

Implementation Details

Checkpoint Lifecycle

1. Export Start
   ├─> EnableCheckpoints? → Create checkpoint
   └─> ResumeFromCheckpoint? → Load checkpoint

2. Download Loop
   ├─> For each file:
   │   ├─> Check if completed in checkpoint
   │   ├─> Skip if completed and valid
   │   ├─> Download if pending/failed
   │   └─> Update checkpoint on completion
   └─> Save checkpoint (per interval)

3. Export Complete
   └─> Delete checkpoint file

Concurrency Safety

All checkpoint operations are protected by mutex:

var checkpointMux sync.Mutex

// Before checkpoint access
checkpointMux.Lock()
checkpoint.UpdateFileProgress(path, size, "completed")
checkpointMux.Unlock()

Atomic Writes

Checkpoint saves are atomic:

1. Write to temporary file: .checkpoint.tmp
2. Rename to final file: .checkpoint (atomic operation)
3. Delete temporary file

This ensures checkpoint is never corrupted mid-write.

Performance Impact

Overhead:

With CheckpointInterval = 0:

With CheckpointInterval = 30s:

Security Considerations

Checkpoint File Permissions

Checkpoint files contain:

Default Permissions:

0644 (rw-r--r--)

For Sensitive Environments:

// After creating checkpoint
os.Chmod(checkpointPath, 0600) // rw-------

Checkpoint Cleanup

Successful Export:

Failed Export:

Cleanup Script:

# Remove checkpoints older than 7 days
find /backups -name ".*.checkpoint" -mtime +7 -delete

Future Enhancements

1. Partial File Resumption

Currently skips completed files. Could support:

// Resume partial file download with HTTP Range headers
Range: bytes=50000000-

2. Checksum Validation

Add integrity checks on resume:

type FileCheckpoint struct {
    // ...
    PartialChecksum string // SHA-256 of downloaded bytes
}

3. Compression

Compress large checkpoint files:

opts.CompressCheckpoint = true // Use gzip

4. Cloud Storage

Store checkpoints in cloud storage:

opts.CheckpointBackend = "s3://bucket/checkpoints/"

5. Multi-Export Coordination

Share checkpoint across multiple export workers:

// Distributed checkpoint for parallel exports
opts.SharedCheckpoint = true

Conclusion

Export resumption is production-ready for all providers.

Key Features:

Use Cases:

Best Practices:


Next: Advanced Features (Manifest Generation, Auto-Conversion, etc.)