Distributed Block Storage
for Kubernetes

Replicated and erasure-coded persistent volumes with topology-aware placement, self-healing, and persistent brick-local storage. Built in Rust for performance and reliability.

Features

Flexible Protection Policies

Choose replicated or erasure-coded volumes. The policy is stored with the volume and exposed through the admin API, CLI, and operator-facing docs.

Topology-Aware Placement

TopoHash algorithm spreads data across failure domains — datacenter, rack, and host — so a single rack failure never exceeds fault tolerance.

Self-Healing

Automatic brick failure detection via heartbeats, shard rebuild planning, and volume recovery. Degraded volumes are restored without operator intervention.

Operational Controls

Alert on degraded bricks and volumes, inspect placement dependencies, deterministic drain with migration tracking, plan and apply rebalancing after topology changes, and guard removals with impact checks.

Kubernetes Native

CSI driver, Helm chart, CRDs for clusters, bricks, and volumes with active operator reconciliation, and node-local NVMe/TCP export reconciliation. Provision volumes declaratively or with StorageClasses.

Persistent Brick Storage

Bricks keep a stable local identity and store shard data in a persistent device-backed log, so restarts preserve both on-disk data and brick UUIDs.

Architecture

A layered system from Kubernetes workloads down to brick-local storage, with replicated or erasure-coded protection, topology-aware placement, and operator-visible recovery workflows.

Hyperblock system architecture diagram showing the flow from Kubernetes workloads through the CSI driver and metadata service to brick servers

Metadata Service

Cluster brain — manages volumes, placement maps, health monitoring, and rebuild orchestration. Port 9200.

Brick Servers

Chunk storage with heartbeat and auto-registration. Each brick manages local NVMe/SSD storage. Port 9100.

CSI Driver

Kubernetes CSI integration for PV/PVC provisioning. Creates volumes via metadata, stages and publishes on nodes. Port 9300.

Export Runtime

`hyperblock-nbd` provides the current compatibility bridge, while `hyperblock-nvmf` and the reference SPDK target-manager path materialize node-local NVMe/TCP exports for future native serving flows.

Operator

Watches HyperblockCluster, HyperblockBrick, and HyperblockVolume CRDs. Reconciles metadata StatefulSets, brick/CSI DaemonSets, brick registration, and volume provisioning.

CLI

Admin command-line interface for volume, brick, and cluster operations.

TopoHash

Hyperblock's topology-aware placement engine turns a volume ID, a stripe index, and a simple placement rule into a deterministic set of brick targets and a persisted placement map.

Input 1

Topology Tree

ClusterTopology is a real hierarchy: root -> datacenter -> rack -> host -> brick. Every placement decision starts from that tree.

Input 2

Placement Rule

PlacementRule is a list of level/count/mode steps such as "pick 6 distinct racks first, then fall back to any remaining unique bricks."

Input 3

Placement Key

compute_pg_id() hashes volume_id || stripe_index with xxh3, producing the deterministic key that drives domain and brick ranking.

Output

Placement Map

The metadata service stores an ordered set of brick targets in a PlacementMap, bumps the placement version, and copies that version onto the volume as placement_epoch.

Selection Flow

Placement key pg_id = xxh3(volume_id || stripe_index)
Rule rack distinct, count = 6
Goal unique bricks, domain-first, deterministic order
Candidate domains at rack level
rack-a
b1 b2
rack-b
b3 b4
rack-c
b5 b6
rack-d
b7 b8

TopoHash ranks all racks once, then ranks all leaf bricks inside each chosen rack. The first unique brick in each ranked domain wins that slot.

Resulting brick order
b3 b6 b7 b2 b5 b8

The first four picks satisfy rack spread. The last two come from the cluster-wide unique-brick fallback because only four racks exist but six targets are required.

Block To Placement Mapping

volume offset
chunk index
stripe index
pg_id
PlacementMap
0-127 MiB 128-255 MiB 256-383 MiB 384-511 MiB
Logical block I/O

split_io() converts a byte range into one or more ChunkId { volume_id, index } operations using the volume's logical chunk_size.

Placement key

The metadata layer hashes volume_id plus the chunk or stripe index to get the placement-group key that TopoHash evaluates.

Replicated volume

One logical chunk location stores multiple brick IDs.

ChunkLocation {
  chunk_id: 3,
  brick_ids: [b3, b6, b7]
}

Erasure-coded volume

One stripe expands into many shard locations, each with one brick target.

stripe 3
  shard0 -> b3
  shard1 -> b6
  shard2 -> b7
  shard3 -> b2
  parity0 -> b5
  parity1 -> b8
1

Build or update topology

Bricks are inserted under datacenter, rack, and host nodes. The tree stores weights and the physical path to every brick.

2

Hash the placement key

compute_pg_id() creates a 64-bit placement key from volume_id and stripe_index. Same inputs always produce the same key.

3

Rank candidate domains

For each rule step, TopoHash collects every node at the requested level and deterministically ranks them from the placement key.

4

Rank bricks within each domain

Leaf bricks inside the chosen domain are ranked in deterministic order. The first brick not already used wins that slot.

5

Fallback if the ideal spread is impossible

If the rule cannot produce enough distinct domains, TopoHash ranks the entire brick set and fills the remaining slots with any unique bricks.

6

Persist the result

The metadata service writes a PlacementMap, increments the placement version, and clients consume that map for later reads, writes, drains, and rebalance plans.

Client write semantics

  1. The client resolves the volume, reads or caches its PlacementMap, and splits a block I/O into one or more chunk operations.
  2. For replicated volumes, the client looks up the matching ChunkGroup, takes the first logical ChunkLocation, and fans the same chunk out to every brick in that location's brick_ids.
  3. The replicated write succeeds when at least write_quorum replicas acknowledge the chunk. Failed bricks are marked failed locally so later operations prefer healthier targets.
  4. For erasure-coded volumes, the client encodes the chunk into data and parity shards, then writes one shard to each placement-map location for the stripe.
  5. The current EC path expects all shard writes for the stripe to complete; otherwise the write fails and the operator-visible recovery path takes over.

Client read semantics

  1. The client uses the same chunk or stripe index to find the correct ChunkGroup in the cached placement map.
  2. For replicated volumes with read_quorum = 1, the client tries replicas in order, skips locally failed bricks, and returns the first healthy response.
  3. For replicated volumes with read_quorum > 1, the client reads from all healthy replicas in parallel and requires at least read_quorum byte-identical responses.
  4. If quorum is reached but some replicas disagree, the client uses the majority response and triggers a background rewrite to repair divergent replicas.
  5. For erasure-coded volumes, the client reads shard locations, reconstructs the stripe locally if enough shards survive, then truncates the decoded bytes back to the requested length.

What the metadata service stores

PlacementMap {
  volume_id,
  version,
  groups: [
    ChunkGroup {
      stripe_index,
      locations: [
        ChunkLocation { chunk_id, brick_ids[...] }
      ]
    }
  ]
}

Current implementation boundaries

  • Selection is topology-aware and deterministic, but the current code uses simple ranked selection rather than a more elaborate weighted bucket algorithm.
  • Topology weights are stored and rolled up through the tree, but current selection does not yet bias picks by weight.
  • The metadata service computes and persists placement maps today; clients and gateways consume those maps rather than independently evaluating placement from raw topology state.
  • Placement operates on block-storage chunks and stripes, not on a separate object namespace.

How It Works

Write Path

1

Client Write

Application writes data to a volume via the CSI-mounted block device.

2

Protect The Data

The client either fans the chunk out to replicas or encodes it into data and parity shards, depending on the stored protection policy.

3

Placement Lookup

TopoHash maps the chunk index to target bricks across failure domains.

4

Parallel Fanout

All shards are written to their assigned brick servers in parallel via gRPC.

Read Path

1

Placement Lookup

Client resolves which bricks hold the shards for the requested chunk.

2

Parallel Fetch

Shards are fetched from brick servers in parallel, skipping any failed bricks.

3

Replica Failover Or EC Reconstruct

Replicated volumes fail over to another healthy replica. Erasure-coded volumes reconstruct the stripe locally when enough shards survive.

4

Return Data

Data shards are concatenated and truncated to the original length.

Self-Healing

1

Heartbeat Monitor

Bricks send heartbeats every 10s. The health monitor scans every 15s for stale bricks (30s timeout).

2

Failure Detection

Missed heartbeat marks brick as Down. Affected stripes and volumes are identified.

3

Rebuild Planning

Replacement bricks are selected from the healthy pool. The RebuildPlanner creates migration tasks.

4

Recovery

Shards are rebuilt or migrated to new bricks, and operators can inspect placement movement through the CLI before removing infrastructure.

Erasure Coding Profiles

Reed-Solomon erasure coding provides durability without full replication. Choose a profile that matches your fault tolerance and storage efficiency requirements.

ProfileData ChunksParity ChunksTotal ShardsFault ToleranceStorage Overhead
EC_4_24262 brick failures1.5x
EC_8_383113 brick failures1.375x
EC_8_484124 brick failures1.5x

Default chunk size: 128 MiB. Compare with 3x replication overhead for equivalent fault tolerance.

Tech Stack

Rust
1.75+ MSRV
Tokio
Async runtime
Tonic + Prost
gRPC framework
OpenRaft
Consensus
Sled
Embedded KV store
Reed-Solomon
Erasure coding
xxHash (xxh3)
Fast hashing
kube-rs
Kubernetes client
Prometheus
Metrics
Clap
CLI framework
Docker
Multi-stage builds
Helm
K8s deployment

Quick Start

1. Build from Source

# Prerequisites: Rust 1.75+, protobuf-compiler
cargo build --workspace
cargo test --workspace       # 180+ tests
cargo clippy --workspace --all-targets   # Must be warning-free

2. Deploy to Kubernetes

# Install CRDs
kubectl apply -f deploy/helm/hyperblock/crds/

# Install with Helm
helm install hyperblock deploy/helm/hyperblock \
  --namespace hyperblock-system \
  --create-namespace \
  --set image.repository=myregistry/hyperblock \
  --set image.tag=0.1.0

3. Create a Volume

# CLI examples
hyperblock-cli volume create --name analytics-ec --size 100GiB --data-chunks 4 --parity-chunks 2
hyperblock-cli volume create --name postgres-r3 --size 50GiB --replicas 3

# Or use a PVC through CSI
kubectl apply -f pvc.yaml