We rebuilt the Kubernetes control loop on Postgres

·Tom Härter

We like Kubernetes. People here have run it in production at real scale, and a few have built and run managed Kubernetes that other companies deployed onto. So this is not the usual story of someone who bounced off kubectl and quit. It is the opposite. Everything we learned running it is why Atlasflow's control plane works the way it does.

Here is what we learned. Kubernetes got one thing exactly right, and then wrapped it in a second full-time job. The thing it got right is the reconcile loop. The second job is running a cluster. We kept the loop, threw the cluster away, and run the whole control plane on Postgres. Here is how.

The loop

Most deploy systems are edge-triggered. A webhook fires, you run a handler, the handler changes the world. That works until a webhook is dropped, or arrives twice, or the handler dies after step three of five. Now what you wanted and what you have disagree, and nothing is in charge of fixing it.

A reconciler works the other way. It is level-triggered, which means it does not trust events at all. It reads two things, the state you asked for and the state that exists, does one thing to move them closer, and asks to run again.

type ReconcileFunc func(ctx context.Context, key uuid.UUID) (Result, error)

We run one of these for each kind of thing: deployments, projects, servers, VMs, domains. The body is boring on purpose. The deployment reconciler counts the healthy instances that exist, compares that to the number you asked for, and if it is short, starts one more and asks to run again.

There is no state machine to get stuck in, and no half-finished deploy to clean up. It compares two numbers and takes one step. So running it once and running it a hundred times end in the same place. That property is called idempotence, but the property is what matters, not the word. When the process can die between any two lines, and ours will, it is the only property worth having. A step that fails just gets retried until it sticks.

Postgres

Kubernetes keeps its model of the world in etcd, a separate database you have to run next to everything else. We keep ours in plain Postgres, the database you already have, and we did not add a second system beside it.

Picture one set of tables where every row is a fact. This app should have three instances. This instance is healthy. This one is shutting down. The rows you write are what you want. The rows we write are what is actually running. The whole job of the control plane is to make the second set match the first. You can open it with psql and read it, copy it with pg_dump, and understand it without learning a new theory of consistency.

Now the part that makes it move. Postgres can stream its own changes, so every time a row changes, we hear about it. The system never polls and guesses. It reacts. The moment an instance reports healthy, the deployment that owns it wakes up and starts retiring the old version. The moment an instance starts shutting down, that deployment wakes up and starts a replacement. The right loop runs within milliseconds of the change that should have triggered it.

Here is the part that lets us sleep. That stream exists only to make things fast. It is not what makes them correct. Even if every notification were lost, the control plane re-reads the whole table on a timer, finds any row that does not match what it should be, and fixes it. The truth gets enforced twice: the instant a change lands, and again on the next sweep. We never have to make the fast path perfect, because the slow path cannot miss.

One leader

Two control planes reconciling the same fleet would fight, so only one leads at a time. We did not add ZooKeeper or etcd or a lease controller to arrange that. Leadership is held, not granted. The leader keeps it only while it stays alive and connected. The instant it crashes or drops off, a standby takes over, and nobody has to do anything.

There is no lease timeout to tune, no fencing token to thread through every write, and no window where two processes both think they are in charge. Failover is immediate, because for a standby, becoming leader is just claiming something the dead leader can no longer hold.

A deploy

You push to your connected branch. None of what follows is yours to run.

The deployment controller wakes up. It sees a new version, and fewer healthy instances than you asked for. It builds the image and schedules a microVM onto the least loaded server with room. The VM boots under Cloud Hypervisor with its own kernel. It is in the fleet, but it carries no traffic, because the edge only routes to instances that have passed a health check. The health checker probes it. The first success is recorded, and that change wakes the deployment again. Now it marks the new version live, and the old instances start draining. Each one removes itself only after a healthy replacement exists, so the old version serves every request until the new one is ready.

A broken release never goes healthy, so it never takes traffic. A failed one never displaces the version that works. Nobody wrote a rollback routine. Rollback is just the loop never advancing past an instance that will not go healthy.

Why not Kubernetes

Because the loop was never the hard part to adopt. The cluster was. To get reconciliation from Kubernetes, you also sign up to run an API server, etcd, a scheduler, a kubelet on every node, a CNI plugin, an ingress controller, and more YAML than most of the apps people deploy on it. That is a fine trade if you are Google. It is a bad one if you have eight engineers and a product to ship.

We wanted our customers to get the guarantees of a control loop and none of the work of running one. So we run the loop, on infrastructure they never see, and we kept the part they touch small: this repo, this many instances, this domain, this database.

Many regions

Kubernetes makes this part painful, and our model makes it dull. A cluster lives in one region. Going multi-region means a cluster per region, plus federation on top, which is a second distributed system with its own consensus and its own ways to fail. Region is a property of the cluster, so crossing regions means crossing clusters.

Our control plane is not a cluster. It is loops over rows. A server is a row, and that row carries its capacity along with where the metal actually is: its region, its city, its latitude and longitude. A VM is a row too, and it inherits the location of the server it lands on. So at any moment the control plane knows not just how much is running but where every instance of your app physically sits. Geography is not a federation layer bolted on the side. It is a few more columns on rows the scheduler already reads.

Because placement can see those coordinates, it becomes a choice instead of a guess. When you ask for three instances, the loop does not just drop them on the three least loaded boxes. It spreads them across regions, so losing a building, a power feed, or a whole region still leaves you serving. And it weighs where your users are against where each server is, so the instance that answers a request is the one closest to the person making it. The same loop that keeps you healthy on one box keeps you healthy across regions. Now it chooses where, not just how many.

None of this is a separate geo feature with its own moving parts. It is the same level-triggered loop, reading a few more facts off the same rows. High availability is the loop refusing to pile your instances into one failure domain. Low latency is the loop preferring servers near your traffic. Both fall out of placement having real coordinates to reason about, and both get enforced on every sweep. A region going dark is just another difference for the loop to close: the instances that vanished with it are now short, so the loop starts them again on the regions still standing.

The point

You declare a few facts: this repository, this many instances, this domain, this database. After that, a level-triggered loop on Postgres holds reality against what you declared, and treats every failure as a difference to close. Zero-downtime deploys, automatic rollback, self-healing, and multi-region are not five features. They are one loop, doing the same boring thing, in more places.

Reading about a control plane proves nothing. Push something and watch it stay up.

Deploy in 3 minutes