Portainer News and Blog

Why are stateful containers so confusing in Kubernetes?

Written by Neil Cresswell, CEO | September 23, 2021

The Portainer team spend a lot of time in online communities related to Kubernetes, and one of the most frequent questions we see relates to data persistence for apps.

There seems to be general confusion as to when you should use a Deployment with a PVC and when you should use a StatefulSet with a PVC. There is also a real lack of understanding when it comes to disk access policies, what RWO/RWX mean and what they allow you to do.

Let me try and explain what all this means.

First things first, both a Deployment with a PVC and a StatefulSet with a PVC will allow you to persist data for your container-based application. However, the decision as to which is best for your app comes down to a couple more things...

  • does your application expect to have more than its disk state retained?
  • does it need its hostname retained too?
  • does the start-up and shutdown order of the replicas matter?
If the answer to any of these questions is 'yes', then your choice becomes a bit more limited.

Perhaps an easier way of thinking about this is to ask yourself "was my app originally designed to run in its own VM with its own disk storage?", if the answer is 'yes', then its likely StatefulSet is a better choice for you.

With a Deployment, the corresponding Pods that get created have a hostname that matches the deployment name + a random system defined value, eg myapp-576cd55d5f-dwp8q.

Should the Pod be rescheduled on another node (due to node maintenance) or, if you update/redeploy the application, then the hostname will change. Can your application accommodate its hostname change?

If you have a Deployment with multiple replicas and these replicas start in a random order, can your application handle that, or does one pod need to start before another, and does the order they are stopped matter also?

If you have a Deployment with multiple replicas, do all of these replicas need to access the same underlying persistent volume, or do they each expect their own unique volume?

While we are here, let’s quickly talk about disk access policy, RWO/RWX (read write from a single pod or read/write from multiple pods) and the lesser used ROX (read only from multiple pods). These policies tell Kubernetes if you can create Deployments with multiple replicas using the SAME underlying persistent volume. Generally, this is a BAD idea unless you know what you are doing, as unless your application supports this, you can end up in a world of pain (think, last changed wins).

We regularly see questions from Kubernetes engineers asking, "my storage is NFS, so I set my access policy to RWX, but why can’t i get the MYSQL deployment to scale up to 3 replicas?".  Well, it’s simple... sure, your storage driver allows you to present the same volume to multiple pods, but the application running within the pod (MySQL in this example) requires an exclusive lock on the files contained within the volume. The first pod starts, MySQL starts, and locks the DB file, the second pod starts, MySQL starts, and crashes as it is unable to open the DB file.

The real answer here is there is no way you can "share" a volume unless your app is designed to handle it, MySQL (and all other DBs out there) are NOT. You need to configure each with its own persistent volume and then configure DB replication (either multi-master, or master/worker).

When you get a more simplistic application, like NGINX or WordPress, you potentially can share a volume, as neither of these want to exclusively lock backend files, so these could be deployed as a deployment with multiple replica's and with persistence, but again, you need to be careful.

Generally, we would advise against using multi-replica deployments with persistence as you really need to know what you are doing in these types of setups.

OK, so back to Deployment with PVC or StatefulSet with PVC..

If you need Pods to have their own persistent volume, then use StatefulSets. Its just not worth the risk to see if your app supports concurrent writes, so dont guess.

If you need the pods that make up your application to have consistent, and predictable hostnames, then use StatefulSets. Hostnames for pods are ALWAYS <application name>-<numerical increment>. This does not change regardless of where they are rescheduled, or when they are redeployed

If you need the pods that make up your application to start in a specific order, then use StatefulSets. Pods are always started in their increment order (0,1,2,3,4,5...). Pod 1 will only start once pod 0 is online and ready.

If you need the pods that make up your application to stop in a specific order, then use StatefulSet. Pods are always stopped in the reverse increment order (5,4,3,2,1,0).

I hope this helps clarify when to use StatefulSets vs Deployments.

If you elect to deploy an application with persistence through Portainer we will DEFAULT to StatefulSet as the alternative can lead you to some very dark places unless you know exactly what you are doing.

Neil