From Platform Firefighting to Platform Uptime: How One Enterprise Broke the Cycle with Portainer

Written by Portainer team | April 22, 2025

There’s a familiar pattern we see across many enterprise IT teams. It begins with the best intentions; a container platform is built in-house to give developers speed and flexibility. Over time though, complexity creeps in. What started as an enabler becomes a growing source of operational drag.

One large US-based Fortune 500 enterprise, operating in a demanding and regulated sector, was living that reality.

They were juggling two container platforms. Dozens of Docker Swarm clusters (managed by Portainer) powered a large legacy application due to its network simplicity and support for persistent workloads. Meanwhile, their modern workloads were hosted on almost 50 Kubernetes Clusters, managed through Rancher and running RKE across almost 300 VMware-based virtual machines.

It worked technically, but only just. A small internal platform team was drowning in operational load. They were acting as a helpdesk and gatekeeper for the developer team, caught between keeping the lights on and trying to push the platform forward. Upgrades were overdue. Risks were mounting. Outages kept occurring. Cluster upgrades kept failing. The platform and the engineering staff that supported it had become a bottleneck.

In addition, their Rancher subscription was due for renewal, and the renewal quote had almost doubled. This cost increase, coupled with a near-simultaneous cost increase for their VMware platform, forced a decision point. The enterprise did consider switching their Rancher deployments to VMware Tanzu, but felt reluctant to dive deeper into the VMware Stack after the Broadcom VMware licensing saga.

So, the IT Director made the tough decision to stop patching the problem, stop throwing good money after bad, and break the cycle completely.

Starting Fresh, the Right Way

Instead of trying to salvage the old setup, the organization went for a clean rebuild. They chose to outsource the operation entirely under a “Platform Engineering as a Service” model, complete with a contractual platform availability SLA. The brief was simple: rebuild everything in alignment with the provider’s best practices, using tools the provider was comfortable operating. The goal was to ensure long-term stability and relieve internal teams from firefighting duties.

This is where Portainer was brought in.

We became their virtual Platform Engineering team, taking on responsibility not just for day-to-day operations but also for the architecture and tooling that powers their container platform. We committed to delivering a 99.9 percent platform SLA and providing 24x7 support with a one-hour response window for critical issues.

Automating the Platform, End to End

When we set out to rebuild the platform, it wasn’t just about making things more stable. We wanted to rethink how infrastructure was delivered. The goal was to give teams the power to move faster, to modernize how they worked, and to do it all at scale.

We started by creating an automated Kubernetes build process using ClusterAPI. Instead of relying on manual steps or fragile scripts, we built a custom Helm chart to serve as the trigger. From there, everything runs through a GitHub Actions pipeline. As part of that pipeline, Ansible Playbooks handle additional configuration and customization, making sure that every cluster meets our standards out of the box.

At the front of this process is a lightweight developer portal that was built in-house. It’s not flashy, but it’s exactly what the customers' internal teams needed. Developers use it to request new clusters. Once a request is approved, the automation takes over. A new cluster is created, monitoring and management tools are deployed, and the whole thing is wired into one central Portainer instance. This same instance now manages both their Docker Swarm and Kubernetes environments.

Access controls are applied automatically. RBAC roles are set up based on predefined templates. Ingress is configured using the customers' existing Citrix NetScalers. All of this happens in the background, so by the time developers get access, the cluster is production-ready.

We also made scaling effortless. Whether a team needs more CPU and memory for a single node or wants to add more worker nodes to handle load, both vertical and horizontal scaling can be triggered through a simple request in the portal. The automation handles the rest, ensuring capacity is added without disruption.

For observability, we rolled out OpenTelemetry to collect telemetry data. Prometheus, backed by Mimir, handles time-series metrics. Grafana gives us the dashboards we need. To avoid the overhead of traditional agents, we leaned into eBPF for low-level performance insights. This gives us deep visibility without the cost of running extra containers.

To top it off, we built in automation rules that watch for early warning signs. If a threshold is crossed or a pattern is detected, the system responds automatically. These "if-this-then-that" workflows help us stay ahead of issues and protect our service-level commitments.

This rebuild has completely changed how infrastructure is delivered. It’s faster, more reliable, and ready to grow with us.

A Better Experience for Developers

With this new setup, developers are no longer stuck behind tickets or waiting on the platform team. Portainer gives them direct visibility into their workloads. They can view deployment logs, track their own GitOps pipeline status, monitor resource usage, and even restart or redeploy services without needing assistance.

It still runs on Kubernetes, but the developers don’t need to think in Kubernetes terms. For them, it just works.

Ready to Make the Switch?

If your container platform is starting to feel like a liability rather than a launchpad, you are not alone. Many teams find themselves boxed in by complexity, technical debt, and limited resources.

Portainer’s Managed Services team can help. We specialize in taking over existing environments or building new ones from scratch, always with a focus on simplicity, reliability, and measurable outcomes.

If you are ready to offload the burden and focus on building great software instead, reach out. We would be happy to share how we helped this customer break the cycle and how we can help you do the same.

View full post