Geoff OlliffNovember 25, 20247 min read

The Platform Engineering Team - Skills and Experience

This article is part of our ongoing series on the importance of Platform Engineering. Earlier articles discuss the nature of a Software Delivery Platform and thoughts on successful Platform Engineering.

In this article we look at the skills and experience that need to come together in the Platform Engineering Team. Given the wide range of technologies assembled into a platform, the team responsible will need to maintain (or have access to) an extensive range of skills at a deep level. Generally, combining the skills required with sufficient real-world experience with those skills in a small team is hard.

But its possible to get this right, and a great platform team (and a great platform) will revolutionize IT's ability to make a difference

The best platforms start small and grow with user demand and feedback. Similarly, not all organizations are heavily focused on development. Many organizations have both development requirements and the need to source and deploy commercial or off-the-shelf software. The composition and skill level of the critical Platform Engineering team will vary from company to company. However, all teams will require some or all of these capabilities in no particular order:

1. Infrastructure as Code (IaC)

Functions: Writing and maintaining scripts to automate the provisioning and management of infrastructure, whether in the cloud or on-premise.
Skills: Proficiency in IaC tools like Terraform, Ansible, Chef, and Puppet; knowledge of both cloud platforms (AWS, Azure, GCP) and on-premise systems (VMware, physical servers, storage, and networking).
Importance: Automating infrastructure through IaC ensures consistency, reduces human error, and allows for rapid scaling and updating of environments, whether in the cloud or on-premise. This automation is critical for maintaining parity between development, testing, and production environments and facilitating disaster recovery. IaC also makes managing complex, hybrid environments that span both cloud and on-premise resources easier.

2. CI/CD Pipeline Management

Functions: Designing, implementing, and managing continuous integration and continuous deployment pipelines.
Skills: Experience with CI/CD tools (Jenkins, GitLab CI, CircleCI), scripting (Python, Bash), and containerization (Docker, Kubernetes).
Importance: CI/CD pipelines are crucial for automating the build, test, and deployment processes, enabling faster and more reliable software delivery. Proper pipeline management minimizes bottlenecks and ensures smooth releases.

3. Hybrid Cloud Architecture and Management (Including on-prem Infrastructure)

Functions: Designing container platform infrastructure that spans both cloud and on-premise environments, managing resources across these environments, optimizing costs, and ensuring seamless integration.
Skills: Deep understanding of cloud services (AWS, Azure, GCP), expertise in on-premise systems (VMware, physical servers), hybrid cloud management tools (such as VMware Cloud on AWS, Azure Arc), storage, networking, and security best practices. Deep understanding of the requirements and principles of a container platform.
Importance: As organizations adopt hybrid cloud strategies, designing and managing infrastructures that integrate both on-premise and cloud resources efficiently and enable the deployment and management of containerized workloads becomes essential. This hybrid approach allows for flexibility, enabling companies to keep specific workloads on-premise (for reasons such as compliance or performance) while leveraging the scalability and flexibility of the cloud. Proper management of hybrid environments is crucial for optimizing performance, ensuring security, and controlling costs across different infrastructure layers.

4. Monitoring and Observability

Functions: Implementing and maintaining monitoring tools, setting alerts, and ensuring system observability.
Skills: Knowledge of monitoring tools (Prometheus, Grafana, Datadog), logging solutions (ELK stack, Fluentd), and alerting systems (PagerDuty).
Importance: Monitoring and observability are critical for detecting and resolving issues before they impact users. They provide insights into system performance and help identify bottlenecks or failures in the infrastructure.

5. Security and Compliance

Functions: Implementing security measures, ensuring compliance with industry standards, and managing security incidents.
Skills: Understanding of security frameworks (ISO 27001, SOC 2), familiarity with security tools (Vault, Nessus, AWS IAM), and knowledge of regulatory requirements (GDPR, HIPAA).
Importance: Securing the platform is non-negotiable with the rise of cyber threats. A platform engineering team must safeguard data, maintain compliance, and integrate security practices into all processes.

6. Automation and Scripting

Functions: Automating repetitive tasks, developing scripts to improve efficiency, and integrating systems.
Skills: Proficiency in scripting languages (Python, Bash, PowerShell), automation tools (Ansible, Jenkins), and API integration.
Importance: Automation reduces the time spent on manual tasks, minimizes errors, and allows the team to focus on more complex issues. It's essential for maintaining a scalable and efficient platform.

7. Networking and Load Balancing

Functions: Manage network infrastructure and load balancers and ensure high availability.
Skills: Knowledge of networking protocols (TCP/IP, DNS, HTTP/HTTPS), experience with load balancing tools (NGINX, HAProxy), and cloud networking (VPCs, Subnets).
Importance: Proper network management ensures that applications are accessible, performant, and resilient. Load balancing prevents downtime by distributing traffic efficiently across servers.

8. Collaboration and DevOps Culture

Functions: Promoting collaboration between development and operations teams, fostering a DevOps culture.
Skills: Strong communication skills, understanding of DevOps principles (continuous integration, continuous delivery, automation), and experience with collaboration tools (JIRA, Confluence, Slack).
Importance: A collaborative culture between development and operations teams is key to platform engineering's success. It ensures that infrastructure and software are developed in tandem, leading to faster, more reliable releases.

9. Containerization and Orchestration

Functions: Managing containers and orchestration tools, optimizing containerized environments.
Skills: Expertise in Docker, Kubernetes, and container orchestration (Helm, OpenShift).
Importance: Containers allow for consistent development, testing, and production environments. Orchestration tools manage these containers at scale, ensuring they run smoothly and efficiently.

10. Configuration Management

Functions: Managing configuration files, ensuring consistency across environments, and tracking changes.
Skills: Familiarity with tools like Ansible, Puppet, or Chef, as well as version control systems (Git).
Importance: Configuration management ensures that environments are consistent, repeatable, and accessible to recovery. It is vital in maintaining stability across different environments (dev, test, production).

11. Incident Management and Response

Functions: Handling incidents, root cause analysis, and implementing preventive measures.
Skills: Incident management tools (PagerDuty, Opsgenie), root cause analysis, and communication during crises.
Importance: Quick and effective incident management is crucial for minimizing downtime and maintaining user trust. A well-prepared team can resolve issues faster and learn from incidents to prevent future occurrences.

12. Platform Evangelism, Documentation, and Agile Product Management

Functions: Promoting the use of the platform, creating comprehensive documentation, educating development teams, and continuously applying Agile product management principles to improve the platform based on user feedback.
Skills: Technical writing, training, experience with documentation tools (Markdown, Confluence), Agile methodologies (Scrum, Kanban), backlog management, user story creation, and stakeholder communication.
Importance: Effective platform evangelism and documentation are critical for ensuring that internal users (especially development teams) understand and fully leverage the platform's capabilities. By incorporating Agile product management concepts, such as iterative development, prioritization, and regular feedback loops, the platform engineering team can continuously improve the platform based on the evolving needs of its users. This approach helps identify the most impactful features or enhancements, ensuring that the platform aligns with the business’s strategic goals and provides real value to its users. Additionally, Agile practices foster a culture of continuous improvement, where user feedback drives platform evolution, ultimately leading to higher user satisfaction and better overall outcomes.

Whew! This is a description of a team of superhumans! Any organization that can create a team with these skills is exceptional. This list might be so daunting that making the whole idea of a platform seem impossible. However, there are two key points to bear in mind:

Gartner has identified that an ideal platform will likely combine several sub-platforms, simplifying the connections from CI/CD pipelines through container orchestration to infrastructure or cloud primitives. By leveraging managed services, whether managed Kubernetes from cloud providers, bare metal Kube/Linux combinations (like Talos Linux from Sidero Labs), or solutions like Portainer Business Edition, organizations can reduce the need for in-house expertise in areas like cluster management, security, and governance. This allows existing IT teams to deliver the functionality of a world-class platform without the overhead.
An increasing number of specialist partners can offer PEaaS (Platform Engineering as a Service). Available in various service levels and delivery configurations, PEaaS allows an organization to either jump-start its platform with an external capability or free its own teams to add value to the business, leaving the heavy lifting of managing and curating the platform to experts.

Above all, a platform engineering team needs to provide a robust foundation. This allows development teams to focus on delivering features and value to users without being bogged down by infrastructure or operational concerns.

Remember, the platform needs to start small and iterate (kayak, not cruise ship). A combination of internal leadership and outsourced expertise may be the ideal starting point for an effective platform.