The AI market is as competitive as any I have seen. When organizations look to implement the latest AI model or agent platform, many skip over the infrastructure-building required for successful deployment. This instinct is understandable – teams want to move quickly, deliver business impact and avoid falling behind in a fast-paced market. But models and frameworks only deliver value over time if they sit on a foundation built for production, not just initial deployment. As AI evolves from models to copilots to increasingly autonomous agents, the systems behind them must also evolve to support reliable, coordinated behavior at scale.
This foundation may not be as exciting as a new model release, but it becomes essential once you deploy AI broadly across an organization and allow access to enterprise tools and data. To responsibly build and scale this foundation, we need interoperable frameworks, shared protocols and secure, community-driven innovation. The agentic ecosystem will not scale or meet developer needs for reliability, security and consistency in isolated, proprietary silos.
The most important lessons from scaling cloud systems—shared standards and open-source community innovation—will be directly relevant to the AI era.
I’m seeing many of the same patterns emerge as when we built Kubernetes: a community converging on shared interfaces and operational patterns that made it possible to run distributed systems reliably at scale. Over the last decade, workloads have shifted from traditional web applications to AI-native applications, but the underlying operational constraints have remained the same.
Lessons from scaling the cloud
To understand what this looks like in practice, we can look at how we learned to operate distributed systems in the cloud. While AI introduces unique complexity, the operational shape is familiar. In distributed environments, feedback is slower, failures are complex and harder to diagnose and system-wide updates are difficult to implement safely, increasing the possibility that unnoticed failures accumulate into systemic instability. Those constraints shaped the cloud, and they shape production AI systems as well.
Kubernetes didn’t just make it possible to run containers: it addressed the harder problem of how to change live systems without breaking them. The solution wasn’t a single tool, but a set of operational patterns such as health checks, controlled rollouts and a consistent way to describe, review and manage change. Furthermore, the definition of health became more flexible, allowing users to evolve what a “healthy” application means over time within familiar contexts.
Another important lesson is the value of good defaults for a healthy system. Letting every team define their own patterns turns every operator into a system expert, which does not scale. If everyone follows their own individual approach, the subtle differences in choices make it impossible to build standardized tools which can work for everyone. This is why modern AI systems need to provide best practices and good defaults while still allowing flexibility to adapt over time.
The role of the open-source community in shaping standards
Most organizations treat AI as a product launch: ship a model, spot-check outputs and iterate quickly. This works for many features and updates, but it doesn’t work for probabilistic systems, where behavior can drift quietly and without obvious failure modes. AI requires us to move past the mindset that something is either “working” or “broken” and shift to a continuous understanding of output quality.
Open-source communities solved this problem for cloud systems by converging on shared interfaces and patterns. That convergence enables ecosystems of tooling and operational practices that make distributed systems repeatable at scale. AI systems need the same kind of convergence and consistency.
As agents operate across frameworks, clouds and environments, interoperability becomes critical. This means developing standards for the surfaces every team interacts with:
- Interfaces for inference and routing
- Common representation of quality gates and system health
- Clear telemetry and tracing for understanding system behavior
- Auditable identity and permissions that follow across multiple systems
- Standard definitions to describe potential actions and their effects.
When standards are in place, organizations can standardize platform defaults, roll out changes gradually and keep rollback paths simple. The good news is that this is an extension of patterns already established in the cloud native ecosystem rather than a complete rethinking of what we need to build. The world of AI stands on the shoulders of a decade of cloud-native technologies, but we must adapt these technologies to the world of AI-native applications.
What Kubernetes can teach us about reliable AI systems and operating them at scale
Kubernetes worked because it assumed that within any application or service, change is constant and made change manageable by making it observable, staged and reversible.
AI systems need the same properties, but with an added dimension: “healthy” also now includes behavior. A model can return responses with low latency and still be wrong in ways that matter. Regressions show up as degraded results, not necessarily errors, which makes them harder to detect.
Because of this, “ship it and see” is a poor strategy, especially as agents begin to take on more autonomous roles. Testing a model on one or two prompts is no longer sufficient. You have to run thousands of tests and determine whether outputs have improved. Determining whether a change is better or worse requires evaluation across a wide variety of inputs. In practice, this often means both testing at scale with thousands of inputs and testing in production, where percentages of traffic can be sent through the new model and compared against the existing system.
Better models alone won’t produce reliable systems. But a focus on intentional, disciplined operations will. The success of AI systems is tied to user inputs and to the outcomes of probabilistic systems. While probabilistic systems aren’t as straightforward to manage as deterministic software, we’ve learned reliability comes from controlled release processes, observability tied to outcome quality and the ability to roll back quickly.
A similar lesson can be applied to operating AI systems at scale, ensuring it’s portable and durable for teams to build on it for years to come. The fastest way to fail with AI is treating it as a feature you ship instead of a system you operate. As organizations move beyond pilots and into production, the bar shifts from “it works” to “it operates safely.”
That requires a small set of non-negotiable practices:
- Treat every model, prompt and data update as a full production release. If you can’t stage, observe and roll back, you’re not in control.
- Measure full system behavior, not just health. Uptime and latency won’t tell you when output quality is degrading.
- Design for safe failure. Build fallbacks, guardrails and clear escalation paths before the system is under load.
- Standardize shared surfaces. Common interfaces, telemetry and release patterns are how operators build muscle memory.
- Reuse proven patterns. Bad patterns create system failures. Reusable, open patterns reduce surprise.
We don’t need to invent a new operational philosophy for AI. We need to apply what Kubernetes and the cloud-native ecosystem already established: standardize where it matters, make change controlled and make system behavior observable. If we apply those lessons early, we avoid relearning them later under production pressure. AI is moving at a fast pace, and we must ensure we’re ready for continued innovation.
This article is published as part of the Foundry Expert Contributor Network.
Want to join?