The next-generation observability architecture: Lessons from a decade of event-scale systems

Revenue dips. Latency spikes. Alerts fire. The dashboards look fine – until they don’t

Slack explodes. Ten engineers become 20. Queries multiply. Everyone starts scanning raw event data at once. And then the system starts to buckle. Right when you need it most.

Over the past decade, I’ve worked on large-scale, real-time analytics systems for massive, bursty workloads. First in ad tech and more recently in observability. Across very different domains, the same failure pattern tends to emerge. Platforms that perform well under normal, steady-state conditions degrade under investigative load.

In many cases, this isn’t simply a matter of tuning or operational discipline. It reflects architectural assumptions. Most observability platforms were designed for detection-oriented workloads and not the unpredictable, exploratory way humans investigate incidents in real time.

Where the architecture breaks

Many observability platforms are built around a core assumption that queries will follow normal, predictable patterns. Dashboards, alerts and saved searches reflect known questions about the system.

But incidents aren’t predictable.

During an investigation, workloads shift instantly. Queries become exploratory. Time ranges expand. Filters change constantly. Concurrency spikes as multiple teams dig into the same data.

Architectural assumptions that work well in steady state can begin to show strain. Index-centric systems perform well on known paths. Step outside them, and performance drops quickly. Sub-second queries turn into minutes, concurrency falls off and costs rise.

Over time, teams may begin to limit the scope of analysis or to export data to other systems simply to maintain responsiveness.

This dynamic isn’t primarily about features. It reflects a structural mismatch between how many systems are designed and how investigations actually unfold.

What “event-native” actually means

Over the past decade, several large-scale real-time analytics systems — including Apache Druid, something I’ve been intimately engaged with — were designed to handle highly bursty, event-driven workloads.

These environments required a different architectural model.

Rather than optimizing around predefined views or tightly coupled indexing structures, event-native systems treat raw, immutable events as the primary unit of storage and analysis. Every request, error and interaction is preserved as an event and remains available for exploration.

Data is stored in column-oriented formats designed for large-scale scanning and high-cardinality queries. Instead of shaping the data upfront for specific access patterns, the system is built to support evolving questions directly against the event stream.

The difference becomes clear during an incident.

Imagine a latency spike affecting a subset of users. Engineers may need to pivot across user ID, region, service version or request path — combining dimensions that were not anticipated in advance.

In an event-native system, those pivots can occur directly against stored event data without rebuilding indexes or reshaping datasets for each new question. Multiple teams can run these queries concurrently, even across large time ranges, without the system degrading.

That’s the core shift: you’re no longer constrained by how the data was modeled upfront. You can investigate what actually happened, in real time, at scale.

Cloud economics changed the rules, but architectures stayed the same

Many observability architectures were designed for an era when storage was fixed (and expensive). That’s no longer the case. In the cloud, storage is abundant and cheap. Compute is elastic, which is often the real cost driver. You can store years of event data in object storage at a fraction of the cost of running always-on compute clusters. Yet many observability platforms still tightly couple storage, indexing and query compute as if nothing changed.

What does this mean in practice? You pay peak compute prices just to keep data available and accessible. This turns observability into making bad trade-offs between cost, retention and performance.

All-in-one observability platforms can be powerful, but they’re also rigid. When storage and compute scale together, you lose control over economics.

Monolithic architectures shine in steady state, but when incidents are triggered, they quickly become painfully expensive, painfully slow or both.

Why observability needs a dedicated data layer, not another all-in-one platform

For years, consolidation has been a common response in observability – one more all-in-one platform promising simplicity.

That approach can reduce surface complexity in the short term. Over time, however, tightly coupled systems can limit flexibility. As scale increases, storage, compute and visualization begin to compete for resources inside the same architecture.

Business intelligence learned this lesson decades ago. What started as tightly coupled stacks separated into a modular architecture where storage, transformation and visualization became independent layers. That separation created leverage and companies like Snowflake, Databricks, Fivetran and Tableau emerged by focusing on distinct parts of the stack.

Each layer could innovate independently. Storage could scale without changing dashboards or workflow, compute engines could evolve without changing ingestion and visualization tools could compete on experience rather than infrastructure.

Observability is next.

One architectural response is the introduction of a purpose-built data layer that sits beneath existing observability tools such as Splunk, Grafana or Kibana. By separating data storage from interaction and analysis, organizations can retain large volumes of telemetry while scaling compute based on investigative demand.

It means longer retention without constant peak compute costs. It means bursty, investigative workloads don’t collapse the system and multiple teams can dig into the same event stream without stepping on each other. It aligns the architecture with how observability admins and engineers actually work during incidents.

And critically so, it treats observability as a data infrastructure problem not just a tooling problem.

This shift breaks the lock between data and tools

In tightly integrated observability platforms, data is often bound to a specific query engine or user interface. That coupling can simplify adoption, but it also limits long-term flexibility. Storage decisions, retention policies and performance characteristics become tied to a single vendor’s architecture.

When the underlying event data layer is open, durable and scalable, organizations gain optionality. The same telemetry can be analyzed across multiple tools. Retention strategies can evolve independently of dashboards. New query engines or visualization systems can be adopted over time without migrating years of historical data.

That’s why new architectural patterns are emerging in large-scale deployments – systems designed for unpredictable query shapes and deep exploratory analysis. Architectures that separate storage, compute and indexing that treat observability as a data problem first.

When data is stored in open, scalable systems rather than locked inside a single platform, organizations gain flexibility. They can analyze the same data across multiple tools, adopt new technologies over time and avoid being constrained by the limitations or cost structures of any one vendor.

What the next decade of observability will look like

Telemetry volumes will continue to grow. Distributed systems introduce more surface area. AI workloads generate additional signals and amplify data scale. Investigations are becoming more collaborative and more exploratory.

In that environment, the defining characteristic of observability systems will not be the number of features they expose, but the architecture beneath them.

When Slack explodes and dashboards slow down with (or completely stop) answering the right questions, the architecture underneath will determine whether teams find the root cause in minutes or watch the system buckle all over again.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?