Variation as Risk

Why system stability is anathema to variation.

Jan 27, 2026

A hospital emergency department designed for an average of 50 patients per day will seem perfectly sized until the day 73 patients arrive. A software system built to handle typical API response times of 100 milliseconds will work flawlessly until a single request takes 10 seconds and blocks everything behind it. A manufacturing line balanced for steady component supply will run smoothly until a shipment is delayed and suddenly every downstream operation must wait.

These aren’t rare edge cases. They are not black swans. They represent cases which are infrequent, but certainly expected and not exceptional circumstances that all systems face They’re the inevitable consequences of a fundamental truth about systems: variability kills performance, reliability, and safety.

Most systems are designed around averages. Average demand, average processing time, average resource availability. But life doesn’t work like that. We don’t just experience the average every time. The actual events fluctuate around the average, often in a nonsymmetrical, irregular and unpredictable manner. In that gap between the designed-for average and the experienced reality lies one of the most underestimated sources of systemic risk.

Two Faces of Variability

Not all variability is created equal. Understanding the distinction between types of variation is essential to managing it as a risk source.

W. Edwards Deming spent much of his career helping organizations understand the difference between common cause variation (random, inherent to the process) and special cause variation (assignable to specific factors). Common cause variation is the natural fluctuation in any process, the fact that no two manufactured parts are exactly identical, no two customers take exactly the same time to serve, no two network requests traverse exactly the same path.

This variation is inherent. It can be reduced through process improvement, but it can never be eliminated entirely. The human heart doesn’t beat at exactly the same interval every time. Temperature fluctuates throughout the day. Demand varies by day of week, weather, and a thousand other factors.

The key takeaway from this is to design systems in such a way as to accomodate this inherent variability, even the extremes of these fluctuations. When we design for the average and treat variation as an anomaly, we guarantee that the system will regularly underperform.

Structural Variation: The Amplified Signal

More dangerous is structural variation. Structural variation is introduced or amplified by the system itself. This is sometimes known as the bullwhip effect in supply chains, where small variations in retail demand create increasingly wild swings in manufacturer orders. famous The MIT Beer Game which I have created a simulation for here is a great demonstration of this effect in practice.

Structural variation emerges from feedback loops, interdependencies, and batching effects, all sources of variation themselves. A customer service system that batches requests for processing every hour doesn’t just delay responses but also creates spiky demand on downstream systems. A production process that responds to every fluctuation in demand creates instability in its own supply chain.

Deming’s point about tampering, best exemplified in the funnel experiment, making adjustments to a stable process in response to common cause variation is really a warning about creating structural variation. When managers adjust staffing, inventory, or production rates in response to normal fluctuation, they often inject more variability than they remove. The key is to manage the system like a surfer manages a wave, not by trying to dictate what the water does, but by navigating through it.

Thanks for reading Systems of Human Performance! This post is public so feel free to share it.

The Physics of Variability in Systems

Why do systems hate variability so much? The answer lies in some fundamental relationships that govern how flow systems behave.

Little’s Law and Queue Length

Little’s Law states that in a stable system, the average number of items in the system equals the arrival rate multiplied by the average time items spend in the system. This seems like an abstract formula until you understand its implications for variability.

When arrival rates vary but processing capacity stays constant, queue length doesn’t vary proportionally but exponentially. A 20% increase in arrival rate doesn’t create a 20% longer queue; it can create queues that are 200% or 500% longer, especially as you approach system capacity.

This is why emergency departments experience such dramatic swings in wait times. A modest increase in patient arrivals pushes the system toward its capacity limit, where queuing behavior becomes nonlinear. The same patients, receiving the same treatments, experience wildly different wait times based solely on when they arrived.

Utilization and Response Time

As system utilization increases, response time doesn’t increase linearly. Instead, it responds geometrically, curving sharply upward. At 70% utilization, a system might respond quickly and predictably. At 85% utilization, response times become erratic. At 95% utilization, the system is effectively broken, with requests waiting indefinitely.

This is counterintuitive because it seems wasteful. Why not run systems at 95% utilization and maximize efficiency? The answer: because any variation in arrival rate or processing time at high utilization creates catastrophic delays. Providing these sorts of “catchup buffers” is the only way airports can get back on track after weather and mechanical delays.

This is why Theory of Constraints emphasizes capacity buffers at non-constraint resources. The goal isn’t to maximize utilization everywhere; it’s to maximize throughput through the constraint while protecting the system from variability everywhere else.

The Convoy Effect

When fast processes must wait for slow ones, the system propagates variability. If nine API calls return in 100ms but one takes 10 seconds, everything waiting for the complete set must wait 10 seconds. The slow outlier creates a convoy of delayed requests.

This happens in manufacturing when one slow machine paces an entire line. It happens in software when one slow database query blocks a transaction. It happens in healthcare when one complex patient case occupies a shared resource that many simpler cases need. The slow event is not the key take issue, it’s that a single slow event can cause shock waves and ripples up and down the system. Variability doesn’t stay local; it propagates.

Real-World Impacts: Where Variability Becomes Risk

Healthcare: The Emergency Department Boarding Crisis

Emergency departments (ED) nationwide face a chronic problem: patients who need admission but must wait in the ED because no inpatient bed is available. This “boarding” problem seems like a capacity issue, so we can just add more beds.

But analysis reveals that variability, not average capacity, is the primary driver. Hospital occupancy varies significantly by day of week, season, and even time of day. Discharge times vary widely based on patient complexity, doctor availability, and administrative processing.

The result: even when average occupancy is 75%, actual occupancy regularly hits 95%+, leaving no surge capacity. The ED becomes a buffer for the entire hospital system’s variability, absorbing all the fluctuation in admission rates and discharge times.

The risk isn’t that the hospital is too small on average it’s that variability in patient flow is inherent to services and creates periods where the system is overwhelmed, leading to degraded care quality, longer lengths of stay (which increases infection risk), and staff burnout.

Manufacturing: The Hidden Cost of Variation

Consider an automotive assembly line designed for a specific cycle time, say 60 seconds per station. If every operation took exactly 60 seconds, the line would flow smoothly. But operations vary: sometimes 55 seconds, sometimes 65 seconds.

That 10-second variation seems minor, but it creates ripple effects. When a station runs fast, the next station isn’t ready, creating waiting. When a station runs slow, the entire line must wait, and inventory accumulates. Over a shift, these small variations compound into significant throughput loss and quality problems.

Deming’s work with Japanese manufacturers in the 1950s focused heavily on reducing variation in manufacturing processes. Not because variation caused obvious defects, but because it created systemic instability that degraded overall performance in subtle, cumulative ways.

Logistics: The Cascading Delay

A delivery company designs routes based on average travel times. On most days, this works fine. But when variation occurs, say because of construction delays, or weather, or traffic, the entire schedule cascades.

Driver 1 is delayed 20 minutes. They miss their optimal window at Stop 5, arriving during rush hour, adding another 15 minutes. Now they’re 35 minutes behind. At Stop 8, the recipient isn’t available during the rescheduled time, forcing a second attempt. The delay compounds.

Meanwhile, Driver 2’s route depends on completing certain pickups after Driver 1’s deliveries. The delay propagates. By afternoon, the entire distribution network is out of sync, not because of any component failure, but because initial variation amplified through interdependencies.

Why Averages Lie

The fundamental problem with designing for averages is that the average rarely happens. It’s a bit unintuitive, isn’t it? If average demand is 50 units per day, you might experience 50 units only a few days per month. Most days you’ll experience something different—45 or 62 or 38.

This seems like a pedantic distinction until you realize that systems designed for 50 units will be oversupplied on 38-unit days (wasting resources) and undersupplied on 62-unit days (degrading service). The system is never quite right.

Worse, many averages hide important patterns. An average of 50 could mean consistent 48-52 every day, or it could mean 20-80 with wild swings. The system requirements for these two scenarios are completely different, even though the average is identical.

This is why variance measures are as important as mean measures. Knowing that average wait time is 15 minutes is less useful than knowing that wait time ranges from 5 to 45 minutes with high variance. The variance is where the risk lives.

Strategies for Managing Variability

Since variation is inevitable, the question becomes: how do we design systems that handle it gracefully?

1. Build in Buffers and Slack

Theory of Constraints teaches that buffers are strategic tools, not waste. Buffers absorb variation and prevent it from propagating.

There are three types of buffers:

Capacity buffers: Extra capacity beyond average demand, so spikes don’t overwhelm the system
Time buffers: Extra time built into schedules to absorb delays
Inventory buffers: Extra stock to absorb supply or demand variation

The key is placing buffers strategically, especially before constraints and at integration points. You don’t need buffers everywhere but you need them where variation could disrupt critical flows.

2. Reduce Variability at the Source

Where possible, reduce inherent variation in processes. This is the core of Deming’s statistical process control: understand what creates variation and systematically eliminate sources.

In healthcare, standardizing admission and discharge times reduces variability in bed availability. In manufacturing, reducing setup times and improving process consistency reduces cycle time variation. In software, implementing rate limiting and circuit breakers prevents cascade failures from variability in downstream services.

The goal isn’t to eliminate all variation but rather to distinguish between necessary and unnecessary variation, then systematically reduce the unnecessary portion.

3. Smooth Demand Where Possible

Sometimes you can’t control when demand arrives, but often you can influence it. Reservation systems smooth restaurant demand. Appointment scheduling smooths healthcare demand. Batch processing windows smooth computing demand.

The trade-off is between responsiveness (serving demand immediately) and stability (smoothing demand to reduce variability). Different contexts justify different choices, but the option should be conscious, not accidental.

4. Design for Degradation, Not Failure

Systems that experience variation don’t need to fail; they can degrade gracefully. An e-commerce site experiencing traffic spikes can disable recommendation engines and fancy features while preserving core checkout functionality. An emergency department experiencing surge can activate fast-track protocols and defer non-urgent care.

This requires identifying priority levels within the system. What must continue? What can be deferred? What can be simplified? Then building these degradation modes into the system architecture, not as after-the-fact fixes but as designed behaviors.

5. Implement Adaptive Capacity

Rather than sizing for peak demand (expensive) or average demand (risky), build systems that can adapt. This is auto-scaling in cloud computing, floating staff pools in healthcare, flex manufacturing in production.

The challenge is that adaptation itself introduces variation and coordination overhead. You need to adapt smoothly enough to absorb variation without creating new variation through the adaptation process. This requires careful tuning of responsiveness thresholds and rate-of-change limits.

6. Monitor Variability, Not Just Averages

Dashboards that show only averages miss the story. You need to track:

Variance and standard deviation: How much does the metric vary?
Range: What are the extremes?
Percentiles: What does the 95th percentile look like (not just the median)?
Time-series trends: Is variability increasing?

Alert thresholds should trigger on variance changes, not just mean changes. A stable average with increasing variance often signals an emerging problem that mean-based alerts will miss.

The Cultural Challenge

Managing variability requires a fundamental shift in organizational thinking. Most businesses celebrate efficiency—maximizing utilization, minimizing waste, running lean. But extreme efficiency leaves no room for variation.

A system that achieves 98% efficiency under ideal conditions but collapses to 60% efficiency under variation is less valuable than a system that maintains 85% efficiency regardless of conditions. But the second system will appear wasteful if you only measure it during steady-state periods.

This requires educating stakeholders about the cost of variation. The extra capacity that sits idle most of the time isn’t waste, but reflects something more like your risk tolerance. it’s an insurance policy.

Conclusion: Living with Variation

Like the wave of the surfer, variability is not a problem to be solved; it’s a condition to be managed. No amount of process improvement or systems design will eliminate it entirely. The goal should be to design and construct systems that perform reliably despite variation, not systems that assume variation away.

This means:

Designing for realistic variation, not idealized averages
Building in strategic buffers and slack
Monitoring variance as carefully as means
Creating graceful degradation paths
Understanding that the cost of handling variation is often invisible until you fail to handle it

The risk isn’t in occasional extreme events like Nassim Nicholas Taleb would have you believe—black swans that we can’t predict. The risk is in normal, everyday variation that we know exists but design our systems to ignore. When we build on averages and hope for stability, we guarantee that our systems will regularly encounter conditions they weren’t designed to handle.

Variability is the friction in the machine, the noise in the signal, the turbulence in the flow. It’s always there. The question is whether you’ve built your system to absorb it or whether it will absorb your system.

You read the whole article! Maybe you want to:

Systems of Human Performance

Discussion about this post

Ready for more?