Experimentation · Metrics · Product judgment

Experimentation Is Not a Statistics Ritual

How stronger hypotheses, closer metrics, power, guardrails, and experiment design make product decisions clearer.

Jean John · May 2026 · 13 min read

In briefExperiments matter when they improve product decisions under uncertainty, not when they simply output a p-value.A useful experiment starts with a plausible causal hypothesis, uses metrics close to the intervention, is powered to detect an effect that actually matters, and defines the decision before results are visible.In marketplace and operational systems, the design matters even more because users, orders, couriers, merchants, and supply pools can influence each other. When SUTVA does not hold, a simple user-level A/B test may produce a misleading readout.The real standard is whether the evidence is strong enough to decide what should be launched, changed, scaled, or stopped.

Most weak experiments do not fail because teams lack tools.

They fail because the test is disconnected from the decision it is meant to support.

Teams can still follow the mechanics: define a variant, run an A/B test for two weeks, check whether the p-value is below 0.05, and then decide whether to launch.

The mechanics may look rigorous while still producing weak product decisions.

A product experiment should reduce enough uncertainty for the team to make a better call: launch, change, scale, or stop.

The quality of that call depends on much more than the final readout. It depends on the strength of the hypothesis, the relevance of the metrics, the design of the test, the minimum effect worth detecting, the duration of the experiment, and the judgment applied when interpreting the result.

Many experiments are already compromised before analysis begins.

They fail when the hypothesis is weak.
They fail when the metric is too far removed from the change.
They fail when the test is underpowered.
They fail when success is defined after seeing the result.
They fail when teams use a simple A/B test in a system where users, orders, drivers, merchants, or supply pools influence each other.

Statistics matter, but clarity about the decision matters first.

Start With a Plausible Hypothesis, Not a Possible Idea

One of the most common mistakes in experimentation is confusing something that is possible with something that is plausible.

It is possible that changing a button, message, ETA, ranking rule, incentive, or notification will improve a business metric. But possibility is not enough reason to run an experiment.

Strong experiments start with a plausible causal hypothesis.

That hypothesis should come from evidence: customer research, funnel data, operational observations, support contacts, behavioral patterns, marketplace diagnostics, or qualitative feedback from frontline teams.

For example, a weak hypothesis would be:

Showing more order status information will improve the customer experience.

That is too broad. It does not explain what customer problem exists, why the proposed change should solve it, or what metric should move.

A stronger hypothesis would be:

Customers contact support when they do not understand why a courier has not yet been assigned. If we explain the expected assignment window before the order is actually at risk, we can reduce anxiety-driven contacts without changing the underlying dispatch logic.

It connects customer behavior, system state, intervention, and expected outcome.

Teams do not need to test every idea; they need to test ideas with a credible reason to work.

Choose Metrics Close to the Intervention

Metric selection is where many experiments become noisy or inconclusive.

Teams often test a feature against a broad business outcome simply because the metric is important. But an important metric is not always the right primary metric.

If the intervention changes customer understanding, the most sensitive metric may be contact rate, cancellation rate, order-tracking engagement, or repeat visits to the status screen.

If the intervention changes dispatch logic, the right metric may be assignment time, courier wait time, pickup delay, batching quality, lateness, or delivery time reliability.

If the intervention changes pricing, the right metric may be conversion, contribution margin, subsidy burn, demand elasticity, or order mix.

The metric should be close enough to the intervention that a movement can reasonably be attributed to the change.

This does not mean outcome metrics are irrelevant. Revenue, retention, order frequency, marketplace efficiency, and customer lifetime value still matter. But they may be too far downstream to serve as the primary readout for every feature-level experiment.

Useful readouts usually have three metric layers:

Primary metric: the main behavior the experiment is expected to influence.
Guardrail metrics: measures that should not materially worsen.
Longer-term outcome metrics: broader business or customer outcomes to monitor, even if the experiment is not powered to detect movement in them.

For example, if a team is testing proactive order-status messaging, the primary metric could be support contact rate. Guardrails could include cancellation rate, refund rate, delivery lateness, and app reopen behavior. Broader outcomes could include repeat order rate or customer satisfaction.

This distinction matters because product teams often declare experiments “flat” when they were simply looking too far away from the intervention.

Do Not Pick Two Weeks Because It Sounds Reasonable

A surprisingly common experimentation habit is to run tests for a standard duration: one week, two weeks, or one business cycle.

Those durations are often inherited operating habits, not decisions tied to power or effect size.

Experiment duration should be shaped by statistical power, baseline rates, expected effect size, traffic, variance, seasonality, and the minimum detectable effect.

Set duration by the smallest effect that matters and the sample size required to detect it reliably.

This is where power and MDE matter.

Power is the probability of detecting an effect if the effect truly exists. Many teams use 80% power as a practical default.

MDE, or minimum detectable effect, is the smallest effect size the experiment is designed to reliably detect.

If the experiment is underpowered, a meaningful result may look statistically insignificant. If the MDE is set unrealistically low, the required duration may become impractically long. If the MDE is set too high, the team may miss smaller but commercially meaningful improvements.

The product call is choosing an effect size that would actually change the roadmap decision.

A 0.1% improvement may be irrelevant for a small feature, but meaningful at massive scale. A 2% improvement may be material for conversion, but insufficient if the implementation introduces operational complexity. A small improvement in on-time delivery may be strategically important if it unlocks a stronger customer promise.

Experiment duration should reflect the decision the team needs to make.

Define the Decision Before Seeing the Result

A good experiment has a decision rule before it starts.

Without this, teams are vulnerable to interpretation drift. They look at the result, search for a metric that moved, rationalize segment cuts, and retrofit the conclusion to what they want to believe.

Before launch, teams should agree on the primary metric, guardrails, launch threshold, stopping conditions, priority segments, test duration, and the default action for positive, flat, or negative outcomes.

This does not remove judgment. It protects judgment from bias.

For example:

We will launch if contact rate reduces by at least X without worsening cancellation, refunds, or delivery reliability beyond agreed guardrails.

That is much stronger than:

Let's run it and see what happens.

The best teams are not rigid. They can still learn from unexpected results. But they separate pre-defined decision criteria from exploratory analysis.

The primary analysis answers the launch decision. Exploratory analysis generates the next hypothesis.

Choose the Right Experiment Design for the System

Not every product change should be tested with a simple user-level A/B test.

A/B tests work well when one user's treatment does not affect another user's outcome. But many product systems are not that clean.

Marketplaces, logistics systems, ride-hailing platforms, food delivery networks, pricing systems, ranking systems, and dispatch systems often have interference effects.

If one group receives a treatment, it may change the experience of the control group.

For example, if a dispatch algorithm changes courier assignment for treated orders, untreated orders may also be affected because both groups draw from the same supply pool.

If a pricing experiment changes demand in one area, it may affect supply availability, courier behavior, or customer conversion in nearby areas.

If a ranking experiment changes visibility for one set of merchants, it may change demand distribution across the marketplace.

This is where SUTVA becomes important.

SUTVA stands for the Stable Unit Treatment Value Assumption. In simple terms, it means the outcome for one unit should depend only on its own treatment, not on the treatment assigned to other units.

Many standard A/B tests quietly rely on this assumption. But in marketplace and operational systems, SUTVA is often violated.

A customer, order, courier, merchant, store, restaurant, city zone, or supply pool may not be independent. Treatment effects can spill over. The control group may no longer represent a clean counterfactual.

Experimentation is still possible, but the design has to match the system. Depending on the problem, that may mean switchback tests, geo experiments, cluster randomization, phased rollouts, factorial designs, or shadow evaluations.

For product leaders, the practical question is choosing the cleanest decision design for the system in front of you.

Statistical Significance Is Not the Same as Practical Significance

Another common failure mode is treating statistical significance as the only form of truth.

A statistically significant result is not automatically important. A non-significant result is not automatically useless.

The p-value tells you something specific: assuming there is no true effect, how surprising is the observed result or something more extreme? It does not tell you whether the feature is good, whether the effect is commercially meaningful, or whether the change should be launched.

A result can be statistically significant but too small to matter.
A result can be directionally positive but inconclusive because the experiment was underpowered.
A result can be neutral overall but highly meaningful for a specific segment.
A result can improve the primary metric while damaging an important guardrail.
A result can be negative because the hypothesis was wrong, or because the implementation was poor, or because the treatment interacted badly with a particular system state.

Readouts should include more than a binary “significant / not significant” label.

The observed delta.
The confidence interval.
The p-value.
The sample size.
The power assumptions.
Primary and guardrail metrics.
Segment-level patterns, clearly marked as exploratory where appropriate.
The practical significance of the movement.
The recommended decision.

The confidence interval is especially useful because it shows the range of plausible effects. If the interval includes both a meaningful positive and a meaningful negative effect, the experiment has not resolved the decision. If the interval is tightly centered around a trivial effect, the team may decide that further testing is not worth it.

The standard is not blind adherence to a 0.05 threshold; it is understanding what the evidence supports and which decision it enables.

Flat Results Still Teach You Something

Many teams treat flat experiments as failures.

A flat result can be extremely valuable if it rules out a hypothesis, prevents unnecessary engineering investment, or shows that a customer problem is not solved by the proposed intervention.

To learn from a flat result, teams usually need to check power, treatment strength, metric proximity, implementation quality, segment effects, and whether the original hypothesis was plausible.

A flat readout does not automatically mean “do nothing.” It may support launch for a low-risk strategic change, stopping a weak hypothesis, improving treatment and retesting, narrowing the target segment, running a more powerful test, or shifting to a better metric.

The worst response to a flat result is to shrug and move on without learning.

Judgment comes before and after the readout

There is a misconception that experimentation replaces judgment.

In practice, experimentation disciplines judgment.

Before the experiment, judgment is needed to form a plausible hypothesis, choose the right metric, define the right unit of randomization, estimate the effect size worth detecting, and select the right design.

After the experiment, judgment is needed to interpret the evidence, understand the trade-offs, evaluate operational risk, and decide whether the result is strong enough to act on.

The experiment does not make the decision for the team. It improves the quality of the decision.

That distinction matters because product teams are not operating in laboratory conditions. They operate in messy systems with imperfect data, changing user behavior, operational constraints, competitive pressure, and strategic ambiguity.

A good product leader does not ask experiments to provide false certainty. They use experiments to reduce uncertainty enough to make a better decision.

The useful test is the one that clarifies the next decision

Most teams run experiments with good intent: to reduce uncertainty before making a product decision.

The problem is that the ritual can become disconnected from the decision. The team ships a variant, waits for the readout, debates significance, and then still struggles to answer the question that mattered in the first place: what should we do next?

A strong experiment should make the next decision clearer.

In practical terms: is the customer problem real, was the intervention strong enough, did behavior change in a meaningful way, are unintended consequences acceptable, and does the evidence support launch, iteration, or stopping?

When experimentation is tied to a clear decision, it forces teams to state what they believe, what evidence they need, what risks they will accept, and what action will follow.

Strong teams do not just run more experiments. They run experiments that make decisions clearer.