Strategy teardown

Statistical arbitrage done right: can a cointegrated pair clear the bar?

A triangular-arbitrage system was broken; EURUSD/GBPUSD weren't cointegrated. So does pairs trading ever work? We screened eight candidate pairs, deep-dived the one textbook pair that passes, and held it to a serious acceptance bar. Done right, the edge is real — but thin.

By ridingyo2026-06-17 Original research · daily data, 2019–2026

The method. An Engle-Granger cointegration screen across eight candidate pairs; then, on the winner, an in-sample hedge ratio, a rolling z-score, Optuna over the lookback and thresholds, a 40% out-of-sample hold-out, a modeled per-leg cost, and a Deflated Sharpe Ratio for the 300 trials. Finally, the result is judged against a real acceptance bar — the kind a desk sets before funding a strategy.

First, the screen: most "obvious" pairs aren't cointegrated

A pairs trade needs cointegration — a spread that reliably returns to a mean — not mere correlation. So we tested eight pairs people reach for. Only three pass the gate.

Cointegration p-values for eight candidate pairs; only WTI/Brent, SPY/DIA, EWA/EWC pass the 0.05 gate — Engle-Granger p-values, 2019–2026. Cointegrated (blue): WTI/Brent, SPY/DIA, EWA/EWC. Correlated-but-not-cointegrated (grey): gold/silver, Nasdaq/S&P, **Bitcoin/Ether (p=0.78)**, even **Coca-Cola/Pepsi (p=0.93)**.

That is the previous teardown's lesson at scale: Coke and Pepsi move together, but their spread wanders off and never comes back, so there is nothing to trade. We take the cleanest survivor forward — EWA/EWC, the Australia and Canada country ETFs, two commodity-driven economies and a classic pairs candidate.

EWA/EWC: a spread that actually reverts

EWA and EWC rise together; their spread oscillates around a mean — Top: EWA and EWC track each other. Bottom: unlike EURUSD/GBPUSD, this spread oscillates around its mean — a tradeable signal. One honest caveat: cointegration is time-varying. Over the in-sample window alone the p-value is a borderline 0.13; it is the full sample that reaches 0.02.

So the raw material is there. We fit the hedge ratio in-sample, z-score the spread, and let Optuna search the lookback and the entry/exit thresholds — then we look at the half we never touched.

The out-of-sample edge is real

In-sample equity rises; out-of-sample also rises to +9.6% before a drawdown — For the first time in this series, out-of-sample (blue) goes the right way: +9.6% after costs, even through a −13% drawdown. The signal survives realistic costs out to a **break-even of ~8.6 bps** per leg.

This is genuinely different from the two failures before it. The strategy is cointegration-grounded, the out-of-sample return is positive after a real cost, and there is a comfortable break-even cushion. If the question is "can honest pairs trading make money," the answer here is a cautious yes.

But does it clear the bar?

Making money and being fundable are different tests. Hold the result to a serious acceptance bar — OOS Sharpe ≥ 1.2, max drawdown ≤ 12%, profit factor ≥ 1.4, and still profitable at double the assumed cost:

Criterion	Result	Verdict
OOS Sharpe ≥ 1.2	0.42	fail
OOS max drawdown ≤ 12%	−13.3%	fail
OOS profit factor ≥ 1.4	1.11	fail
Profit factor ≥ 1.15 at 2× cost	<1.15	fail
Deflated Sharpe (300 trials)	0.16	weak

It misses every line. The in-sample Sharpe of 1.08 more than halves to 0.42 out-of-sample, the drawdown breaches the limit, and — the number that matters most — the Deflated Sharpe is 0.16. After charging the strategy for the 300 parameter sets we tried, the headline collapses: there is a positive edge, but it is not robust enough to bet the proposal's KPIs on.

Out-of-sample Sharpe versus per-leg cost, positive at low cost and fading by ~9 bps — The edge is real but shallow: out-of-sample Sharpe is positive only up to about 9 bps of per-leg cost. A wider-spread broker erases it.

Verdict

Does it survive validation?

A real edge — but a thin one.

This is the series' first genuine, cost-surviving out-of-sample edge, and it earns that by starting from a cointegrated pair instead of a merely correlated one. But it is modest: it fails a serious acceptance bar on every line, and the Deflated Sharpe (0.16) says a single pair tuned over 300 trials is not something to size up. Cointegration is necessary. It is not sufficient.

How a thin edge becomes a tradeable one

The fix is not a better single pair; it is discipline and breadth:

A portfolio of pairs, not one. A 0.4-Sharpe edge on one pair is noise; twenty weakly-correlated 0.4-Sharpe edges can be a real book. Stat-arb is a breadth game.
Re-screen and re-fit on a schedule. Cointegration is time-varying (we saw EWA/EWC drift from p=0.13 to 0.02). Walk-forward, and drop pairs whose spread stops reverting.
Guard the structure. A half-life-based time-stop, a robust z-score, and a correlation gate that blocks entries when the relationship is breaking down.
Deflate, always. Every pair and every parameter you try lowers the bar a real edge must clear. Put a number on it.

Check the math yourself

Tool
Deflated Sharpe Ratio — what does a 300-trial backtest's headline actually survive? Tool
Position Size & Risk of Ruin — size a thin edge so a −13% drawdown doesn't end you

Educational analysis, not investment advice. This is a methodology case study, not a recommendation to trade any strategy or instrument. Simulated and optimized results have severe limitations and do not predict future performance. See the full disclaimer.