Quant Lab Tools
Home / Analysis / Statistical arbitrage done right
Strategy teardown

Statistical arbitrage done right: can a cointegrated pair clear the bar?

A triangular-arbitrage system was broken; EURUSD/GBPUSD weren't cointegrated. So does pairs trading ever work? We screened eight candidate pairs, deep-dived the one textbook pair that passes, and held it to a serious acceptance bar. Done right, the edge is real — but thin.

The method. An Engle-Granger cointegration screen across eight candidate pairs; then, on the winner, an in-sample hedge ratio, a rolling z-score, Optuna over the lookback and thresholds, a 40% out-of-sample hold-out, a modeled per-leg cost, and a Deflated Sharpe Ratio for the 300 trials. Finally, the result is judged against a real acceptance bar — the kind a desk sets before funding a strategy.

First, the screen: most "obvious" pairs aren't cointegrated

A pairs trade needs cointegration — a spread that reliably returns to a mean — not mere correlation. So we tested eight pairs people reach for. Only three pass the gate.

Cointegration p-values for eight candidate pairs; only WTI/Brent, SPY/DIA, EWA/EWC pass the 0.05 gate
Engle-Granger p-values, 2019–2026. Cointegrated (blue): WTI/Brent, SPY/DIA, EWA/EWC. Correlated-but-not-cointegrated (grey): gold/silver, Nasdaq/S&P, Bitcoin/Ether (p=0.78), even Coca-Cola/Pepsi (p=0.93).

That is the previous teardown's lesson at scale: Coke and Pepsi move together, but their spread wanders off and never comes back, so there is nothing to trade. We take the cleanest survivor forward — EWA/EWC, the Australia and Canada country ETFs, two commodity-driven economies and a classic pairs candidate.

EWA/EWC: a spread that actually reverts

EWA and EWC rise together; their spread oscillates around a mean
Top: EWA and EWC track each other. Bottom: unlike EURUSD/GBPUSD, this spread oscillates around its mean — a tradeable signal. One honest caveat: cointegration is time-varying. Over the in-sample window alone the p-value is a borderline 0.13; it is the full sample that reaches 0.02.

So the raw material is there. We fit the hedge ratio in-sample, z-score the spread, and let Optuna search the lookback and the entry/exit thresholds — then we look at the half we never touched.

The out-of-sample edge is real

In-sample equity rises; out-of-sample also rises to +9.6% before a drawdown
For the first time in this series, out-of-sample (blue) goes the right way: +9.6% after costs, even through a −13% drawdown. The signal survives realistic costs out to a break-even of ~8.6 bps per leg.

This is genuinely different from the two failures before it. The strategy is cointegration-grounded, the out-of-sample return is positive after a real cost, and there is a comfortable break-even cushion. If the question is "can honest pairs trading make money," the answer here is a cautious yes.

But does it clear the bar?

Making money and being fundable are different tests. Hold the result to a serious acceptance bar — OOS Sharpe ≥ 1.2, max drawdown ≤ 12%, profit factor ≥ 1.4, and still profitable at double the assumed cost:

CriterionResultVerdict
OOS Sharpe ≥ 1.20.42fail
OOS max drawdown ≤ 12%−13.3%fail
OOS profit factor ≥ 1.41.11fail
Profit factor ≥ 1.15 at 2× cost<1.15fail
Deflated Sharpe (300 trials)0.16weak

It misses every line. The in-sample Sharpe of 1.08 more than halves to 0.42 out-of-sample, the drawdown breaches the limit, and — the number that matters most — the Deflated Sharpe is 0.16. After charging the strategy for the 300 parameter sets we tried, the headline collapses: there is a positive edge, but it is not robust enough to bet the proposal's KPIs on.

Out-of-sample Sharpe versus per-leg cost, positive at low cost and fading by ~9 bps
The edge is real but shallow: out-of-sample Sharpe is positive only up to about 9 bps of per-leg cost. A wider-spread broker erases it.

Verdict

Does it survive validation?

A real edge — but a thin one.

This is the series' first genuine, cost-surviving out-of-sample edge, and it earns that by starting from a cointegrated pair instead of a merely correlated one. But it is modest: it fails a serious acceptance bar on every line, and the Deflated Sharpe (0.16) says a single pair tuned over 300 trials is not something to size up. Cointegration is necessary. It is not sufficient.

How a thin edge becomes a tradeable one

The fix is not a better single pair; it is discipline and breadth:

Check the math yourself

Tool
Deflated Sharpe Ratio — what does a 300-trial backtest's headline actually survive?
Tool
Position Size & Risk of Ruin — size a thin edge so a −13% drawdown doesn't end you
Educational analysis, not investment advice. This is a methodology case study, not a recommendation to trade any strategy or instrument. Simulated and optimized results have severe limitations and do not predict future performance. See the full disclaimer.