Quant Lab Tools
Home / Analysis / Qullamaggie strategy stress test
Strategy teardown · part 2 of 3

We tortured the Qullamaggie strategy. Where does its edge actually come from?

In part 1 the raw systematic version barely made money — the right shape, but a regime roller-coaster. So we put it on the rack: four stress tests that locate the edge, the fragility, and the exact reasons it isn't tradeable as written.

Recap. Part 1 rebuilt Kristjan Kullamägi's published momentum rules mechanically on 1,500 US stocks. The result reproduced his trade signature — small losses, rare huge winners — but as a portfolio it was a coin flip with a −48% drawdown: CAGR 1.7%, Sharpe 0.19. This part asks why, and what would have to change.

Test 1 / The drawdown is a regime story

Split every day by a simple market filter — is the S&P 500 above or below its own 200-day moving average? The strategy spends 81% of its life in the "above" (healthy) regime and the rest in the "below" (unhealthy) one. Here is where the money is made and lost:

Market regimeShare of daysCumulative strategy return
Above 200-day MA (healthy)81%+30%
Below 200-day MA (unhealthy)19%−7%
Equity curve with bear-market regimes shaded — the declines line up with the shaded periods
Equity (log scale). Red shading = market below its 200-day MA. The worst declines — 2018, the 2020 crash, 2022 — sit almost entirely inside the shading.

Nearly all the profit is earned in the healthy regime; the unhealthy 19% of days does little but manufacture the drawdown. The strategy buys breakouts indiscriminately, including into the teeth of a falling market. A market filter is the single biggest lever available — and it's exactly the discretionary "is this a market to be aggressive in?" judgement the original trader applies by hand. We'll add it mechanically in part 3.

Test 2 / The entire edge lives in about ten trades

Momentum is supposed to pay through a thin tail of enormous winners. It does — to a fault. Across 2,288 trades, the single best 22 (the top 1%) account for 18% of all gross profit. Pull out just the ten largest winners and the eleven-year result flips from a profit to a loss:

ScenarioNet P&L (11 years, $100k start)
All 2,288 trades+$20,933
Minus the 10 best trades−$49,200
Minus the 20 best trades−$88,982
Cumulative trade P&L: the full series ends positive, the same series minus its top 20 winners bleeds to minus $90k
Cumulative trade P&L. Blue = all trades. Red = the identical sequence with the 20 best winners zeroed out — it never recovers.

This is the real edge and the real danger in one chart. The fat tail is genuine, but the system is hostage to it: miss a handful of the biggest moves — through hesitation, a tight stop, or taking profits too early — and you are underwater for a decade. This is precisely where a discretionary master and a naïve system diverge, and why "just trail the winners" is far harder to live than to backtest.

Test 3 / The edge is thinner than the spread

Our base case already charges 5 basis points per side in fees and slippage. Watch what a little more does:

Cost per side (bps)CAGRSharpe
0 (frictionless)6.1%0.38
5 (base case)1.7%0.19
10−2.4%−0.00
20−10.3%−0.38
Sharpe ratio falling steadily as cost per side rises, crossing zero near 10 basis points
Sharpe versus round-trip cost. The whole edge is consumed by roughly 10 bps per side — and small-cap momentum breakouts routinely slip more than that.

Break-even sits near 10 bps per side. That is a generous budget for blue-chips and a fantasy for the low-float, high-ADR names this strategy is built to chase — the ones that gap on the open and run. The frictionless 0.38 Sharpe is the number a careless backtest reports; the tradeable number is lower still.

Test 4 / Optimizing it doesn't save it

Maybe the defaults are just unlucky. We handed six parameters to Optuna and let it search 80 configurations to maximize the in-sample (2015–2020) Sharpe, then asked two honesty questions of the winner. The first: the Deflated Sharpe Ratio, which discounts a result for how many configurations you tried. The second: does the in-sample champion survive on unseen 2021–2026 data?

Three bars: in-sample Sharpe 0.84, deflation bar 0.50, out-of-sample Sharpe 0.35
Optimizing lifts the in-sample Sharpe to 0.84 — but the deflation bar (the Sharpe you'd expect from the best of 80 random tries) is 0.50, and out-of-sample the edge falls to 0.35.

Optimization more than quadrupled the in-sample Sharpe, from 0.19 to 0.84. But after searching 80 configurations, you would expect a best-of-batch Sharpe of 0.50 from luck alone — so the Deflated Sharpe Ratio leaves only a 0.79 probability that the edge clears that selection bar, short of the ~0.95 you'd want before believing it. And out-of-sample the tuned Sharpe collapses to 0.35, still attached to a −40% drawdown. The optimizer bought in-sample fit, not a tradeable edge — the exact trap our overfitting guide is about.

And we are still flattering it

Every number above is an optimistic upper bound, because the universe is the current S&P 1500 — the survivors. Split it by liquidity and the strategy's return leans on the smaller, more delisting-prone names:

Sub-universeCAGRSharpeMax drawdown
Large-cap (top 500 by volume)5.4%0.38−35%
Small / mid-cap (the rest)7.5%0.44−49%

The smaller names carry more of the return — and they are exactly the population where survivorship bias bites hardest, since the small caps that went to zero after a failed breakout are simply absent from a current-constituents list. We can't measure the inflation precisely without a point-in-time, delisting-inclusive universe (a structural limit we flag rather than paper over); the academic range for momentum is roughly one to a few points of CAGR. Whatever the exact figure, it cuts the wrong way.

Part 2 verdict

A faint, real momentum edge — wrapped in three problems.

The signal isn't noise: there is a genuine tail and a weakly positive deflated result. But raw, it is regime-blind, hostage to a dozen trades, priced for costs it won't get, and flattered by survivorship. None of those are the signal's fault — they're the absence of the discipline the original trader supplies by hand. That's a testable hypothesis: add the discipline, and see if a faint edge becomes a real one.

Run the same checks on your own backtest

Tool
Deflated Sharpe Ratio — feed it your trial count and watch the headline Sharpe deflate
Tool
Backtest Costs — turn a gross curve into a net one and find your break-even bps
Tool
Position Size & Risk of Ruin — size a fat-tailed, low-win-rate system to survive
Educational analysis, not investment advice. A methodology case study of a publicly published strategy — credited to Kristjan Kullamägi and reimplemented clean-room — not a recommendation to trade any strategy or instrument, and not a judgement of the trader, whose live results reflect discretion a daily backtest cannot capture. Simulated and optimized results have severe limitations, including the survivorship bias noted above, and do not predict future performance. See the full disclaimer.