Reference Guide

Understanding Fortuna — How the Whole Thing Works

A plain-language guide to what Fortuna Labs is, what it does, and how a trading strategy travels from "idea on a napkin" to "real money on the line" — or, more often, gets killed along the way. Written for someone who's never touched algorithmic trading or machine learning. Print-friendly (use your browser's "Save as PDF").

What Fortuna is

a research lab

Not a fund yet. Built to find genuine trading edges through pre-registered, kill-rule-bound experiments.

The single rule

write the plan first

Every test starts with a written plan saying "if X happens we ship, if Y happens we kill it." Filed before the test runs.

Stage pipeline

8 stages

From "_1-ideas" to "_8-live". A strategy can die at any stage. Most do.

Live capital rule

discipline-gated

A strategy reaches real money only after passing every stage of the pipeline — including a 30-day paper-trading window with no surprises.

1 · What is Fortuna Labs?

Fortuna Labs is an independent algorithmic-trading research lab. What it is, structurally, is a discipline machine: a system for proposing trading ideas, testing them honestly, and either deploying the survivors or killing them with a written record of why. Live capital is committed only after a strategy clears every gate in the pipeline.

The whole point of the system is to survive the most common failure mode in algorithmic trading: fooling yourself. Most people who try to build trading bots end up with a strategy that looks brilliant in historical data and loses money the moment it sees a real market. Fortuna's design is built around catching those self-deceptions before any real money is at risk.

The mental model: think of Fortuna less like a hedge fund and more like a small science lab. The product isn't trades — it's trustworthy answers to the question "does this trading idea actually have an edge?". Some answers are "yes, deploy it"; most are "no, here's the evidence, file the kill". Both are valuable.

2 · Why algorithmic trading at all?

Three reasons:

Markets have small, exploitable patterns — not predict-the-future patterns, but statistical edges in things like how prices behave around liquidity events, opening sessions, gap conditions. They're small (often a fraction of a percent per trade) but real, and humans can't trade them at the speed and consistency required.
Discipline beats discretion — a coded rule trades the same way at 3am as it does after a great morning coffee. A human doesn't.
You can test it before you risk anything — unlike discretionary trading, an algorithmic strategy can be back-tested against decades of historical data to estimate whether the edge is real.

The catch: back-tests lie. A strategy that looks great in a back-test can fail live for many reasons — overfitting to noise, lookahead bias (using information that wouldn't have been available at the time), unrealistic execution assumptions, regime change. The job of Fortuna's discipline is to filter the genuine edges from the false-positive ones before live capital is committed.

3 · The complete journey of a strategy

Every strategy in Fortuna travels through eight named stages. It can be killed at any stage, and most are. The stages live in the codebase as folder names with leading digits so they sort in order: _1-ideas, _2-hypothesis, and so on. Here's the visual:

ideas

An idea worth investigating

hypothesis

A specific testable claim

data

Get clean data to test it on

backtest

Run the rule on history

validate

OOS + walk-forward checks

hyperopt

Tune (sparingly!) parameters

dryrun

Paper trade on live data

live

Real money, real risk

Green-top = a stage where the discipline catches most failures. Yellow-top = paper trading (no real money). Most strategies die in stages 4–6.

What happens at each stage

Stage 1 — Ideas

An idea is born. Examples: "what if BTC tends to revert after a leverage cascade?", "what if S&P futures break out reliably from the first 30 minutes of trading on gap days?". At this stage it's just a notion, often inspired by reading, observation, or a market quirk we noticed.

Gate to advance: the idea has a plausible mechanism (not just "this looks like it works on this chart").

Stage 2 — Hypothesis

The idea gets sharpened into a testable claim. Not "BTC mean-reverts" but "BTC reverts more than a fair coin would predict in the 4 hours following a 5%+ price spike on 10x+ leverage liquidations". Specific, measurable, falsifiable.

This is also where the three-question test happens: (1) Why should this work? (2) Why hasn't it been arbitraged away? (3) What would prove it doesn't work? An idea that can't answer all three goes back to stage 1 or dies.

Gate to advance: a written one-page edge document in knowledge/edges/ covering mechanism, regime affinity, and conditions where the edge should fail.

Stage 3 — Data

Get the historical data needed to test the hypothesis. For crypto: candle data from Binance going back years. For US futures: 1-minute OHLCV from Databento. Data quality is non-negotiable — bad data produces lying back-tests. We pin a SHA-256 hash of every data file used so we can reproduce results.

Gate to advance: data covers a long-enough window (typically 5+ years) and is verified clean.

Stage 4 — Backtest

Run the trading rule against the historical data. The output is a list of trades it would have made, each with a profit or loss. From this we compute summary statistics: total return, win rate, drawdown, Sharpe ratio.

This is where most strategies show their first lie. A back-test result is a single sample of how the strategy would have performed in one specific historical sequence. To know if the result generalises, we need walk-forward testing: train on years 1–4, test on year 5, then move forward and repeat.

Gate to advance: positive expected value, drawdown within tolerance, lookahead-analysis test passes (proves the strategy didn't accidentally peek at future data).

Stage 5 — Validate

The honest check. Run combinatorial purged cross-validation (CPCV): split history into chunks, train on every combination of "this many chunks", test on the rest, with a buffer to prevent information leakage. If the strategy looks good on average across all combinations, that's a strong signal. If it looks good on one specific split but bad on others, it was overfit.

Also compute the Probability of Backtest Overfitting (PBO) — a statistical measure of how likely the strategy's apparent edge is just noise from trying many parameter combinations. PBO above ~50% means the result is likely fake.

Gate to advance: walk-forward Sharpe positive across most folds, PBO < 50%.

Stage 6 — Hyperopt

Optional and dangerous. If the strategy has tunable parameters (e.g. a stop-loss distance), carefully search for better values. The danger: with enough parameters and enough trials, you can always find a combination that looks great in-sample and fails out-of-sample. We cap the number of trials hard (e.g. 30 max) and apply Deflated Sharpe Ratio (DSR) — a correction that subtracts the bias from running many trials.

The cautionary tale here is K009 (the AutoLiqVac family kill): 7,500 hyperopt trials produced specs that looked spectacular but had DSR ≈ 0 — meaning the apparent edge was completely explained by selection bias. The whole strategy family was killed as a result.

Gate to advance: tuned parameters still pass walk-forward, trial count documented, DSR > 0.95.

Stage 7 — Dryrun

Paper trading. The strategy is connected to live market data and "trades" with imaginary money for at least 30 days. This catches things the back-test never could: real-time data feeds drop, exchange APIs misbehave, the strategy runs slower than expected, latency varies. Crucially, no real money is at risk yet.

Gate to advance: ≥30 days of clean dryrun behaviour, real-time stats match back-test expectations, no operational issues.

Stage 8 — Live

Real money. The strategy trades with actual capital. There is a hard requirement at this gate: typing the literal words GO LIVE at the command prompt. Nothing in Fortuna goes live by default. The default state of every strategy is "not deployed".

Gate to advance: typed confirmation of GO LIVE; ongoing kill rules monitor live performance and pull the strategy back to dryrun (or kill it) if drawdown or other safety triggers fire.

The shape of survivor selection: for every strategy that reaches _8-live, many die at earlier stages. The folder knowledge/killed/ contains a written record for each one — a kill ID, the failure mode, and the conditions under which a re-attempt would be allowed. This is not a bug; it's the system working. The lab's primary output is informed killing, not heroic shipping.

4 · The single most important rule — Pre-Analysis Plans

Before any back-test or experiment runs, a written document is filed: a Pre-Analysis Plan (PAP). It lives in knowledge/preregs/ with a unique ID like PAP-20260426-01. The PAP must contain:

The hypothesis — what specific claim we're testing.
The universe — which instruments, which time period.
Pre-registered metrics — what numbers will determine pass/fail.
Pass conditions — what would need to be true for the strategy to be deployed.
Kill conditions — what would need to be true for the strategy to be killed.
Forbidden moves — explicit list of "things I'm not allowed to do after seeing the result", e.g. "no shifting the goalposts", "no adding features post-hoc".

Every test produces one of three pre-registered outcomes:

PASS

Hypothesis confirmed

The pre-registered pass conditions were met. The strategy advances to the next stage in the pipeline (often dryrun).

KILL

Hypothesis falsified

The pre-registered kill conditions were met. The strategy is filed in knowledge/killed/ with a kill ID and revive conditions (what would need to change to re-test).

INCONCLUSIVE

Not enough data

Sample size too small to draw an honest verdict. The PAP closes without a pass-or-fail, names what new data would unblock it, and waits.

Why this matters: without pre-registration, the most natural human behaviour after a "borderline" back-test is to relax thresholds, try one more parameter, or quietly redefine "success". Pre-registration is a commitment device. The discipline isn't about being smart in the moment — it's about being honest after seeing the result.

This pattern is borrowed from clinical trials and academic statistics, where pre-registration is the standard tool against the same self-deception problems. In algorithmic trading it's much rarer than it should be, and the absence of it is why so many "great back-tests" fail live.

5 · A worked example — MeasuredMove from idea to paper trading

Here's the full journey of a representative strategy: MeasuredMove_v1, edge ID E002. Every stage of the pipeline, with the actual artefacts produced along the way.

Stage 1–2 — The idea + hypothesis

Idea: after a strong move (an "impulse"), price often pulls back a bit, then continues in the same direction by a similar magnitude — a "measured move". Classical chart pattern, possibly tradeable.

Hypothesis: on the BTC/USDT daily chart, when an upward impulse occurs (close above 20-day EMA, recent swing high) followed by a pullback, an entry on the next breakout produces positive expected value over a multi-month horizon. Filed as edge document E002.

Stage 3 — Data

BTC/USDT daily OHLCV from Binance, 2017 onwards. SHA-256 of the dataset is pinned in the PAP so the result is reproducible later.

Stage 4 — Back-test

Strategy coded in Python (MeasuredMove_v1.py, ~150 lines using the Freqtrade framework). Back-test run over a 5-year window produced 22 trades with a 59% win rate, +6.07% total return, and a 6.25% max drawdown. Modest but positive. Lookahead-analysis: clean (no information leakage from future bars).

Stage 5 — Validate

Combinatorial purged cross-validation with 15 paths: average OOS Sharpe 0.558, PBO 0%, all 15 paths positive. The signal is small but stable across time periods. That's the evidence basis for promoting the strategy to paper trading.

Stage 6 — Hyperopt

Skipped. The strategy parameters were locked at values from literature and intuition (EMA 20, swing 2 bars, ATR-1.0 stops). No tuning means no DSR penalty for selection bias.

Stage 7 — Dryrun (paper trading)

The strategy runs live against real BTC market data, paper-trades each entry, and logs every decision. The PAP requires at least 30 days of clean dryrun behaviour + a 20-trade checkpoint before any consideration of going live.

MeasuredMove is intentionally sparse — daily-bar entries fire on specific conditions, not every day. The expected pace is slow: the dryrun checkpoint can take months to reach. Patience is built into the system; impatience is exactly the failure mode the discipline exists to prevent.

Stage 8 — Live

If the dryrun gates pass, the strategy becomes a candidate for GO LIVE typed confirmation. Even then, allocation starts conservative (a tiny fraction of the account) and grows only with continued clean performance.

Total time from idea to dryrun: about a week of focused work. Total time before live capital, if it ever gets there: months. The whole rhythm is paced by deliberate skepticism, not by enthusiasm.

6 · How the fleet is structured

The "fleet" is the set of strategies the lab keeps in active rotation. Each one has an entry in state/fleet.yaml and is queryable via fortuna bot list. A strategy in the fleet sits at exactly one of three operating states:

Active · paper trading

stage: _7-dryrun

The strategy is connected to live market data and is paper-trading entries. Every decision is logged. The minimum gate to advance is 30 days of clean operation plus a trade-count checkpoint; below those, no strategy is even considered for live capital.

Active · live capital

stage: _8-live

Real money committed. Allocation starts conservative. Ongoing kill rules monitor live performance — drawdown limits, Sharpe deviation, error-rate triggers — and pull the strategy back to paper trading or kill it outright if any gate is breached.

Retired

stage: archived

A strategy that has been killed. Each retirement carries a kill ID (K-NNN) and a written record of why it failed. Retirement is not silent — it produces a permanent finding so the same failure mode isn't repeated.

The current contents of any of these three states is a snapshot — strategies move between them as they pass or fail their gates. The structure is durable: every active strategy is in exactly one of the operating stages, and every kill carries documentation. The membership changes; the discipline doesn't.

7 · How we know if a strategy actually works — metrics in plain English

Trading-strategy metrics have intimidating names but boil down to a few simple ideas.

Win rate: Percentage of trades that closed at a profit. By itself this means little — you can have a 90% win rate and lose money if your losers are bigger than your winners.
Profit factor: Total profit on winning trades divided by total loss on losing trades. Above 1.0 means winners outweigh losers. Above 1.5 is good. Above 3.0 is suspiciously good.
Sharpe ratio: How much extra return you get per unit of volatility. Higher is better. A Sharpe of 1.0 is decent, 2.0 is great, above 3.0 in back-tests is usually a sign of overfitting.
Maximum drawdown: The biggest peak-to-trough loss the strategy ever experienced. This matters for psychological tolerance and for capital sizing — if you can't stomach a 30% drawdown, you can't run a strategy with one.
PSR (Probabilistic Sharpe Ratio): The probability that the strategy's true Sharpe is above zero, given the observed Sharpe and the sample size. PSR > 0.95 means we're 95% confident there's a real edge.
DSR (Deflated Sharpe Ratio): PSR but adjusted for the fact that we ran many trials and picked the best. The more trials, the more we have to discount. K009 was killed because its DSR was essentially 0 despite a beautiful raw Sharpe.
PBO (Probability of Backtest Overfitting): Statistical estimate of how likely the back-test result is just noise from running too many configurations. Below 50% = probably real; above 50% = probably fake.

Most of these come from the work of Marcos López de Prado in Advances in Financial Machine Learning (Wiley, 2018). Fortuna implements the canonical formulas with unit tests; full paper references are in section 14 below.

8 · What happens when a single trade fires

Here's the lifecycle of one trade, from signal to logged outcome:

▶

Signal

Strategy code detects entry conditions on the latest bar

→

Order

Buy/sell order sent to the exchange (Binance or IBKR)

⊙

Hold

Position open; stop-loss + take-profit levels watched

◼

Exit

Stop hit, target hit, or rule says exit

✎

Log

Trade written to database with outcome, regime label, features at entry

Every trade — back-tested, paper, or live — ends up in the database with the same schema. This means we can ask questions across time: "in the last 6 months of LiquidityVacuum dryrun trades, how did the strategy perform in the volatile regime?" The answer comes from a single SQL query against the database, no special tooling needed.

9 · The three biggest traps — and how Fortuna defends

Most retail algorithmic trading dies for the same handful of reasons. Fortuna's discipline is built specifically around catching these.

Trap 1 Lookahead bias

What it is: the strategy accidentally uses information from the future. For example, computing a 20-bar moving average that includes the current bar, then trading on that bar — at runtime, you wouldn't know the current bar's close yet.

How it shows up: the back-test looks brilliant; the live strategy loses money. The "edge" was reading tomorrow's newspaper.

Defense: every feature is shifted by one bar (so at time T you only know features computed on T-1 and earlier). Every back-test is required to pass freqtrade lookahead-analysis before its result is trusted. We also run a custom AST-based linter (ml/lint_lookahead.py) that scans Python feature code for missing shifts and flags any.

Trap 2 Overfitting via too many trials

What it is: you run thousands of parameter combinations on the same back-test data and pick the best one. By chance alone, some combination will look great. Run live, it reverts to noise.

How it shows up: the chosen parameters are oddly specific (e.g. "stop loss at 2.847%", not 3%). The strategy works on the back-test window and fails everywhere else.

Defense: hard caps on trial budget (typically 30 max), Deflated Sharpe Ratio applied honestly (subtract the bias from the trial count), Probability of Backtest Overfitting computed via combinatorially symmetric cross-validation. K009 (the AutoLiqVac family kill) is the in-house example: 7,500 trials produced DSR ≈ 0; the whole family got killed for it.

Trap 3 Goalpost shifting after seeing the result

What it is: the back-test is borderline. You think "well if I just relax this threshold, or filter out this one bad regime, it works". Each individual move is small; the cumulative effect is that you've trained your eyes on the data.

How it shows up: nothing dramatic; the strategy just doesn't generalise out-of-sample because the "rules" were retroactively fitted to the in-sample data.

Defense: the PAP. Pass and kill conditions are written down before the back-test runs. After the result, you don't get to argue with the conditions you set. If a result is borderline, the PAP forces a kill or an INCONCLUSIVE filing — you can't massage it into a pass.

10 · How findings stack — the knowledge system

Every meaningful outcome gets a permanent record. The system is intentionally simple: 6 folders under knowledge/, each containing append-only Markdown files with YAML frontmatter. Indices auto-rebuild via fortuna sync.

edges/

E···

One file per validated edge mechanism. Mechanism, regime affinity, evidence.

preregs/

PAP-···

Pre-Analysis Plans. The contract for every test — written before, not after.

experiments/

EXP-···

The result of running each PAP — pass/kill/inconclusive verdict + evidence.

findings/

F···

Durable market facts discovered. Each one is a claim with evidence and a status (durable / provisional / superseded).

killed/

K···

Strategies that died. Each kill names the reason and lists the conditions under which a re-attempt would be allowed.

decisions/

D···

Architectural choices about how the lab itself operates (design conventions, discipline rules, deploy policies).

The whole knowledge base is plain text; everything is grep-able, git-able, and (where appropriate) syncable to an LLM-readable knowledge base for AI assistants to consult.

11 · The ML lab — a separate experiment

Sitting alongside the main strategy work is a more experimental layer: the ML lab. The question it tries to answer is: can machine-learning techniques improve any of our existing rule-based edges?

The lab runs ML PAPs the same way the main lab runs strategy PAPs: each one is a pre-registered hypothesis with explicit pass / kill / inconclusive thresholds, executed against historical trade data, and filed regardless of outcome. Typical ML PAP shapes look like:

Regime gating — would skipping trades in certain market regimes improve a strategy's portfolio Sharpe?
Regime sizing — would scaling stake by per-regime expected edge produce a better risk-adjusted return than uniform stake?
Meta-labelling — could a secondary ML model filter the marginal trades of a primary rule, raising precision without losing too many entries?

The lab's structural bias: ML is treated as an augmentation of validated rules, not a generator of new strategies. Per the López de Prado framework that informs the design, the highest-leverage use of ML for a small lab is improving a rule that already has positive expected value — not searching the space of all possible rules. Most ML PAPs that have been run produced a clean "no, this doesn't add edge here" answer; that is exactly the kind of evidence the discipline is designed to produce. The alternative would be thinking we have ML edge when we don't.

12 · Why this approach (vs most retail algo trading)

Most people who try to build trading bots follow some version of this loop:

Have an idea.
Code it up.
Run a back-test.
Tweak parameters until the back-test looks good.
Deploy with real money.
Lose money.
Tweak the strategy in response to live losses.
Lose more money.
Quit.

Fortuna's loop looks different:

Have an idea.
Pass the three-question test, or kill the idea.
Write down the hypothesis and what would falsify it (the PAP).
Get clean data.
Code it up.
Back-test with lookahead-analysis.
Walk-forward / CPCV validate; compute DSR + PBO.
If pre-registered conditions met → dryrun; else file the kill with reasons.
30 days of dryrun discipline.
Type GO LIVE if and only if the discipline says yes.
Monitor live; kill rules pull back automatically if drawdown limits are breached.

The first loop optimises for "ship something that works". The second optimises for "find out the truth, then ship only if the truth supports it". The second is much slower in the early days. The expectation is that it's much more durable in the long run, because it doesn't accumulate hidden self-deception.

The core asymmetry: a strategy that fails live can lose real money fast. A strategy that gets killed in _4-backtest only costs you the time spent coding and back-testing. The discipline trades a lot of cheap failures (killed back-tests) for fewer expensive failures (live drawdowns). That's the bet.

13 · The human and the AI — division of labour

Modern research labs increasingly run on a partnership between human strategists and AI coding assistants. Fortuna is built around a clean version of that division:

Human strategists decide what ideas to pursue, what to kill, when to deploy live capital, and what counts as evidence. Every irreversible action is reviewed before it happens.
AI coding assistants handle execution: writing code, running back-tests, filing PAPs from templates, computing statistics, drafting findings, editing research pages. They cannot deploy live capital without explicit human confirmation.

The AI's job is to preserve the discipline — every PAP it files has the kill conditions filled in; every script it writes runs the lookahead linter; every result it reports cites the registry entry. The human's job is to make the decisions only a human can make: which ideas matter, when to stop a line of work, when to risk real money. This is the operational pattern Fortuna uses; the discipline framework itself is what makes the partnership safe.

14 · Sources & further reading

The discipline framework Fortuna runs on is not original — it borrows heavily from the academic and industry literature on quantitative finance, with particular debt to Marcos López de Prado's body of work. The substrate (PSR, DSR, PBO, CPCV, meta-labelling) is implemented per his published formulas.

Marcos López de Prado

QuantResearch.org — his official site with most papers freely downloadable.
Wikipedia entry — background and notable contributions.
Advances in Financial Machine Learning (Wiley, 2018) — the canonical book. Source of the meta-labelling, CPCV, and most of the metrics framework Fortuna uses.
Machine Learning for Asset Managers (Cambridge, 2020) — shorter follow-up; the "Seven Sins of Quantitative Investing" chapter is the in-house reference for what NOT to do.

Specific papers used in Fortuna's metric implementations

Bailey & López de Prado (2012) — The Sharpe Ratio Efficient Frontier. Defines the Probabilistic Sharpe Ratio (PSR). SSRN.
Bailey & López de Prado (2014) — The Deflated Sharpe Ratio. The selection-bias-adjusted PSR; this is the one that exposed K009. SSRN.
Bailey, Borwein, López de Prado & Zhu (2015) — The Probability of Backtest Overfitting. The PBO via combinatorially symmetric cross-validation. SSRN.

Other foundations

Politis & Romano (1994) — The Stationary Bootstrap. The bootstrap method used for confidence intervals on serially-correlated trade returns.
Erb & Harvey (2013) — The Golden Dilemma. Background for the gold-silver mean-reversion hypothesis tested by Fortuna's metals research.
Pre-registration as a discipline tool — the methodology is borrowed from clinical trials and the open-science movement, where it is the standard defense against post-hoc rationalisation. The Center for Open Science (cos.io/initiatives/prereg) maintains the canonical reference.