A number is only as honest as how it was chosen
Careful research fails in a way that has little to do with intelligence. The idea can be sound, the code can be clean, and the result can still be worthless, because of how the reported number was arrived at.
Take the ordinary act of trying several candidate strategies and keeping the best one. The score attached to that winner tells you almost nothing about how it will trade. It is the largest of a set of noisy measurements, and the largest draw from noise runs high every time. Nobody cheated. The outcome being measured was allowed to influence the choice of what to report, and that alone is enough to spoil it.
The same leak opens wherever a decision depends on the outcome. Which candidate moves forward. Which configuration stays. Which window gets shown. Which of several runs is called representative. Any of these, made after a glance at the number that ends up in the report, turns that number from something observed into something chosen.
Selection hides in ordinary decisions
What makes this hard to police is that the decisions seldom look like selection. A threshold gets nudged once the outcome is visible. A check meant to catch mistakes starts filtering for the answer you were hoping for. You run it one more time. None of it feels like cooking the books, and all of it does the same damage.
A result is worth trusting only when nothing that shaped it was chosen by looking at it.
We hold to that as a rule, and honoring it is harder than stating it. The temptation is strongest at the worst moment: a promising result is when a researcher most wants to keep looking.
The fix is structural
Trying to be more disciplined in the moment does not work, because the moment is already compromised. What works is deciding in advance. Before an evaluation runs, we settle what will be measured, what would count as failure, and how many things are being tried, and we write it down. After that the evaluation returns whatever it returns, and the number in the report is one we watched happen.
When a result cannot show that separation, we do not argue about whether it is probably fine. It becomes exploratory, and exploratory work is free to suggest ideas while earning no confidence.
Why the strictness earns its keep
None of this shows up in a demo. A strategy handed its score by selection looks identical to one that earned it, right until real capital stands behind it and the borrowed optimism fails to reappear. Separating selection from reporting moves that disappointment somewhere cheap: a rejected file on a research machine, months before it could have been a drawdown.
Given the choice, we let good ideas die to an over-strict process far more often than we let a flattering number talk its way into a live account.