A trustworthy model should signal its own unreliability, so I evaluate every method with the same source-agnostic tools — calibration and selective-prediction AUROC — which test whether reported confidence tracks actual errors, not which kind of uncertainty caused them.
What changes is how tractable the problem is, and that turns on whether the uncertainty is objectively checkable. Where it is — verifying a math reasoning step, judging whether retrieved evidence is relevant — labels are cheap, models are strong, and uncertainty can be trained and rigorously tested. One favorite paper of my own, ReProbe, lives here: it reads a model's own internal states mid-reasoning to tell when an answer has settled, scaling test-time compute only as far as needed, with indisputable gold evaluation and extensive out-of-distribution tests. It is exactly the kind of non-agentic, uncertainty-reporting verifier that Scientist AI calls for.
The same reporting goal extends to far less tractable terrain — subjective classification and conflicting human labels, long argued to be signal rather than noise (Plank, 2022; Uma et al., 2021; Xu et al., 2025) — where I build uncertainty-aware classifiers distilled into cheap, deployable models (AFaCTA, DIRAS), evidence-grounded specialists whose every claim is traceable to its source (Evidence-Based QA), and interfaces that surface the edge cases a classifier quietly gets wrong so people adjudicate the contested ones (Co-DETECT). Here the gold signal is a distribution of disagreeing human labels — scarce, costly, and far from mainstream benchmarks — so even evaluation is an open problem. That gap is itself a central finding: because this uncertainty cannot be cheaply scored, it never enters the reward, and models are never trained to handle it.