Research Statement | Jingwei Ni

Topic

One reporting ability, behind everything.

I build AI systems that know when to be trusted: I extract the uncertainty a model already carries and expose it for human supervision. One requirement sits behind everything I do — a trustworthy model should report low confidence whenever it is likely to be wrong, whatever the cause — and that single reporting ability is what makes human oversight possible. Uncertainty here comes in two flavors, epistemic (what the model doesn't know) and aleatoric (irreducible ambiguity in the world) (Hüllermeier & Waegeman, 2021; Shorinwa et al., 2024); what matters for trust is that the model reports either one.

Why this matters now

Uncertainty reporting is the bottleneck.

01

The safer path to advanced AI runs on uncertainty reporting.

Bengio et al.'s Scientist AI makes a simple case: agentic systems tend to develop implicit, instrumental goals — self-preservation, power-seeking, reward hacking via overoptimization — that emerge regardless of their initial objective, so rather than trust an agent on its own, we supervise it with a non-agentic, uncertainty-aware system that acts as a guardrail (Bengio et al., 2025). Building that kind of supervisor is the tractable core of my work.

02

Our evaluation tools can't see disagreement.

The field ranks, filters, and aligns models with LLM-as-a-judge (Zheng et al., 2023; Gu et al., 2024), but "helpful," "harmful," and "novel" are contested across legitimate viewpoints, so a judge commits to one reading and returns a confident score — leaving leaderboards most miscalibrated exactly where humans most disagree (Chochlakis et al., 2025; Baan et al., 2022; Li et al., 2025a).

03

Deployment is full of the hard kind of uncertainty.

Real instructions are underspecified and real judgments are contested, yet agents are rewarded only for producing a heuristic-verifiable correct answer in math and code (Lambert et al., 2024; DeepSeek-AI, 2025) — never for being uncertainty-aware — so they invent the missing detail and act instead of asking (Wang et al., 2024; Li et al., 2025b).

04

Scaling will not rescue it.

When uncertainty is irreducible, more compute on the same ambiguous input cannot resolve it — and my own work shows that making models reason longer actively harms their ability to represent human disagreement (Ni et al., 2026).

My research

One goal, measured one way.

A trustworthy model should signal its own unreliability, so I evaluate every method with the same source-agnostic tools — calibration and selective-prediction AUROC — which test whether reported confidence tracks actual errors, not which kind of uncertainty caused them.

What changes is how tractable the problem is, and that turns on whether the uncertainty is objectively checkable. Where it is — verifying a math reasoning step, judging whether retrieved evidence is relevant — labels are cheap, models are strong, and uncertainty can be trained and rigorously tested. One favorite paper of my own, ReProbe, lives here: it reads a model's own internal states mid-reasoning to tell when an answer has settled, scaling test-time compute only as far as needed, with indisputable gold evaluation and extensive out-of-distribution tests. It is exactly the kind of non-agentic, uncertainty-reporting verifier that Scientist AI calls for.

The same reporting goal extends to far less tractable terrain — subjective classification and conflicting human labels, long argued to be signal rather than noise (Plank, 2022; Uma et al., 2021; Xu et al., 2025) — where I build uncertainty-aware classifiers distilled into cheap, deployable models (AFaCTA, DIRAS), evidence-grounded specialists whose every claim is traceable to its source (Evidence-Based QA), and interfaces that surface the edge cases a classifier quietly gets wrong so people adjudicate the contested ones (Co-DETECT). Here the gold signal is a distribution of disagreeing human labels — scarce, costly, and far from mainstream benchmarks — so even evaluation is an open problem. That gap is itself a central finding: because this uncertainty cannot be cheaply scored, it never enters the reward, and models are never trained to handle it.

References

Works cited.

Hüllermeier & Waegeman, 2021 Hüllermeier, E., & Waegeman, W. (2021). Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Machine Learning, 110(3), 457–506. link
Shorinwa et al., 2024 Shorinwa, O., Mei, Z., Lidard, J., Ren, A. Z., & Majumdar, A. (2024). A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions. arXiv:2412.05563. link
Bengio et al., 2025 Bengio, Y., et al. (2025). Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? arXiv:2502.15657. link
Zheng et al., 2023 Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS Datasets and Benchmarks. link
Gu et al., 2024 Gu, J., et al. (2024). A Survey on LLM-as-a-Judge. arXiv:2411.15594. link
Chochlakis et al., 2025 Chochlakis, G., et al. (2025). Aggregation Artifacts in Subjective Tasks Collapse Large Language Models' Posteriors. NAACL. link
Baan et al., 2022 Baan, J., et al. (2022). Stop Measuring Calibration When Humans Disagree. EMNLP. link
Li et al., 2025a Li, Z., Schuurmans, D., Dai, B., & Palangi, H. (2025). Judging with Confidence: Calibrating Autoraters to Preference Distributions. arXiv:2510.00263. link
Lambert et al., 2024 Lambert, N., et al. (2024). Tülu 3: Pushing Frontiers in Open Language Model Post-Training. arXiv:2411.15124. link
DeepSeek-AI, 2025 DeepSeek-AI (Guo, D., et al.). (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Nature, 645, 633–638 (arXiv:2501.12948). link
Wang et al., 2024 Wang, W., et al. (2024). Learning to Ask: When LLM Agents Meet Unclear Instruction. arXiv:2409.00557. link
Li et al., 2025b Li, B. Z., Kim, J., & Wang, R. (2025). QuestBench: Can LLMs Ask the Right Question to Acquire Information in Reasoning Tasks? NeurIPS Datasets and Benchmarks (arXiv:2503.22674). link
Ni et al., 2026 Ni, J., Fan, Y., et al. (2026). Can Reasoning Help Large Language Models Capture Human Annotator Disagreement? EACL. link
Plank, 2022 Plank, B. (2022). The "Problem" of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation. EMNLP. link
Uma et al., 2021 Uma, A., et al. (2021). Learning from Disagreement: A Survey. Journal of Artificial Intelligence Research, 72, 1385–1470. link
Xu et al., 2025 Xu, M., Santosh, T. Y. S. S., & Plank, B. (2025). From Noise to Signal to Selbstzweck: Reframing Human Label Variation in the Era of Post-training in NLP. arXiv:2510.12817. link

AI systems that know when to be trusted.