Treating Human Uncertainty in Human-Machine Teaming

Katherine M. Collins1, Umang Bhatt1, Adrian Weller1

1 Cambridge University, UK

If a collaborator is unsure about a task that you are working on together, you expect them to communicate their uncertainty. We argue this practice should be followed when developing and deploying human-machine teams: if any team member is uncertain, efforts should be made to communicate and resolve or compensate for such uncertainty. Efforts have been made in the machine learning community to design frameworks which encourage a model to better incorporate uncertainty in its outputs [1], for instance, generating calibrated predictions [2, 3] or producing a set of plausible responses rather than a single estimate when unsure [4, 5]. However, while human probabilistic reasoning has been studied extensively within the cognitive science, psychology, and crowdsourcing communities [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], human uncertainty representation specifically in machine learning has been limited, and if considered, is often captured through a single scalar measure [17, 18, 19, 20, 21, 22]. We argue for further elicitation and incorporation of human uncertainty, not just model uncertainty.

Our research program aims to address this gap. In Collins et al. [23], we took a step in this direction by
eliciting soft labels over multinomial label distributions representing annotators’ uncertainty in challenging image classification, finding that training models with richer labels can improve model generalization, robustness, and calibration. But this work has so far only focused on improving the machine learning system performance itself; human uncertainty has the potential to support the design of more effective and reliable collaborative systems. Already, recent works have begun to show that humans who are uncertain are more likely to side with a model, even if the model is wrong [24]. How then can we guard against propagating human biases in decision-making, while still designing systems which complement humans [25] accounting for their uncertainty to empower safer decision making under ambiguity? This is crucial in high-stakes settings, such as forming a treatment plan for a patient with comorbidities, or deciding from a set of policies to enact in response to geopolitical or climate instability. We believe the next generation of human-machine collaborative systems will benefit from a careful treatment of model and human uncertainty to adapt to an ever-uncertain world, and do so in ways that engender trust through appropriate transparency [26, 27].


[1] Z. Ghahramani, “Probabilistic machine learning and artificial intelligence.” Nature, vol. 521, no. 7553,
pp. 452–459, 2015. [Online]. Available:
[2] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” in
International conference on machine learning. PMLR, 2017, pp. 1321–1330.
[3] A. Fisch, T. Jaakkola, and R. Barzilay, “Calibrated selective classification,” arXiv preprint
arXiv:2208.12084, 2022.
[4] S. Bates, A. Angelopoulos, L. Lei, J. Malik, and M. Jordan, “Distribution-free, risk-controlling
prediction sets,” Journal of the ACM (JACM), vol. 68, no. 6, pp. 1–34, 2021.
[5] V. Babbar, U. Bhatt, and A. Weller, “On the utility of prediction sets in human-AI teams,” in IJCAI,
[6] S. Lichtenstein, B. Fischhoff, and L. D. Phillips, “Calibration of probabilities: The state of the art,”
Decision making and change in human affairs, pp. 275–324, 1977.
[7] A. Tversky and D. Kahneman, “On the reality of cognitive illusions,” Psychological Review, vol. 103,
no. 3, pp. 582–591, 1996.

[8] A. O’Hagan, C. E. Buck, A. Daneshkhah, J. R. Eiser, P. H. Garthwaite, D. J. Jenkinson, J. E. Oakley,
and T. Rakow, Uncertain Judgements: Eliciting Expert Probabilities. Chichester: John Wiley, 2006.
[Online]. Available:
[9] T. Sharot, “The optimism bias,” Current biology, vol. 21, no. 23, pp. R941–R945, 2011.
[10] N. D. Goodman, J. B. Tenenbaum, and T. Gerstenberg, “Concepts in a probabilistic language of
thought,” Center for Brains, Minds and Machines (CBMM), Tech. Rep., 2014.
[11] D. G. Goldstein and D. Rothschild, “Lay understanding of probability distributions,” Judgment and
Decision making
, vol. 9, no. 1, p. 1, 2014.
[12] R. F. Murray, K. Patel, and A. Yee, “Posterior probability matching and human perceptual decision
making,” PLoS computational biology, vol. 11, no. 6, p. e1004342, 2015.
[13] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman, “Building machines that learn and
think like people,” Behavioral and brain sciences, vol. 40, 2017.
[14] J. J. Y. Chung, J. Y. Song, S. Kutty, S. R. Hong, J. Kim, and W. S. Lasecki, “Efficient elicitation
approaches to estimate collective crowd answers,” Proc. ACM Hum.-Comput. Interact., vol. 3, no.
CSCW, nov 2019. [Online]. Available:
[15] A. O’Hagan, “Expert knowledge elicitation: Subjective but scientific,” The American Statistician,
vol. 73, no. sup1, pp. 69–81, 2019. [Online]. Available:
[16] M. A. Peters, “Confidence in decision-making,” 03 2022. [Online]. Available:
[17] S. Branson, C. Wah, F. Schroff, B. Babenko, P. Welinder, P. Perona, and S. Belongie, “Visual
recognition with humans in the loop,” in European Conference on Computer Vision. Springer, 2010,
pp. 438–451.
[18] Q. Nguyen, H. Valizadegan, and M. Hauskrecht, “Learning classification models with soft-label
information,” Journal of the American Medical Informatics Association, vol. 21, no. 3, pp. 501–508,
[19] J. Song, H. Wang, Y. Gao, and B. An, “Active learning with confidence-based answers for crowdsourcing labeling tasks,” Knowledge-Based Systems, vol. 159, pp. 244–258, 2018.
[20] L. Beyer, O. J. H´enaff, A. Kolesnikov, X. Zhai, and A. van den Oord, “Are we done with ImageNet?”
CoRR, vol. abs/2006.07159, 2020. [Online]. Available:
[21] K. Vodrahalli, R. Daneshjou, T. Gerstenberg, and J. Zou, “Do humans trust advice more if it comes
from AI? an analysis of human-AI interactions,” in Proceedings of the 2022 AAAI/ACM Conference on
AI, Ethics, and Society
, ser. AIES ’22. New York, NY, USA: Association for Computing Machinery,
2022, p. 763–777. [Online]. Available:
[22] M. Steyvers, H. Tejeda, G. Kerrigan, and P. Smyth, “Bayesian modeling of human–ai complementarity,”
Proceedings of the National Academy of Sciences, vol. 119, no. 11, p. e2111547119, 2022.
[23] K. M. Collins, U. Bhatt, and A. Weller, “Eliciting and learning with soft labels from every annotator,”
in Proceedings of the AAAI Conference on Human Computation and Crowdsourcing (HCOMP), vol. 10,
[24] E. Bondi, R. Koster, H. Sheahan, M. Chadwick, Y. Bachrach, T. Cemgil, U. Paquet, and K. Dvijotham,
“Role of human-AI interaction in selective prediction,” in Proceedings of the AAAI Conference on
Artificial Intelligence
, vol. 36, no. 5, 2022, pp. 5286–5294.
[25] B. Wilder, E. Horvitz, and E. Kamar, “Learning to complement humans,” 2020.
[26] U. Bhatt, Y. Zhang, J. Antor´an, Q. V. Liao, P. Sattigeri, R. Fogliato, G. G. Melan¸con, R. Krishnan,
J. Stanley, O. Tickoo, L. Nachman, R. Chunara, A. Weller, and A. Xiang, “Uncertainty as a form of
transparency: Measuring, communicating, and using uncertainty,” CoRR, vol. abs/2011.07586, 2020.
[Online]. Available:
[27] C. Schneider, A. Freeman, D. Spiegelhalter, and S. van der Linden, “The effects of communicating
scientific uncertainty on trust and decision making in a public health context,” 2022.