Nowadays, generative question-answering models (e.g., UnifiedQA) achieve state-of-the-art performance in various datasets. Despite their remarkable performance, these models still produce wrong answers with high confidence scores. The responsible use of such systems in high-risk applications, like healthcare, requires some guarantees in terms of the correlation of the model scores and the output’s correctness. One potential approach toward these guarantees is calibration. Despite the vast research on calibration for K-ary classification in machine learning, calibration for textual-based systems imposes additional challenges that range from a combinatorial output space to the many definitions of correctness. In this talk, we will discuss the challenges towards the calibration of generative question-answering systems, as well as the current state-of-the-art approaches to address it.
On the Calibration of Generative Question-Answering models: State-of-the-art and Challenges
May 17, 2022
1:00 pm