Evaluating image captions is essential for ensuring both linguistic fluency and accurate semantic alignment with visual content. While reference-free metrics such as CLIPScore have advanced automated caption evaluation, most existing work on learned evaluation metrics remains limited to pointwise English-centric assessments, with significant gaps in terms of reliability, interpretability, and multilingual inclusivity of vision-and-language evaluation metrics.
In this seminar session I will explore extensions of current English-centric benchmarks to a multilingual scenario promoting the development of more inclusive frameworks.
Additionally I will present two extensions from CLIPScore metric aiming to improve its interpretability and reliability in real world applications. Leveraging a model-agnostic conformal risk control framework, I will explore the calibration of CLIPScore distributions values for task-specific control variables tackling both granular assessment for individual word errors within captions, and the calibration of these raw distribution scores producing a more reliable interval for captioning evaluation by improving the correlation between uncertainty estimations and prediction errors.