Bridging Human Perception and Vision-Language Models: A Perspective on Explainable Aesthetic Evaluation

Conference proceedings article


Authors/Editors


Strategic Research Themes


Publication Details

Author listSupatta Viriyavisuthisakul, Toshihiko Yamasaki

Publication year2026

LanguagesEnglish-United States (EN-US)


Abstract

Aesthetic evaluation of images is subjective, culturally dependent, and shaped by complex perceptual factors. Several studies aim to automate this assessment using deep neural networks, but significant challenges remain, such as the lack of objective ground truth and the difficulty of providing explanations for aesthetics grounded in human perception. Recently, vision-language models (VLMs) and multimodal large language models (MLLMs) have been explored to offer new opportunities for both aesthetic prediction and natural language reasoning; however, their explanations often diverge from human perception, revealing persistent gaps in interpretability and alignment. This perspective synthesizes recent approaches in explainable image aesthetics, including prompt-based CLIP evaluation, SHAP-based factor attribution, multimodal reasoning, and explanation-guided image retouching, and integrates insights from these methods. We highlight limitations related to prompt sensitivity, cultural bias, and inconsistencies among explanation modalities. We propose a unified framework that bridges quantitative attribute analysis with semantic reasoning. By grounding aesthetic evaluation in human-centered perception and aligning VLM explanations with interpretable visual evidence, this work aims to provide a roadmap for future research on transparent and user-centric aesthetic assessment systems.


Keywords

No matching items found.


Last updated on 2026-05-03 at 12:00