Bridging Human Perception and Vision-Language Models: A Perspective on Explainable Aesthetic Evaluation
Conference proceedings article
Authors/Editors
Strategic Research Themes
Publication Details
Author list: Supatta Viriyavisuthisakul, Toshihiko Yamasaki
Publication year: 2026
Languages: English-United States (EN-US)
Abstract
Aesthetic evaluation of images is subjective, culturally dependent, and shaped by complex perceptual factors. Several studies aim to automate this assessment using deep neural networks, but significant challenges remain, such as the lack of objective ground truth and the difficulty of providing explanations for aesthetics grounded in human perception. Recently, vision-language models (VLMs) and multimodal large language models (MLLMs) have been explored to offer new opportunities for both aesthetic prediction and natural language reasoning; however, their explanations often diverge from human perception, revealing persistent gaps in interpretability and alignment. This perspective synthesizes recent approaches in explainable image aesthetics, including prompt-based CLIP evaluation, SHAP-based factor attribution, multimodal reasoning, and explanation-guided image retouching, and integrates insights from these methods. We highlight limitations related to prompt sensitivity, cultural bias, and inconsistencies among explanation modalities. We propose a unified framework that bridges quantitative attribute analysis with semantic reasoning. By grounding aesthetic evaluation in human-centered perception and aligning VLM explanations with interpretable visual evidence, this work aims to provide a roadmap for future research on transparent and user-centric aesthetic assessment systems.
Keywords
No matching items found.






