AL-ViT: Label-Efficient Robusta Coffee-Bean Defect Detection in Thailand Using Active Learning Vision Transformers
Journal article
Authors/Editors
Strategic Research Themes
Publication Details
Author list: Sirawich Vachmanus, Wimolsiri Pridasawas, Worapan Kusakunniran, Kitti Thamrongaphichartkul, Noppanan Phinklao
Publisher: Elsevier
Publication year: 2026
Volume number: 29
ISSN: 2667-3053
Languages: English-United States (EN-US)
Abstract
In major training and export markets, the coffee bean grading process still relies heavily on manual
labor to sort individual beans from large harvest volumes. This labor-intensive task is time-consuming,
costly, and prone to human error, especially within Thailand’s rapidly expanding Robusta coffee sector.
This study introduces AL–ViT, an end-to-end Active-Learning Vision Transformer framework that
operationalizes active learning and transformer-based feature extraction within a single, productionoriented
pipeline. The framework integrates a ViT-Base/16 backbone with seven active learning (AL)
query strategies, random sampling, entropy-based selection, Bayesian Active Learning by
Disagreement (BALD), Batch Active Learning by Diverse Gradient Embeddings (BADGE), Core-Set
diversity sampling, ensemble disagreement, and a novel hybrid uncertainty–diversity strategy designed
to balance informativeness and representativeness during sample acquisition. A high-resolution dataset
of 2,098 Robusta coffee bean images was collected under controlled-lighting conditions aligned with
grading-machine setups, with only 5 % initially labeled and the remainder forming the AL pool. Across
five random seeds, the hybrid strategy without MixUp augmentation achieved 97.1 % accuracy and an
F1bad of 0.956 using just 850 labels (41 % of the dataset), within 0.3 percentage points of full
supervision. Operational reliability, defined as 95 % accuracy, consistent with prior inspection
benchmarks, was reached with only 407 labels, reflecting a 75 % reduction in annotation. Entropy
sampling showed the fastest early-stage gains, whereas BADGE lagged by >1 pp; Core-Set and
Ensemble provided moderate but stable results. Augmentation and calibration analyses indicated that
explicit methods (MixUp, CutMix, RandAugment) offered no further benefit, with the hybrid pipeline
already achieving well-calibrated probabilities. Statistical validation via paired t-tests, effect sizes, and
bootstrap CIs confirmed consistent improvements of uncertainty-driven strategies over random
sampling. Overall, the proposed AL–ViT framework establishes a label-efficient and practically
deployable approach for agricultural quality control, achieving near-supervised accuracy at a fraction
of the labeling cost.
Keywords
Active Learning, Coffee Grading, Vision Transformer






