Data Selection for Close-Domain Data in Medical Continual Pretraining: A Case Study on Data Selection via Importance Resampling (DSIR)
Conference proceedings article
Authors/Editors
Strategic Research Themes
Publication Details
Author list: Chatiyar Ardchon, Can Udomcharoenchaikit, Nonthakit Chaiwong, Paisit Khanarsa
Publication year: 2025
URL: https://link.springer.com/chapter/10.1007/978-981-96-6400-9_6
Abstract
Continual pretraining for domain-specific tasks faces significant computational challenges, particularly with unlabeled and closely related domain data. Identifying and ranking data that effectively supports the learning of domain-specific language patterns remains a complex task. This study addresses this gap by proposing a case study on Data Selection via Importance Resampling (DSIR), leveraging n-gram distribution for improved performance and time efficiency compared to existing data selection methods. The findings show that DSIR selects clinical data from PubMed, covering close-domain topics such as biomolecular science, medical ethics, and medical procedures. This selection leads to improvements in ICD-10 classification, a complex medical task, with increases in micro F1 and precision at 8 by approximately 5.17% and 3.71%, respectively, over random selection. Additionally, the study highlights that the KL divergence of selected data strongly correlates with end-task performance, potentially serving as an early indicator of data selection quality for continual pretraining.
Keywords
No matching items found.