Data Selection for Close-Domain Data in Medical Continual Pretraining: A Case Study on Data Selection via Importance Resampling (DSIR)

Conference proceedings article


Authors/Editors


Strategic Research Themes


Publication Details

Author listChatiyar Ardchon, Can Udomcharoenchaikit, Nonthakit Chaiwong, Paisit Khanarsa

Publication year2025

URLhttps://link.springer.com/chapter/10.1007/978-981-96-6400-9_6


View on publisher site


Abstract

Continual pretraining for domain-specific tasks faces significant computational challenges, particularly with unlabeled and closely related domain data. Identifying and ranking data that effectively supports the learning of domain-specific language patterns remains a complex task. This study addresses this gap by proposing a case study on Data Selection via Importance Resampling (DSIR), leveraging n-gram distribution for improved performance and time efficiency compared to existing data selection methods. The findings show that DSIR selects clinical data from PubMed, covering close-domain topics such as biomolecular science, medical ethics, and medical procedures. This selection leads to improvements in ICD-10 classification, a complex medical task, with increases in micro F1 and precision at 8 by approximately 5.17% and 3.71%, respectively, over random selection. Additionally, the study highlights that the KL divergence of selected data strongly correlates with end-task performance, potentially serving as an early indicator of data selection quality for continual pretraining.


Keywords

No matching items found.


Last updated on 2025-28-08 at 00:00