Data Selection for Close-Domain Data in Medical Continual Pretraining: A Case Study on Data Selection via Importance Resampling (DSIR)

Conference proceedings article

Authors/Editors

PAISIT KHAN-AR-SA

Strategic Research Themes

Digital Transformation (Strategic Research Themes)

Publication Details

Author list: Chatiyar Ardchon, Can Udomcharoenchaikit, Nonthakit Chaiwong, Paisit Khanarsa

Publication year: 2025

URL: https://link.springer.com/chapter/10.1007/978-981-96-6400-9_6

View on publisher site

Abstract

Continual pretraining for domain-specific tasks faces significant computational challenges, particularly with unlabeled and closely related domain data. Identifying and ranking data that effectively supports the learning of domain-specific language patterns remains a complex task. This study addresses this gap by proposing a case study on Data Selection via Importance Resampling (DSIR), leveraging n-gram distribution for improved performance and time efficiency compared to existing data selection methods. The findings show that DSIR selects clinical data from PubMed, covering close-domain topics such as biomolecular science, medical ethics, and medical procedures. This selection leads to improvements in ICD-10 classification, a complex medical task, with increases in micro F1 and precision at 8 by approximately 5.17% and 3.71%, respectively, over random selection. Additionally, the study highlights that the KL divergence of selected data strongly correlates with end-task performance, potentially serving as an early indicator of data selection quality for continual pretraining.

Keywords

No matching items found.