Neuropeptide Classification at Scale with Protein Language Models: Hard Negatives and Cluster-Aware Splits

Conference proceedings article

Authors/Editors

Strategic Research Themes

Publication Details

Author list: Wachirawit Intaphan, Kimhan Jongjaidee, Sathapana Tinop, Pawinnarut Pornpanarat, Supatcha Lertampaiporn, Warin Wattanapornprom

Publication year: 2025

Start page: 484

End page: 490

Number of pages: 7

URL: https://ieeexplore.ieee.org/abstract/document/11298054

Languages: English-United States (EN-US)

View on publisher site

Abstract

Neuropeptides are short, secreted signaling molecules that regulate nervous, metabolic, and immune functions. Getting neuropeptide vs. non-neuropeptide discrimination right matters for discovery pipelines, off-target filtering, and triage for wet-lab validation. Much of the reported progress—especially with protein language models (PLMs)—uses NeuroPep-anchored positives and length-matched UniProt negatives, often deduplicating only positives, which risks homolog leakage and overly easy negatives. We ask a different question: How do PLM baselines behave when leakage is minimized and negatives approximate deployment reality? We thereby assemble a multi-source corpus (positives from NeuroPep 2.0, aSynPEP-DB, NeuroPedia, Figshare; negatives from UniProtKB with secretion-aware controls), apply CD-HIT to both classes, and build cluster-aware cross-validation and test splits so families never straddle evaluation boundaries. On this foundation, we present a transparent baseline: ESM embeddings projected to PCA-128, with well-understood learners (RBF-SVC, logistic regression, tree ensembles). On the fixed, balanced 100k study set, the RBF-SVC attains CV = 0.712 and test accuracy = 0.709 with a stable precision–recall profile. A faithful re-training of iNP_ESM on the same data/splits yields ACC = 0.5657, Precision = 0.6089, Recall = 0.5313, F1 = 0.5393, underscoring the impact of protocol on apparent gains. Beyond accuracy, we report macro-F1, balanced accuracy, PR-AUC, and probability calibration to reflect realistic prevalence and thresholding. We release accession lists and split scripts to enable exact regeneration without redistributing third-party content. The result is a baseline evaluated under hard negatives and cluster-aware splits, suitable for replication and comparison.

Keywords

ESM embeddings, neuropeptide classification, Protein Language Model, protein sequence embeddings