From Free Text to Upper Gastrointestinal Cancer Diagnosis: Fine-Tuning Language Models on Endoscopy and Histology Narratives
Kazhan Misri, Leo Alexandre, Beatriz De La Iglesia
2025
Abstract
Clinical free text reports from endoscopy and histology are a valuable yet underexploited source of information for supporting upper gastrointestinal (GI) cancer diagnosis. Our initial learning task was to classify procedures as cancer-positive or cancer-negative based on downstream registry-confirmed diagnoses. For this, we developed a patient-level dataset of 63,040 endoscopy reports linked with histology data and cancer registry outcomes, allowing supervised learning on real-world clinical data. We fine-tuned two transformer-based models: general-purpose BERT and domain-specific BioClinicalBERT and evaluated methods to address severe class imbalance, including random minority upsampling and class weighting. BioClinicalBERT combined with up-sampling achieved the best recall (sensitivity) of 85% and reduced false negatives compared to BERT’s recall of 78%. Calibration analysis indicated that predicted probabilities were broadly reliable. We also applied SHapley Additive exPlanations (SHAP) to interpret model decisions by highlighting influential clinical terms, fostering transparency and trust. Our findings demonstrate the potential of scalable, interpretable natural language processing models to extract clinically meaningful insights from unstructured narratives, providing a foundation for future retrospective review of cancer diagnosis and clinical decision support tools.
DownloadPaper Citation
in Harvard Style
Misri K., Alexandre L. and De La Iglesia B. (2025). From Free Text to Upper Gastrointestinal Cancer Diagnosis: Fine-Tuning Language Models on Endoscopy and Histology Narratives. In Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR; ISBN , SciTePress, pages 501-508. DOI: 10.5220/0013836200004000
in Bibtex Style
@conference{kdir25,
author={Kazhan Misri and Leo Alexandre and Beatriz De La Iglesia},
title={From Free Text to Upper Gastrointestinal Cancer Diagnosis: Fine-Tuning Language Models on Endoscopy and Histology Narratives},
booktitle={Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR},
year={2025},
pages={501-508},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013836200004000},
isbn={},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR
TI - From Free Text to Upper Gastrointestinal Cancer Diagnosis: Fine-Tuning Language Models on Endoscopy and Histology Narratives
SN -
AU - Misri K.
AU - Alexandre L.
AU - De La Iglesia B.
PY - 2025
SP - 501
EP - 508
DO - 10.5220/0013836200004000
PB - SciTePress