From Free Text to Upper Gastrointestinal Cancer Diagnosis: Fine-Tuning Language Models on Endoscopy and Histology Narratives

Kazhan Misri, Leo Alexandre, Beatriz De La Iglesia

2025

Abstract

Clinical free text reports from endoscopy and histology are a valuable yet underexploited source of information for supporting upper gastrointestinal (GI) cancer diagnosis. Our initial learning task was to classify procedures as cancer-positive or cancer-negative based on downstream registry-confirmed diagnoses. For this, we developed a patient-level dataset of 63,040 endoscopy reports linked with histology data and cancer registry outcomes, allowing supervised learning on real-world clinical data. We fine-tuned two transformer-based models: general-purpose BERT and domain-specific BioClinicalBERT and evaluated methods to address severe class imbalance, including random minority upsampling and class weighting. BioClinicalBERT combined with up-sampling achieved the best recall (sensitivity) of 85% and reduced false negatives compared to BERT’s recall of 78%. Calibration analysis indicated that predicted probabilities were broadly reliable. We also applied SHapley Additive exPlanations (SHAP) to interpret model decisions by highlighting influential clinical terms, fostering transparency and trust. Our findings demonstrate the potential of scalable, interpretable natural language processing models to extract clinically meaningful insights from unstructured narratives, providing a foundation for future retrospective review of cancer diagnosis and clinical decision support tools.

Download


Paper Citation


in Harvard Style

Misri K., Alexandre L. and De La Iglesia B. (2025). From Free Text to Upper Gastrointestinal Cancer Diagnosis: Fine-Tuning Language Models on Endoscopy and Histology Narratives. In Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR; ISBN , SciTePress, pages 501-508. DOI: 10.5220/0013836200004000


in Bibtex Style

@conference{kdir25,
author={Kazhan Misri and Leo Alexandre and Beatriz De La Iglesia},
title={From Free Text to Upper Gastrointestinal Cancer Diagnosis: Fine-Tuning Language Models on Endoscopy and Histology Narratives},
booktitle={Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR},
year={2025},
pages={501-508},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013836200004000},
isbn={},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR
TI - From Free Text to Upper Gastrointestinal Cancer Diagnosis: Fine-Tuning Language Models on Endoscopy and Histology Narratives
SN -
AU - Misri K.
AU - Alexandre L.
AU - De La Iglesia B.
PY - 2025
SP - 501
EP - 508
DO - 10.5220/0013836200004000
PB - SciTePress