In a collaboration with the German Guideline Program in Oncology and the Jena University Language & Information Engineering (JULIE) Lab, we create and maintain the German Guideline Program in Oncology NLP Corpus (GGPONC), currently one of the largest (> 1.3M tokens) and very few publicly available corpora of German medical text. The corpus is based on German S3 oncology guidelines, covering a wide range of indications. It features a variety of metadata in addition to the medical text content. In contrast to patient-level medical text (like clinical notes), this data is not privacy-sensitive and can therefore be used by researchers without the usual data protection restrictions.
We work with medical expert to continuously add different layers of information (entitiy spans, IDs, relationships) by manual annotation. In addition, we investigate the utility of such publicly available datasets to fuel ML-based models for information extraction from German medical text - not only from clinical guidelines, but also clinical notes through transfer learning and domain adaptation.
As the creation of the corpus has been completely automated with Apache Airflow pipelines, we will be able to provide timely updates to the community and also explore changes of clinical guidelines along temporal dimensions. Morever, we can link information from clinical guidelines to other sources of information to enable novel applications in the space of evidence-based medicine.
Florian Borchert*, Christina Lohr*, Luise Modersohn*, Thomas Langer, Markus Follmann, Jan Philipp Sachs, Udo Hahn, Matthieu-P. Schapranow. GGPONC: A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines. In Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis, pp. 38–48. Online: Association for Computational Linguistics, 2020. (* equal contribution)