In a collaboration with the German Guideline Program in Oncology and the Jena University Language & Information Engineering (JULIE) Lab, we create and maintain the German Guideline Program in Oncology NLP Corpus (GGPONC), currently one of the largest (> 1.87M tokens) and very few publicly available corpora of German medical text. The corpus is based on German S3 oncology guidelines, covering a wide range of indications. It features a variety of metadata in addition to the medical text content. In contrast to patient-level medical text (like clinical notes), this data is not privacy-sensitive and can therefore be used by researchers without the usual data protection restrictions.

We work with medical expert to continuously add different layers of information (entitiy spans, IDs, relationships) by manual annotation. In addition, we investigate the utility of such publicly available datasets to fuel ML-based models for information extraction from German medical text - not only from clinical guidelines, but also clinical notes through transfer learning and domain adaptation.

As the creation of the corpus has been completely automated with Apache Airflow pipelines, we will be able to provide timely updates to the community and also explore changes of clinical guidelines along temporal dimensions. Morever, we can link information from clinical guidelines to other sources of information to enable novel applications in the space of evidence-based medicine.




Niklas Kämmer*, Florian Borchert*, Silvia Winkler, Gerard de Melo, and Matthieu-P. Schapranow Resolving Elliptical Compounds in German Medical Text.. In: The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 292–305, Toronto, Canada. Association for Computational Linguistics