In a collaboration with the German Guideline Program in Oncology and the Jena University Language & Information Engineering (JULIE) Lab, we create and maintain the German Guideline Program in Oncology NLP Corpus (GGPONC), currently one of the largest (> 1.87M tokens) and very few publicly available corpora of German medical text. The corpus is based on German S3 oncology guidelines, covering a wide range of indications. It features a variety of metadata in addition to the medical text content. In contrast to patient-level medical text (like clinical notes), this data is not privacy-sensitive and can therefore be used by researchers without the usual data protection restrictions.
We work with medical expert to continuously add different layers of information (entitiy spans, IDs, relationships) by manual annotation. In addition, we investigate the utility of such publicly available datasets to fuel ML-based models for information extraction from German medical text - not only from clinical guidelines, but also clinical notes through transfer learning and domain adaptation.
As the creation of the corpus has been completely automated with Apache Airflow pipelines, we will be able to provide timely updates to the community and also explore changes of clinical guidelines along temporal dimensions. Morever, we can link information from clinical guidelines to other sources of information to enable novel applications in the space of evidence-based medicine.
Keno K. Bressem, Jens-Michalis Papaioannou, Paul Grundmann, Florian Borchert, Lisa C. Adams, Leonhard Liu, Felix Busch, Lina Xu, Jan P. Loyen, Stefan M. Niehues, Moritz Augustin, Lennart Grosser, Marcus R. Makowski, Hugo JWL. Aerts, Alexander Löser. medBERT.de: A Comprehensive German BERT Model for the Medical Domain. Expert Systems with Applications (2023): 121598 [Hugging Face Model] IF = 8.5
Ignacio Llorca, Florian Borchert, Matthieu-P. Schapranow A Meta-dataset of German Medical Corpora: Harmonization of Annotations and Cross-corpus NER Evaluation. In: Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 171–181, Toronto, Canada. Association for Computational Linguistics
Niklas Kämmer*, Florian Borchert*, Silvia Winkler, Gerard de Melo, and Matthieu-P. Schapranow Resolving Elliptical Compounds in German Medical Text.. In: The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 292–305, Toronto, Canada. Association for Computational Linguistics
Phillip Richter-Pechanski, Philipp Wiesenbach, Dominic M. Schwab, Christina Kiriakou, Mingyang He, Michael M. Allers, Anna S. Tiefenbacher, Nicola Kunz, Anna Martynova, Noemie Spiller, Julian Mierisch, Florian Borchert, Charlotte Schwind, Norbert Frey, Christoph Dieterich & Nicolas A. Geis. A distributable German clinical corpus containing cardiovascular clinical routine doctor’s letters. Scientific Data 10, 207 (2023) [Data Access]
Sandro Steinwand*, Florian Borchert*, Silvia Winkler and Matthieu-P. Schapranow. GGTWEAK: Gene Tagging with Weak Supervision for German Clinical Text. In: Juarez, J.M., Marcos, M., Stiglic, G., Tucker, A. (eds) Artificial Intelligence in Medicine. AIME 2023. Lecture Notes in Computer Science, vol 13897. Springer, Cham [Code]
Florian Borchert, Christina Lohr, Luise Modersohn, Jonas Witt, Thomas Langer, Markus Follmann, Matthias Gietzelt, Bert Arnrich, Udo Hahn and Matthieu-P. Schapranow. GGPONC 2.0 - The German Clinical Guideline Corpus for Oncology: Curation Workflow, Annotation Policy, Baseline NER Taggers. LREC 2022 — Proceedings of the Language Resources and Evaluation Conference, pp. 3650‑3660. Marseille, France, European Language Resources Association, 2022 [Data Access] [Code]
Florian Borchert*, Christina Lohr*, Luise Modersohn*, Thomas Langer, Markus Follmann, Jan Philipp Sachs, Udo Hahn, Matthieu-P. Schapranow. GGPONC: A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines. In Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis, pp. 38–48. Online: Association for Computational Linguistics, 2020. (* equal contribution) [Data Access] [Code]