Medical Knowledge Synthesis

Medical evidence is currently disseminated mostly through unstructured or semi-structured free-text documents, e.g., clinical guidelines, systematic reviews or primary research articles. Turning these text documents into structured, linked data through NLP would allow users to reason and answer questions about the body of evidence as a whole, such as:

  • Is there a disagreement between latest research articles and clinical guideline recommendations?
  • How long does translation of clinical research into practice actually take?
  • Do multiple publications report the same data and should be considered duplicates in a systematic review?
  • Is a trial report consistent with the initial trial synopsis?
Integration of structured and unstructured clinical evidence items through NLP

Nowadays, the best-performing information extraction approaches are based on supervised ML algorithms and thus require annotated text corpora. Such corpora are extremely scarce or of limited utility in the domain of clinical evidence reports (RCTs, case reports) and practically nonexistent for clinical guidelines, due to the tedious manual annotation process. The scarcity of data in the domain of clinical guidelines is particularly pronounced, as they are usually published in the respective national language of their country of origin, whereas annotated medical text corpora have been mostly published in English. While vast amounts of unlabelled text data do exist in many languages, which can be leveraged for unsupervised pre-training, the training of end models still requires some supervision signal.

We therefore investigate, whether extraction of structured information from multilingual biomedical text is possible with heterogeneous, cheaper forms of supervision and whether this will allow future biomedical NLP projects to shift resources from the annotation process to more productive tasks. We want to determine which types of supervision can enhance the performance of specific biomedical IE solutions. In particular:

  • Inaccurate supervision through various kinds of noisy label sources
  • Distant supervision through curated knowledge bases and medical ontologies
  • High-level supervision gathered from domain experts provided trough adequate domain specific primitives and symbolic representations
  • Incidental supervision e.g., trough structured metadata




Florian Borchert*, Andreas Mock*, Aurelie Tomczak*, Jonas Hügel, Samer Alkarkoukly, Alexander Knurr, Anna-Lena Volckmar, Albrecht Stenzinger, Peter Schirmacher, Jürgen Debus, Dirk Jäger, Thomas Longerich, Stefan Fröhling, Roland Eils, Nina Bougatf, Ulrich Sax, Matthieu-P Schapranow. Knowledge Bases and Software Support for Variant Interpretation in Precision Oncology, Briefings in Bioinformatics, Volume 22, Issue 6, November 2021, bbab134 (* equal contribution) IF = 11.6