Elana J. Fertig (Department of Oncology, School of Medicine)
Raman Arora (Department of Computer Science, Whiting School of Engineering)
Currently, scientists have unprecedented access to a wide variety of high quality datasets which are collected from independent studies. However, standardized annotations are essential to perform meta analyses, and this presents a problem as standards are often not used. Accurately combining records from diverse studies requires tedious and error-prone human curation, posing a significant time and cost barrier.
We propose a novel natural language processing (NLP) algorithm, Synthesize, that merges data annotations automatically and is part of an open source web application, Synthesizer, that allows the user to easily interact with merged data visually. The Synthesize algorithm was used to merge varying cancer datasets and to also merge ecological datasets. The algorithm demonstrated high accuracy (on the order of 85-100%) when compared to manually merged data.