Why We Need Simulated Biomedical Data

Samantha Kleinberg
Assistant Professor
Department of Computer Science
Stevens Institute of Technology

The ongoing adoption of electronic medical records (EMRs) by hospitals and physicians alike has led to a huge amount of newly accessible health data. Some of the potential benefits of these observational data, as compared to traditional controlled studies, is that findings come from the population clinicians aim to treat, and they cover a much larger group of people over a much longer time than would otherwise be possible. These data have already been used to discover dangerous interactions between drugs and provide decision support to clinicians, and more generally may enable research on long term causes of health and disease.

Yet, widespread use of health data as the basis for fundamental research is hampered by limits on data sharing due to privacy concerns, lack of ground truth against which inferences can be compared, and data quality.

Human subjects, whether they’re patients in a hospital or individuals collecting fitness data during daily life, must be protected from risks stemming from participation in research. In the case of medical record data, where studies often retrospectively analyze the data of many patients, the primary risk is loss of privacy. While obviously identifying information such as names and zip codes can be removed, this does not make the data anonymous. EHR data is contained in both structured records (databases with results of laboratory tests and medication orders) and notes (including narrative medical histories).

Completely removing identifying information from text is a challenging computational problem, but even if a single dataset is truly no longer identifiable on its own, it may be identifiable when combined with other data sources. For instance Latanya Sweeney (2002) showed that when Massachusetts released allegedly anonymous medical records, these could be re-identified when combined with voter registration records. Similarly, in 2013 she combined data from the state of Washington with news items and, again, was able to re-identify a substantial fraction of individuals.

Since true anonymity cannot be guaranteed, there are strong restrictions on how this data can be shared – and with whom. What this means for computational researchers is that it is not really possible for a number of non-collaborating groups to test their methods on a single dataset and compare results, and likewise each group only has access to data from a limited set of hospitals with which they’re affiliated. This is a major challenge since the data are noisy (there can be errors when copying and pasting text in notes, findings may be mis-recorded) and incomplete (patients may have gaps in medical coverage, may switch providers, or may move). Further, due to structural differences in records it can be difficult to determine whether different findings in two studies are due to population differences, artifacts of the medical record, or failures of the methodology (Kleinberg and Elhadad, 2013).

To address this, my group is now working on an NIH-funded project, in collaboration with researchers at Stevens who specialize in simulation and neurologists and clinicians at Columbia University Medical Center, to simulate neurological intensive care unit data. Simulated data is not subject to privacy concerns and has the advantages of being shareable and fully controlled. Thus if an algorithm infers a relationship, such as between seizures and secondary brain injury, we can know if this is truly causal. Right now, computational methods are normally validated on simulated data from other domains, but these may differ fundamentally in their structure from biomedical data and may not be representative of the unique challenges faced here. The result of our work will be a publicly available system that allows one to create data for a virtual patient, where the connections between variables are known, and challenges such as noise and missing data can be artificially added and varied to test how they affect inference accuracy. Most importantly, the data and code for creating can be used by biomedical researchers to compare their methods on shared datasets and by computational researchers who may not have access to realistic medical data.

Kleinberg, S., & Elhadad, N. (2013). Lessons Learned in Replicating Data-Driven Experiments in Multiple Medical Systems and Patient Populations. In AMIA annual symposium.

Sweeney, L. (2002). k-anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05), 557-570.

Sweeney, L. (2013) Matching Known Patients to Health Records in Washington State Data. Harvard University. Data Privacy Lab. White Paper 1089-1.