Tuesday, February 21, 2017

Making sense of clinical trials through visualizations

Since my return to development, I spent all of 2016 immersed in the world of Machine Learning and Natural Language Processing for the Watson for Genomics offering (former Watson Genomics Analytics) .

In 2017, I am shifting towards the analytics and visualization space, which is a nice segue to working on the Machine Learning / NLP that generates the data. One of the key features in Watson for Genomics is the ability to recommend clinical trials to patients based on their genetic variations and clinical conditions, an area where visualization can be of enormous assistance to both internal test and development teams verifying the overall quality of the data in the system.

Eventually, better analytics and visualizations can also help researchers on both industry and governmental agencies to better understand the ever shifting landscape in clinical trials.

Disclaimer: none of the content of this blog or github project represents actual data from a production system or from individual patient data. The visualizations below do not represent current or upcoming features for the actual product and solely represent my personal experiments with the various technologies and techniques that could be used in the space.

Leaving aside the clinician/patient use cases for a moment, which are largely centered around matching patients to trials, it is the general browsing and navigation aspects that caught my attention at first.

I started a small github side project (https://github.com/nastacio/clinical-viz) over the weekend to explore these visualizations, with the intention of expanding them into more powerful browsing capabilities in the future.

In order to cover the visualization potential, I started with a small data sample, querying for any clinical trials that covered retinal cancers (under 200 results) , modeling the resulting data as a graph containing nodes for "clinical trials", "sponsors", "conditions", and "locations". I then turned to an export to graphml format and a subsequent import into the (excellent) Gephi visualization tool.

In its full form, and after a few minutes of appearance customizations, the complete graph took on an interesting shape (for those familiar with Gephi, there are few things more entertaining than watching the Fruchterman Reingold algorithm do its magic) .

This is obviously not terribly useful in its unfiltered form, but there are already some interesting insights for development and verification teams looking into parameters for machine training, fringe testcases, scale, and UI decisions. Zooming into a couple of the clusters shows correlations between duration, type of study, number of conditions and number of sponsors involved in a given trial.

Once you apply some light querying within Gephi, the potential for these types of visualizations becomes even clearer. As one example, I wanted to have an idea of the density of conditions covered in each clinical trial and the commonality between conditions covered across different trials, which is useful to get a sense of how overwhelming the complete results may be for a given set of patients with a certain condition, so I created a filter removing all location and sponsor nodes, then added the labels for gender, phase, id, status, enrollment target for number of patients, and starting year.

This visualization quickly surfaced clusters of conditions that are more researched than others, such as in the snapshot below:

Leveraging the "Degree Range" filter set to list only interconnected conditions, trials and sponsors, it was easier to arrive at a much more compact representation of the initial graph, now further filtered to contain only interventional clinical trials that were recruiting new patients:

This visualization shows the typical low number of conditions covered in each clinical trial and also surfaced the insight that none of the different clinical trials in this (small) set shared the same sponsoring organization.

Another visualization, which is more interesting to health industry wonks than it is to doctors and patients, is the degree and nature of collaboration amongst sponsors for a given set of trials, making the nodes proportional to the degree of collaboration (the nodes are colored according to the type of organization) .

Then it is possible to zoom into areas of interest and start to notice the relationship between hospitals, government agencies, pharmaceutical companies and others. On a larger data set, the thickness of an edge between two organizations would be proportional to the number of clinical trials cosponsored by these organizations.

The next areas of focus for these visualizations involve dealing with much larger sets, at which point the examples above become really useful. I also need to work on better normalization of condition names, which often spans multiple UMLS semantic types (neoplastic process, diseases or syndromes, findings, symptoms or signs, and many others) , and also some form of dashbording capability that would allow people to interact more directly with clinical trial data without having to generate and import a graphml file into an external visualization tool.