Tuesday, February 21, 2017

Making sense of clinical trials through visualizations

Since my return to development, I spent all of 2016 immersed in the world of Machine Learning and Natural Language Processing for the Watson for Genomics offering (former Watson Genomics Analytics) .

In 2017, I am shifting towards the analytics and visualization space, which is a nice segue to working on the Machine Learning / NLP that generates the data. One of the key features in Watson for Genomics is the ability to recommend clinical trials to patients based on their genetic variations and clinical conditions, an area where visualization can be of enormous assistance to both internal test and development teams verifying the overall quality of the data in the system.

Eventually, better analytics and visualizations can also help researchers on both industry and governmental agencies to better understand the ever shifting landscape in clinical trials.

Disclaimer: none of the content of this blog or github project represents actual data from a production system or from individual patient data. The visualizations below do not represent current or upcoming features for the actual product and solely represent my personal experiments with the various technologies and techniques that could be used in the space.

Leaving aside the clinician/patient use cases for a moment, which are largely centered around matching patients to trials, it is the general browsing and navigation aspects that caught my attention at first.

I started a small github side project (https://github.com/nastacio/clinical-viz) over the weekend to explore these visualizations, with the intention of expanding them into more powerful browsing capabilities in the future.

In order to cover the visualization potential, I started with a small data sample, querying for any clinical trials that covered retinal cancers (under 200 results) , modeling the resulting data as a graph containing nodes for "clinical trials", "sponsors", "conditions", and "locations". I then turned to an export to graphml format and a subsequent import into the (excellent) Gephi visualization tool.

In its full form, and after a few minutes of appearance customizations, the complete graph took on an interesting shape (for those familiar with Gephi, there are few things more entertaining than watching the Fruchterman Reingold algorithm do its magic) .

This is obviously not terribly useful in its unfiltered form, but there are already some interesting insights for development and verification teams looking into parameters for machine training, fringe testcases, scale, and UI decisions. Zooming into a couple of the clusters shows correlations between duration, type of study, number of conditions and number of sponsors involved in a given trial.

Once you apply some light querying within Gephi, the potential for these types of visualizations becomes even clearer. As one example, I wanted to have an idea of the density of conditions covered in each clinical trial and the commonality between conditions covered across different trials, which is useful to get a sense of how overwhelming the complete results may be for a given set of patients with a certain condition, so I created a filter removing all location and sponsor nodes, then added the labels for gender, phase, id, status, enrollment target for number of patients, and starting year.

This visualization quickly surfaced clusters of conditions that are more researched than others, such as in the snapshot below:

Leveraging the "Degree Range" filter set to list only interconnected conditions, trials and sponsors, it was easier to arrive at a much more compact representation of the initial graph, now further filtered to contain only interventional clinical trials that were recruiting new patients:

This visualization shows the typical low number of conditions covered in each clinical trial and also surfaced the insight that none of the different clinical trials in this (small) set shared the same sponsoring organization.

Another visualization, which is more interesting to health industry wonks than it is to doctors and patients, is the degree and nature of collaboration amongst sponsors for a given set of trials, making the nodes proportional to the degree of collaboration (the nodes are colored according to the type of organization) .

Then it is possible to zoom into areas of interest and start to notice the relationship between hospitals, government agencies, pharmaceutical companies and others. On a larger data set, the thickness of an edge between two organizations would be proportional to the number of clinical trials cosponsored by these organizations.

The next areas of focus for these visualizations involve dealing with much larger sets, at which point the examples above become really useful. I also need to work on better normalization of condition names, which often spans multiple UMLS semantic types (neoplastic process, diseases or syndromes, findings, symptoms or signs, and many others) , and also some form of dashbording capability that would allow people to interact more directly with clinical trial data without having to generate and import a graphml file into an external visualization tool.


  1. Very cool, Denilson! When you get to the interactivity/exploration part, take a look at d3.js. You can use it to create a web-based exploration tool for this data, something along the lines of http://bl.ocks.org/paulovn/9686202 for a network visualization like the ones above, or even alternative visualizations like the chord diagram used in http://thronesviz.github.io/.

  2. Hi Nascif, thanks for the comments. I am familiar with d3.js and its ability to render graphs (and other types of charts) , but trying to avoid the hand-to-hand combat of Javascripting for now (and hopefully in the future) .

    I am mostly after patterns that support usability and navigation decisions in a more traditional UI (e.g. would it be best to use radio-boxes or a drop-down to select certain filtering options) . At that point Gephi is incredibly powerful in terms of rapidly prototyping and exploring visualizations for interconnected data. That said, once UI decisions are made about certain visualizations meeting an end-user scenario, then the pressure for d3.js scripting will grow.

    I am still holding an inner philosophical debate about the pros/cons of building visualizations from the ground up or leveraging an off-the-shelf dashboarding technology such as freeboard.io, dashbuilder.org, etc. I can see the allure of having absolute control over every screen pixel, but at the same time, charting can be a blackhole in terms of resources and attention span.

    I know many a people who can turn up beautiful visualizations when in full control of the pixels (I am not one of them) , but they would rather stay away from everything else that goes around the charts (e.g. drag-drop placement of viewlets within a portal, layout editors, etc) .

    On chord diagrams, that one was quite entertaining and works really well in a dynamic setting where you can hover over an item to filter out everything else. I am sure I can find some "buyers" for that type of visualization in this space, nice tip!

    BTW: the freeboard.io + dweet.io combo is awesome, worth a look. Sadly, I don't have an excuse to dab into IoT right now.