The Power of Data and Predictive Analytics in Pandemics

May 7, 2020

Digital Empowers’ “The Power of Data and Predictive Analytics in Pandemics” webinar was the first event of a three-part virtual series on COVID-19 response designed to bring the innovation and social impact communities together, and provide a platform for collective action and shared learning. View the full program here, and read on to learn more about what was covered.

To help us achieve this goal and address the breadth of complexities surrounding COVID-19 forecasting, I looked to three presenters, Institute for Health Metrics and Evaluation (IHME)’s Bill Heisel, Biofourmis’ Kuldeep Singh Rajput, and Excella’s data visualization expert Amanda Makulec, all esteemed by their field and representatives of organizations at the top of their game. 

Speakers brought data visualizations and forecasting into the context of current public health, policy decisions, and changes to clinical care and patient management. Presentations featured leading data visualization models that are enhancing our understanding of COVID-19, including Johns Hopkins University’s web-based COVID-19 dashboard, the Centers for Disease Control and Prevention, the Reich Lab, and of course, that of the University of Washington’s IHME. In their own way, all three speakers addressed topics ranging from data stewardship and literacy to visualizations that are simple and actionable, yet maintain the rigor and methodology for the scientific communities.

Amanda underscored the “unknown, unknowns” in COVID-19 information and the difficulty of creating accurate models during a pandemic, leading to variations in the way data is interpreted. Bill spoke about the role of forecasts and how they can be an effective tool in helping people change their behavior to avoid unwanted outcomes. Kuldeep highlighted how a combination of software-based therapeutics are helping treat COVID-19 patients, while tracking important data to help inform the public health response and telemedicine. 

This information-packed hour left many viewer questions on the table, so I caught up with Amanda and Bill offline to address the most frequently cited and thought provoking questions submitted during their panel discussion. 

Q1: Can you provide insight into the types, if any, of peer or independent review conducted on COVID-19 models?

Amanda: I haven't seen a lot of widespread independent reviews of the different models; still, there's a lot that's been written about them—many of which comparisons have been published in The New York Times or by the Reich Lab. If you are looking for more information, I would look to those and similar  institutions that have started combining and pulling the outputs from various models together, so that we can more easily compare and contrast them, see what the differences in their trajectories look like, and differentiate the limitations they may have, be it in methodology or other factors.

There has been some interesting work coming out of a team at Galois, a research organization based in Arlington, Virginia, that has actually created an interactive interface that enables users to test and look at different underlying assumptions, test out the math of different models that even interact with models using an icon-based interface. 

In the coming week, we [Excella] will be publishing an interview and article with the principal data scientists at Galois talking about that tool, which is, notably, also rooted in the peer review process.

Bill: IHME is part of an international network of collaborators, which has grown to around 5,000 people in more than 150 countries. That leads to a robust and ongoing scientific debate about all aspects of the disease modeling work. In addition, IHME is preparing and submitting papers related to COVID-19 for peer review, and it is formally incorporating COVID-19 estimates into the Global Burden of Disease study, which generates hundreds of peer-reviewed papers annually. As with all of IHME’s work, this new effort builds on decades of previously peer-reviewed and published scientific papers. You can see many of these cited in the pre-print publication IHME posted about its initial projections model here.

Q2: Are there any publicly available data sets you would recommend for coded research around the COVID-19 research database? 

Amanda: I think that the broader research about COVID-19 is going to have to be inclusive of looking at a lot of the social determinants of health—economic indicators, among other data. I think that we will also look more to widely available open data sources published on data.gov [in the U.S.] and to intermediary groups, like the World Bank, if the project scope is international. I don't have specific data resources that I would recommend at this point, but some of the most widely used for tracking COVID cases and deaths include Johns Hopkins University data resource, New York Times Data Resource, COVID-19 Tracking Project.

Bill: The list of data sources is ever changing as new data sets come online and old ones drop behind firewalls. A good start is this list put together by the NIH.

Q3: When will data and predictive analysis be available for all countries? How could governments run IHME’s model with their own collected data? 

Bill: IHME is working as quickly as it can to expand the projections to all countries that have experienced enough deaths to allow for a strong enough model. Governments are running their own models informed by IHME’s work, and if a government agency is interested in talking with IHME about its projections, it should write to: covid19@healthdata.org

Q4: Are there any key differentiators between models presented to policymakers, health care systems, and the general public? Are these audiences needing or looking for different types of data or visual presentations?

Amanda: I would hope that we [researchers] are not running different models for different people—in the sense that we think that the public can't handle what the actual outcomes are of some of these standardized models.

There is, however, a significant distinction between the different mathematical and epidemiologic approaches used in modeling. The big distinction is a mathematical model (that's the IHME model) fits a curve, and epidemiologic models consider a number of other variables about the people who are currently exposed, infected, recovered from a disease or susceptible, the “SEIR model” (Susceptible - Exposed - Infectious – Recovered).”

There are certainly different models that are being used and how they're presented varies, a bit. I wish there was more transparency in how models are presented to the public—specifically, providing what I find is necessary additional context about the model’s uncertainty and limitations.

When someone looks at models and only sees a line plotted on a chart, they assume that there is a certain amount of certainty attached to the points on that line; when in actuality, there are most likely wide confidence bands tied to those points.

Early April, Matthew Kay and Jessica Hullman from Northwestern held a great webinar, talking all about visualizing uncertainty around COVID-19. Their presentation talked about the “known unknowns,” the things that we know are unknown in our model, and then there are “unknown, unknowns.” It’s those “unknown, unknowns” that make it very difficult to actually plot uncertainty on these charts and graphs. Finally, they [Matthew Kay and Jessica Hullman] say all predictive models are wrong because underlying assumptions are hidden from the audience.

Bill: I would encourage you to read this recent piece in the New York Times, which included an interactive chart comparing various models.

Q5: How do you build confidence in your modelling? Do you think revealing the assumption would be a good idea? 

Amanda: I mean, all predictive models are wrong in some way, because they're based on uncertain data and information that's ever changing. We can't predict the future, but people [researchers] have been very transparent about this [predictive analytics] not being a crystal ball. On the flip side, I think it's a little bit incorrect to state that underlying assumptions are hidden from the audience. 

Good models will be very specific and explicit about disclosing their limitations, assumptions, and sources for data inputs and model parameters. And if you don't see that information presented upfront—in whitepapers, the website, its graphic interface, or GitHub file—or it isn't clearly discerned why that's not available, there should be definite doubt, or hesitancy, about using those models for policymaking.

If you can't see inside the “black box” of how those data models were constructed, it would be difficult for anyone, even subject matter experts, to understand and determine the significance of its results. For example, why the shape of the model in question looks different than another. Modelers can only conduct good and valid comparisons through peer review—which first, requires transparency. 

Let’s all remember: these models are truly informing national level and state level policies, so details and the peer review process could not be more important. 

Bill: Statistician George Box made his famous comment about models in many published writings. Here is one: 

“All models are approximations. Assumptions, whether implied or clearly stated, are never exactly true. All models are wrong, but some models are useful. So the question you need to ask is not ‘Is the model true?’ (it never is) but ‘Is the model good enough for this particular application?’”

So far, IHME and other research institutions have seen the value in modeling play out as governments have been encouraged to put social distancing measures in place, lowering the number of severe cases of the disease, and thereby allowing health systems to handle the case load. In IHME’s case, every time it releases a new set of projections, it explains what has changed in its modeling. You can find those explanations here.

Q6: How often are the assumptions reviewed for the COVID-19 social distancing model, such as the assumption that a conservative 1 in 1 million goal should be reached in order to begin easing up? Has demonstrated hospital capacity metrics warranted a view of the model, at say 5 in 1 million?

Bill: What makes IHME’s model a little different from some of the other models being used is its strong reliance on input data. Some models lock in assumptions and then carry those projections forward. What is happening with IHME’s model is more like a weather forecast, which means that it changes the projections as new data are fed in. So when there is a parameter that has been set in the model, for example the ratio between hospitalizations and deaths, that can change as data come in showing something new. The initial ratio of hospitalizations to deaths was higher when the world only had data from the provinces of China. As new data came in from around the U.S., IHME updated its parameters and, as a result, the projections for hospital resource utilization came down.

Watch the full webinar to hear more on the power of data and predictive analytics during pandemics.