No More Plague, Inc.
Could the Covid-19 pandemic have been avoided in the UK ?
Introduction
The Sars-CoV 2 has spread around the entire world causing a global pandemic that reached a scaled never seen before in recent History. It changed and endangered the lives of millions. The unique nature of this event raised many questions and concerns among experts. How can the virus be condemned? How deadly is the virus? How should I live to be safe and unlikely to die? The last question is of big importance for the individual human being.
In light of those viral events, many debates have occurred on the effect of governmental responses in the case of a pandemic and the behaviours to adapt. But could this have been prevented even before we heard about the newfound virus ?
Our analysis concerns the United Kingdom, the first country to have legalized a vaccine to combat the virus. It happens that this country keeps track of Covid-19 deaths and collect many information such as data from food consumption patterns and indices of multiple depravation (IMD) in Greater London area.
Our work will focus on trying to answer to the question: Are there any factors in individual’s way of living that impacts Covid-19 deaths?
As said above, we will base our work on Covid-19 deaths at MSOA level in Greater London area[1] as well as the Tesco dataset[2] and IMD[3] that consists of the combination of many weighted data from income to crime and health.
Health, Nutrition & Data
The first look we will have at our data is through the prism of health and nutrition. Health is deeply linked to our nutrition; any unbalance can lead to deficiencies or to diseases like diabetes. We will try to find a trend in population’s food consumption and link it to the risk of dying from the virus. It will lead us to make hypothesis on how people should modify their diet to reduce the risk of dying from the virus.
In order to establish the models needed, we must obtain data on people’s food consumption habits that have meet some requirements. The data must be large, otherwise our analysis will be too heavily impacted by sample size. The data must not come from self-declarative surveys as people may lie intentionally or not on their consumption. It could also lead to imbalances in geographical repartition of the data.
As everybody must buy food somewhere, distribution firm such as Tesco can be a good source of anonymized data. It is particularly interesting for our case of study as Tesco stores are distributed in all the London area. The company records shopping data of owners of a Tesco fidelity card. The data is scaled for Middle Layer Output Areas (MSAO) for the Greater London Area and thus is easy to incorporate in our analysis.
Representativeness
There is a total of 983 MSOAs, you can see them on the left figure in blue for our sector of focus. We can even the River Thames flowing through all Greater London Area. The Tesco dataset summarize the food consumption as “typical products” for each area. Typical products are defined as the mean of the characteristics (like weight, quantity of fat, nutrients etc…) of each sold food item in a Tesco store.
For example, a typical product bought in the city of London has 5.25g of proteins and 9.27g of Sugar. We cannot use the dataset as it is. Even so it seems complete it lacks in several ways. Tesco store are not distributed evenly in the sector. Compiling shopping records of card owners is good but it does not account for all the customers that have not one. We sorted the data to only keep relevant areas defined as areas that represent the whole population’s food consumption correctly. After this treatment, we will work using areas where the percentage of card owners in the total population is equal or higher than 16%.
This leaves us with the MSOAs in blue.
So you think people eat healthy ?
With the sorted dataset, we want to identify each area as “healthy” or “unhealthy” based on the nutritional values of the typical product of this area. It will enable us to compare Covid-19 deaths between healthy and unhealthy areas. To define the healthiness of an area, will we use World Health Organization base recommendations[4]; that is less than 30% of total energy consumed should come from fat, less than 10% from saturated fat and less than 10% from sugar.
These boundaries appear in the figure below with the actual distribution of the percentage of energy consumed for the three features.
We can clearly notice the not one area satisfies any of the criterion dictated by the WHO. Thus, we define as “healthy” the 50% of areas that are the closest to satisfying WHO recommendations. In fact, our healthy feature becomes more “the less unhealthy areas” feature.
Propensity Matching
One notices that many of the healthy areas are grouped together, especially in the north west.
To exclude other influences on the analysis like that the population is healthier is just young and therefore less likely to die of covid, a propensity score matching is applied. A statistical method to exclude such influences. Before the matching the lines for Healthy(Test) and Unhealthy(control) are far apart. After matching they are almost identical, which is wanted. Because then the distribution, in this case the total of population between 0-17 is identical.
The Influential Factors
After using the propensity score matching to limit other influences, we can start to investigate which parameters are critical for the risk of dying of Covid-19 in food consumption habits. We plotted the median comparison of deaths due to Covid-19.
Time for Machine Learning
We are starting to understand which features have an impact on the gravity of the virus in a region. Using machine learning, we will create a model of the pandemic based on all compatible features of the dataset.
Machine Learning is a category of algorithms and procedures designed to improve with experiecne. There are many facets to this domain, and many algorithms which may or may not fit a single problem.
Inter-Predictions
We now have three key elements from three different datasets : generic healthiness of an area, deprivation index and gravity of the pandemic. Would it be possible to efficiently predict one of those elements using the other two ?
We tried different machine learning models and came to the conclusion that those features are not so predictive of each other. We managed to reach an R2 score of 0.12 when predicting IMD19-score, but it was not sufficient to create a good model of the situation and hope to solve the pandemic.
As such, our quest continues.
More is More
As we all know, when it gets to statistical models, more data means more learning. Instead of trying to limit our predictions to a few meaningful factors and see how those interacted, we used all the features available from our datasets to try and predict the gravity of the pandemic (except the amount of Covid19-related deaths, since that would be cheating).
We created multiple models and saw that we were getting decent results and which could be adapted using linear regression. Our r-squared score approximated 0.3 during model creation. This means that our results should not be taken as fact but rather as an interesting take on the situation.
PlagueSolver™
We now have a model and ideas of relevant features.
By modifying our original dataset and applying some modificators, we can see an alternate universe's potential gravity of the pandemic. Below, you'll find some examples of modificators we applied and the resulting simulations.
Double Population Density
This comes as no surprise. Doubling the density of population per area indeed worsens the pandemic.
20% More Fruits & Veggies
With 20% more fruits and vegetables consumption, the gravity is reduced by around 50% !
Lowering Deprivation
Relative deprivation is essentially a measure of poverty. Lowering it by 10% reduces once by about half the gravity of the virus.
Less Energetic Food
Interestingly, lowering the energetic density of food by 30% seems to worsen the pandemic by about 40%. Maybe this could be caused by different consumption habits between the at-risk or safer groups.
Solving the Pandemic
Using what we learned along the way, we tuned some of the factors to try and simulate a better outcome of the pandemic. To keep things somewhat realistic, we applied small modifiers on habits and the situation. We were able to create a simulated version with about 40% less deaths per population.
- 10% decrease in IMD19 score
- 10% decrease in population density
- 10% increase in energy density
- 20% increase in fruits and veggies
Conclusion
Although our models could have been more precise using more data and maybe with more precise algorithms, we were able to create a simulation which seem to follow popular knowledge of what makes a healthy lifestyle and the factors which worsen a pandemic. It is important to note that although the trends seem relevant, the values should be taken into account with much grain of salt and PlagueSolver™ is just a fun tool to play around with.
Although the modifiers we applied might seem rather small, the changes that need to be made to a society to implement them are country-wide, decennial challenges and have many implications. It probably wouldn't suffice to eat 50% more fruit once the pandemic has begun...
Or would it ?
Bibliography
[1] Office for National Statistics, “Deaths involving COVID-19 by local area and deprivation”, https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/deathsinvolvingcovid19bylocalareaanddeprivation
[2] Luca Maria Aiello, Daniele Quercia, Rossano Schifanella & Lucia DelPrete, “Tesco Grecory 1.0, a large-scale dataset of grocery purchases in London”
[3] IMD2019 Maps, https://research.mysociety.org/sites/imd2019/about/
[4] WHO, base recommendations, https://www.who.int/publications/m/item/healthy-diet-factsheet394.