In this tutorial, we’re going to use one of the COVID datasets to practice some of the things that we’ve learnt during the course. This will hopefully give you some ideas of how what we’ve been going through can be applied to your own data. The tutorial is divided into data exploration and data modelling sections. There’s a series of exercises running through the tutorial. Each section has some exercises which ask specific questions. The code used to answer these specific questions is provided and these are in a similar style to previous tutorials. Each section also contains several extended questions. The idea with these is that you can explore whatever parts of the data you want to - the question gives you a framework to do it but you can then choose which data to use and how to explore/model it. These questions are therefore a bit more involved. So have a go at the specific questions and if you’d like to get more practice, run through the extended questions.
We’re going to use several COVID datasets that contain different information. They can all be found in the datasets folder if you’ve downloaded the course from GitHub. Not a problem if not, you can download the datasets from here: . Click to open each dataset in turn and then right click, save as and save them somewhere so you know where they are.
Whenever you start a new analysis, its always a good idea to first load the libraries that you want to use. Not having the library loaded is one of the most common sources of errors in R. We only need 1 library for this tutorial (unless you want any others for the extensions) which is ggplot2. This tutorial has been set up so all of the code to answer the exercises is in the document but is hidden by default. This is so you can try to answer the exercises without help if you’d like to. But if you’re not sure what to do, you can click on the “code” square on the right hand side just below each exercise and it will show you the code that I’d use to carry out the exercise. This is just what I’d do and for at least several of the exercises there’s probably more efficient ways of doing it so don’t worry if what you’ve done doesn’t look like what I’d do!
Exercises
We’re next going to import the two COVID datasets that we’ll use for the rest of the tutorial.
Exercises
covid is now a data frame containing a series of measurements for a series of countries. Each row is a country and each column is some measurement from the country. Towards the end of the tutorial, we’re going to construct a model to explain the number of COVID cases within a country and explore the factors that influence this number. So let’s start by exploring some of the variables to get an idea of the range and distribution of their values. We’ll initially look at the number of confirmed cases and several of the variables that might influence the number of cases. One good way to get an idea of the range and distribution of a variable of interest is using a histogram. The values for some of the variables we’re interested in span several orders of magnitude. Its difficult to visualise these variables on a normal axis. However, we can transform the plot axis into logarithmic values which will make these variables easier to visualise. Let’s take a look at some of variables.
Exercises
#The structure of the ggplot line will be the same in each case: ggplot(covid, aes(x = Column_you_want_to_plot)) + scale_x_log10() + geom_histogram()
ggplot(covid, aes(x = confirm)) + scale_x_log10() + geom_histogram() #Number of confirmed cases
ggplot(covid, aes(x = death)) + scale_x_log10() + geom_histogram() #Number of COVID-related deaths - some countries have 0 here so you'll get a message saying some infinite values have been introduced as you can't take the logarithm of 0
ggplot(covid, aes(x = tourist_arrival_2018)) + scale_x_log10() + geom_histogram() #Number of tourists arriving in the country
ggplot(covid, aes(x = tourist_departure_2018)) + scale_x_log10() + geom_histogram() #Number of tourists departing from the country
ggplot(covid, aes(x = fempop_2018)) + scale_x_log10() + geom_histogram() #Female population
ggplot(covid, aes(x = male_pop_2018)) + scale_x_log10() + geom_histogram() #Male population
ggplot(covid, aes(x = pop_over65_2018)) + scale_x_log10() + geom_histogram() #Population over the age of 65
These exercises have enabled us to start to get a picture of what some of our variables look like. Exercise 4 shows why its often very useful to transform one or both of our axes into logarithmic values - using a normal axis made it much harder to see the full distribution of values for the number of confirmed cases. This is because when we have one or more outlier measurements, most of the axis is taken up by the space between the outlier value(s) and the rest of the values. The rest of the values therefore end up being crunched up close together so we can’t see the differences between them. The logarithmic axis reduces the difference between the outlier(s) and the rest of the values so enables us to see the differences between the rest of the values more easily. I’d usually plot something initially using normal axes and see what it looks like. If it looks like the plot in exercise 4 with lots of values crunched together and a big space before getting to other values, I’d then see if it looks better with the logarithmic axes.
The histograms have shown us that most of the values fall into a middle range for most of the variables. Having an understanding of the distribution of these variables will come in handy when we do our modelling later in the tutorial. Its also often useful to know if there are variables that are tightly correlated with one another. This is because when it comes to modelling, we’ll likely find it difficult to disentangle the effects of two or more variables if they are very tightly correlated. To look at whether variables of interest are correlated, we can plot them against one another using ggplot.
Extended exercises
E1) Plot some of these variables for exercises 3-5 against one another using geom_point(). Does it look like any of them are strongly correlated?
We’ve now seen how some of the variables in our dataset are distributed across countries. Its clear that some countries have lower values for some variables than other countries. It might be useful to understand which countries correspond to which values as this might influence our interpretation. Let’s look at one of the variables in detail - the Oxford stringency level. This is the degree to which a country has imposed restrictions in response to the COVID outbreak. A score of 0 means no restrictions while a score of 100 means maximum restrictions. We plotted the distribution of the Oxford stringency level in exercise 4. This showed that we have the full range of values in our dataset - some countries are at level 0 while others are at level 100. But this plot didn’t show us where these restrictions have been put in place. In our data frame, we have a column called region which is the continent on which the country is located. Let’s use ggplot to determine if the Oxford stringency level is determined by continent. We can do this just by colouring our histogram by the region column.
Exercises
R plots this as a stacked bar plot by default. So all of the countries with score 100 are plotted on top of one another and we therefore have multiple continents coloured in the bar representing a score of 100. If the Oxford stringency level was determined by continent, we’d expect to see the bars representing high values being all one colour and the bars representing low values all being another colour. What we actually see is the colours intermingled throughout the plot. So the Oxford stringency score isn’t determined completely by continent. Rather different countries within the same continent have imposed different levels of restrictions.
Let’s look at the countries with the highest and lowest restriction levels.
Exercises
When we plotted the number of confirmed COVID cases and deaths in exercise 3, it was clear that some countries have more cases and deaths than others. In this next part of the tutorial, we’re going to use linear models to try to pick out factors that might influence the number of cases and deaths.
One simple scenario is that the number of cases in a country is influenced by the population size of the country. This makes some sense from an epidemiological perspective as a larger population means more people to infect. Let’s first see if there’s a correlation between cases and population size. The covid data frame has population divided into females and males, so to simplify things let’s first combine these columns into a column showing the total population. We can do this using:
covid$total_population <- covid$fempop_2018 + covid$male_pop_2018
or using dplyr
covid <- mutate(total_population = fempop_2018 + male_pop_2018)
This creates a new column total_population by adding the values in columns fempop_2018 and male_pop_2018 within the respective row. We can then estimate the influence of population size on the number of covid cases using a linear model:
Exercises
covid$total_population <- covid$fempop_2018 + covid$male_pop_2018
summary(lm(confirm ~ total_population, data = covid))
summary(lm(death ~ total_population, data = covid))
This shows us that the population size influences the number of confirmed cases and the number of deaths.
Exercises
model1 <- lm(confirm ~ 1, data = covid)
model2 <- lm(confirm ~ total_population, data = covid)
model3 <- lm(confirm ~ Oxford_stringency_level, data = covid)
model4 <- lm(confirm ~ tourist_arrival_2018, data = covid)
model5 <- lm(confirm ~ tourist_departure_2018, data = covid)
#R^2 is part of the model summary
summary(model1)
summary(model2) # We can extract the minimum and maximum number
#of the residuals from the summary of model 2. min/max residual: -20341.1/ 20380.9
summary(model3)
summary(model4)
summary(model5)
#a plot of the regression line and data
ggplot(covid, aes(confirm, total_population)) +
geom_point() +
geom_smooth(method = "lm")
From this, it looks like tourist_departure gives the best fit to the number of cases. Note that for all models we find extreme outliers that influence the regression model this can lead to a poor fit. In practise we may need to identify and remove outliers or use other more robust regression methods. It’s absolutely crucial to check if a model is well behaved by inspecting the distribution of residuals.
Extended exercises
E1) Investigate whether the residuals of the models are normal distributed using at least two different ways (hint: see R wrap-up lecture)
E2) Investigate the factors in the dataset that have the greatest influence on the number of COVID deaths
Some times we want to use more than one independent variable to predict our dependent variable.
Exercises
model6 <- lm(confirm ~ total_population + Oxford_stringency_level, data = covid)
summary(model6)
AIC(model2, model6)
The R^2 is 0.3106 for the combined model vs 0.3049 for the univariate model but model 6 has got a higher AIC than model 2 (higher is worse). It’s important to note that the R^2 is always higher/better for more complex models. The AIC penalizes the number of variables/parameter and is the appropriate measure when comparing models that use a different number of variables.
Extended exercises
E1: Find a combination of variables that predicts confirmed cases or deaths better than a univariate predictor and also has got a better AIC.