Getting started with data analysis and visualization in R

This is a tutorial on basic data analysis and visualization using R packages dplyr and ggplot2. Please make sure you check out our "Getting started with R" guide before this tutorial, as this tutorial assumes that you know some basics of R and how to get started.

Loading packages and dataset

We will be loading the following packages: dplyr and ggplot2.

# load packages
library(dplyr)
library(ggplot2)

# if you don't have the packages, install using the command below:
# install.packages('tidyverse') # this package contains both dplyr and ggplot2
# OR
# install.packages('dplyr')
# install.packages('ggplot2')

Set directory and load the data

Let’s set our directory and load in our data. The dataset contains COVID-19 cases in 2020 obtained from kaggle linked here.

# set directory
setwd('/Users/rikac/Downloads')

# load data - this is COVID-19 cases updates in 2020
data <- read.csv('country_wise_latest.csv')

# type View(data) to see a pop-up tab of your dataset

Clean the dataset

The dataset seems pretty clean, but let’s say we want to change the column names to make them easier to work with.

# change column names of data
data <- data %>% # this sign is called a pipe operator, it simply means "and"
    rename('Country_Region' = 'Country.Region',
           'Deaths_per_100_cases' = 'Deaths...100.Cases',
           'Recovered_per_100_cases' = 'Recovered...100.Cases',
           'Deaths_per_100_Recovered' = 'Deaths...100.Recovered',
           'One_week_change' = 'X1.week.change',
           'One_week_%_increase' = 'X1.week...increase')

# tip = things are better handled if column names are separated by underscores
# tip = make sure you save your edits back to data by writing "data <-"!

Data analysis with dplyr

dplyr is a package that allows you to filter, select, group, and summarize data in a clean and readable way using the pipe operator (%>%), which basically means 'and' -- this helps us connect multiple commands in a sequence. Below is a series of commands that can be done using dplyr.

First, let’s start by inspecting our dataset. Let’s say we are interested to look at these four columns: Country_Region, Confirmed, Deaths, and Recovered. We can inspect it by using the select() command below:

# Select only the following columns: Country_Region, Confirmed, Deaths, Recovered
data %>%
    select(Country_Region, Confirmed, Deaths, Recovered)

This command just simply shows those columns that you have specified. You can see that in the top right corner, it says “df[187 x 4]” which means that this selected data has 187 rows and 4 columns.

Now, let’s filter to just look at cases of deaths and recovered just in Italy. We will also use the summarise() command in order to get the total number of recovered and deaths. summarise() function can be used to do basic arithmetic of different variables.

# filter to just Italy and view the total number of Deaths and Recovered
data %>%
    filter(Country_Region == "Italy") %>%
    summarise(Total_Recovered = sum(Recovered),
              Total_Deaths = sum(Deaths))

Tip: Total_Recovered and Total_Deaths are names you assigned and are not names that exist in the dataset!

Now, let’s filter by rows to see countries, where the number of deaths is less than 10.

# filter by rows to show countries where number of deaths is less than 10
data %>%
    filter(Deaths < 10) %>%
    select(Country_Region)

We did “select(Country_Region)” here to just show countries/regions where deaths are less than 10, otherwise, it will show the other columns corresponding to those top 10 countries as well!

Let’s now look at the top WHO regions by total confirmed cases. First, we need to use a group_by() function to group countries into WHO regions. Next, we summarize (by using the summarise() function) the cases by taking the sum of total cases in the column 'Confirmed' then, we show the result in a descending order, hence 'desc(Total_cases)'.

# Show top WHO region by total confirmed cases
data %>%
    group_by(WHO.Region) %>%
    summarise(Total_Cases = sum(Confirmed)) %>%
    arrange(desc(Total_Cases))

Bonus: The mutate() command shows how to create a new column using values from existing columns. This is useful when we want to calculate the percentage of something as well as other operations. For example, if we want to have a column that represents the percentage of females of the total population, we can take the column that shows the number of females divided by the column that shows the total population.

# let's create a new column called Death_Rate, where we take the number of deaths divide by the 
# number of confirmed cases

data <- data %>%
    mutate(Death_Rate = Deaths / Confirmed)

# again, we save it back to the dataset by using the 'data <-', otherwise the new column will not be saved!

Data visualization with ggplot2

ggplot2 is a package for data visualization, where you can do tons of customizations for your plots.

Let’s say we want to create a barplot of the number of confirmed cases in countries within the Southeast Asia WHO region.

# first, we have to subset the data to only countries within South-East Asia

sea_countries <- data %>%
                    filter(WHO.Region == 'South-East Asia')


g1 <- ggplot(sea_countries, aes(x=Country_Region, y=Confirmed)) +
        geom_bar(stat = 'identity', fill='steelblue') +
        labs(title = 'Number of confirmed cases in countries in Southeast Asia WHO region',
             x = 'Countries',
             y = 'Number of confirmed cases')

print(g1) # since we save the plot to g1, type this command to view the plot

By default, geom_bar() will try to count each row that corresponds with each category, in our case, they are countries, but since our data is also aggregated, we need to specify 'identity' to override that default.

Now, let’s look at the comparison of death rates between WHO regions using boxplots.

# now, let's look at the comparison of death rates within WHO regions

g2 <-  ggplot(data, aes(x = WHO.Region, y = Death_Rate)) +
            geom_boxplot(fill='pink') +
            labs(title = 'Death Rates across WHO regions',
                 x = 'WHO regions',
                 y = 'Death rates')

print(g2)

Boxplots tell you the distribution of numeric variables such as mean, medians, spread, and outliers. From the plot above, we can see that Europe has the highest mean death rate. However, overall, all regions maintain similar distribution.

Now, let’s make it a bit more complicated and look at the different types of cases (confirmed, death, recovered) across WHO regions, and plot it side-by-side.

# first, we need to summarize each case type

region_totals <- data %>%
                    group_by(WHO.Region) %>%
                    summarise(Confirmed = sum(Confirmed),
                              Deaths = sum(Deaths),
                              Recovered = sum(Recovered))



# view region_totals to make sure you understand the purpose of this step
# now, we need to make a new dataframe that summarizes the different of types of cases for each region

region_cases <- data.frame(
                    WHO_region = rep(region_totals$WHO.Region, each = 3),
                    Case_Type = rep(c("Confirmed", "Deaths", "Recovered"), times = nrow(region_totals)),
                    Count = c(region_totals$Confirmed, region_totals$Deaths, region_totals$Recovered))



# view region_cases to see the purpose of this step
# now, let's plot!

g3 <- ggplot(region_cases, aes(x = Case_Type, y = Count)) +
        geom_bar(stat = "identity", fill = 'lightgreen') +
        labs(title = "Total COVID-19 case types across WHO regions (2020)",
             x = "Case types",
             y = "Total Count") +
        facet_wrap(~WHO_region) +
        theme(legend.position = "none")

print(g3)

That’s it for some basic data analyses and visualizations in R! Check out dplyr and ggplot2 documentations for more functions and usage!

By: Rika Chan