Unleash the Power of Data using R with Covid Trial Dataset

Anu Ganesan 31 Mar, 2021 • 3 min read
This article was published as a part of the Data Science Blogathon.

Introduction

Data is the universal truth guiding today’s world of Data Science. At the same time, it is not as simple as it looks to extract actionable insights. Turning data into meaningful insights require deep insights into the domain and the environment from where the data originates.

The year 2020 turned into a painful year. Covid has changed our lives forever. These difficult times have taught us how to be resilient and empathetic. Hope is setting its foot with the arrival of Covid Vaccines. Let’s learn the power of data using Covid Trial Datasets which has led to the emergence of vaccines in record time.

We will be using R as the programming language to explore and visualize Covid Trial Datasets.

Below is a visualization of Covid Vaccine Trial Datasets in R that we will be analyzing:

Step 1: Load the Covid Trial Dataset

CovidDF = data.frame(read.csv(“CovidTrials.csv”))

Step 2: Analyze Datasets and their variables

> str(COVID)
‘data.frame’: 5061 obs. of 27 variables:

There are 5061 observations(rows) with 27 variables(columns)

Analyze Gender Variable:

> unique(CovidDF$Gender)
[1] “All” “Female” “Male” “”

> length(which(CovidDF$Gender == “All”))
[1] 4881

> length(which(CovidDF$Gender == “Male”))
[1] 40

> length(which(CovidDF$Gender == “Female”))
[1] 131

> length(which(CovidDF$Gender == “”))
[1] 9

> nrow(CovidDF);
[1] 5061

There is a total of 5061 observations (rows) with 40 males, 131females,4881 all, and 9 unspecified genders.

The value “All” for the gender variable indicates that it can be of any gender. It is hard to achieve precise results when modeling data with gender. Here comes the nightmare of every data engineer trying to deal with incomplete data.

Analyze Study Results Variable:

> unique(CovidDF$Study.Results)
[1] “No Results Available” “Has Results”

> length(which(CovidDF$Study.Results == “No Results Available”))
[1] 5033

> length(which(CovidDF$Study.Results == “Has Results”))
[1] 28

The above result indicates that the Covid Trial datasets were captured when the study was still happening with 5033 results under “No Results Available”.

Step 3: Analyze Age Variable

Age variable consists of free-form values with different formats used to capture the age of volunteers in different locations.

Below is sample age values captured in trial datasets

18 Years and older (Adult, Older Adult)
18 Years and older (Adult, Older Adult)
Child, Adult, Older Adult
18 Years to 48 Years (Adult)

It would have been much better if age was captured as an integer or in a range format. Since it is having different values and formats, we will use convertAge function to approximate age to a certain numeric value.

convertAge <- function(age) {
newage = str_replace_all(age, “Child”, “”)
newage = str_replace_all(newage, “Adult”, “”)
newage = str_replace_all(newage, “Older”, “”)
newage = str_replace_all(newage, “older”, “”)
newage = str_replace_all(newage, “and”, “”)
newage = str_replace_all(newage, “Years”, “”)
newage = str_replace_all(newage, “to”, “”)
newage = str_replace_all(newage, “up”, “”)
newage = str_replace_all(newage, “\(“, “”)
newage = str_replace_all(newage, “\)”, “”)
newage = str_replace_all(newage, “,”, “”)
newage = str_trim(newage, side = c(“both”, “left”, “right”))
newage = ifelse(grepl(“Months”, newage, fixed = TRUE) == TRUE,str_replace_all(newage, newage, “2”), ageStrip)
newage = ifelse(grepl(“Month”, newage, fixed = TRUE) == TRUE,str_replace_all(newage, newage, “2”), ageStrip)
newage = ifelse(grepl(“Days”, newage, fixed = TRUE) == TRUE,str_replace_all(newage, newage, “1”), ageStrip)
newage = substr(newage, nchar(newage)-2+1, nchar(newage))
newage = as.numeric(newage)
}
ageVector <- sapply(CovidDF$Age, convertAge)
CovidDF$NewAge =ageVector
CovidDF$NewAge[is.na(CovidDF$NewAge)] <- 0
CovidDF$NewAge = as.numeric(CovidDF$NewAge)

Step 4: Visualize using plots in R

hist(CovidDF$NewAge, xlab=”Age group”, main=”Age of Volunteers in Covid Vaccine Trails”)

The above histogram provides information about the number of volunteers grouped by age.

Box plot of the same information gives the median, 1’st, and 3’rd quartile information

Convert Variable Conditions to a factor, a vector of integer values with a corresponding set of character values. The integer value of the conditions factor can then be used to plot a scatter plot against the new age variable which is also numeric.

ConditionFactor = as.factor(CovidDF$Conditions)
CovidDF$NewConditions = as.numeric(ConditionFactor)
plot(x=CovidDF$NewAge, y=CovidDF$NewConditions, xlab = “Age”)

The above scatter plot shows more conditions to be registered for age groups 18 and 0. The age group 0 indicates that the trial data did not have any age value.

Data is the oil for all new-age technologies but it is also important to understand the need to properly engineer data without which highly accurate machine learning models would be near to impossible.

Follow us to keep updated on our services and solutions in the field of Data Engineering, Machine Learning and MLOps

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

Anu Ganesan 31 Mar 2021

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear