4. The tidier the better: Basics of coding with the Tidyverse

Mason A. Wirtz https://masonwirtz.github.io

Exercise 1

Load in the data Vampires (preferably as a tibble using the read_csv() function).

We are only interested in the columns idVampire, gender, ageOfVampire and numberOfChildren. Create a new data frame called Vamps including only these columns.

Click for Answer

SOLUTION

Since our goal here is to simply subset the data frame and select on certain columns, we can use the function select() from the dplyr package included in the tidyverse. This will allow us to pick and choose which columns we want to subset. We can then assign these subsetted variables to a new object, which the directions ask us to call Vamps. Let’s go ahead and do this:

Vamps =                                       # define object
  Vampires %>%                                # data
  select(idVampire, gender,                   # select variables (i.e. subset data)
         ageOfVampire, numberOfChildren)

Since the first three columns we want to subset are in a row (i.e. the first three columns), we can also easily specify 1:3 (i.e. columns 1 through 3), plus the variable numberOfChildren, like so:

Vamps =                                 # define object
  Vampires %>%                          # data
  select(1:3, numberOfChildren)         # select variables (i.e. subset data)

Exercise 2

Calculate the mean amount of cities the vampires have visited (variable: visitedCities), as well as the standard deviation thereof, using the summarize() function.

Click for Answer

SOLUTION

As always, our first step is to reach into the data frame we are interested in, namely Vampires, and then simply use the summarize function, like so:

Vampires %>%                                       # dataa
  summarize(mean = mean(visitedCities),            # calculate mean
            sd = sd(visitedCities))                # calculate SD
# A tibble: 1 × 2
   mean    sd
  <dbl> <dbl>
1  32.5  39.7

Exercise 3

Calculate the mean age of male and female vampires.

HINT You will need the pipe ( %>% ) to group/stack functions, the group_by function and the summarize function (in that order).

Click for Answer

SOLUTION

Alright, so our first step, as always, is to specify which data frame we are interested in. Then we want to group our data frame into two groups, namely male and female, which can be done by using the group_by() function on the variable gender. After that, we need the summarize() function to calculate the mean age (variable: ageOfVampire), like so:

Vampires %>% 
  
  # Group the data frame by the gender variable, 
  # which is binary, i.e. only male or female.
  group_by(gender) %>% 
  
  # Summarize the grouped data 
  summarize(mean = mean(ageOfVampire))
# A tibble: 2 × 2
  gender  mean
  <chr>  <dbl>
1 Female  82.1
2 Male    87.6

Exercise 4

Alright, now we can get to the fun stuff: Manipulating data.

To start off, let’s clean up our environment a bit. We won’t really be needing the Vamps data frame we made earlier, so remove it from your environment using the rm(Vamps) function.

Let’s say we are interested in whether alive vampires have changed more people into vampires depending on whether they have fangs or not. The variables here that will interest us are deadOrAlive, hasFangs and numberChangedToVamp. There is no need to subset the respective variables, but if you want to for the sake of practice, go ahead.

What is the mean number of people the alive vampires WITH and WITHOUT fangs have turned into vampires?

HINT You will need the pipe ( %>% ) to group/stack functions, the filter function, the group_by function and the summarize function (in that order).

Click for Answer

SOLUTION

This is pretty tricky, there are three steps we need to go through to get to the answer. Let’s go through this together.

We have an interesting outcome, vampires without fangs have changed more people into vampires on average than vampires with fangs…interesting. Maybe vampires without fangs have a Minderwertigkeitskomplex…?

Vampires %>% 
  
  # Filter the data frame so that it 
  # ONLY includes vampires that are ALIVE
  filter(deadOrAlive == "Alive") %>% 

  # Group the data frame by the hasFangs variable, 
  # which is binary, i.e. only yes or no. 
  
  group_by(hasFangs) %>% 
  
  # Summarize the grouped data 
  summarize(mean = mean(numberChangedToVamp))
# A tibble: 2 × 2
  hasFangs  mean
  <chr>    <dbl>
1 No        7.73
2 Yes       8.73

Exercise 5 (intermediate)

Alright, we are going to go through some REALLY useful functions that I tend to use relatively often. They are super practical when you need to sort your data.

We are going to focus on the following:

  1. starts_with()

  2. ends_with()

  3. where()

  4. across()

  5. contains()

  6. grepl()

Some of these are fairly self-explanatory (e.g., the first two), and a lot of them are used really often in combination with the select function.

Run the following code chunk: What happens? When do you think these functions might be useful?

Vampires %>% 
  select(starts_with("g"))
# A tibble: 100 × 1
   gender
   <chr> 
 1 Female
 2 Male  
 3 Male  
 4 Male  
 5 Female
 6 Male  
 7 Female
 8 Female
 9 Male  
10 Male  
# … with 90 more rows
Vampires %>% 
  select(ends_with("Vampire"))
# A tibble: 100 × 2
   idVampire ageOfVampire
       <dbl>        <dbl>
 1         1           89
 2         2          192
 3         3           67
 4         4           23
 5         5           40
 6         6          105
 7         7           58
 8         8           67
 9         9           88
10        10          122
# … with 90 more rows

I tend to use this function quite a bit when I work with z-scored variables, because I name all of them variable_z, so I can easily run the function select(ends_with("_z")), which is of course really handy if I just want a data frame with my z-scored predictor variables.

The where() function is really handy when you want to sort based on the class of a vector in a data frame. For exaample, is we want to select all columns that are numeric in class, we can run the following function:

Vampires %>% 
  select(where(is.numeric))
# A tibble: 100 × 7
   idVampire ageOfVampire wellbeing  income visitedC…¹ numbe…² numbe…³
       <dbl>        <dbl>     <dbl>   <dbl>      <dbl>   <dbl>   <dbl>
 1         1           89      61.6 131670.         10       3      23
 2         2          192      71.4 153860.          4       1       7
 3         3           67      64.2 154087.          3       1       9
 4         4           23      24.7 113842.         97       2      11
 5         5           40      47.2 144047.         33       3       3
 6         6          105      23.9 138654.         41       1       5
 7         7           58      38.8 183873.         43       5      13
 8         8           67      73.1 136873.        119       1      21
 9         9           88      51.4 109140.         16       4       4
10        10          122      70.6 149289.         11       1       8
# … with 90 more rows, and abbreviated variable names ¹​visitedCities,
#   ²​numberOfChildren, ³​numberChangedToVamp

The across() function basically tells R to do something across a certain amount of columns So, if I want R to summarize all of my numeric variables reaally quickly, I could run something like this:

Vampires %>% 
  summarise(across(where(is.numeric), mean))
# A tibble: 1 × 7
  idVampire ageOfVampire wellbeing  income visitedCi…¹ numbe…² numbe…³
      <dbl>        <dbl>     <dbl>   <dbl>       <dbl>   <dbl>   <dbl>
1      50.5         84.5      51.0 142011.        32.5    3.07     8.2
# … with abbreviated variable names ¹​visitedCities,
#   ²​numberOfChildren, ³​numberChangedToVamp

And, if I want multiple statistics, I can do this easily, too:

Vampires %>% 
  summarise(across(where(is.numeric), c(mean = mean, sd = sd, min = min, max = max)))
# A tibble: 1 × 28
  idVampire_…¹ idVam…² idVam…³ idVam…⁴ ageOf…⁵ ageOf…⁶ ageOf…⁷ ageOf…⁸
         <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1         50.5    29.0       1     100    84.5    32.8      14     198
# … with 20 more variables: wellbeing_mean <dbl>, wellbeing_sd <dbl>,
#   wellbeing_min <dbl>, wellbeing_max <dbl>, income_mean <dbl>,
#   income_sd <dbl>, income_min <dbl>, income_max <dbl>,
#   visitedCities_mean <dbl>, visitedCities_sd <dbl>,
#   visitedCities_min <dbl>, visitedCities_max <dbl>,
#   numberOfChildren_mean <dbl>, numberOfChildren_sd <dbl>,
#   numberOfChildren_min <dbl>, numberOfChildren_max <dbl>, …

Exercise 5.1

The contains() function does the same as (a) and (b), really, but it selects any COLUMNS that contain a certain string. Use this function to to select any COLUMNS that contain the word Vampire

Click for Answer

SOLUTION

This works the same was as the starts_with() etc. functions, so we just define “Vampire” in our contains() function.

Vampires %>% 
  select(contains("Vampire"))
# A tibble: 100 × 2
   idVampire ageOfVampire
       <dbl>        <dbl>
 1         1           89
 2         2          192
 3         3           67
 4         4           23
 5         5           40
 6         6          105
 7         7           58
 8         8           67
 9         9           88
10        10          122
# … with 90 more rows

Exercise 5.2

The grepl() function is similar, but it filters anything with a certain string. HOWEVER, we also have to specify in WHICH COLUMN it is.

Which function do we use to subset ROWS (instead of columns)?

How would we subset our data set depending on which vampires were born in any of the Americas?

HINT: You will have to use grepl("America", bornIn), but this has to be wrapped in another function (but which one? Hmmm…)

Click for Answer

SOLUTION

This one is pretty tricky! We first need to use the filter() function since we are filtering ROWS. Then, we use the grepl() function to define a certain string of what we want the code to search for, and then finally define in which COLUMN the code should search.

Vampires %>% 
  filter(grepl("America", bornIn))
# A tibble: 46 × 14
   idVampire gender ageOfVamp…¹ deadO…² hasFa…³ bornIn vampT…⁴ wellb…⁵
       <dbl> <chr>        <dbl> <chr>   <chr>   <chr>  <chr>     <dbl>
 1         1 Female          89 Dead    No      North… hybrid     61.6
 2         2 Male           192 Dead    Yes     North… hybrid     71.4
 3         3 Male            67 Dead    Yes     South… sangui…    64.2
 4         8 Female          67 Alive   Yes     North… hybrid     73.1
 5        11 Male            61 Alive   Yes     South… psychic    51.5
 6        13 Female          87 Alive   No      North… hybrid     19.5
 7        17 Female          89 Alive   Yes     South… psychic    97.0
 8        20 Male            99 Dead    Yes     South… sangui…    53.8
 9        23 Male            99 Dead    No      North… hybrid     23.6
10        24 Male            61 Alive   No      North… psychic    68.0
# … with 36 more rows, 6 more variables: maritalStatus <chr>,
#   employment <chr>, income <dbl>, visitedCities <dbl>,
#   numberOfChildren <dbl>, numberChangedToVamp <dbl>, and
#   abbreviated variable names ¹​ageOfVampire, ²​deadOrAlive,
#   ³​hasFangs, ⁴​vampType, ⁵​wellbeing

Exercise 5.3

Let’s say that we know we do not want any character vectors in our data frame, but we rather want these to be factor vectors. Using the mutate(), across() and where() functions, change ALL character vectors to factor vectors. This one is a bit tricky, but, once you know this trick, you will use it SO OFTEN!!!

Click for Answer

SOLUTION

This one is pretty tricky! We first need the mutate() function because we want to change the variables. Then, we need across() because we are mutating across several variables, and we need to tell the function across which variables we are mutating, so we use the where() function. Since we want to mutate all CHARACTER vectors, we need to search for where these vectors are, which we can do with the is.character function. Then, we tell the function what to do once it has located the character vectors, which is then to change them to factors, which we do by using the as.factor() function.

Vampires %>% 
  mutate(across(where(is.character), as.factor))
# A tibble: 100 × 14
   idVampire gender ageOfVamp…¹ deadO…² hasFa…³ bornIn vampT…⁴ wellb…⁵
       <dbl> <fct>        <dbl> <fct>   <fct>   <fct>  <fct>     <dbl>
 1         1 Female          89 Dead    No      North… hybrid     61.6
 2         2 Male           192 Dead    Yes     North… hybrid     71.4
 3         3 Male            67 Dead    Yes     South… sangui…    64.2
 4         4 Male            23 Dead    No      Europa psychic    24.7
 5         5 Female          40 Alive   Yes     Austr… hybrid     47.2
 6         6 Male           105 Alive   No      Antar… psychic    23.9
 7         7 Female          58 Alive   Yes     Austr… hybrid     38.8
 8         8 Female          67 Alive   Yes     North… hybrid     73.1
 9         9 Male            88 Alive   No      Antar… hybrid     51.4
10        10 Male           122 Dead    Yes     Antar… psychic    70.6
# … with 90 more rows, 6 more variables: maritalStatus <fct>,
#   employment <fct>, income <dbl>, visitedCities <dbl>,
#   numberOfChildren <dbl>, numberChangedToVamp <dbl>, and
#   abbreviated variable names ¹​ageOfVampire, ²​deadOrAlive,
#   ³​hasFangs, ⁴​vampType, ⁵​wellbeing

Exercise 6 (for experts)

Alright, you’ve done well until now? Great! Let’s take on a harder task. Before you start, look up the arguments for the function ungroup().

Let’s say we had participants complete several versions of a C Test (a language assessment test in which participants have to complete words—this has been shown to strongly correlate with participants’ general language proficiency, cf. Raatz & Klein–Braley [2002]). Since no two tests are typically exactly the same, even after pilot testing (that is, without conducting large-scale psycho-metric validity screenings), we tend to correct for the possible differences in the tasks statistically. We can correct for differences (to help ensure better comparability) between the versions of the C Test by subtracting or adding the deviation of each version’s mean count from the overall mean count for each individual score.

Let me generate some data for you that replicates this situation:

# RUN THE FOLLOWING CODE CHUNK

CTest_df =                                                          # define object
  tibble(Version = gl(n = 3, k = 20),                               # generate data
         CTest = c(round(abs(rnorm(n = 20, mean = 17, sd = 4))),    # generate CTest 1
                 round(abs(rnorm(n = 20, mean = 19, sd = 4))),      # generate CTest 2
                 round(abs(rnorm(n = 20, mean = 20, sd = 4))))      # generate CTest 3
) %>% 
  mutate(CTest = ifelse(CTest > 30, 30, CTest),                     # delete wild devs
         CTest = ifelse(CTest < 0, 0, CTest)) 

# Let's have a look
CTest_df
# A tibble: 60 × 2
   Version CTest
   <fct>   <dbl>
 1 1          20
 2 1          20
 3 1          13
 4 1          17
 5 1          22
 6 1          19
 7 1          19
 8 1          15
 9 1          14
10 1          20
# … with 50 more rows
CTest_df %>% 
  group_by(Version) %>% 
  summarize_at(.vars = "CTest", 
               .funs = c("max", "min", "mean", "sd"))
# A tibble: 3 × 5
  Version   max   min  mean    sd
  <fct>   <dbl> <dbl> <dbl> <dbl>
1 1          22     9  16.8  3.50
2 2          26    10  17.6  4.15
3 3          26    12  19.2  3.70

So, now we have a data frame with the version of the C Test (Version) and each participant’s score on the C Tests (CTest).

Adjust the variable CTest by subtracting or adding the difference of each C Test version’s mean score from the overall mean score of the C Tests from or to each individual score.

HINT You will need to (or can, there are other ways to solve this) use the following functions in the given order: group_by(), mutate(), ungroup() and mutate()

Click for Answer

SOLUTION

See the commented code below for the solution

CTest_df %>% 
  
  # First we need to group by version, 
  # since we need the mean of each 
  # VERSION of the CTest
  group_by(Version) %>% 
  
  # Now we create a NEW variable called 
  # CTestMean, which gives us the mean of 
  # each version
  mutate(CTestMean = mean(CTest)) %>% 
  
  # We haven't seen this function yet, 
  # but what it does is the exact opposite 
  # of the group_by() function, 
  # but instead of grouping the different 
  # levels of a factor, it ungroups them 
  # to give us back the actual data frame, 
  # just now with the new variable 
  # (i.e. CTestMean) we just created
  ungroup() %>% 
  
  # Now we need to create a few different variables: 
  # We need the OVERALL mean, i.e. the mean of ALL the 
  # C Test scores; We need the difference between the 
  # OVERALL mean score and the mean scores of each 
  # version of the C test. We then change the original 
  # CTest variable by adding the difference (positive or 
  # negative) to each individual CTest score
  mutate(CTestMean_overall = mean(CTestMean), 
         CTestMean_difference = CTestMean_overall - CTestMean, 
         CTest = CTest + CTestMean_difference) %>% 
  
  # Now we just get rid of the extra noise variables 
  # that we needed to calculate the adjusted C Test scores
  select(-c(CTestMean, CTestMean_overall, CTestMean_difference))
# A tibble: 60 × 2
   Version CTest
   <fct>   <dbl>
 1 1        21.1
 2 1        21.1
 3 1        14.1
 4 1        18.1
 5 1        23.1
 6 1        20.1
 7 1        20.1
 8 1        16.1
 9 1        15.1
10 1        21.1
# … with 50 more rows