Load in the data Vampires
(preferably as a tibble using
the read_csv()
function).
We are only interested in the columns idVampire
,
gender
, ageOfVampire
and
numberOfChildren
. Create a new data frame called
Vamps
including only these columns.
SOLUTION
Since our goal here is to simply subset the data frame and select on
certain columns, we can use the function select()
from the
dplyr
package included in the tidyverse
. This
will allow us to pick and choose which columns we want to subset. We can
then assign these subsetted variables to a new object, which the
directions ask us to call Vamps
. Let’s go ahead and do
this:
Since the first three columns we want to subset are in a row
(i.e. the first three columns), we can also easily specify 1:3
(i.e. columns 1 through 3), plus the variable
numberOfChildren
, like so:
Calculate the mean amount of cities the vampires have visited
(variable: visitedCities
), as well as the standard
deviation thereof, using the summarize()
function.
SOLUTION
As always, our first step is to reach into the data frame we are
interested in, namely Vampires
, and then simply use the
summarize function, like so:
Calculate the mean age of male and female vampires.
HINT You will need the pipe ( %>% ) to
group/stack functions, the group_by
function and the
summarize
function (in that order).
SOLUTION
Alright, so our first step, as always, is to specify which data frame
we are interested in. Then we want to group our data frame into two
groups, namely male and female, which can be done by using the
group_by()
function on the variable gender
.
After that, we need the summarize()
function to calculate
the mean age (variable: ageOfVampire
), like so:
Alright, now we can get to the fun stuff: Manipulating data.
To start off, let’s clean up our environment a bit. We won’t really
be needing the Vamps data frame we made earlier, so remove it from your
environment using the rm(Vamps)
function.
Let’s say we are interested in whether alive
vampires have changed more people into vampires depending on
whether they have fangs or not. The variables here that will interest us
are deadOrAlive
, hasFangs
and
numberChangedToVamp
. There is no need to subset the
respective variables, but if you want to for the sake of practice, go
ahead.
What is the mean number of people the alive vampires WITH and WITHOUT fangs have turned into vampires?
HINT You will need the pipe ( %>% ) to
group/stack functions, the filter
function, the
group_by
function and the summarize
function
(in that order).
SOLUTION
This is pretty tricky, there are three steps we need to go through to get to the answer. Let’s go through this together.
To start with, we know that we are only interested in
ALIVE vampires, so this means that we need to subset
the data frame to home in on only the vampires that are alive. Our
variable deadOrAlive
tells us whether the vampires are dead
or alive, so this is the variable we first need to subset. We do this by
using the filter()
function (remember, filter is for ROWS
and select is for COLUMNS). And we specify we only want ALIVE vampires
by using the == operator (REMEMBER, the operator = is
for assigning values to an object, == means “is equal to”).
Now, since we are interested in whether vampires with and without
fangs have changed more people to vampires, we need to group our data
frame into two groups: vampires WITH fangs and vampires WITHOUT fangs.
We can do this by using the group_by()
function.
Lastly, we then need the summarize
function and
specify it to give us the mean of the people the vampires WITH and
WITHOUT fangs have changed into vampires.
We have an interesting outcome, vampires without fangs have changed more people into vampires on average than vampires with fangs…interesting. Maybe vampires without fangs have a Minderwertigkeitskomplex…?
Vampires %>%
# Filter the data frame so that it
# ONLY includes vampires that are ALIVE
filter(deadOrAlive == "Alive") %>%
# Group the data frame by the hasFangs variable,
# which is binary, i.e. only yes or no.
group_by(hasFangs) %>%
# Summarize the grouped data
summarize(mean = mean(numberChangedToVamp))
# A tibble: 2 × 2
hasFangs mean
<chr> <dbl>
1 No 7.73
2 Yes 8.73
Alright, we are going to go through some REALLY useful functions that I tend to use relatively often. They are super practical when you need to sort your data.
We are going to focus on the following:
starts_with()
ends_with()
where()
across()
contains()
grepl()
Some of these are fairly self-explanatory (e.g., the first two), and a lot of them are used really often in combination with the select function.
Run the following code chunk: What happens? When do you think these functions might be useful?
Vampires %>%
select(starts_with("g"))
# A tibble: 100 × 1
gender
<chr>
1 Female
2 Male
3 Male
4 Male
5 Female
6 Male
7 Female
8 Female
9 Male
10 Male
# … with 90 more rows
# A tibble: 100 × 2
idVampire ageOfVampire
<dbl> <dbl>
1 1 89
2 2 192
3 3 67
4 4 23
5 5 40
6 6 105
7 7 58
8 8 67
9 9 88
10 10 122
# … with 90 more rows
I tend to use this function quite a bit when I work with z-scored
variables, because I name all of them variable_z, so I can
easily run the function select(ends_with("_z"))
, which is
of course really handy if I just want a data frame with my z-scored
predictor variables.
The where()
function is really handy when you want to
sort based on the class of a vector in a data frame. For exaample, is we
want to select all columns that are numeric in class, we can run the
following function:
# A tibble: 100 × 7
idVampire ageOfVampire wellbeing income visitedC…¹ numbe…² numbe…³
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 89 61.6 131670. 10 3 23
2 2 192 71.4 153860. 4 1 7
3 3 67 64.2 154087. 3 1 9
4 4 23 24.7 113842. 97 2 11
5 5 40 47.2 144047. 33 3 3
6 6 105 23.9 138654. 41 1 5
7 7 58 38.8 183873. 43 5 13
8 8 67 73.1 136873. 119 1 21
9 9 88 51.4 109140. 16 4 4
10 10 122 70.6 149289. 11 1 8
# … with 90 more rows, and abbreviated variable names ¹visitedCities,
# ²numberOfChildren, ³numberChangedToVamp
The across()
function basically tells R to do something
across a certain amount of columns So, if I want R to summarize
all of my numeric variables reaally quickly, I could run something like
this:
# A tibble: 1 × 7
idVampire ageOfVampire wellbeing income visitedCi…¹ numbe…² numbe…³
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 50.5 84.5 51.0 142011. 32.5 3.07 8.2
# … with abbreviated variable names ¹visitedCities,
# ²numberOfChildren, ³numberChangedToVamp
And, if I want multiple statistics, I can do this easily, too:
# A tibble: 1 × 28
idVampire_…¹ idVam…² idVam…³ idVam…⁴ ageOf…⁵ ageOf…⁶ ageOf…⁷ ageOf…⁸
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 50.5 29.0 1 100 84.5 32.8 14 198
# … with 20 more variables: wellbeing_mean <dbl>, wellbeing_sd <dbl>,
# wellbeing_min <dbl>, wellbeing_max <dbl>, income_mean <dbl>,
# income_sd <dbl>, income_min <dbl>, income_max <dbl>,
# visitedCities_mean <dbl>, visitedCities_sd <dbl>,
# visitedCities_min <dbl>, visitedCities_max <dbl>,
# numberOfChildren_mean <dbl>, numberOfChildren_sd <dbl>,
# numberOfChildren_min <dbl>, numberOfChildren_max <dbl>, …
The contains()
function does the same as (a) and (b),
really, but it selects any COLUMNS that contain a certain string. Use
this function to to select any COLUMNS that contain the word
Vampire
SOLUTION
This works the same was as the starts_with()
etc.
functions, so we just define “Vampire” in our contains()
function.
The grepl()
function is similar, but it filters anything
with a certain string. HOWEVER, we also have to specify in WHICH COLUMN
it is.
Which function do we use to subset ROWS (instead of columns)?
How would we subset our data set depending on which vampires were born in any of the Americas?
HINT: You will have to use
grepl("America", bornIn)
, but this has to be wrapped in
another function (but which one? Hmmm…)
SOLUTION
This one is pretty tricky! We first need to use the
filter()
function since we are filtering ROWS. Then, we use
the grepl()
function to define a certain string of what we
want the code to search for, and then finally define in which COLUMN the
code should search.
# A tibble: 46 × 14
idVampire gender ageOfVamp…¹ deadO…² hasFa…³ bornIn vampT…⁴ wellb…⁵
<dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 1 Female 89 Dead No North… hybrid 61.6
2 2 Male 192 Dead Yes North… hybrid 71.4
3 3 Male 67 Dead Yes South… sangui… 64.2
4 8 Female 67 Alive Yes North… hybrid 73.1
5 11 Male 61 Alive Yes South… psychic 51.5
6 13 Female 87 Alive No North… hybrid 19.5
7 17 Female 89 Alive Yes South… psychic 97.0
8 20 Male 99 Dead Yes South… sangui… 53.8
9 23 Male 99 Dead No North… hybrid 23.6
10 24 Male 61 Alive No North… psychic 68.0
# … with 36 more rows, 6 more variables: maritalStatus <chr>,
# employment <chr>, income <dbl>, visitedCities <dbl>,
# numberOfChildren <dbl>, numberChangedToVamp <dbl>, and
# abbreviated variable names ¹ageOfVampire, ²deadOrAlive,
# ³hasFangs, ⁴vampType, ⁵wellbeing
Let’s say that we know we do not want any character vectors in our
data frame, but we rather want these to be factor vectors. Using the
mutate()
, across()
and where()
functions, change ALL character vectors to factor vectors. This one is a
bit tricky, but, once you know this trick, you will use it SO
OFTEN!!!
SOLUTION
This one is pretty tricky! We first need the mutate()
function because we want to change the variables. Then, we need
across()
because we are mutating across several variables,
and we need to tell the function across which variables we are
mutating, so we use the where()
function. Since we want to
mutate all CHARACTER vectors, we need to search for where these vectors
are, which we can do with the is.character
function. Then,
we tell the function what to do once it has located the character
vectors, which is then to change them to factors, which we do by using
the as.factor()
function.
# A tibble: 100 × 14
idVampire gender ageOfVamp…¹ deadO…² hasFa…³ bornIn vampT…⁴ wellb…⁵
<dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <dbl>
1 1 Female 89 Dead No North… hybrid 61.6
2 2 Male 192 Dead Yes North… hybrid 71.4
3 3 Male 67 Dead Yes South… sangui… 64.2
4 4 Male 23 Dead No Europa psychic 24.7
5 5 Female 40 Alive Yes Austr… hybrid 47.2
6 6 Male 105 Alive No Antar… psychic 23.9
7 7 Female 58 Alive Yes Austr… hybrid 38.8
8 8 Female 67 Alive Yes North… hybrid 73.1
9 9 Male 88 Alive No Antar… hybrid 51.4
10 10 Male 122 Dead Yes Antar… psychic 70.6
# … with 90 more rows, 6 more variables: maritalStatus <fct>,
# employment <fct>, income <dbl>, visitedCities <dbl>,
# numberOfChildren <dbl>, numberChangedToVamp <dbl>, and
# abbreviated variable names ¹ageOfVampire, ²deadOrAlive,
# ³hasFangs, ⁴vampType, ⁵wellbeing
Alright, you’ve done well until now? Great! Let’s take on a harder
task. Before you start, look up the arguments for the function
ungroup()
.
Let’s say we had participants complete several versions of a C Test (a language assessment test in which participants have to complete words—this has been shown to strongly correlate with participants’ general language proficiency, cf. Raatz & Klein–Braley [2002]). Since no two tests are typically exactly the same, even after pilot testing (that is, without conducting large-scale psycho-metric validity screenings), we tend to correct for the possible differences in the tasks statistically. We can correct for differences (to help ensure better comparability) between the versions of the C Test by subtracting or adding the deviation of each version’s mean count from the overall mean count for each individual score.
Let me generate some data for you that replicates this situation:
# RUN THE FOLLOWING CODE CHUNK
CTest_df = # define object
tibble(Version = gl(n = 3, k = 20), # generate data
CTest = c(round(abs(rnorm(n = 20, mean = 17, sd = 4))), # generate CTest 1
round(abs(rnorm(n = 20, mean = 19, sd = 4))), # generate CTest 2
round(abs(rnorm(n = 20, mean = 20, sd = 4)))) # generate CTest 3
) %>%
mutate(CTest = ifelse(CTest > 30, 30, CTest), # delete wild devs
CTest = ifelse(CTest < 0, 0, CTest))
# Let's have a look
CTest_df
# A tibble: 60 × 2
Version CTest
<fct> <dbl>
1 1 20
2 1 20
3 1 13
4 1 17
5 1 22
6 1 19
7 1 19
8 1 15
9 1 14
10 1 20
# … with 50 more rows
CTest_df %>%
group_by(Version) %>%
summarize_at(.vars = "CTest",
.funs = c("max", "min", "mean", "sd"))
# A tibble: 3 × 5
Version max min mean sd
<fct> <dbl> <dbl> <dbl> <dbl>
1 1 22 9 16.8 3.50
2 2 26 10 17.6 4.15
3 3 26 12 19.2 3.70
So, now we have a data frame with the version of the C Test
(Version
) and each participant’s score on the C Tests
(CTest
).
Adjust the variable CTest
by subtracting or adding the
difference of each C Test version’s mean score from the overall mean
score of the C Tests from or to each individual score.
HINT You will need to (or can, there are other ways
to solve this) use the following functions in the given order:
group_by()
, mutate()
, ungroup()
and mutate()
SOLUTION
See the commented code below for the solution
CTest_df %>%
# First we need to group by version,
# since we need the mean of each
# VERSION of the CTest
group_by(Version) %>%
# Now we create a NEW variable called
# CTestMean, which gives us the mean of
# each version
mutate(CTestMean = mean(CTest)) %>%
# We haven't seen this function yet,
# but what it does is the exact opposite
# of the group_by() function,
# but instead of grouping the different
# levels of a factor, it ungroups them
# to give us back the actual data frame,
# just now with the new variable
# (i.e. CTestMean) we just created
ungroup() %>%
# Now we need to create a few different variables:
# We need the OVERALL mean, i.e. the mean of ALL the
# C Test scores; We need the difference between the
# OVERALL mean score and the mean scores of each
# version of the C test. We then change the original
# CTest variable by adding the difference (positive or
# negative) to each individual CTest score
mutate(CTestMean_overall = mean(CTestMean),
CTestMean_difference = CTestMean_overall - CTestMean,
CTest = CTest + CTestMean_difference) %>%
# Now we just get rid of the extra noise variables
# that we needed to calculate the adjusted C Test scores
select(-c(CTestMean, CTestMean_overall, CTestMean_difference))
# A tibble: 60 × 2
Version CTest
<fct> <dbl>
1 1 21.1
2 1 21.1
3 1 14.1
4 1 18.1
5 1 23.1
6 1 20.1
7 1 20.1
8 1 16.1
9 1 15.1
10 1 21.1
# … with 50 more rows