Alright, so we’ve gotten to know a few functions. Let’s go ahead and review some of the important ones, specifically the ones that we use really often, like the summary statistics.
Compute the mean, sd, min and max of the ageOfVampire
variable in the Vampires
data frame.
SOLUTION
mean(Vampires$ageOfVampire)
[1] 84.5
sd(Vampires$ageOfVampire)
[1] 32.79366
min(Vampires$ageOfVampire)
[1] 14
max(Vampires$ageOfVampire)
[1] 198
Remember, when we want to reach into a data frame, we need to use the
$
operator. This tells R to reach into a specific data
frame and carry out the function on the variable that comes after the
$
operator.
Nice job, you’re doing fantastic! So, since you were able to complete the last activity, let’s kick it up a notch and get to some fun stuff.
You’ve had a hard day, because, let’s be honest, that’s academia. You need some encouragement, but no one is home to tell you how amazing you are. Let’s fix this!
Load in the package called praise
and then call the
praise()
function. Run this as many times as you need until
you feel like the awesomest person on Earth, because you are!!
SOLUTION
Remember, if we haven’t installed a package yet, we ALWAYS need to
run the install.packages()
function, with the package name
in parentheses. After that, we need to LOAD the package, cause otherwise
we have ‘downloaded’ it, but we haven’t actually ‘opened’ it yet.
Alright, so now that we know how to load in packages and call some useful functions, what happens if we forget functions, or if we have something we want to do, but don’t remember or know a helpful function for this? Well, GOOGLE is our friend!
In our Vampires
data frame, we want to know how many
male and female vampires there are. There are a few important steps we
need to take to do this.
FIRST, we NEED to make sure that all of our variables that should be
treated as factor vectors are, indeed, factors. If you have read in the
Vampires data set, chances are the gender
variable was
saved as a character vector, which we don’t want.
Go on Google and try to find out which function we can use to change
a CHARACTER vector in a data frame to a FACTOR vector (googling
something like “change character to factor in r” should do the trick).
Your GOAL is to change the gender
variable in the
Vampires
data frame from a CHARACTER vector to a FACTOR
vector (you can see if it worked using the
class(Vampires$gender)
function.)
SOLUTION
So, there are actually a lot of ways to do this, depending on whether you are using tidyverse (we will learn about this in the next section) or base R. I’ll show you a Base R example.
So, in base R, we work with our $
operator. Since we
want to change a vector in a data frame, we will first have to tell R to
change a vector in the data frame, which we do by defining a new
variable Vampires$gender
. Since we already have
gender
in the Vampires
data frame, by defining
our variable in this way, it will overwrite the character vector as a
factor vector.
Vampires$gender = as.factor(Vampires$gender)
Awesome, you’re doing so well! So, since we have now changed our
gender
variable to a factor, we can now count the
factor levels, i.e. how many different levels does the factor
have (in our case, I only entered female/male for the sake of
simplicity). Use the table()
function to count the factor
levels of the factor gender
(what do we need to feed into
the table()
function to make it count the factor
levels?)
SOLUTION
You guessed it! All we need to do is enter the classic
Vampires$gender
into the table function to have it count
the number of factor levels each.
table(Vampires$gender)
Female Male
56 44
Let’s use what we just learned to answer the following questions:
How many vampires in the data frame are dead, and how many alive?
How many vampires were born on each continent?
How many vampires are married and how many divorced?
SOLUTION
Well, we first have to factor our variable, then use the
table()
function.
Africa Antarctica Asia Australia Europa
2 30 2 15 5
North America South America
19 27
Divorced Married Single
47 20 33
Oh my God, why are so many divorced?!
Alright, we’re going to do some really fun statistics, cause why not?
Try to do the following (feel free to group up for these!)
Which variables in the Vampires
data frame are
NUMERIC?
Choose two NUMERIC variables and run a correlation using the
cor.test()
function. This is a fantastic chance to use the
help environment to find out what you should enter into the
cor.test()
function! smiley emoji
Install and load the package report
. Go back and
save your correlation test as the variable cor
. Then run
report(cor)
and thank me later.
INTERMEDIATE: Change the type of the correlation to a Spearman’s correlation.
SOLUTION
Vampires
data frame are
NUMERIC?So, this one is pretty easy, but we need to know the right function
to do this! We can easily use the str()
function, cause
this gives us the classes of all variables in a data frame.
str(Vampires)
spec_tbl_df [100 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ idVampire : num [1:100] 1 2 3 4 5 6 7 8 9 10 ...
$ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 1 1 2 2 ...
$ ageOfVampire : num [1:100] 89 192 67 23 40 105 58 67 88 122 ...
$ deadOrAlive : Factor w/ 2 levels "Alive","Dead": 2 2 2 2 1 1 1 1 1 2 ...
$ hasFangs : chr [1:100] "No" "Yes" "Yes" "No" ...
$ bornIn : Factor w/ 7 levels "Africa","Antarctica",..: 6 6 7 5 4 2 4 6 2 2 ...
$ vampType : chr [1:100] "hybrid" "hybrid" "sanguinarian" "psychic" ...
$ wellbeing : num [1:100] 61.6 71.4 64.2 24.7 47.2 ...
$ maritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 1 1 2 1 2 3 1 3 3 ...
$ employment : chr [1:100] "Employed" "Employed" "Not Employed" "Employed" ...
$ income : num [1:100] 131670 153860 154087 113842 144047 ...
$ visitedCities : num [1:100] 10 4 3 97 33 41 43 119 16 11 ...
$ numberOfChildren : num [1:100] 3 1 1 2 3 1 5 1 4 1 ...
$ numberChangedToVamp: num [1:100] 23 7 9 11 3 5 13 21 4 8 ...
- attr(*, "spec")=
.. cols(
.. idVampire = col_double(),
.. gender = col_character(),
.. ageOfVampire = col_double(),
.. deadOrAlive = col_character(),
.. hasFangs = col_character(),
.. bornIn = col_character(),
.. vampType = col_character(),
.. wellbeing = col_double(),
.. maritalStatus = col_character(),
.. employment = col_character(),
.. income = col_double(),
.. visitedCities = col_double(),
.. numberOfChildren = col_double(),
.. numberChangedToVamp = col_double()
.. )
- attr(*, "problems")=<externalptr>
cor.test()
function. This is a fantastic chance to use the
help environment to find out what you should enter into the
cor.test()
function! smiley emojiSo, if you know a little about statistics, you know that a
correlation is just the strength of the association between two
variables. This logically means that we need to enter two variables into
the correlation analysis. Let’s say we want to know whether income
correlates with wellbeing in vampires. REMEMBER, we need to define in
which data frames our variables are coming from using the $
operator!!
cor.test(Vampires$wellbeing, Vampires$income)
Pearson's product-moment correlation
data: Vampires$wellbeing and Vampires$income
t = 2.5368, df = 98, p-value = 0.01276
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.05447568 0.42398305
sample estimates:
cor
0.2482377
Nice, looks like there is something going on there! (Remember, since the data generated here are generated newly every time I remake the website, the results you get could be entirely different, so don’t let this scare you!! If you got a result, then you did it just fine, go you!)
report
. Go back and save
your correlation test as the variable cor
. Then run
report(cor)
and thank me later.# install.packages("report")
library(report)
cor = cor.test(Vampires$income, Vampires$wellbeing)
report(cor)
Effect sizes were labelled following Funder's (2019) recommendations.
The Pearson's product-moment correlation between Vampires$income and
Vampires$wellbeing is positive, statistically significant, and medium
(r = 0.25, 95% CI [0.05, 0.42], t(98) = 2.54, p = 0.013)
cor.test(Vampires$income, Vampires$wellbeing, method = "spearman")
Spearman's rank correlation rho
data: Vampires$income and Vampires$wellbeing
S = 119390, p-value = 0.004373
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.2835884