6. Code chunks in R Markdown

Mason A. Wirtz https://masonwirtz.github.io

Exercise 1

Create a new R Markdown document and name it “Exercise 6”. Delete the example content in the markdown document (from line 12 and down). Insert a new R chunk by clicking on the Insert button in the editor toolbar (this is the button with a white C in a green square) and choosing R.

Import the Vampires.csv data set in this chunk using the read_csv() function and save this under the object Vampires. Don’t forget that this data frame should now be in our Data folder!!! Write the relative path in the read_csv() function accordingly!

Click for Answer

SOLUTION

By inserting a code chunk, we are effectively structuring our code in a readable manner and in a way that makes following what our process of analysis is easier. When you look back at your code after a time away from it, it is also much easier to understand what you yourself were thinking when coding. To read in the data frame we are interested in, we do this in our code chunks exactly the same way as we did when working with regular .R documents.

Vampires = read_csv("./Data/Vampires.csv")         # read in data with a relative path

Exercise 2

Create a new code chunk using the Insert button in RStudio. In your new code chunk, complete the following exercise:

Using the help bar (or console), look up the arguments of the ifelse() function. What arguments does it take?

Use the ifelse() function to create a new categorical variable called VampOld: All vampires older than 100 should be categorized as Old, all vampires younger than 100 should be categorized as Young.

HINT You will need the mutate() function and the ifelse() function.

Click for Answer

SOLUTION

Since we are interested in creating a new variable, this should tell us immediately that we need the mutate() function. In the mutate() function, we need to define our new variable VampOld.

Once we have defined the new variable, we need to define what should be included in the new variable. Since we want to categorize the age variable (which we really wouldn’t ever want to do in real life, but for practice, it’s an easy example), we can use the ifelse() statement, which first takes an argument to evaluate. The argument to evaluate we need to supply the function with is “if the vampire is older than 100…”, which we can do using the variable ageOfVampire and the > operator. Once we have supplied the ‘if’ argument, we need the ‘then do’ argument, i.e. what needs to happen when this argument is true? In this case, all vampires older than 100 should be categorized as ‘old’, so we supply the function with ‘old’. The final argument in the function takes the ‘else’ part, i.e. what happens when the first argument is false? Since we just want to categorize the variable into old and young, if the vampire is not old, then they must be young, so we supply the final argument with the character value ‘young’.

Vampires =                                  # define object              
  Vampires %>%                              # data
  mutate(VampOld =                          # define NEW VARIABLE
           ifelse(ageOfVampire > 100,       # TEST (the "if" argument)
                  "Old",                    # if the test is TRUE, change to OLD
                  "Young"))                 # if the test is FALSE, change to YOUNG

Exercise 3 (intermediate level)

You can name R chunks in R Markdown, which is practical when structuring your analysis process. This can be done by writing a name beside the ` r at the top of the chunk, e.g. as follows

# This is an example code chunk NAMED "example"

We are going to run a simple linear regression using the base-R lm() function. For those not familiar with regression modeling, it is the most basic form of predictive analysis. The overall idea of regression analysis is to examine the predictive power of a variable, i.e. how well of a job does a particular independent variable (predictor) do in predicting the outcome of a dependent variable?

If you’ve never conducted a regression analysis before, I’m going to walk you through this process here. Alright, let’s get started.

Exercise 3.1

Create an R code chunk and name it modAge

Click for Answer

SOLUTION

This activity should be relatively self explanatory, but here it is:

{r modAge}

Exercise 3.2

Above, I explained the general goals of a regression analysis. The goal is to determine how well a predictor variable explains the outcome of a dependent variable. In this case, we are interested in whether the age of the vampire (variable: ageOfVampire) has explanatory value in predicting how many people a vampire changes into vampires (variable: numberChangedToVamp).

The lm() (which stands for linear model) takes the arguments FORMULA and DATA. FORMULA refers to the formula y ~ x, i.e. the dependent variable y (numberChangedToVamp) as a function of the independent variable x (ageOfVampire). DATA refers to the data set in which your variables are saved.

Using the lm() function, replace the y and x with the respective variables from above, fill in the data argument in the code chunk below, and save this as an object with the name modAge.

... = lm(y ~ x, data = ...)
Click for Answer

SOLUTION

With this model, we are assessing the predictive power of the age of the vampires on how many people the vampires in this data set change to vampires.

modAge =                                     # define object
  lm(numberChangedToVamp ~ ageOfVampire,     # number changed to vamp as a function of vampire age
     data = Vampires)                        # supply the data frame in which the variables are saved

Exercise 3.3

Alright, now we have run a simple linear regression model. The hard part is always interpreting the output that we have here.

To have a look at the output of this regression model, you can use the summary() function, with the argument being a linear model. Insert the modAge linear regression into the summary() function.

Click for Answer

SOLUTION

We can get the results of our linear regression model by running the following code chunk.

summary(modAge)          # call summary of the model

Call:
lm(formula = numberChangedToVamp ~ ageOfVampire, data = Vampires)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.5447 -3.3155 -0.6979  1.7053 14.8689 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.49448    1.29026   7.359 5.77e-11 ***
ageOfVampire -0.01532    0.01424  -1.075    0.285    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.648 on 98 degrees of freedom
Multiple R-squared:  0.01166,   Adjusted R-squared:  0.00158 
F-statistic: 1.157 on 1 and 98 DF,  p-value: 0.2848

Exercise 3.4

The output of the linear regression model gives us important information, namely the intercept and the slope of the respective predictor variable. But what do these actually mean? Well, good question!

The INTERCEPT is the y-value when x = 0, so the amount of people vampires at age 0 have changed into vampires.

The SLOPE tells us: for every increase of 1 unit from 0, the outcome variable changes (starting from the value of the intercept) by this value. In our case, this means: For every year increase (from 0), i.e. for every year older a vampire becomes, a vampire changes .008 more people into vampires (again, from the intercept).

Not very intuitive, right? It doesn’t make a lot of sense. A vampire 1 year old probably wouldn’t be changing anyone into anything. How do we fix this?

Well, to make regression modeling more intuitive, we tend to CENTER variables, which just means we subtract the mean from each data point. Thus, the variable is expressed in terms of how much each data point is ABOVE the mean (positive score) or BELOW the mean (negative score). If a variable is centered, then zero represents the MEAN of the respective variable. We can center variables using the scale() function and adding the argument scale = FALSE.

Your next task: Using the mutate() function, center the variable ageOfVampire, and save the new data frame as vampDat. Save this code chunk as manipuate data frame.

HINT You will need to form the scale() function as scale(ageOfVampire, scale = FALSE)

Click for Answer

SOLUTION

This one is a bit tricky. To start off, we know we need the save the data set as vampDat, which we can do relatively easily using the = operator. After that, we need to define the data frame that we want to ‘reach into’, which in this case is the Vampires data frame. We then use the pipe %>% to signal that we want to string functions together. After this, we need the mutate() function. The mutate() function takes the name of the new variable as its first argument. We want to overwrite the ageOfVampire variable in this new data frame, so we just add this name. Then we add the = operator, after which we apply the scale() function given above.

vampDat =                            # define object
  Vampires %>%                       # data
  mutate(ageOfVampire =              # define NEW VARIABLE (this overrides current age variable)
           scale(ageOfVampire,       # apply the scale() function
                 scale = FALSE))     # set scale argument to false, which centers the variable

Exercise 3.5

Now that we have centered our predictor variable, we can more easily interpret the slopes and intercept. Run another linear model using the same variables, but changing the data frame to vampDat, since this has the newly centered predictor variable ageOfVampire. Save this model as modAge_c.

How can we interpret this output?

Click for Answer

SOLUTION

Let’s first run the linear model.

modAge_c =                                  # define object
  lm(numberChangedToVamp ~ ageOfVampire,    # number changed to vamp as a function of vampire age
     data = vampDat)                        # supply data frame in which the variables are contained

summary(modAge_c)                           # call model summary

Call:
lm(formula = numberChangedToVamp ~ ageOfVampire, data = vampDat)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.5447 -3.3155 -0.6979  1.7053 14.8689 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   8.20000    0.46478  17.643   <2e-16 ***
ageOfVampire -0.01532    0.01424  -1.075    0.285    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.648 on 98 degrees of freedom
Multiple R-squared:  0.01166,   Adjusted R-squared:  0.00158 
F-statistic: 1.157 on 1 and 98 DF,  p-value: 0.2848

The INTERCEPT is the predicted mean of people changed to vampires when we consider the average age of the vampires in the data set. In other words, vampires in an average age (of our sample)—which is 81.49—are predicted to have changed 8.4 people into vampires. We can check this in a broad way by running the following code as well, which filters the data frame to vampires around the approximate mean age of the sample, and then checking the mean amount of people they have changed to vampires. We should get a value relatively close to the intercept of the new model.

Vampires %>%                                             # data
  filter(ageOfVampire > 80 | ageOfVampire < 82) %>%      # select vamps between ages of 80 and 82
  summarize(mean = mean(numberChangedToVamp),            # receive mean of vamps between ages of 80 and 82
            sd = sd(numberChangedToVamp))                # receive standard deviation
# A tibble: 1 × 2
   mean    sd
  <dbl> <dbl>
1   8.2  4.65

The SLOPE now tells us that for every increase from 0 (which is NOW the mean of the group) by 1 unit (i.e. by 1 year in this case), the number of vampires changed increases (from the intercept) by this amount (i.e. .008).