1 Introducing the data

Our primary research question is this: why do graduates of some schools earn more than graduates of others? We consider graduates of the top liberal arts colleges (LACs) and universities in the US, according to US News & World Report rankings. The sample consists of 48 schools–25 universities and 23 LACs.

We will focus on the universities in our exercise. For homework, you’ll work with the LACs.

We’ll use graphs to visually test three hypotheses:

  1. Graduates of smaller schools earn more.
  2. Graduates of higher-ranked schools earn more.
  3. Graduates of STEM-focused schools earn more.

We will create bubble plots, with the explanatory variable on the x axis and median salaries on the y axis. The size of the bubbles will correspond to the size of each school, and the colors will correspond to the type of school. Here is the aesthetic we will be aiming for:

An example of a bubble plot from the Economist magazine

To begin, download the “Uni Salaries” Excel file from tinyurl.com/collegesal. Open it in Excel and explore it.

The first nine columns are:

  • Number: Meaningless. An artifact from data processing.
  • School Name Long: Full name of school
  • Location: Location of each school (from Google)
  • Student Population: Total student enrollment (from Google)
  • University or LAC: Type of school (redundant)
  • Uni Ranking: US News Ranking
  • Tags: Description tags
  • Labels: Description tags (duplicate)
  • School Name Short: Shortened name

The last three columns columns are all sourced from the website PayScale.com, which surveys US graduates about their jobs. These columns are:

  • % High Meaning: Proportion of alumni who say they derive high meaning from their jobs
  • % STEM Degrees: Proportion of alumni whose degree was in a STEM field
  • Pay: This is a compound variable, composed of two figures, separated by a semicolon:
  1. Median salary of alumni 0 - 5 years after graduation
  2. Median salary of alumni 10 or more years after graduation

(I used the ‘merge’ function in R to combine data from several different sources to create this dataset.)

Next, let’s import the dataset into Rstudio.

# we use the openxlsx package to read Excel files into RStudio
library(openxlsx) 
# you may need to install this package first!

# insert your file path here
unis <-  read.xlsx("insert_file_path_here", sheet = 1) 

How do you get your file path? On a Macbook, two-finger click on the downloaded file in Finder. When a menu pops up, press the Option key. See the option to Copy “your_file_name” as pathname? Clicking on it copies the file path to the clipboard. You can now paste the path into the function above.

2 Cleaning the data

Before we clean the data, let’s first view it in RStudio’s native viewer. You should recall the command for this. If you have forgotten, Google it.

Questions:

  1. What did the spaces in the column names get replaced by after the import? What does this tell you about naming columns in R?
  2. Is the ‘Number’ column useful?
  3. Would it be helpful to change the position of the ‘School.Name.Short’ column?

2.1 Summary functions

Later on in your R career, you’ll work with datasets too large to view in RStudio’s viewer without stressing out your computer. In such cases, using summary functions will allow you to peek at tiny bits of your data. Try these now.

# What information does each of these functions provide?
str(unis)
summary(unis)
head(unis)
tail(unis)
names(unis)
colnames(unis)
nrow(unis)
ncol(unis)
dim(unis)

# My personal favorite summary function is in the 'dplyr' package
library(dplyr)
glimpse(unis)

The output of that last command shows the name of each column, the data types in each column (e.g. <int>) and the list of the first few elements in the column.

The data types in our dataset are:

  1. <chr>, Character: Strings of text.
  2. <dbl>, Double: A numeric data type. Does not have to be a whole number.

Questions:

  1. What is the mean student population for these schools?

  2. What data type are the “%.High.Meaning” and “%.STEM.Degrees” columns? Is this appropriate?

2.2 Deleting columns

Let’s get rid of the “Number” column, as we do not need it. The dollar sign, $, allow you to specify a column in a data frame.

# run this line of code to see what gets returned
unis$Number

# now delete that column
unis$Number <- NULL
# (we tell R, please replace the column called "Number" with NOTHING)

# we can check to make sure it is gone
unis$Number

# or use:
colnames(unis)
# to show that it is no longer one of the columns

Your turn:

  1. Delete the “Labels” column, since it is a duplicate of the “Tags” column.

  2. Also delete the “University.or.LAC” column, as it is superfluous.

  3. Use one of the summary functions discussed above to count the number of columns that remain.

2.3 Reordering columns (Intro to subsetting)

We notice that the ‘School.Name.Short’ column is in a weird position. Let’s move that data to the first column, since it is a nice identifier.

To do this, you need to understand the principles of subsetting in R. To subset of a data frame, we use square brackets. Let’s look at some examples. Run the code lines below.

# retrieve row 1 and column 1 from data frame 
unis[1,1]

# retrieve row 1, column 2
unis[1,2]

# row 1, and all columns
unis[1,]

# all rows, column 7
unis[,7]

# all rows, all columns
unis[ , ]

# all rows and all columns, except column 1
unis[ ,-1]

# rows 3 to 5, and columns 1 to 6
unis[3:5,1:6]

# rows 1,3 and 5, and column 2
unis[c(1,3,5), 2] 

# rows 1, 3, 5 and 7 to 9, all columns
unis[c(1,3,5, 7:9), ]

What we are doing above is selecting rows and columns based on their positions or ‘indices’. But how do we figure out their positions in the first place? For column indices, here’s one option:

# create a data frame with names of columns
data.frame(colnames(unis))
# the rownumbers match the column indices. There you go!

And for row indices:

data.frame(unis$School.Name.Long)
# and now you have each school's row number!

Now we can change the position of the ‘School.Name.Short’ column, our original mission.

unis <- unis[ , c(6,1:5,7:9) ]
# replace the old 'uni' with the new uni, 
# which is arranged as follows: 
# all rows, column 6, then columns 1 to 5, then columns 7 to 9

Phew! Done. I’ll admit–that was quite cumbersome. It would have been much simpler to have done that using copy and paste commands in Excel. The benefits of R only become evident when analyzing large datasets or performing repetitive tasks.

Your turn: Move the ‘Tags’ from its current position to position 3, beside the ‘School.Name.Long’ column.

2.4 Separating columns

Next, we would like to separate out the ‘Pay’ column into two separate columns, each containing its own variable. We employ the ‘separate’ function in the tidyr package. The function takes five key arguments:

library(tidyr)
unis <- separate(data = unis, #take the data frame 'unis'
                col = Pay, # the column called 'Pay'
                sep = "[;]", # separate based on semicolons
                into = c('Early.Career.Pay','Mid.Career.Pay'),  #into these two columns
                remove = TRUE) # and remove the original column

Done!

There is one problem though. Because there was a space after the semicolon, the values in the ‘Mid.Career.Pay’ now have a space before the dollar sign. Run the code below to see this.

unis[ , "Mid.Career.Pay"]

This could be problematic. To remove this, we use the ‘trim white space’ function in base R.

unis$Mid.Career.Pay <- trimws(unis$Mid.Career.Pay)

Your turn:

  1. Several pieces of information are contained in the ‘Tags’ column. Separate them out into four different columns, labelled “Tag1”, “Tag2” and so on. (Ignore the warning message.) Trim the white space created.
  2. Separate the ‘Location’ column into three columns, called ‘County’, ‘State’ and ‘State.Initials’. Trim the white space generated.

2.5 Processing Text

Now let’s do some text processing. The dollar sign and the the commas in the ‘Pay’ columns prevent R from understanding that those columns are actually numeric variables. We will need to get rid of those. We use the ‘gsub’ command, which we learned about in the last lesson.

unis$Early.Career.Pay <- gsub(x = unis$Early.Career.Pay , # within this column
                              pattern = "[,]", # find all commas
                              replacement = "" ) #replace with nothing 

Your turn:

  1. Remove the commas from the “Mid.Career.Pay” column.
  2. Remove the dollar signs from both the “Early.Career.Pay” and the “Mid.Career.Pay” columns.

2.6 Converting Variable Types

If we call the following command on our dataset,

glimpse(unis)

we see that most of our numeric variables are still coded as characters (‘chr’). This will be a problem when trying to plot the data. We need to change the ‘classes’ of these variables. There is a family of functions for this. Type as. into Rstudio then pause. A list of variable-converting functions should show up. Let’s use some of these them on our variables now.

# '% High Meaning' should be numeric
unis$`%.High.Meaning` <- as.numeric(unis$`%.High.Meaning`)
# NOTE: we need the backticks here because of the % sign in the column name

# 'Early Career Pay' should also be numeric
unis$Early.Career.Pay <- as.numeric(unis$Early.Career.Pay)

Your turn:

  1. Convert the ‘Mid.Career.Pay’ column into a numeric variable.
  2. Use the glimpse function again on the data frame. Are all data types appropriate?

2.7 Logical subsetting

Further above, we looked at subsetting based on column and row indices.

Another form of subsetting—subsetting based on a condition—is often needed. This is called logical subsetting. For logical subsetting, we use square bracket notation (which you already learnt) along with the following logical operators:

  1. < ‘less than’
  2. <= ‘less than or equal to’
  3. > ‘greater than’
  4. >= ‘greater than or equal to’
  5. == ‘exactly equal to’
  6. %in% : ‘found in’. Works similar to ’==’but sometimes preferred.
  7. != ‘not equal to’
  8. | : ‘or’
  9. & : ‘and’

Let’s use these in some examples. First, the “==”, “<” ,“<=”, “>”, and “=>” operators.

# retrieve rows with schools in California, all columns
unis[ unis$State == "California" , ]
# another way of writing this is: 
unis[ unis$State %in% "California" , ]

# Retrieve the top 10 schools, and show only first column
unis[ unis$Uni.Ranking <= 10 , 1]

# Top 10 schools, and show only the School Name Short column
unis[ unis$Uni.Ranking <= 10 , "School.Name.Short"]

# Non top 10 schools 
unis[ unis$Uni.Ranking > 10, "School.Name.Short"]

Your turn:

  1. Retrieve all school in Cambridge, first column.
  2. Retrieve the bottom 15 schools, first column.
  3. Retrieve schools where more than half of students graduate with STEM degrees, all columns.

Now, let’s use the “|” and “&” operators to combine conditions.

# Schools ranked 5 to 15, all columns
unis[ unis$Uni.Ranking >=5 & unis$Uni.Ranking <=15 , ]

# The mid-career pay of schools ranked 5 to 15
unis[ unis$Uni.Ranking >=5 & unis$Uni.Ranking <=15 , "Mid.Career.Pay" ]

# Schools in California or Texas
unis[ unis$State == 'California'  | unis$State == 'Texas' , ]
# Or:
unis[ unis$State %in% 'California'  | unis$State %in% 'Texas' , ]

# Schools in California with more than 10000 students
unis[ unis$State == 'California'  & unis$Student.Population > 10000, ]

Your turn:

  1. Retrieve the names of schools in California where the Mid-career pay is above 150000
  2. Retrieve the names AND the populations of these schools

Next, let’s use the negation operator, !=.

# Schools outside California
unis[ unis$State != 'California' , ]

Your turn:

  1. Retrieve the names of schools outside California and New York.

Now, what if we want to test the condition across multiple columns?

# Schools which are tagged as 'Private School' in any of the four tags
unis[  unis$Tag1 == 'Private School' | 
       unis$Tag2 == 'Private School' |
       unis$Tag3 == 'Private School' |
       unis$Tag4 == 'Private School' , ]
# OR
unis[  unis$Tag1 %in% 'Private School' | 
       unis$Tag2 %in% 'Private School' |
       unis$Tag3 %in% 'Private School' |
       unis$Tag4 %in% 'Private School' , ]

Your turn:

  1. Retrieve the names of all engineering schools.
  2. Retrieve the names AND student populations of all engineering schools.
  3. Retrieve the names of all engineering schools located in California.

Okay, let’s finally put our newly-gained power to good use!

We would like to create a new column, called ‘Sports’ that tells us whether or not a school is sports-focused. We will use the ‘%in%’ operator, instead of the usual ‘==’ operator as this latter operator is problematic when trying to make a new variable.

# Create an empty column called 'Sports'
unis$Sports <- NA

# Take the subset of schools tagged as 'For Sports Fans'
unis[  unis$Tag1 %in% 'For Sports Fans' | 
       unis$Tag2 %in% 'For Sports Fans' |
       unis$Tag3 %in% 'For Sports Fans' |
       unis$Tag4 %in% 'For Sports Fans', 'Sports'] <- "Sports School" 
# and assign the value 'Sports School' to their Sports column

Next, let’s indicate that all the other schools are “Non-Sports schools”. We will require a negation of the ‘%in%’ operator, which can be found in the ‘Hmisc’ package.

library(Hmisc)
# subset of all unis for which the 'Sports' does not say 'Sports School.
unis[unis$Sports %nin% 'Sports School', 'Sports'] <- 'Non-Sports School'

Your turn:

  1. Create a new variable called ‘Ivy’ that contains the values “Ivy League” and “Non-Ivy League”.
  2. Create another variable called ‘Private’ that contains the values “Private” and “Public”.

Most of the necessary data processing is now done! We can move on to plotting the data.

3 Plotting

The ‘ggplot2’ (Grammar of Graphics Plot 2) package is the unrivaled choice for visualization in R. Let’s take a look at its syntax.

We first pass our data frame into the ‘ggplot()’ function. Then we add in specific geometric elements. For each element, we must provide the ‘aesthetics’ to be mapped.

Let’s plot a simple scatter plot of university rankings and early career pay.

library(ggplot2)

ggplot(data = unis) + # using the data in the "unis" data frame
  geom_point(mapping = aes( # create a scatterplot with the following aesthetic mappings
    x = Uni.Ranking,       # on the x axis, place the university's ranking 
    y = Early.Career.Pay ) ) # on the y axis, place the early career pay
# NOTE: The syntax above is separated over multiple lines, but it does not need to be.


# We can also assign the plot to an object, then call the object
plot1<- ggplot(data = unis) +
  geom_point(mapping = aes( x = Uni.Ranking, y = Early.Career.Pay ) )
plot1

Instead of just seeing dots, what if we’d like to see the name of each university? We use the geom_text object for this. To the geom_text object, we add our usual x and y mappings, but we also include an extra aesthetic mapping: ‘labels’.

ggplot(data = unis) +
  geom_text(mapping = aes(x = Uni.Ranking, 
                          y = Early.Career.Pay, 
                          label = School.Name.Short ) )

Nice!

But what if we want to show both the dot and the text labels? We can simply add another ‘+’ sign and hook on a second geometric element.

ggplot(data = unis) +
  geom_point(mapping = aes( x = Uni.Ranking, 
                            y = Early.Career.Pay ) ) +
  geom_text(mapping = aes(x = Uni.Ranking, 
                          y = Early.Career.Pay, 
                          label = School.Name.Short ) )

Getting there!

The overlap of the geom_text elements is quite annoying though. Can we fix it? Yes! There’s a package for that. The ‘ggrepel’ package allows us to replace the usual ‘geom_text’ element with a ‘geom_text_repel’ element which tries avoid overlap.

library(ggrepel)
ggplot(data = unis) +
  geom_point(mapping = aes( x = Uni.Ranking, 
                            y = Early.Career.Pay ) ) +
  geom_text_repel(mapping = aes(x = Uni.Ranking, 
                          y = Early.Career.Pay, 
                          label = School.Name.Short ) )

Much better!

Next, the graph would be easier to read if the rankings on the x axis were flipped. How do we do this? A quick Google search reveals that “scale_x_reverse” is the function we’re looking for. Google will be a close friend when working with ggplot. This ggplot cheatsheet at tinyurl.com/rcheat1 might come in handy as well. The official reference for functions in ggplot is at tinyurl.com/rrefe1.

ggplot(data = unis) +
  geom_point(mapping = aes( x = Uni.Ranking, 
                            y = Early.Career.Pay ) ) +
  geom_text_repel(mapping = aes(x = Uni.Ranking, 
                          y = Early.Career.Pay, 
                          label = School.Name.Short ) ) +
  scale_x_reverse()

Our plot looks good so far, but we could add some more information to it. Let’s add in two more geometric elements. We map the size of the geom_point dots onto the populations of each school. And we map the color of these dots onto the “Ivy” column.

ggplot(data = unis) +
  geom_point(mapping = aes( x = Uni.Ranking, 
                            y = Early.Career.Pay,
                            size = Student.Population,  # map size to Population
                            color = Ivy ) ) +    # map color to 'Ivy'
  geom_text_repel(mapping = aes(x = Uni.Ranking, 
                          y = Early.Career.Pay, 
                          label = School.Name.Short ) ) +
  scale_x_reverse()

Sweet!

Now let’s make the graph prettier. We can improve the visual appeal of the plot by using predesigned themes. The ggplot2 package has some default themes, but the package ‘ggthemes’ has nicer ones. Let’s use a theme remniscient of plots from The Economist magazine, since this was our goal aesthetic.

We’ll also assign the plot to an object, p1, so we don’t need to keep pasting the entire command to tweak the plot.

library(ggthemes)

p1 <- ggplot(data = unis) +
  geom_point(mapping = aes( x = Uni.Ranking, 
                            y = Early.Career.Pay,
                            size = Student.Population,  # map size to Population
                            color = Ivy ) ) +    # map color to 'Ivy'
  geom_text_repel(mapping = aes(x = Uni.Ranking, 
                          y = Early.Career.Pay, 
                          label = School.Name.Short ) ) +
  scale_x_reverse() +
  theme_economist()

Finally, let’s change all the font elements. To do this we use the ‘theme()’ function in ggplot, which can take many arguments. A full list can be found at tinyurl.com/ggthemeref1. Within the theme function, we pass the element_text() argument to the piece of text we want to change (e.g ‘axis.text’ or ‘axis.title.y’). Then, within element_text(), we can specify font family, size, face and justification.

We’ll also add a title and caption. And we’ll modify all text labels.

library(ggthemes)

p1 <- p1 + labs(title = "Rankings and Median Salary for Graduates of Top 25 US Universities",
       caption = "Source: PayScale.com and US News and World Report",
    # change the x and y axis labels
       x = "2019 Ranking",
       y = "Median Salary 0-5 years after graduation, $",
    # change the legend labels
       color = "Ivy Status",
       size = "Student Population") +
    # change fonts
  theme(axis.text = element_text( size = 10, face = "bold"),
        axis.title.x = element_text( size = 14, face = 'bold'),
        axis.title.y = element_text( size = 14, face = 'bold'),
        legend.title = element_text(size = 11, face = "bold"),
        legend.text = element_text( size= 11),
        plot.title = element_text( size=16, face = "bold", hjust = 0.5),
        plot.caption = element_text( size=8, colour = "dark gray"))

And that’s as good as we’ll get for now Save your plot using the ‘Export’ button above the plotting pane, then proceed to the next assignment.

Your turn:

  1. Create a new plot in which your x axis is the percentage of STEM degrees
  2. Which of the two plots so far seems to better explain the differences in earnings?

4 Coefficient of determination

To better understand which of our independent variables better explains graduate earnings, we can calculate the Pearson’s correlation coefficient, r, for the relevant variables.

cor(unis$Early.Career.Pay, unis$Uni.Ranking)
# we notice a negative correlation. 
# This is because a bigger number for the ranking denotes a worse school.

The square of the Pearson’s correlation coefficient, R, gives us ‘R-squared’, also known as the coefficient of determination. This value represents the proportion of variance in the dependent variable (Salary) that is explained by the independent variable (Ranking in our case).

Let’s compute that value in the present case.

cor(unis$Early.Career.Pay, unis$Uni.Ranking)^2

Let’s now add this value as a subtitle in our plot. We use the ‘expression’ function to type superscript.And we add further styling elements for the subtitle.

p1 + labs(subtitle = expression(R^2 ~ '=' ~ '0.18')) +
  theme(plot.subtitle = element_text( size = 14, face = 'bold', hjust = 0.5))

Your turn:

  1. Repeat the above steps for the second plot, in which the independent variable is the percentage of graduates with STEM degrees.
  2. Which of the two plots better explains the differences in earnings?

5 Rinse and Repeat

We have come to the end of this lesson. Your job now is to replicate our analysis on the liberal arts colleges dataset (LAC salaries). You should import the data from the same folder(tinyurl.com/collegesal, clean it, then produce plots similar to those we have created in this exercise.

The datasets are more or less identical, so as long as you follow the steps we have followed here, you should emerge victorious.

Best of luck.