1 What is R?

R is an open-source statistical language available for free download and maintained by the R Consortium, Inc. and the worldwide community of users, maintainers, and developers of R software. This workshop is for demonstration purposes only and is not meant to provide an exhaustive education in the R Project for Statistical Computing. The examples used in this workshop were developed primarily for educational purposes and are being presented in this free workshop as a demonstration of some of the capabilities of this open source language and environment for statistical computing and graphics. Throughout this workshop you will see screen captures and images from the RStudio Program. It is an integrated development environment (IDE) for R maintained by RStudio, Inc. and is available in free open source and commercial editions. All of the code and data sourced for this workshop was developed or modified by the authors and has been made available for educational use through the links provided in this workbook. Additional materials in this workbook have been made freely available under Creative Commons licenses and distributed in open-source platforms.

2 In The Beginning

You must install R and RStudio on your PC, Mac, or Linux computer. RStudio is not a stand-alone program so please install R first:

R Download: https://mirrors.nics.utk.edu/cran/

RStudio Download: https://www.rstudio.com/products/rstudio/download/

RStudio operates as a quasi-graphical user interface (GUI) for R. RStudio typically opens in a three or four window design:

RStudio Windows

Upper Left Upper Right Lower Left Lower Right
Script/Data Window History/Environment Code/Console Packages, Help, etc.

Script/Data Window: This area lets you develop script to be run in the console window and also view data sets.

History/Environment Window: In this section of RStudio you can see objects you have imported or created and a history of the code you have run.

Code/Console Window: All of the code you run from scripts will appear in the console window (along with any errors highlighted in red). You can also input code directly to this area, initiate help windows on various packages, view options for certain commands, etc.

Files, Plots, Packages, Help, Viewer Window: In this area you can load and view options within various packages (a collection of formatted functions, data, and compiled code that improve the functionality of the base R functions), view and export plots/graphics that have been created, view help options for packages/functions, and view files in your current working directory (explained later).

2.1 The Basics

When beginning any project in R, it can be beneficial to start a new R Script. This creates a record of the code you create that can be saved for future reference and sharing. To start a script click CRTL+Shift+N on PC, Shift+⌘+N on Mac, or select R Script from the drop-down menu below File io the menu bar.

Scripting Window

In the scripting window there are some limited debugging features that provide suggestions regarding errors in your code. You can see an example of this feature in the image below.

Debugging

However, there are a lot of resources available through the help tab in the lower-right window or by including a ? before any package or function you’re trying to run such as ?setwd(). There are also tons of resources available through google searches and YouTube tutorials that walk you step-by-step through various packages or analyses. Throughout this workshop we will use a number of different packages. These are collections of R functions, data, and compiled code to help perform specific tasks. While R provides some basic functions, packages expand the power of R allowing you to perform more advanced functions and analyses. For each new package you will need to use the function install.packages(...) with the name of the package between the parenthesis to download and library(...) to load the package into your working environment. You only need to download the package once, but you will need to load the package with each new project.

3 Getting Started

To start you want to either begin with a new project or establish your working directory. If you are planning on the analysis being part of a large project or that you plan to share with other researchers, it would be ideal to begin with a new project directory using version control. However, if you are looking to complete a quick analysis to working on a class exercise it might be sufficient enough to just create w working directory. This directory is where all of the datasets used for the project should be stored and where all of the output data, plots, etc. will be stored by default. Set this directory by using the following script:

setwd(“…”)

Between the parentheses and quotes you should input the folder path from your computer. This can be obtained by using windows explorer on a PC, Command+⌘ on a Mac, or find in Linux. One thing to note with R, path names use forward slash and not the backslash which is common to windows. Here is an example of a correct setwd(“...”) script:

Set wd

Once you have the script written in the scripting window, select the line and click Run to initiate the script. Notice when you set the working directory, that information also appears at the top of and in the Console/Code window:

Set wd and console

With the working directory set, it is time to import your data for analysis.

3.1 Importing Data

To begin, you will need to import a dataset(s) in to your working environment. As with all things in R and RStudio, there are a number of different ways to complete various tasks. This includes using a read function in R, using the import feature in RStudio, or using scripts to create the data. For the purpose of this workshop we begin by using a multi-tab excel file.

Excel Dataset

So we will need to add a specific package called readxl and to do that we will use the following script to obtain, install, and load the package:

library(readxl)

If successful you should now have the package listed under Packages in the packages tab and it should be checked. To get information about the package you and type ?readxl into the console or click on the package name in the packages tab.We are going to use the read_excel() function to load each tab individually as objects in our working environment. To learn more about the function ?read_excel type it in the console.

basic_stats <- read_excel("ChemData.xlsx", sheet="central tendency")
ttest <- read_excel("ChemData.xlsx", sheet="t-test")
anova <- read_excel("ChemData.xlsx", sheet="anova")
correlation <- read_excel("ChemData.xlsx", sheet="correlations")
regression <- read_excel("ChemData.xlsx",sheet = "regression")

When complete you should have a list of five (5) datasets in your working environment. By clicking on the arrow in the blue circle on the left of the object name you can expand the dataset to see what sort of information it contains. By clicking the sheet icon on the far right you can open the data to view its contents.

Data Environment

3.2 Data Formatting and Analysis

If you examine the basic_stats dataset you will see columns labeled No which refers to the trial number and then a column for each student in a class. Because we want our dataset to contain only numerical values we can use R to quickly format the dataset and remove the first column of data using the row numbers instead to identify the trails.

Format Rows

To perform this action we will use the following command:

basic_stats[,1] <- NULL

basic_stats refers to the dataset and [,1] refers to the first column. If we wanted to remove rows in a vector or matrix dataset we would have used [1,] instead. Using the head() function we can now view the first few rows of our newly formatted data.

head(basic_stats)

3.2.1 Summary Statistics

Now we can use the dataset to examine some basic statistics such as mean, median, standard deviation (sd), variance (var), min, max, median, range, and quantile or we can see a summary of the data.

summary(basic_stats)
   Student_2        Student_3        Student_4        Student_5     
 Min.   :0.9933   Min.   :0.9343   Min.   :0.8611   Min.   :0.9741  
 1st Qu.:0.9954   1st Qu.:0.9559   1st Qu.:0.8739   1st Qu.:1.0226  
 Median :0.9969   Median :0.9581   Median :0.8792   Median :1.0247  
 Mean   :1.0457   Mean   :0.9572   Mean   :0.8803   Mean   :1.0428  
 3rd Qu.:0.9981   3rd Qu.:0.9607   3rd Qu.:0.8884   3rd Qu.:1.0370  
 Max.   :1.9768   Max.   :0.9653   Max.   :0.8942   Max.   :1.1705  
   Student_6        Student_7        Student_8       Student_9     
 Min.   :0.9018   Min.   :0.9992   Min.   :1.007   Min.   :0.5432  
 1st Qu.:0.9127   1st Qu.:1.0011   1st Qu.:1.008   1st Qu.:0.9961  
 Median :0.9200   Median :1.0018   Median :1.009   Median :1.0006  
 Mean   :0.9184   Mean   :1.0025   Mean   :1.009   Mean   :0.9774  
 3rd Qu.:0.9248   3rd Qu.:1.0026   3rd Qu.:1.010   3rd Qu.:1.0044  
 Max.   :0.9310   Max.   :1.0155   Max.   :1.012   Max.   :1.3930  
   Student_10       Student_11       Student_12      Student_13    
 Min.   :0.9730   Min.   :0.5261   Min.   :1.005   Min.   :0.9559  
 1st Qu.:0.9896   1st Qu.:0.9723   1st Qu.:1.006   1st Qu.:0.9830  
 Median :0.9913   Median :0.9826   Median :1.007   Median :0.9870  
 Mean   :0.9953   Mean   :0.9567   Mean   :1.008   Mean   :0.9847  
 3rd Qu.:1.0051   3rd Qu.:0.9894   3rd Qu.:1.009   3rd Qu.:0.9893  
 Max.   :1.0316   Max.   :0.9921   Max.   :1.013   Max.   :0.9914  
   Student_14      Student_15       Student_16       Student_17    
 Min.   :1.159   Min.   :0.9591   Min.   :0.9486   Min.   :0.9912  
 1st Qu.:1.179   1st Qu.:0.9962   1st Qu.:0.9580   1st Qu.:0.9939  
 Median :1.183   Median :0.9972   Median :0.9636   Median :0.9952  
 Mean   :1.181   Mean   :0.9971   Mean   :0.9669   Mean   :0.9947  
 3rd Qu.:1.184   3rd Qu.:0.9982   3rd Qu.:0.9727   3rd Qu.:0.9952  
 Max.   :1.187   Max.   :1.0333   Max.   :0.9876   Max.   :0.9972  
   Student_18      Student_19       Student_20   
 Min.   :1.004   Min.   :0.8837   Min.   :1.001  
 1st Qu.:1.006   1st Qu.:0.9815   1st Qu.:1.002  
 Median :1.007   Median :0.9830   Median :1.003  
 Mean   :1.007   Mean   :0.9833   Mean   :1.003  
 3rd Qu.:1.009   3rd Qu.:0.9842   3rd Qu.:1.003  
 Max.   :1.011   Max.   :1.0856   Max.   :1.004  

If we want to examine statistics not included in the summary above you can use the script sapply(basic_stats, sd) and substitute the sd for any of the statistics listed above. This command applies the same function, standard deviation in this example, for values across a dataset, in this case basic_stats.

sapply(basic_stats, sd)
   Student_2    Student_3    Student_4    Student_5    Student_6 
0.2191556659 0.0063999859 0.0098162159 0.0480434591 0.0087473247 
   Student_7    Student_8    Student_9   Student_10   Student_11 
0.0032817992 0.0014800854 0.1624048055 0.0126479638 0.1021407781 
  Student_12   Student_13   Student_14   Student_15   Student_16 
0.0020438150 0.0079614194 0.0061589307 0.0121377380 0.0123836928 
  Student_17   Student_18   Student_19   Student_20 
0.0014721942 0.0017635611 0.0327986545 0.0008405765 

3.2.2 Paired t-test

For this next analysis we are going to use the ttest dataset to run a paired t-test so we can calculate the difference between paired observations. To do this we are going to use the $ operator to point to specific values in our dataset. This operator can be used in a number of different data formats to narrow down which information will be used in a calculation. To perform the t-test we will use the following script:

t.test(ttest$Student_A,ttest$Student_B, paired = TRUE)

    Paired t-test

data:  ttest$Student_A and ttest$Student_B
t = 2.4924, df = 19, p-value = 0.02209
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.002010165 0.023080060
sample estimates:
mean of the differences 
             0.01254511 

We can vary the input to perform variations on the ttest such as One Sample t-test t.test(ttest) or a Welch Two Sample t-test t.test(ttest$Student_A,ttest$Student_B, paired = FALSE). More information on this function can be found at ?t.test

We can also use a boxplot to graph out the values in our dataset. This can be used as a diagnostic step to help us compare two variables. To create a boxplot we use the following:

boxplot(ttest$Student_A,ttest$Student_B)

What hypothesis would you have made regarding the t-test after seeing the results of the box plot?

3.2.3 ANOVA

An Analysis of Variance (ANOVA) is a powerful tool for examining the differences among group means in a sample. To run this analysis we will use the anova dataset to examine all of the students trials. In order to examine the differences among the groups we will use a Tukey’s HSD (honestly significant difference) test. However, because the output tables can become quite large, we are going to use two additional packages to help organize the data. These are the tidyverse and broom packages.

library(broom)
library(tidyverse)
anova.results <- aov(Quantity~Group,data=anova)
summary(anova.results)
             Df Sum Sq Mean Sq F value Pr(>F)    
Group        18  1.265 0.07026   14.91 <2e-16 ***
Residuals   359  1.691 0.00471                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

From our analysis we can see there is a significant difference in the amount of liquid (quantity) being pipetted by each student (group). However, we still need to run the post-hoc test to determine exactly where those significant differences are within the data. To do that we are going to use a function from broom called tidy to help format the table that results from the TuketHSD function.

tidy(TukeyHSD(anova.results))

You can use the arrow in the header row to move the table to see columns to the right or the numbers at the bottom to see additional rows. The resulting table has 171 rows of data. Some of which have results that are significant and some are not. So we can instead use the filter function from the dplyr package to select only those comparisons where the adjust p-value is less than 0.05.

tidy(TukeyHSD(anova.results)) %>% filter(adj.p.value<0.05)

This helps reduce the table to the 46 comparisons that have significant p-values. Remember you can use the arrow to see columns to the right or the numbers to see additional rows. We can use the ggplot2 package to make a plot of the data and identify outliers in the dataset.

library(ggplot2)
Warning messages:
1: In readChar(file, size, TRUE) : truncating string with embedded nuls
2: In readChar(file, size, TRUE) : truncating string with embedded nuls
ggplot(anova.results, aes(x = Group, y = Quantity)) + 
  geom_boxplot(colour = "black", outlier.colour="red", outlier.shape = 8,   outlier.size = 2, fatten = 1) + 
  scale_x_discrete(limits = c('Student_2','Student_3','Student_4',
                              'Student_5','Student_6','Student_7',
                              'Student_8','Student_9','Student_10',
                              'Student_11','Student_12','Student_13',
                              'Student_14','Student_15','Student_16',
                              'Student_17','Student_18','Student_19',
                              'Student_20')) +
  labs(x="Students", y="Volume (mL)", title="Box Plot of Means") +
  theme(plot.title = element_text(hjust=0.5, face="bold")) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

In this plot can you identify outliers (red symbols) in the dataset?

3.2.4 Regression

You can also examine the relationship between the volume and mass of each trial in the data. To do this we will run and plot a linear regression with mass as dependent variable and volume as the independent predictor.

regression.results <- lm(Volume~Mass, data = regression)
summary(regression.results)

Call:
lm(formula = Volume ~ Mass, data = regression)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.018805 -0.003122  0.001021  0.005849  0.009918 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.007527   0.006058   1.242    0.249    
Mass        0.972383   0.009598 101.307 1.01e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.008962 on 8 degrees of freedom
Multiple R-squared:  0.9992,    Adjusted R-squared:  0.9991 
F-statistic: 1.026e+04 on 1 and 8 DF,  p-value: 1.007e-13

From these results we can see there is a significant relationship between mass and volume as you likely expected. To visualize this information we can create a plot of the data.

lm.coef<-round(coef(regression.results), 2)
plot(regression$Volume, regression$Mass, pch = 21, cex = 1.3, col = "black", bg="gray", main = "Volume vs Mass",xlab="Volume (mL)", ylab="Mass (g)")
abline(regression.results, col="red")
r2<-round(summary(regression.results)$r.squared,4)
mtext(bquote(y==.(lm.coef[2])*x+.(lm.coef[1])*","~~r^2==.(r2)*" "),line=-18,adj=1, padj=0)

You can see from this plot how closely mass follows volume as the r2 value for the model is approaching 1.0.

4 How To Get Coding

Hopefully this quick workshop gave you an idea of how quickly we can run statistical analyses using freely available software and packages in R using the RStudio IDE. As the job market and admission to graduate schools becomes more competitive, students need to find ways to help push their applications to the top of the pile. One way beyond your curriculum and grade point average is to at least a cursory knowledge of programming language(s). While there are a number of languages that would be beneficial, those that are adept at large scale data analysis are currently some of the most sought after. Two of the more popular languages for data science are Python and R. Each of them have their pros and cons so you should spend some time determing which would be most beneficial to you. There are a lot of website and YouTube videos dedicated to teaching these langauges, but there are also books that can help you get started. Check out…

R for Dummies or Python for Dummies

Each of these books will thoroughly cover the basics of each language and allow you to gain more confidence moving forward in your analyses. Just remember, like most other skills that you take time to develop, it pays to be consist as your skills will diminish the less engaged you are over time.

