What is R?
R is an open-source statistical language available for free download and maintained by the R Consortium, Inc. and the worldwide community of users, maintainers, and developers of R software. This workshop is for demonstration purposes only and is not meant to provide an exhaustive education in the R Project for Statistical Computing. The examples used in this workshop were developed primarily for educational purposes and are being presented in this free workshop as a demonstration of some of the capabilities of this open source language and environment for statistical computing and graphics. Throughout this workshop you will see screen captures and images from the RStudio Program. It is an integrated development environment (IDE) for R maintained by RStudio, Inc. and is available in free open source and commercial editions. All of the code and data sourced for this workshop was developed or modified by the authors and has been made available for educational use through the links provided in this workbook. Additional materials in this workbook have been made freely available under Creative Commons licenses and distributed in open-source platforms.
In The Beginning
You must install R and RStudio on your PC, Mac, or Linux computer. RStudio is not a stand-alone program so please install R first:
R Download: https://mirrors.nics.utk.edu/cran/
RStudio Download: https://www.rstudio.com/products/rstudio/download/
RStudio operates as a quasi-graphical user interface (GUI) for R. RStudio typically opens in a three or four window design:
Script/Data Window |
History/Environment |
Code/Console |
Packages, Help, etc. |
Script/Data Window: This area lets you develop script to be run in the console window and also view data sets.
History/Environment Window: In this section of RStudio you can see objects you have imported or created and a history of the code you have run.
Code/Console Window: All of the code you run from scripts will appear in the console window (along with any errors highlighted in red). You can also input code directly to this area, initiate help windows on various packages, view options for certain commands, etc.
Files, Plots, Packages, Help, Viewer Window: In this area you can load and view options within various packages (a collection of formatted functions, data, and compiled code that improve the functionality of the base R functions), view and export plots/graphics that have been created, view help options for packages/functions, and view files in your current working directory (explained later).
The Basics
When beginning any project in R, it can be beneficial to start a new R Script. This creates a record of the code you create that can be saved for future reference and sharing. To start a script click CRTL+Shift+N on PC, Shift+⌘+N on Mac, or select R Script from the drop-down menu below File io the menu bar.
In the scripting window there are some limited debugging features that provide suggestions regarding errors in your code. You can see an example of this feature in the image below.
However, there are a lot of resources available through the help tab in the lower-right window or by including a ? before any package or function you’re trying to run such as ?setwd()
. There are also tons of resources available through google searches and YouTube tutorials that walk you step-by-step through various packages or analyses. Throughout this workshop we will use a number of different packages. These are collections of R functions, data, and compiled code to help perform specific tasks. While R provides some basic functions, packages expand the power of R allowing you to perform more advanced functions and analyses. For each new package you will need to use the function install.packages(...)
with the name of the package between the parenthesis to download and library(...)
to load the package into your working environment. You only need to download the package once, but you will need to load the package with each new project.
Getting Started
To start you want to either begin with a new project or establish your working directory. If you are planning on the analysis being part of a large project or that you plan to share with other researchers, it would be ideal to begin with a new project directory using version control. However, if you are looking to complete a quick analysis to working on a class exercise it might be sufficient enough to just create w working directory. This directory is where all of the datasets used for the project should be stored and where all of the output data, plots, etc. will be stored by default. Set this directory by using the following script:
setwd(“…”)
Between the parentheses and quotes you should input the folder path from your computer. This can be obtained by using windows explorer on a PC, Command+⌘ on a Mac, or find in Linux. One thing to note with R, path names use forward slash and not the backslash which is common to windows. Here is an example of a correct setwd(“...”)
script:
Once you have the script written in the scripting window, select the line and click Run to initiate the script. Notice when you set the working directory, that information also appears at the top of and in the Console/Code window:
With the working directory set, it is time to import your data for analysis.
Importing Data
To begin, you will need to import a dataset(s) in to your working environment. As with all things in R and RStudio, there are a number of different ways to complete various tasks. This includes using a read function in R, using the import feature in RStudio, or using scripts to create the data. For the purpose of this workshop we begin by using a multi-tab excel file.
So we will need to add a specific package called readxl
and to do that we will use the following script to obtain, install, and load the package:
If successful you should now have the package listed under Packages in the packages tab and it should be checked. To get information about the package you and type ?readxl into the console or click on the package name in the packages tab.We are going to use the read_excel()
function to load each tab individually as objects in our working environment. To learn more about the function ?read_excel
type it in the console.
basic_stats <- read_excel("ChemData.xlsx", sheet="central tendency")
ttest <- read_excel("ChemData.xlsx", sheet="t-test")
anova <- read_excel("ChemData.xlsx", sheet="anova")
correlation <- read_excel("ChemData.xlsx", sheet="correlations")
regression <- read_excel("ChemData.xlsx",sheet = "regression")
When complete you should have a list of five (5) datasets in your working environment. By clicking on the arrow in the blue circle on the left of the object name you can expand the dataset to see what sort of information it contains. By clicking the sheet icon on the far right you can open the data to view its contents.
Data Formatting and Analysis
If you examine the basic_stats dataset you will see columns labeled No which refers to the trial number and then a column for each student in a class. Because we want our dataset to contain only numerical values we can use R to quickly format the dataset and remove the first column of data using the row numbers instead to identify the trails.
To perform this action we will use the following command:
basic_stats refers to the dataset and [,1] refers to the first column. If we wanted to remove rows in a vector or matrix dataset we would have used [1,] instead. Using the head()
function we can now view the first few rows of our newly formatted data.
Summary Statistics
Now we can use the dataset to examine some basic statistics such as mean, median, standard deviation (sd), variance (var), min, max, median, range, and quantile or we can see a summary of the data.
Student_2 Student_3 Student_4 Student_5
Min. :0.9933 Min. :0.9343 Min. :0.8611 Min. :0.9741
1st Qu.:0.9954 1st Qu.:0.9559 1st Qu.:0.8739 1st Qu.:1.0226
Median :0.9969 Median :0.9581 Median :0.8792 Median :1.0247
Mean :1.0457 Mean :0.9572 Mean :0.8803 Mean :1.0428
3rd Qu.:0.9981 3rd Qu.:0.9607 3rd Qu.:0.8884 3rd Qu.:1.0370
Max. :1.9768 Max. :0.9653 Max. :0.8942 Max. :1.1705
Student_6 Student_7 Student_8 Student_9
Min. :0.9018 Min. :0.9992 Min. :1.007 Min. :0.5432
1st Qu.:0.9127 1st Qu.:1.0011 1st Qu.:1.008 1st Qu.:0.9961
Median :0.9200 Median :1.0018 Median :1.009 Median :1.0006
Mean :0.9184 Mean :1.0025 Mean :1.009 Mean :0.9774
3rd Qu.:0.9248 3rd Qu.:1.0026 3rd Qu.:1.010 3rd Qu.:1.0044
Max. :0.9310 Max. :1.0155 Max. :1.012 Max. :1.3930
Student_10 Student_11 Student_12 Student_13
Min. :0.9730 Min. :0.5261 Min. :1.005 Min. :0.9559
1st Qu.:0.9896 1st Qu.:0.9723 1st Qu.:1.006 1st Qu.:0.9830
Median :0.9913 Median :0.9826 Median :1.007 Median :0.9870
Mean :0.9953 Mean :0.9567 Mean :1.008 Mean :0.9847
3rd Qu.:1.0051 3rd Qu.:0.9894 3rd Qu.:1.009 3rd Qu.:0.9893
Max. :1.0316 Max. :0.9921 Max. :1.013 Max. :0.9914
Student_14 Student_15 Student_16 Student_17
Min. :1.159 Min. :0.9591 Min. :0.9486 Min. :0.9912
1st Qu.:1.179 1st Qu.:0.9962 1st Qu.:0.9580 1st Qu.:0.9939
Median :1.183 Median :0.9972 Median :0.9636 Median :0.9952
Mean :1.181 Mean :0.9971 Mean :0.9669 Mean :0.9947
3rd Qu.:1.184 3rd Qu.:0.9982 3rd Qu.:0.9727 3rd Qu.:0.9952
Max. :1.187 Max. :1.0333 Max. :0.9876 Max. :0.9972
Student_18 Student_19 Student_20
Min. :1.004 Min. :0.8837 Min. :1.001
1st Qu.:1.006 1st Qu.:0.9815 1st Qu.:1.002
Median :1.007 Median :0.9830 Median :1.003
Mean :1.007 Mean :0.9833 Mean :1.003
3rd Qu.:1.009 3rd Qu.:0.9842 3rd Qu.:1.003
Max. :1.011 Max. :1.0856 Max. :1.004
If we want to examine statistics not included in the summary above you can use the script sapply(basic_stats, sd)
and substitute the sd for any of the statistics listed above. This command applies the same function, standard deviation in this example, for values across a dataset, in this case basic_stats.
Student_2 Student_3 Student_4 Student_5 Student_6
0.2191556659 0.0063999859 0.0098162159 0.0480434591 0.0087473247
Student_7 Student_8 Student_9 Student_10 Student_11
0.0032817992 0.0014800854 0.1624048055 0.0126479638 0.1021407781
Student_12 Student_13 Student_14 Student_15 Student_16
0.0020438150 0.0079614194 0.0061589307 0.0121377380 0.0123836928
Student_17 Student_18 Student_19 Student_20
0.0014721942 0.0017635611 0.0327986545 0.0008405765
Paired t-test
For this next analysis we are going to use the ttest dataset to run a paired t-test so we can calculate the difference between paired observations. To do this we are going to use the $
operator to point to specific values in our dataset. This operator can be used in a number of different data formats to narrow down which information will be used in a calculation. To perform the t-test we will use the following script:
t.test(ttest$Student_A,ttest$Student_B, paired = TRUE)
Paired t-test
data: ttest$Student_A and ttest$Student_B
t = 2.4924, df = 19, p-value = 0.02209
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.002010165 0.023080060
sample estimates:
mean of the differences
0.01254511
We can vary the input to perform variations on the ttest such as One Sample t-test t.test(ttest)
or a Welch Two Sample t-test t.test(ttest$Student_A,ttest$Student_B, paired = FALSE)
. More information on this function can be found at ?t.test
We can also use a boxplot to graph out the values in our dataset. This can be used as a diagnostic step to help us compare two variables. To create a boxplot we use the following:
boxplot(ttest$Student_A,ttest$Student_B)

What hypothesis would you have made regarding the t-test after seeing the results of the box plot?
ANOVA
An Analysis of Variance (ANOVA) is a powerful tool for examining the differences among group means in a sample. To run this analysis we will use the anova dataset to examine all of the students trials. In order to examine the differences among the groups we will use a Tukey’s HSD (honestly significant difference) test. However, because the output tables can become quite large, we are going to use two additional packages to help organize the data. These are the tidyverse
and broom
packages.
library(broom)
library(tidyverse)
anova.results <- aov(Quantity~Group,data=anova)
summary(anova.results)
Df Sum Sq Mean Sq F value Pr(>F)
Group 18 1.265 0.07026 14.91 <2e-16 ***
Residuals 359 1.691 0.00471
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From our analysis we can see there is a significant difference in the amount of liquid (quantity) being pipetted by each student (group). However, we still need to run the post-hoc test to determine exactly where those significant differences are within the data. To do that we are going to use a function from broom called tidy
to help format the table that results from the TuketHSD
function.
tidy(TukeyHSD(anova.results))
You can use the arrow in the header row to move the table to see columns to the right or the numbers at the bottom to see additional rows. The resulting table has 171 rows of data. Some of which have results that are significant and some are not. So we can instead use the filter function from the dplyr package to select only those comparisons where the adjust p-value is less than 0.05.
tidy(TukeyHSD(anova.results)) %>% filter(adj.p.value<0.05)
This helps reduce the table to the 46 comparisons that have significant p-values. Remember you can use the arrow to see columns to the right or the numbers to see additional rows. We can use the ggplot2
package to make a plot of the data and identify outliers in the dataset.
Warning messages:
1: In readChar(file, size, TRUE) : truncating string with embedded nuls
2: In readChar(file, size, TRUE) : truncating string with embedded nuls
ggplot(anova.results, aes(x = Group, y = Quantity)) +
geom_boxplot(colour = "black", outlier.colour="red", outlier.shape = 8, outlier.size = 2, fatten = 1) +
scale_x_discrete(limits = c('Student_2','Student_3','Student_4',
'Student_5','Student_6','Student_7',
'Student_8','Student_9','Student_10',
'Student_11','Student_12','Student_13',
'Student_14','Student_15','Student_16',
'Student_17','Student_18','Student_19',
'Student_20')) +
labs(x="Students", y="Volume (mL)", title="Box Plot of Means") +
theme(plot.title = element_text(hjust=0.5, face="bold")) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

In this plot can you identify outliers (red symbols) in the dataset?
Regression
You can also examine the relationship between the volume and mass of each trial in the data. To do this we will run and plot a linear regression with mass as dependent variable and volume as the independent predictor.
regression.results <- lm(Volume~Mass, data = regression)
summary(regression.results)
Call:
lm(formula = Volume ~ Mass, data = regression)
Residuals:
Min 1Q Median 3Q Max
-0.018805 -0.003122 0.001021 0.005849 0.009918
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.007527 0.006058 1.242 0.249
Mass 0.972383 0.009598 101.307 1.01e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.008962 on 8 degrees of freedom
Multiple R-squared: 0.9992, Adjusted R-squared: 0.9991
F-statistic: 1.026e+04 on 1 and 8 DF, p-value: 1.007e-13
From these results we can see there is a significant relationship between mass and volume as you likely expected. To visualize this information we can create a plot of the data.
lm.coef<-round(coef(regression.results), 2)
plot(regression$Volume, regression$Mass, pch = 21, cex = 1.3, col = "black", bg="gray", main = "Volume vs Mass",xlab="Volume (mL)", ylab="Mass (g)")
abline(regression.results, col="red")
r2<-round(summary(regression.results)$r.squared,4)
mtext(bquote(y==.(lm.coef[2])*x+.(lm.coef[1])*","~~r^2==.(r2)*" "),line=-18,adj=1, padj=0)

You can see from this plot how closely mass follows volume as the r2 value for the model is approaching 1.0.
How To Get Coding
Hopefully this quick workshop gave you an idea of how quickly we can run statistical analyses using freely available software and packages in R using the RStudio IDE. As the job market and admission to graduate schools becomes more competitive, students need to find ways to help push their applications to the top of the pile. One way beyond your curriculum and grade point average is to at least a cursory knowledge of programming language(s). While there are a number of languages that would be beneficial, those that are adept at large scale data analysis are currently some of the most sought after. Two of the more popular languages for data science are Python and R. Each of them have their pros and cons so you should spend some time determing which would be most beneficial to you. There are a lot of website and YouTube videos dedicated to teaching these langauges, but there are also books that can help you get started. Check out…
R for Dummies or Python for Dummies
Each of these books will thoroughly cover the basics of each language and allow you to gain more confidence moving forward in your analyses. Just remember, like most other skills that you take time to develop, it pays to be consist as your skills will diminish the less engaged you are over time.
