In this exercise you will examine ways to connect to external data sources such as Dryad, GBIF, or data stored in GitHub Repositories. You will also take a look at creating a simple species distribution model using data from GBIF and worldclim.org. To begin you need to install some packages you may not have used before.
There are several packages that allow you to connect to external data sources as well as format and display the data you have collect. To begin this exercise, install the following:
Often times you will find data available on websites or within a distributed data sources you want to include in your projects, but downloading and distributing the data with a markdown document or R script can be cumbersome. Additionally, the dataset may be large enough that it must be stored in your personal/business cloud storage for distribution even though it is already stored elsewhere on the web. Therefore, it is more convenient to connect to this data directly within your script and avoid secondary distribution all together.
Then you can copy the link at the top of the page to get direct access to the file. In this example, if you were to select the raw Campus_Point.kml file you would see a link that looks like this: https://raw.githubusercontent.com/chrismgentry/Mapping-Basics/master/Data/Campus_Points.kml The link will provide you access to that specific files from the repository. Now you can use that file in our new project.
So for this example you are going to recreate the simple static campus map you made in the previous exercise. To do that you will need to link to the raw kml file above.
campus.kml <- st_read('https://raw.githubusercontent.com/chrismgentry/Mapping-Basics/master/Data/Campus_Points.kml')
This simple line of code is all that is needed to create a spatialpointsdataframe object in our environment. To complete the map we can add the various aesthetics and options to create a custom look for your map.
campus.points <- campus.kml %>% mutate(x = unlist(map(campus.kml$geometry,1)),
y = unlist(map(campus.kml$geometry,2))) %>%
st_drop_geometry(campus.kml)
campus.map <- openmap(c(36.5360,-87.3570),c(36.5300,-87.3495), type='bing')
APSU <- openproj(campus.map, projection = "+proj=longlat +ellps=WGS84 +units=m +no_defs")
autoplot.OpenStreetMap(APSU) +
geom_point(data=campus.points, aes(x = x, y = y, color = Name), size = 4, alpha = 0.8) +
geom_text(data=campus.points,aes(x = x, y = y, label = Name), color="white", vjust=-0.75, fontface="bold") +
labs(x="Longtiude", y="Latitude") + theme(legend.position = "none")
Try it yourself! Link to the Main_Campus.kml file in that same folder and add a polygon of the main APSU campus to your example map above.
The only caveat to this method of adding data is the unknown longevity of the data source. Depending on the activity level of the website or repository, a simple change in storage structure could break the link location and necessitate a rewrite of your script to update to the new location. Additionally, changes in how GitHub handles data could also cause errors in your script.
Sites such as GBIF or Dryad aggregate data from scientists, governments, private organizations, etc. and distribute them on open-access services. These data are likely more stable than a GitHub Repository and should have consistent standards that apply to all datasets. This provides straight-forward access to a much larger catalog of data with better reproducibility.
To start, we are going to use the dismo
package to gain
access to a simple GBIF dataset. Although this is
slightly more complicated than linking to a ready-made dataset, sites
like GBIF provide access to well nearly two billion
records.
The first step in obtaining data is setting the extent of the bounding box for your search. While you can access global datasets, there are limits on the amount of information you can download and limitations on your local processing power that make doing so difficult. With the extent set, you provide the genus and species along with the following basic arguments:
geo: logical. If TRUE, only records that have a georeference (longitude and latitude values) will be downloaded
removeZeros: logical. If TRUE, all records that have a latitude OR longitude of zero will be removed if geo=TRUE; or set to NA if geo=FALSE. If FALSE, only records that have a latitude AND longitude that are zero will be removed or set to NA
download: logical. If TRUE, records will be downloaded, otherwise only the number of records will be shown
For this example you are going to obtain records for Pinus longaeva (Rocky Mountain Bristlecone Pine) within the western United States.
pilo.dismo <- gbif("pinus", species = "longaeva", ext = c(-130,-70,20,60),
geo = TRUE, download = TRUE, removeZeros = TRUE)
Once you have obtained the observations, the next step is to prepare it for display. To do this, we are going to obtain a base map of the US, locate the x,y fields in the dataset, and create our map of known PILO stands in the west.
us <- map_data("state")
ggplot() +
geom_polygon(data = us, aes(x=long, y = lat, group = group),
fill = "white", color="black") +
geom_point(data = pilo.dismo, aes(x=lon, y=lat)) +
xlab("Longitude") + ylab("Latitude") +
coord_fixed(xlim = c(-124,-110), ylim = c(35,45)) +
xlab("Longitude") + ylab("Latitude") + ggtitle("Pinus longaeva Stands in the Western US") +
theme_bw() + theme(plot.title = element_text(hjust = 0.5)) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "lightblue"))
Give it a shot! Change the species and extent information above to create a quick custom map of a species in your thesis or research project.
This is a simple example of using the GBIF options in
dismo
to create a quick map of species locations. Be
careful when using this sort of open data source. Quality control on the
datasets might not be to the standard required for a robust research
project as you will see in the next section.
While the dismo
package allowed you to create a quick
species map, there were only a few variables obtained with the data.
Using rgbif
will provide a more detailed look at the
species by including additional attributes.
For this part you will continue to look at Bristlecone Pine in the
Great Basin so you can evaluate the differences in the type of data and
distribution of PILO as compared to its relative Rocky Mountain
Bristlecone Pine (Pinus aristata). To begin with you will use
the occ_data
function to set the scientific name,
bounding coordinates, and requirements for PILO and PIAR on
GBIF. In this instance a limit is set to only include
up to the first 2000 records.
pilo.rgbif <- occ_data(scientificName = "Pinus longaeva",
hasCoordinate = TRUE, limit = 2000,
decimalLongitude = "-125, -65",
decimalLatitude = "24, 50")
piar.rgbif <- occ_data(scientificName = "Pinus aristata",
hasCoordinate = TRUE, limit = 2000,
decimalLongitude = "-125, -65",
decimalLatitude = "24, 50")
In this instance you now have several different records for PILO and
PIAR that includes more variables than the simple geographic data
collected above with dismo
. After examining the dataset,
View(pilo_rgbif)
, it is determined that only a handful of
the variables should be retained for further analysis.
pilo.rgbif.df <- cbind.data.frame(pilo.rgbif$data$species,
pilo.rgbif$data$decimalLatitude,
pilo.rgbif$data$decimalLongitude,
pilo.rgbif$data$stateProvince,
pilo.rgbif$data$verbatimLocality)
piar.rgbif.df <- cbind.data.frame(piar.rgbif$data$species,
piar.rgbif$data$decimalLatitude,
piar.rgbif$data$decimalLongitude,
piar.rgbif$data$stateProvince,
piar.rgbif$data$verbatimLocality)
colnames(pilo.rgbif.df) <- c("species","y","x","state","location")
colnames(piar.rgbif.df) <- c("species","y","x","state","location")
There are a number of steps in this code to consider. The data
obtained with rgbif
is in a large gbif/tbl_df (S3) class.
Because of the format, we will use cbind.data.frame
instead
of the standard cbind to ensure the data remains in the proper
format. Next, each of the selected variables were renamed. You can see
this simplified structure in the str(pilo.rgbif.df)
dataset.
Now you are ready to finalize the aesthetics and options for displaying the new gbif data.
ggplot() +
geom_polygon(data = us, aes(x=long, y = lat, group = group),
fill = "white", color="black") +
geom_point(data = pilo.rgbif.df, aes(x=x, y=y, color = species), size = 3) +
geom_point(data = piar.rgbif.df, aes(x=x, y=y, color = species), size = 3) +
coord_fixed(xlim = c(-124,-110), ylim = c(35,45)) +
xlab("Longitude") + ylab("Latitude") + ggtitle("Bristlecone Stands in the Western US") +
guides(color=guide_legend("Legend", override.aes = list(size = 4))) +
theme_bw() + theme(plot.title = element_text(hjust = 0.5)) +
theme(legend.position = "bottom") +
theme(legend.title.align = 0.5, legend.box.just = "center") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "lightblue"))
It is here where you can see a potential issue with the data. By viewing the Forest Service, Distribution and Occurrence information for Great Basin Bristlecone Pine you will see that it occurs “in a relatively narrow latitudinal range in California, Nevada, and Utah.” This information seems to track with the visualization above. However, viewing the same information for Rocky Mountain Bristlecone Pine you will find it “occurs in the southern Rocky Mountains of Colorado, New Mexico, and Arizona.” However, the distribution of PIAR on the visualization above extends into Utah, Nevada, and California overlapping the range of PIAL. This underlies the importance of understanding the data you are working with.
You could also use this data in leaflet
to make an
interactive version.
colors <- colorFactor(c("#F8766D","#00BA38","#619CFF"), pilo.rgbif.df$state)
leaflet(pilo.rgbif.df) %>%
addTiles() %>%
addCircleMarkers(pilo.rgbif.df$x,
pilo.rgbif.df$y,
popup = pilo.rgbif.df$species,
weight = 2,
color = colors(pilo.rgbif.df$state),
fillOpacity = 0.7) %>%
addMiniMap(position = 'topright',
width = 100,
height = 100,
toggleDisplay = FALSE) %>%
addScaleBar(position = "bottomright")
Data for this example is being downloaded from Schile, Lisa et al. (2016), Abu Dhabi Blue Carbon project, Dryad, Dataset, https://doi.org/10.15146/R3K59Z
You can download data directly from specific Dryad
publications to import into your environment. While there is a
dryad_download()
command in the rdryad
package
that allows you to link to a specific dataset based on the doi, the
function does not include robust features to allow for finite control
for where an output file will be saved. So for this example you will use
download.file()
. This function will work regardless of the
data format (*.csv, *.xlsx, *.zip) you are linking to. If you are
downloading a *.zip file you may need to use the unzip
package to open the data. For this example, the authors provided data a
*.xlsx file. Using download.file
you will identify the
link, destination path and file type, in mode information to ensure the
file is opened for writing in binary mode.
carbon <- download.file("https://datadryad.org/stash/downloads/file_stream/140080", destfile = "./carbon.xlsx", mode = "wb")
With this new dataset collected you have options on how to import the
file into your project. Using the readxl
package you can
click on the file dataset within the files tab and choose
Import Dataset… to use a GUI interface for
importing. This will open a new window where you can give the object a
name, select the specific sheet and use drop-down menus to change the
type of data, or choose to skip it on import. This can be beneficial
when you need to review the dataset prior to importing to
R. When you have selected all of the appropriate
options you can either select import or copy the code and paste it into
your script.
For this dataset the only necessary variables are: site, ecosystem, plot, core depth, latitude, and longitude. Either method above is acceptable for importing the data however only the code will be displayed below:
carbon <- readxl::read_excel("carbon.xlsx",
sheet = "plot information",
col_types = c("skip","skip", "text", "text",
"numeric", "numeric", "skip",
"skip", "skip", "skip", "skip",
"skip", "skip", "skip", "skip",
"skip", "numeric", "numeric",
"skip", "skip"))
Notice in the script the specific sheet was identified and variables were either skipped, or identified in the prefer format. With this information available you can create a simple visualization.
world <- map_data("worldHires")
uae <-map_data("worldHires", "United Arab Emirates")
main_map <- ggplot() +
geom_polygon(data = world, aes(x=long, y = lat, group = group),
fill = "gray", color="white") +
geom_polygon(data = uae, aes(x=long, y = lat, group = group),
fill = "white", color="black") +
geom_point(data = carbon, aes(Longitude, Latitude, color = Ecosystem, size = `core depth (cm)`)) +
coord_fixed(xlim = c(51.5,56.5), ylim = c(23.5,26.5)) +
xlab("Longitude") + ylab("Latitude") + ggtitle("Blue Carbon Project") +
guides(color=guide_legend("Ecosystems", override.aes = list(size = 3))) +
guides(size=guide_legend("Core Depth (cm)")) +
theme_bw() + theme(plot.title = element_text(hjust = 0.5)) +
theme(legend.position = "right") +
theme(legend.title.align = 0.5, legend.box.just = "center") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "lightblue"))
main_map
Occasionally you may be interested in including a inset or indicator map on your visualization. These are smaller maps that are used to provide additional detail about a specific location. These maps are often small scale versions of the primary map, but sometimes can be larger scale to enhance detail in congested areas. In this case, the data is located in the United Arab Emirates. Because that area of the world is unfamiliar to some, an inset map will help them locate it geographically.
To begin this process you are going to create an additional ggplot object for countries of the world highlighting UAE.
inset <- ggplot() +
geom_polygon(data = world, aes(x=long, y = lat, group = group),
fill = "gray", color="white") +
geom_polygon(data = uae, aes(x=long, y = lat, group = group),
fill = "blue", color="black") +
coord_map(xlim = c(30,62), ylim = c(10,40), "polyconic") +
theme(panel.background = element_rect(fill = "lightblue"), panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
axis.line=element_blank(), axis.text.x=element_blank(), axis.text.y=element_blank(),axis.ticks=element_blank(),
axis.title.x=element_blank(), axis.title.y=element_blank()) +
theme(plot.margin=grid::unit(c(0,0,0,0), "mm"))
inset
To create an overlay of these two maps you will use
ggdraw
from the cowplot package. In this
function draw_plot
calls to the maps created above, and x,y
and dimensions are provided to place the inset over the primary map.
Finally, using a number of the packages included above, we can examine factors that might influence the location of a particular species. For this example we will return to the PILO data and use bioclimatic variables from worldclim.org to examine the relationship between climate variables and PILO stands in the western US.
bioclim <- getData(name = "worldclim", res = 2.5, var = "bio", path = "./")
names(bioclim) <- c("Ann Mean Temp","Mean Diurnal Range","Isothermality","Temperature Seasonality",
"Max Temp Warmest Mo","Min Temp Coldest Mo","Ann Temp Range","Mean Temp Wettest Qtr",
"Mean Temp Driest Qtr","Mean Temp Warmest Qtr","Mean Temp Coldest Qtr","Annual
Precip","Precip Wettest Mo","Precip Driest Mo","Precip Seasonality","Precip Wettest
Qtr","Precip Driest Qtr","Precip Warmest Qtr","Precip Coldest Qtr")
bio.extent <- extent(x = c(
min(pilo.rgbif.df$x),
max(pilo.rgbif.df$x),
min(pilo.rgbif.df$y),
max(pilo.rgbif.df$y)))
bioclim.extent <- crop(x = bioclim, y = bio.extent)
bioclim.model <- bioclim(x = bioclim.extent, p = cbind(pilo.rgbif.df$x,pilo.rgbif.df$y))
presence.model <- dismo::predict(object = bioclim.model,
x = bioclim.extent,
ext = bio.extent)
In this script bioclimatic variable were generated and descriptive
names were provided. Because the bioclimatic variables are global, an
extent was created out of the PILO dataset to limit the analysis to
their geographic range and the extent was used to crop the variables.
Finally, a raster layer was created with a prediction based on the model
object. To view that information on a map you can use the
gplot
function which is a wrapper for ggplot from
the rasterVis package used to visualize raster
opjects.
rasterVis::gplot(presence.model) +
geom_polygon(data = us, aes(x= long, y = lat, group = group),
fill = "gray", color="black") +
geom_raster(aes(fill=value)) +
geom_polygon(data = us, aes(x= long, y = lat, group = group),
fill = NA, color="black") +
geom_point(data = pilo.rgbif.df, aes(x = x, y = y), size = 2, color = "black", alpha = 0.5) +
scale_fill_gradientn(colours=c("brown","yellow","darkgreen"), "Probability") +
coord_fixed(xlim = c(-122.5,-110.5), ylim = c(36,42.5)) +
xlab("Longitude") + ylab("Latitude") + ggtitle("Probability of PILO Occurrence") +
theme_bw() + theme(plot.title = element_text(hjust = 0.5)) + theme(legend.position = "right") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "lightblue"))
This analysis included “the kitchen sink” of variables. The model
could be better specified by removing variables that are unrelated to
the growth response of PILO. Additionally you can see areas where PILO
has a high probability of occurrence but is not present. In Arizona,
this is the range of PIAR. You can also view this information in
leaflet
to examine the data interactively.
colors <- c("brown","yellow","darkgreen")
leaflet() %>%
addTiles() %>%
addRasterImage(presence.model, colors = colors, opacity = 0.8) %>%
addCircleMarkers(pilo.rgbif.df$x,
pilo.rgbif.df$y,
weight = 1,
color = "grey",
fillColor = "green",
fillOpacity = 0.7) %>%
addMiniMap(position = 'topright',
width = 100,
height = 100,
toggleDisplay = FALSE) %>%
addScaleBar(position = "bottomright")
Now it’s your turn! Although some of this might not be applicable to your thesis research, try this information out on a dataset of your choice from any of the sources above. If you can apply this to your thesis, then add it to your website! Otherwise create a new repository for your work on Thursday. Hint: If uploading this as a project to GitHub be sure to add the bioclim folder to your *.gitignore file to avoid errors relating to the size of folder exceeding GitHub standards.