1 Introduction

In this exercise you will examine ways to connect to external data sources such as Dryad, GBIF, or data stored in GitHub Repositories. You will also take a look at creating a simple species distribution model using data from GBIF and worldclim.org. To begin you need to install some packages you may not have used before.

1.1 Packages used in this exercise

There are several packages that allow you to connect to external data sources as well as format and display the data you have collect. To begin this exercise, install the following:

packages<-c("cowplot","dismo","leaflet","mapdata","OpenStreetMap","rasterVis","rdryad","rgbif","sf","tidyverse")
sapply(packages, library, character.only=T)

2 Connecting to External Data Sources

Often times you will find data available on websites or within a distributed data sources you want to include in your projects, but downloading and distributing the data with a markdown document or R script can be cumbersome. Additionally, the dataset may be large enough that it must be stored in your personal/business cloud storage for distribution even though it is already stored elsewhere on the web. Therefore, it is more convenient to connect to this data directly within your script and avoid secondary distribution all together.

2.1 Connecting to GitHub or other Websites

For example, in a previous exercise, two *.kml files of the APSU campus were included in the data folder. If you wanted to use those in a new project you could download the data, include them in your data distribution, and remind users they must download the data in order for the script to work. Alternatively you could navigate to the repository and link directly to the data. To do this, you simply go to the Data Folder in the repository, click on the file you need, and click the Raw button along the top of the script window.

Raw Button

Then you can copy the link at the top of the page to get direct access to the file. In this example, if you were to select the raw Campus_Point.kml file you would see a link that looks like this: https://raw.githubusercontent.com/chrismgentry/Mapping-Basics/master/Data/Campus_Points.kml The link will provide you access to that specific files from the repository. Now you can use that file in our new project.

So for this example you are going to recreate the simple static campus map you made in the previous exercise. To do that you will need to link to the raw kml file above.

campus.kml <- st_read('https://raw.githubusercontent.com/chrismgentry/Mapping-Basics/master/Data/Campus_Points.kml')

This simple line of code is all that is needed to create a spatialpointsdataframe object in our environment. To complete the map we can add the various aesthetics and options to create a custom look for your map.

campus.points <- campus.kml %>% mutate(x = unlist(map(campus.kml$geometry,1)),
                                       y = unlist(map(campus.kml$geometry,2))) %>%
                                st_drop_geometry(campus.kml)

campus.map <- openmap(c(36.5360,-87.3570),c(36.5300,-87.3495), type='bing')
APSU <- openproj(campus.map, projection = "+proj=longlat +ellps=WGS84 +units=m +no_defs")

autoplot.OpenStreetMap(APSU) +
  geom_point(data=campus.points, aes(x = x, y = y, color = Name), size = 4, alpha = 0.8) +
  geom_text(data=campus.points,aes(x = x, y = y, label = Name), color="white", vjust=-0.75, fontface="bold") +
  labs(x="Longtiude", y="Latitude") + theme(legend.position = "none")

Try it yourself! Link to the Main_Campus.kml file in that same folder and add a polygon of the main APSU campus to your example map above.

The only caveat to this method of adding data is the unknown longevity of the data source. Depending on the activity level of the website or repository, a simple change in storage structure could break the link location and necessitate a rewrite of your script to update to the new location. Additionally, changes in how GitHub handles data could also cause errors in your script.

2.2 Connecting to Data Distribution Networks

Sites such as GBIF or Dryad aggregate data from scientists, governments, private organizations, etc. and distribute them on open-access services. These data are likely more stable than a GitHub Repository and should have consistent standards that apply to all datasets. This provides straight-forward access to a much larger catalog of data with better reproducibility.

2.2.1 DISMO

To start, we are going to use the dismo package to gain access to a simple GBIF dataset. Although this is slightly more complicated than linking to a ready-made dataset, sites like GBIF provide access to well nearly two billion records.

The first step in obtaining data is setting the extent of the bounding box for your search. While you can access global datasets, there are limits on the amount of information you can download and limitations on your local processing power that make doing so difficult. With the extent set, you provide the genus and species along with the following basic arguments:

geo: logical. If TRUE, only records that have a georeference (longitude and latitude values) will be downloaded
removeZeros: logical. If TRUE, all records that have a latitude OR longitude of zero will be removed if geo=TRUE; or set to NA if geo=FALSE. If FALSE, only records that have a latitude AND longitude that are zero will be removed or set to NA
download: logical. If TRUE, records will be downloaded, otherwise only the number of records will be shown

For this example you are going to obtain records for Pinus longaeva (Rocky Mountain Bristlecone Pine) within the western United States.

pilo.dismo <- gbif("pinus", species = "longaeva", ext = c(-130,-70,20,60),
                   geo = TRUE, download = TRUE, removeZeros = TRUE)

Once you have obtained the observations, the next step is to prepare it for display. To do this, we are going to obtain a base map of the US, locate the x,y fields in the dataset, and create our map of known PILO stands in the west.

us <- map_data("state")

ggplot() +
  geom_polygon(data = us, aes(x=long, y = lat, group = group),
               fill = "white", color="black") +
  geom_point(data = pilo.dismo, aes(x=lon, y=lat)) + 
  xlab("Longitude") + ylab("Latitude") +
  coord_fixed(xlim = c(-124,-110), ylim = c(35,45)) +
  xlab("Longitude") + ylab("Latitude") + ggtitle("Pinus longaeva Stands in the Western US") + 
  theme_bw() + theme(plot.title = element_text(hjust = 0.5)) + 
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "lightblue"))

Give it a shot! Change the species and extent information above to create a quick custom map of a species in your thesis or research project.

This is a simple example of using the GBIF options in dismo to create a quick map of species locations. Be careful when using this sort of open data source. Quality control on the datasets might not be to the standard required for a robust research project as you will see in the next section.

2.2.2 RGBIF

While the dismo package allowed you to create a quick species map, there were only a few variables obtained with the data. Using rgbif will provide a more detailed look at the species by including additional attributes.

For this part you will continue to look at Bristlecone Pine in the Great Basin so you can evaluate the differences in the type of data and distribution of PILO as compared to its relative Rocky Mountain Bristlecone Pine (Pinus aristata). To begin with you will use the occ_data function to set the scientific name, bounding coordinates, and requirements for PILO and PIAR on GBIF. In this instance a limit is set to only include up to the first 2000 records.

pilo.rgbif <- occ_data(scientificName = "Pinus longaeva",
                       hasCoordinate = TRUE, limit = 2000,
                       decimalLongitude = "-125, -65", 
                       decimalLatitude = "24, 50")

piar.rgbif <- occ_data(scientificName = "Pinus aristata",
                       hasCoordinate = TRUE, limit = 2000,
                       decimalLongitude = "-125, -65", 
                       decimalLatitude = "24, 50")

In this instance you now have several different records for PILO and PIAR that includes more variables than the simple geographic data collected above with dismo. After examining the dataset, View(pilo_rgbif), it is determined that only a handful of the variables should be retained for further analysis.

pilo.rgbif.df <- cbind.data.frame(pilo.rgbif$data$species,
                                  pilo.rgbif$data$decimalLatitude,
                                  pilo.rgbif$data$decimalLongitude,
                                  pilo.rgbif$data$stateProvince,
                                  pilo.rgbif$data$verbatimLocality)

piar.rgbif.df <- cbind.data.frame(piar.rgbif$data$species,
                                  piar.rgbif$data$decimalLatitude,
                                  piar.rgbif$data$decimalLongitude,
                                  piar.rgbif$data$stateProvince,
                                  piar.rgbif$data$verbatimLocality)

colnames(pilo.rgbif.df) <- c("species","y","x","state","location")
colnames(piar.rgbif.df) <- c("species","y","x","state","location")

There are a number of steps in this code to consider. The data obtained with rgbif is in a large gbif/tbl_df (S3) class. Because of the format, we will use cbind.data.frame instead of the standard cbind to ensure the data remains in the proper format. Next, each of the selected variables were renamed. You can see this simplified structure in the str(pilo.rgbif.df) dataset.

Now you are ready to finalize the aesthetics and options for displaying the new gbif data.

ggplot() +
  geom_polygon(data = us, aes(x=long, y = lat, group = group),
               fill = "white", color="black") +
  geom_point(data = pilo.rgbif.df, aes(x=x, y=y, color = species), size = 3) +
  geom_point(data = piar.rgbif.df, aes(x=x, y=y, color = species), size = 3) +  
  coord_fixed(xlim = c(-124,-110), ylim = c(35,45)) +
  xlab("Longitude") + ylab("Latitude") + ggtitle("Bristlecone Stands in the Western US") + 
  guides(color=guide_legend("Legend", override.aes = list(size = 4))) +
  theme_bw() + theme(plot.title = element_text(hjust = 0.5)) + 
  theme(legend.position = "bottom") +
  theme(legend.title.align = 0.5, legend.box.just = "center") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "lightblue"))

It is here where you can see a potential issue with the data. By viewing the Forest Service, Distribution and Occurrence information for Great Basin Bristlecone Pine you will see that it occurs “in a relatively narrow latitudinal range in California, Nevada, and Utah.” This information seems to track with the visualization above. However, viewing the same information for Rocky Mountain Bristlecone Pine you will find it “occurs in the southern Rocky Mountains of Colorado, New Mexico, and Arizona.” However, the distribution of PIAR on the visualization above extends into Utah, Nevada, and California overlapping the range of PIAL. This underlies the importance of understanding the data you are working with.

You could also use this data in leaflet to make an interactive version.

colors <- colorFactor(c("#F8766D","#00BA38","#619CFF"), pilo.rgbif.df$state)

leaflet(pilo.rgbif.df) %>% 
  addTiles() %>% 
  addCircleMarkers(pilo.rgbif.df$x,
                   pilo.rgbif.df$y,
                   popup = pilo.rgbif.df$species,
                   weight = 2,
                   color = colors(pilo.rgbif.df$state),
                   fillOpacity = 0.7) %>%
  addMiniMap(position = 'topright',
             width = 100, 
             height = 100,
             toggleDisplay = FALSE) %>%
  addScaleBar(position = "bottomright")

2.2.3 DRYAD

Data for this example is being downloaded from Schile, Lisa et al. (2016), Abu Dhabi Blue Carbon project, Dryad, Dataset, https://doi.org/10.15146/R3K59Z

You can download data directly from specific Dryad publications to import into your environment. While there is a dryad_download() command in the rdryad package that allows you to link to a specific dataset based on the doi, the function does not include robust features to allow for finite control for where an output file will be saved. So for this example you will use download.file(). This function will work regardless of the data format (*.csv, *.xlsx, *.zip) you are linking to. If you are downloading a *.zip file you may need to use the unzip package to open the data. For this example, the authors provided data a *.xlsx file. Using download.file you will identify the link, destination path and file type, in mode information to ensure the file is opened for writing in binary mode.

carbon <- download.file("https://datadryad.org/stash/downloads/file_stream/140080", destfile = "./carbon.xlsx", mode = "wb")

With this new dataset collected you have options on how to import the file into your project. Using the readxl package you can click on the file dataset within the files tab and choose Import Dataset… to use a GUI interface for importing. This will open a new window where you can give the object a name, select the specific sheet and use drop-down menus to change the type of data, or choose to skip it on import. This can be beneficial when you need to review the dataset prior to importing to R. When you have selected all of the appropriate options you can either select import or copy the code and paste it into your script.

For this dataset the only necessary variables are: site, ecosystem, plot, core depth, latitude, and longitude. Either method above is acceptable for importing the data however only the code will be displayed below:

carbon <- readxl::read_excel("carbon.xlsx", 
                          sheet = "plot information", 
                          col_types = c("skip","skip", "text", "text", 
                                        "numeric", "numeric", "skip", 
                                        "skip", "skip", "skip", "skip", 
                                        "skip", "skip", "skip", "skip",
                                        "skip", "numeric", "numeric", 
                                        "skip", "skip"))

Notice in the script the specific sheet was identified and variables were either skipped, or identified in the prefer format. With this information available you can create a simple visualization.

world <- map_data("worldHires")
uae <-map_data("worldHires", "United Arab Emirates")

main_map <- ggplot() +
  geom_polygon(data = world, aes(x=long, y = lat, group = group),
               fill = "gray", color="white") +
  geom_polygon(data = uae, aes(x=long, y = lat, group = group),
               fill = "white", color="black") +
  geom_point(data = carbon, aes(Longitude, Latitude, color = Ecosystem, size = `core depth (cm)`)) +
  coord_fixed(xlim = c(51.5,56.5), ylim = c(23.5,26.5)) +
  xlab("Longitude") + ylab("Latitude") + ggtitle("Blue Carbon Project") + 
  guides(color=guide_legend("Ecosystems", override.aes = list(size = 3))) + 
  guides(size=guide_legend("Core Depth (cm)")) +  
  theme_bw() + theme(plot.title = element_text(hjust = 0.5)) + 
  theme(legend.position = "right") +
  theme(legend.title.align = 0.5, legend.box.just = "center") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "lightblue"))
main_map

2.3 Creating an Inset Map

Occasionally you may be interested in including a inset or indicator map on your visualization. These are smaller maps that are used to provide additional detail about a specific location. These maps are often small scale versions of the primary map, but sometimes can be larger scale to enhance detail in congested areas. In this case, the data is located in the United Arab Emirates. Because that area of the world is unfamiliar to some, an inset map will help them locate it geographically.

To begin this process you are going to create an additional ggplot object for countries of the world highlighting UAE.

inset <- ggplot() + 
  geom_polygon(data = world, aes(x=long, y = lat, group = group),
               fill = "gray", color="white") +
  geom_polygon(data = uae, aes(x=long, y = lat, group = group),
               fill = "blue", color="black") +
  coord_map(xlim = c(30,62), ylim = c(10,40), "polyconic") +
  theme(panel.background = element_rect(fill = "lightblue"), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), 
        axis.line=element_blank(), axis.text.x=element_blank(), axis.text.y=element_blank(),axis.ticks=element_blank(), 
        axis.title.x=element_blank(), axis.title.y=element_blank()) +
  theme(plot.margin=grid::unit(c(0,0,0,0), "mm"))
inset

To create an overlay of these two maps you will use ggdraw from the cowplot package. In this function draw_plot calls to the maps created above, and x,y and dimensions are provided to place the inset over the primary map.

ggdraw() +
  draw_plot(main_map) + 
  draw_plot(inset, x = 0.024, y = 0.595, width = 0.25, height = 0.25)

3 Simple Species Distribution Map

Finally, using a number of the packages included above, we can examine factors that might influence the location of a particular species. For this example we will return to the PILO data and use bioclimatic variables from worldclim.org to examine the relationship between climate variables and PILO stands in the western US.

bioclim <- getData(name = "worldclim", res = 2.5, var = "bio", path = "./")

names(bioclim) <- c("Ann Mean Temp","Mean Diurnal Range","Isothermality","Temperature Seasonality",
                    "Max Temp Warmest Mo","Min Temp Coldest Mo","Ann Temp Range","Mean Temp Wettest Qtr",
                    "Mean Temp Driest Qtr","Mean Temp Warmest Qtr","Mean Temp Coldest Qtr","Annual
                    Precip","Precip Wettest Mo","Precip Driest Mo","Precip Seasonality","Precip Wettest
                    Qtr","Precip Driest Qtr","Precip Warmest Qtr","Precip Coldest Qtr")

bio.extent <- extent(x = c(
  min(pilo.rgbif.df$x),
  max(pilo.rgbif.df$x),
  min(pilo.rgbif.df$y),
  max(pilo.rgbif.df$y)))

bioclim.extent <- crop(x = bioclim, y = bio.extent)

bioclim.model <- bioclim(x = bioclim.extent, p = cbind(pilo.rgbif.df$x,pilo.rgbif.df$y))
presence.model <- dismo::predict(object = bioclim.model, 
                                 x = bioclim.extent, 
                                 ext = bio.extent)

In this script bioclimatic variable were generated and descriptive names were provided. Because the bioclimatic variables are global, an extent was created out of the PILO dataset to limit the analysis to their geographic range and the extent was used to crop the variables. Finally, a raster layer was created with a prediction based on the model object. To view that information on a map you can use the gplot function which is a wrapper for ggplot from the rasterVis package used to visualize raster opjects.

rasterVis::gplot(presence.model) + 
  geom_polygon(data = us, aes(x= long, y = lat, group = group),
               fill = "gray", color="black") +
  geom_raster(aes(fill=value)) +
  geom_polygon(data = us, aes(x= long, y = lat, group = group),
               fill = NA, color="black") +
  geom_point(data = pilo.rgbif.df, aes(x = x, y = y), size = 2, color = "black", alpha = 0.5) +
  scale_fill_gradientn(colours=c("brown","yellow","darkgreen"), "Probability") +
  coord_fixed(xlim = c(-122.5,-110.5), ylim = c(36,42.5)) +
  xlab("Longitude") + ylab("Latitude") + ggtitle("Probability of PILO Occurrence") + 
  theme_bw() + theme(plot.title = element_text(hjust = 0.5)) + theme(legend.position = "right") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "lightblue"))

This analysis included “the kitchen sink” of variables. The model could be better specified by removing variables that are unrelated to the growth response of PILO. Additionally you can see areas where PILO has a high probability of occurrence but is not present. In Arizona, this is the range of PIAR. You can also view this information in leaflet to examine the data interactively.

colors <- c("brown","yellow","darkgreen")

leaflet() %>% 
  addTiles() %>%
  addRasterImage(presence.model, colors = colors, opacity = 0.8) %>%
  addCircleMarkers(pilo.rgbif.df$x,
                   pilo.rgbif.df$y,
                   weight = 1,
                   color = "grey",
                   fillColor = "green",
                   fillOpacity = 0.7) %>%
  addMiniMap(position = 'topright',
             width = 100, 
             height = 100,
             toggleDisplay = FALSE) %>%
  addScaleBar(position = "bottomright")

4 YOUR TURN!

Now it’s your turn! Although some of this might not be applicable to your thesis research, try this information out on a dataset of your choice from any of the sources above. If you can apply this to your thesis, then add it to your website! Otherwise create a new repository for your work on Thursday. Hint: If uploading this as a project to GitHub be sure to add the bioclim folder to your *.gitignore file to avoid errors relating to the size of folder exceeding GitHub standards.

External Data, Distribution Networks, and Simple Species Distribution Models
Advanced Data Analytics

Austin Peay State University

1 Introduction

1.1 Packages used in this exercise

2 Connecting to External Data Sources

2.1 Connecting to GitHub or other Websites

2.2 Connecting to Data Distribution Networks

2.2.1 DISMO

2.2.2 RGBIF

2.2.3 DRYAD

2.3 Creating an Inset Map

3 Simple Species Distribution Map

4 YOUR TURN!

External Data, Distribution Networks, and Simple Species Distribution Models Advanced Data Analytics

Austin Peay State University

1 Introduction

1.1 Packages used in this exercise

2 Connecting to External Data Sources

2.1 Connecting to GitHub or other Websites

2.2 Connecting to Data Distribution Networks

2.2.1 DISMO

2.2.2 RGBIF

2.2.3 DRYAD

2.3 Creating an Inset Map

3 Simple Species Distribution Map

4 YOUR TURN!

External Data, Distribution Networks, and Simple Species Distribution Models
Advanced Data Analytics