18 Raster Analysis

18.1 Learning Objectives

In this lesson, you will learn:

How to use the raster package to import geospatial data
how to combine raster and vector to conduct geospatial analysis in R

18.2 Introduction

We just have seen how to conduct spatial analysis in R using vector data. There is a second categories of geospatial data: raster data.

In a nutshell, raster data is a matrix of cells (or pixels) organized into rows and columns. Each cell / pixels stores a value. Satellite imagery, aerial imagery and radar from Earth Observation Systems are a great source of environmental data.

As discussed in the previous section, raster will have a Coordinate Reference System (CRS) defining the mathematical formula used to used to define the position of the raster pixels on earth. Few extra terms you might hear when talking about raster:

The spatial resolution of a raster refers to the size of a pixel. It is often defined as one number (e.g. 30m) assuming the pixels are squares; However it is not a requirement and pixels can be rectangular (have edges of different length).
The extent of a raster refers to the area on Earth that is covered by the entire raster. It be defined in many ways, but it will often be defined using a specific corner, the pixel size, and the number of pixels along the x and y dimensions.
The number of bands of a raster refers to a set of images over the same area but with value corresponding to different “bands” of the electromagnetic spectrum. For example, a typical RGB image will be made of 3: Red, Green, Blue. Often you want to stack those images to conduct an analysis as the area of observation matched perfectly. Satellite and other sensors can of course capture information far beyond the visible part of the electromagnetic spectrum.

18.2.1 Raster File Formats

As vector data, raster data comes in various file formats:

18.2.1.1 `TIFF/GeoTIFF`

TIFF/GeoTIFF are probably the most common raster formats. GeoTIFFs consist of one file with a specific header that encapsulates the geospatial information. There is a second way to add geospatial information to a TIFF file by adding an extra “world” file .tfw with the same name as the TIFF file. This .tfw is a simple text file that defines various parameters to define the raster geospatial information.

18.2.1.2 `IMG`

This is a proprietary form from the remote sensing software ERDAS Imagine. It has the advantage to be able to store multiple bands in one file and also to have great compression capacity.

18.2.1.3 `NetCDF`

The Network Common Data Form, or netCDF, is an interface to a library of data access functions for storing and retrieving data in the form of arrays. A Subset of this format, the Hierarchical data format (HDF4 or HDF5), is a format that is often used by satellite imagery data providers. Through its great handling of multidimensional data, it can be used to store both space and time information.

One word on ESRI geospatial database storing raster data: it is unfortunately to sue R to read this data format. R relies on open source libraries to import and process geospatial data (GDAL, GEOS, PROJ, …), and ESRI has not yet opened this data format to those libraries.

18.3 The `raster` Package

The raster package is THE package in R to import and process raster data. It can handle many of the formats mentioned above and more.

To import a raster in R, you will use the raster function for a single band raster. For multi-band raster, you can use the brick() function instead.

For this example, we are going to read a land cover map of Alaska from the The Multi-Resolution Land Characteristics (MRLC) consortium.

The original file was quite large quite large (~8GB), so we have already resampled to a lower spatial resolution of 50m and saved it in a shared directory on the server.

library(dplyr)
library(sf)
library(raster)
library(ggplot2)
library(curl)
library(scales)


# House keeping to setup data from the vector chapter
ak_regions <- read_sf("data/shapefiles/ak_regions_simp.shp")

ak_regions_3338 <- ak_regions %>%
  st_transform(crs = 3338)

pop <- read.csv("data/shapefiles/alaska_population.csv")

pop_4326 <- st_as_sf(pop, 
                  coords = c('lng', 'lat'),
                  crs = 4326,
                  remove = F)

pop_3338 <- st_transform(pop_4326, crs = 3338)

pop_joined <- st_join(pop_3338, ak_regions_3338, join = st_within)

Let us now read this raster into R:

lc_3338 <- raster("/home/shares/scientist/ds-workshop/ak_nlcd_2011_landcover_50m_3338.tif")

Look at the raster:

lc_3338

## class      : RasterLayer 
## dimensions : 40716, 74552, 3035459232  (nrow, ncol, ncell)
## resolution : 50, 50  (x, y)
## extent     : -2232595, 1495005, 344575, 2380375  (xmin, xmax, ymin, ymax)
## crs        : +proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
## source     : /tmp/RtmpakeRBa/file49ed3b556455 
## names      : file49ed3b556455 
## values     : 0, 95  (min, max)

It is always good to check if everything aligns well:

plot(lc_3338)
plot(pop_3338, add=TRUE)

### Extracting information from raster

OK, everything looks good. Now we want to find the most frequent land cover within a radius of 500m. To do so we can use the extract function. This function can handle, the buffering and even have a modal function that will let us find the mode.

# Compute the most frequent land cover type around the population
raster_points <- extract(lc_3338, pop_3338, buffer=500, fun=modal)

# Add those values to our data
pop_joined$land_cover <- raster_points

Note that I am using 500 for the buffer radius because we know the unit is meter from the projection definition: crs : +proj=aea +lat_0=50 +lon_0=-154 +lat_1=55 +lat_2=65 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 **+units=m**.

At this point, we only have the values for this raster and do not know yet what are the corresponding land covers. Luckily we can look at the metadata here: https://www.mrlc.gov/data/legends/national-land-cover-database-2011-nlcd2011-legend

This not a super friendly format. We have already processed this information (see the end of this chapter to learn how) and stored it in a csv:

##    ID      COUNT Red Green Blue              Land.Cover.Type
## 1   0 6539917102   0     0    0                 Unclassified
## 2  11  267506887  71   107  161                   Open Water
## 3  12   77582023 209   222  250           Perennial Ice/Snow
## 4  21     383719 222   202  202        Developed, Open Space
## 5  22    1080686 217   148  130     Developed, Low Intensity
## 6  23     145477 238     0    0  Developed, Medium Intensity
## 7  24      52478 171     0    0    Developed, High Intensity
## 8  31  146969306 179   174  163                  Barren Land
## 9  41   61258881 104   171   99             Deciduous Forest
## 10 42  259612828  28    99   48             Evergreen Forest
## 11 43   61771547 181   202  143                 Mixed Forest
## 12 51  322951293 166   140   48                  Dwarf Shrub
## 13 52  426246570 204   186  125                  Shrub/Scrub
## 14 71   32455056 227   227  194         Grassland/Herbaceous
## 15 72  107796396 202   202  120             Sedge/Herbaceous
## 16 74     582776 120   174  148                         Moss
## 17 81      51213 220   217   61                  Pasture/Hay
## 18 82     319495 171   112   40             Cultivated Crops
## 19 90   65732442 186   217  235               Woody Wetlands
## 20 95   56251009 112   163  186 Emergent Herbaceous Wetlands

# Read the land cover legend
Legend_lc <- read.csv("/home/shares/scientist/ds-workshop/legend_ak_nlcd_2011.csv", stringsAsFactors = FALSE)
Legend_lc

Now we can join this information to our main data frame and since the categories are pretty detailed, we will also aggregate them to a coarser level of information:

# It is a lot of categories, let us consolidate this
Legend_lc <- Legend_lc %>%
  mutate(main_lc = ifelse(ID %in% 40:49, "Forest",
                              ifelse(ID %in% 20:29, "Urban",
                                     ifelse(ID %in% 50:59, "Shrub",
                                             ifelse(ID %in% 70:79, "Grass",
                                                     ifelse(ID %in% 80:89, "Crops",
                                                             ifelse(ID %in% 90:99, "Wetland", Land.Cover.Type)
                                                             )
                                                     )
                                             )
                                     )
                              )
  )


# Join the LC categories to the population data
pop_3338_cover <- left_join(pop_joined, Legend_lc, by=c("land_cover"="ID")) %>% 
  dplyr::select(-Red, -Green, -Blue, -Land.Cover.Type)

18.3.1 Plotting

# Create color palette by keeping last color of each group
pal <- Legend_lc %>% 
  group_by(main_lc) %>% 
  slice(n()) %>% # Keeping the last color of the groups
  ungroup %>% 
  arrange(ID) %>%
  mutate(color_hex = rgb(Red, Green, Blue, max = 255)) %>% 
  dplyr::select(main_lc, color_hex)

# turn pallete into a list for plotting
pal_list <- pal$color_hex
names(pal_list) <- pal$main_lc


# Plot by region
ggplot(pop_3338_cover, aes(region, population, fill = main_lc)) +
    geom_col() +
    scale_y_continuous(labels = comma) +
    scale_fill_manual(values = pal_list) +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    ggtitle("Land cover type by region and population") +
    labs(fill = "", y = "Population", x = "")

ggplot() +
    geom_sf(data = ak_regions_3338, aes(), color = "black") +
    geom_sf(data = pop_3338_cover, aes(color = main_lc,
                                       size = population), show.legend = "point") +
    scale_size_continuous(guide = F) +
    scale_color_manual(values = pal_list) +
    theme_bw() +
    theme(legend.position = "bottom", legend.title = element_blank())

18.3.2 Calculation with raster

A Very useful function of the raster package is the function calc. As the name suggest this function helps you to do computation on raster. Continuing our example above, we want to compute a percentage of area that is forest.

Let us first select a region of interest:

copper_region_3338 <- ak_regions_3338 %>% 
  filter(region == "Copper River")

Crop the raster using the region:

copper_lc_3338 <- crop(lc_3338, copper_region_3338)

plot(copper_lc_3338)

The way calc works is that you need to create a function to be applied to the raster using calc. Here is the function we will use create a binary mask for forest:

# Function to create a binary mask for forest 
forest_masker <- function(x){
  x[x<40 | x>49] <- 0
  x[x>=40 & x<=49] <- 1
  x
}

Now we can use the calc function to apply this function to every pixels:

copper_forested <- calc(copper_lc_3338, forest_masker)
plot(copper_forested)

As previously, we can now use the extract function to count the number of forested pixels:

# Filter the population data for the copper region
copper_pop_3338 <- pop_3338_cover %>%
  filter(region == "Copper River")

# Use those locations to extract the number of pixels wth forest
forested_count <- extract(copper_forested, copper_pop_3338, buffer=500, fun=sum)

Adding the values back to the main data set:

copper_pop_3338$forest_cov <- 100 * forested_count / (20*20) # 20 pixels within the diameter
head(copper_pop_3338)

## Simple feature collection with 6 features and 12 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 356847.6 ymin: 1318282 xmax: 505439 ymax: 1433557
## epsg (SRID):    3338
## proj4string:    +proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
##   year             city      lat       lng population region_id       region
## 1 2015      Chistochina 62.57178 -144.6542         59         5 Copper River
## 2 2015          Chitina 61.51583 -144.4369         80         5 Copper River
## 3 2015    Copper Center 61.97696 -145.3297        405         5 Copper River
## 4 2015 Eureka Roadhouse 61.93861 -147.1681          0         5 Copper River
## 5 2015           Gakona 62.30194 -145.3019        205         5 Copper River
## 6 2015       Glennallen 62.10917 -145.5464        366         5 Copper River
##   mgmt_area land_cover     COUNT main_lc                 geometry forest_cov
## 1         2         90  65732442 Wetland   POINT (477473 1433557)       5.25
## 2         2         41  61258881  Forest   POINT (505439 1318282)      49.75
## 3         2         42 259612828  Forest POINT (451821.3 1362933)      65.00
## 4         2         52 426246570   Shrub POINT (356847.6 1347480)      13.00
## 5         2         41  61258881  Forest POINT (448515.1 1399150)      43.50
## 6         2         21    383719   Urban POINT (438715.9 1376141)      33.50

Plotting the percentage of forested area in the 1 km^2 surround of the population centers:

ggplot() +
    geom_sf(data = copper_region_3338, aes(), color = "black") +
    geom_sf(data = copper_pop_3338, aes(color = forest_cov,
                                       size = population)) +
    scale_size_continuous(guide = F) +
    scale_colour_gradientn(colours = terrain.colors(10, rev=TRUE)) +
    theme_bw() +
    theme(legend.position = "bottom") +
    labs(color = "Percent Forest Cover")

18.4 Other Thoughts

18.4.1 Raster processing can be slow

Processing raster in R is not fast and raster processing can quickly involve large data set. This is especially true when reprojecting large raster. It is therefore also good to learn how to use other tools to do such tasks. gdal at the command line is a great one (see gdalwrap).

It is also very important to process raster the correct way. Before starting the processing of large raster, we strongly recommend you read this information: https://rspatial.org/raster/appendix1.html

18.4.2 Preprocessing we did

Actually because the original Land Cover raster was using a img format, it was possible to store more information then the pixel values. This format can handle categorical data and attach extra information, we can get the colors as specified by the data creator as well as the categories corresponding to a specific values. The the raster object in R use a S4 format that allows it to store specific information into slots. You can access these slots using the @. Here how we created the legend csv file you used:

# Read the original raster
lc <- raster("/home/shares/scientist/ds-workshop/LC30/ak_nlcd_2011_landcover_1_15_15.img")

# Look at the table attached
lc@data@attributes

# Remove the value without any pixel
legend_lc <- lc@data@attributes[[1]] %>%
  filter(COUNT!=0) %>% 
  dplyr::select(-Opacity)

# Write the csv
write_csv(legend_lc, "/home/shares/scientist/ds-workshop/legend_ak_nlcd_2011.csv")

This is how we reprojected to original data and resampled it to a 50m resolution instead of the 30m. Note that the use the Nearest Neighbor method ngb to compute the new raster as this algorithm keeps the original values

lc_3338 <- projectRaster(lc, crs=crs(ak_regions_3338), res=50, method = 'ngb', filename="/home/shares/scientist/ds-workshop/ak_nlcd_2011_landcover_50m_3338.tif")

18.5 References

Modern Geospatial Data Analysis in R Geocomputation with R Raster analysis in R Spatial data science