2 Data Manipulation

Author

Irena Axmanová

In this chapter, we will try to explore the data, prepare subsets with only selected variables and filter them to only defined cases. We will prepare new variables and rearrange the data according to them. We will also further train importing and exporting the data.

2.1. Introducing dplyr

The term data manipulation might sound a bit tricky. However, it does not mean we plan to show you how to cheat and get better results. It simply means we want to show you how to easily handle the data and prepare it in the form you need. The functions for basic data handling, namely select(), filter(), mutate(), arrange(), and slice(), are provided by the tidyverse package dplyr. Do you remember how to find out more about the package? If nothing else, you can try ?dplyr, which actually gives you more hints where to look further.

Note that many of these operations can also be performed in some table editors (e.g., Excel) before importing to R. However, the effort and time demands would be significantly higher and would increase enormously with the size of the dataset. In contrast, in R, you can modify and rerun the steps in a single pipeline, and the data will be immediately ready for the following analysis.

We will need the following libraries:

library(tidyverse)
library(readxl)

And the forest understory dataset. Again, we will import the data just once and use the pipe to test the functions/effects without actually changing the data:

data <- read_excel("data/forest_understory/Axmanova-Forest-understory-diversity-analyses.xlsx")
names(data)
 [1] "PlotID"           "ForestType"       "ForestTypeName"   "Herbs"           
 [5] "Juveniles"        "CoverE1"          "Biomass"          "Soil_depth_categ"
 [9] "pH_KCl"           "Slope"            "Altitude"         "Canopy_E3"       
[13] "Radiation"        "Heat"             "TransDir"         "TransDif"        
[17] "TransTot"         "EIV_light"        "EIV_moisture"     "EIV_soilreaction"
[21] "EIV_nutrients"    "TWI"             

2.2 select()

select() extracts columns/variables based on their names or position. It is essential to understand the distinction between select() and filter(). select() is used to select variables (columns) we want to keep in the dataset, while filter() applies to rows depending on their values.

You can select by naming the variables. Here, you would appreciate the tidy style of names! Tidy names mean no need to use parentheses :-).

data %>% 
  select(PlotID, ForestType, ForestTypeName) 
# A tibble: 65 × 3
   PlotID ForestType ForestTypeName     
    <dbl>      <dbl> <chr>              
 1      1          2 oak hornbeam forest
 2      2          1 oak forest         
 3      3          1 oak forest         
 4      4          1 oak forest         
 5      5          1 oak forest         
 6      6          1 oak forest         
 7      7          1 oak forest         
 8      8          2 oak hornbeam forest
 9      9          2 oak hornbeam forest
10     10          1 oak forest         
# ℹ 55 more rows

You can also select by position. However, be sure it will remain the same after all the changes you might make to the data.

data %>% 
  select(1:3)
# A tibble: 65 × 3
   PlotID ForestType ForestTypeName     
    <dbl>      <dbl> <chr>              
 1      1          2 oak hornbeam forest
 2      2          1 oak forest         
 3      3          1 oak forest         
 4      4          1 oak forest         
 5      5          1 oak forest         
 6      6          1 oak forest         
 7      7          1 oak forest         
 8      8          2 oak hornbeam forest
 9      9          2 oak hornbeam forest
10     10          1 oak forest         
# ℹ 55 more rows

Sometimes, you decide you want to get rid of some variables. Either you can name all the others that you want to keep, or you can remove the unwanted ones with a minus sign. In the case of just one variable, it is easy: select(-xx). If you want to exclude two or more, you have to use select(-c(xx, xy)).

data %>% 
  select(-c(ForestType, ForestTypeName))
# A tibble: 65 × 20
   PlotID Herbs Juveniles CoverE1 Biomass Soil_depth_categ pH_KCl Slope Altitude
    <dbl> <dbl>     <dbl>   <dbl>   <dbl>            <dbl>  <dbl> <dbl>    <dbl>
 1      1    26        12      20    12.8              5     5.28     4      412
 2      2    13         3      25     9.9              4.5   3.24    24      458
 3      3    14         1      25    15.2              3     4.01    13      414
 4      4    15         5      30    16                3     3.77    21      379
 5      5    13         1      35    20.7              3     3.5      0      374
 6      6    16         3      60    46.4              6     3.8     10      380
 7      7    17         5      70    49.2              7     3.48     6      373
 8      8    21         1      70    48.7              5     3.68     0      390
 9      9    15         4      15    13.8              3.5   4.24    38      255
10     10    14         4      75    79.1              5     4.01    13      340
# ℹ 55 more rows
# ℹ 11 more variables: Canopy_E3 <dbl>, Radiation <dbl>, Heat <dbl>,
#   TransDir <dbl>, TransDif <dbl>, TransTot <dbl>, EIV_light <dbl>,
#   EIV_moisture <dbl>, EIV_soilreaction <dbl>, EIV_nutrients <dbl>, TWI <dbl>

You can also define a range of variables between two of them.

data %>%
  select(PlotID:ForestTypeName)
# A tibble: 65 × 3
   PlotID ForestType ForestTypeName     
    <dbl>      <dbl> <chr>              
 1      1          2 oak hornbeam forest
 2      2          1 oak forest         
 3      3          1 oak forest         
 4      4          1 oak forest         
 5      5          1 oak forest         
 6      6          1 oak forest         
 7      7          1 oak forest         
 8      8          2 oak hornbeam forest
 9      9          2 oak hornbeam forest
10     10          1 oak forest         
# ℹ 55 more rows

Or you can combine the approaches listed above:

data %>%
  select (PlotID, 3:6)
# A tibble: 65 × 5
   PlotID ForestTypeName      Herbs Juveniles CoverE1
    <dbl> <chr>               <dbl>     <dbl>   <dbl>
 1      1 oak hornbeam forest    26        12      20
 2      2 oak forest             13         3      25
 3      3 oak forest             14         1      25
 4      4 oak forest             15         5      30
 5      5 oak forest             13         1      35
 6      6 oak forest             16         3      60
 7      7 oak forest             17         5      70
 8      8 oak hornbeam forest    21         1      70
 9      9 oak hornbeam forest    15         4      15
10     10 oak forest             14         4      75
# ℹ 55 more rows

select() can also be used in combination with functions from the stringr package to identify a specific pattern in the names. Try several options: starts_with(), ends_with() or a more general one, contains().

data %>%
  select (PlotID, starts_with("EIV"))
# A tibble: 65 × 5
   PlotID EIV_light EIV_moisture EIV_soilreaction EIV_nutrients
    <dbl>     <dbl>        <dbl>            <dbl>         <dbl>
 1      1      5            4.38             6.68          4.31
 2      2      4.71         4.64             4.67          3.69
 3      3      4.36         4.7              4.8           3.55
 4      4      5.26         4.38             5.53          3.56
 5      5      6.14         4                5.33          3.46
 6      6      6.19         4.35             6.75          5.06
 7      7      6.19         4.25             6.09          4.33
 8      8      5.29         4.6              5.07          4.12
 9      9      5.47         4.36             5.46          3.5 
10     10      6.53         3.86             6             3   
# ℹ 55 more rows

select() can effectively help you organise the data. Imagine you have a workflow where you need only some variables, but in a specific sequence. And you import the data from different people or years. By using select() in your script, you can always order the variables in the same way, e.g., SampleID, ForestType, SpeciesNr, Productivity, even during import. You can also rename the variables using select to get exactly what you need. Here, the new name is on the left, as in rename().

data %>% 
  select(SampleID = PlotID, ForestCode = ForestType, SpeciesNr = Herbs, Productivity = Biomass) 
# A tibble: 65 × 4
   SampleID ForestCode SpeciesNr Productivity
      <dbl>      <dbl>     <dbl>        <dbl>
 1        1          2        26         12.8
 2        2          1        13          9.9
 3        3          1        14         15.2
 4        4          1        15         16  
 5        5          1        13         20.7
 6        6          1        16         46.4
 7        7          1        17         49.2
 8        8          2        21         48.7
 9        9          2        15         13.8
10       10          1        14         79.1
# ℹ 55 more rows

2.3 arrange()

This function retains the same variables and reorders the rows according to the values of the selected variables. To view the changes at a glance, we will first select only a few variables.

data <- read_excel("data/forest_understory/Axmanova-Forest-understory-diversity-analyses.xlsx") 
data <- data %>% select(PlotID, ForestType, ForestTypeName, Biomass)

Now we have decided to arrange the data by Forest type.

data %>% 
  arrange(ForestTypeName)
# A tibble: 65 × 4
   PlotID ForestType ForestTypeName  Biomass
    <dbl>      <dbl> <chr>             <dbl>
 1    101          4 alluvial forest    91.1
 2    103          4 alluvial forest   114. 
 3    104          4 alluvial forest   188. 
 4    110          4 alluvial forest   126. 
 5    111          4 alluvial forest    84.8
 6    113          4 alluvial forest    74.5
 7    125          4 alluvial forest   123. 
 8    127          4 alluvial forest   176  
 9    129          4 alluvial forest   100. 
10    131          4 alluvial forest   163. 
# ℹ 55 more rows

We can also decide to arrange the data from the highest value to the lowest, i.e. in descending order.

data %>% 
  arrange(desc(Biomass))
# A tibble: 65 × 4
   PlotID ForestType ForestTypeName      Biomass
    <dbl>      <dbl> <chr>                 <dbl>
 1    132          4 alluvial forest        287.
 2    130          3 ravine forest          238 
 3    104          4 alluvial forest        188.
 4    127          4 alluvial forest        176 
 5    131          4 alluvial forest        163.
 6    110          4 alluvial forest        126.
 7    125          4 alluvial forest        123.
 8    128          3 ravine forest          121.
 9    103          4 alluvial forest        114.
10    124          2 oak hornbeam forest    102.
# ℹ 55 more rows

Or we can arrange according to more variables. Here, the forest type is listed first, as we have decided it is the most important grouping variable. Within each type, we want to see the rows/cases ordered by biomass values.

data %>% 
  arrange(ForestType, ForestTypeName, desc(Biomass)) %>% print(n = 30)
# A tibble: 65 × 4
   PlotID ForestType ForestTypeName      Biomass
    <dbl>      <dbl> <chr>                 <dbl>
 1     10          1 oak forest             79.1
 2      7          1 oak forest             49.2
 3      6          1 oak forest             46.4
 4     44          1 oak forest             34.1
 5     82          1 oak forest             33.9
 6     28          1 oak forest             29.7
 7     42          1 oak forest             21.5
 8      5          1 oak forest             20.7
 9     81          1 oak forest             20.6
10      4          1 oak forest             16  
11     47          1 oak forest             15.5
12      3          1 oak forest             15.2
13     31          1 oak forest             14.9
14    100          1 oak forest             13.5
15      2          1 oak forest              9.9
16     11          1 oak forest              7.4
17    124          2 oak hornbeam forest   102. 
18    120          2 oak hornbeam forest    98.5
19    121          2 oak hornbeam forest    84.7
20    126          2 oak hornbeam forest    72.9
21     18          2 oak hornbeam forest    72.2
22    118          2 oak hornbeam forest    66.2
23     99          2 oak hornbeam forest    64.8
24    109          2 oak hornbeam forest    60.7
25    112          2 oak hornbeam forest    58.4
26    102          2 oak hornbeam forest    57.4
27    117          2 oak hornbeam forest    54.7
28      8          2 oak hornbeam forest    48.7
29     86          2 oak hornbeam forest    46.9
30     87          2 oak hornbeam forest    42.9
# ℹ 35 more rows

To see more rows of the resulting table, we specified their number using the print() function.

2.4 distinct()

distinct() is a function that takes your data and removes all the duplicate rows, keeping only the unique ones. There are many cases where you will truly appreciate this elegant and effortless approach. For example, we want a list of unique Plot IDs, unique combinations of two categories, and so on. Here, we aim to compile a list of forest type codes and corresponding names.

data %>%
  arrange(ForestType) %>%
  select(ForestType, ForestTypeName) %>%
  distinct()
# A tibble: 4 × 2
  ForestType ForestTypeName     
       <dbl> <chr>              
1          1 oak forest         
2          2 oak hornbeam forest
3          3 ravine forest      
4          4 alluvial forest    

The same can also be done if I skip the select tool, because distinct() works more like select() + distinct().

data %>%      
  arrange(ForestType) %>%
  distinct(ForestType, ForestTypeName)
# A tibble: 4 × 2
  ForestType ForestTypeName     
       <dbl> <chr>              
1          1 oak forest         
2          2 oak hornbeam forest
3          3 ravine forest      
4          4 alluvial forest    

Sometimes, distinct() can also be used when importing data to remove duplicate rows that were overlooked previously: data <- read_csv... %>% distinct().

2.5 filter() and slice()

When we have a large dataset, we sometimes need to create a subset of the rows/cases. First, we need to define the variable on which we will filter the rows (e.g., Forest type, soil pH) and determine which values are acceptable and which are not.

In this first example, we use a categorical variable, and we want to match the exact value, so we have to use ==

data %>% 
  filter(ForestTypeName == "alluvial forest")
# A tibble: 11 × 4
   PlotID ForestType ForestTypeName  Biomass
    <dbl>      <dbl> <chr>             <dbl>
 1    101          4 alluvial forest    91.1
 2    103          4 alluvial forest   114. 
 3    104          4 alluvial forest   188. 
 4    110          4 alluvial forest   126. 
 5    111          4 alluvial forest    84.8
 6    113          4 alluvial forest    74.5
 7    125          4 alluvial forest   123. 
 8    127          4 alluvial forest   176  
 9    129          4 alluvial forest   100. 
10    131          4 alluvial forest   163. 
11    132          4 alluvial forest   287. 

Or we might define a list of values, especially for categorical variables. The filter function will try to find rows with values exactly matching those %in% the list.

data %>% 
  filter(ForestTypeName %in% c("alluvial forest", "ravine forest"))
# A tibble: 21 × 4
   PlotID ForestType ForestTypeName  Biomass
    <dbl>      <dbl> <chr>             <dbl>
 1    101          4 alluvial forest    91.1
 2    103          4 alluvial forest   114. 
 3    104          4 alluvial forest   188. 
 4    105          3 ravine forest      65.5
 5    106          3 ravine forest      33.6
 6    108          3 ravine forest      82.3
 7    110          4 alluvial forest   126. 
 8    111          4 alluvial forest    84.8
 9    113          4 alluvial forest    74.5
10    114          3 ravine forest      71.5
# ℹ 11 more rows

When filtering a continuous variable, we can work with thresholds. For example, here the biomass of the herb layer is measured in g/m2, and we want only those rows/cases where the biomass values are higher than 80.

data <- read_excel("data/forest_understory/Axmanova-Forest-understory-diversity-analyses.xlsx") 
data %>% 
  filter(Biomass > 80) #[g/m2]
# A tibble: 17 × 22
   PlotID ForestType ForestTypeName      Herbs Juveniles CoverE1 Biomass
    <dbl>      <dbl> <chr>               <dbl>     <dbl>   <dbl>   <dbl>
 1    101          4 alluvial forest        28         6      90    91.1
 2    103          4 alluvial forest        35         5      80   114. 
 3    104          4 alluvial forest        25        11      85   188. 
 4    108          3 ravine forest          28         7      70    82.3
 5    110          4 alluvial forest        37         4      95   126. 
 6    111          4 alluvial forest        26        10      85    84.8
 7    120          2 oak hornbeam forest    36         6      75    98.5
 8    121          2 oak hornbeam forest    30        12      80    84.7
 9    122          3 ravine forest          29         6      60   102. 
10    124          2 oak hornbeam forest    34        15      75   102. 
11    125          4 alluvial forest        25         8      95   123. 
12    127          4 alluvial forest        53         6       0   176  
13    128          3 ravine forest          20         4      70   121. 
14    129          4 alluvial forest        31         7      85   100. 
15    130          3 ravine forest          18         5      85   238  
16    131          4 alluvial forest        59         3      95   163. 
17    132          4 alluvial forest        60         7      90   287. 
# ℹ 15 more variables: Soil_depth_categ <dbl>, pH_KCl <dbl>, Slope <dbl>,
#   Altitude <dbl>, Canopy_E3 <dbl>, Radiation <dbl>, Heat <dbl>,
#   TransDir <dbl>, TransDif <dbl>, TransTot <dbl>, EIV_light <dbl>,
#   EIV_moisture <dbl>, EIV_soilreaction <dbl>, EIV_nutrients <dbl>, TWI <dbl>

You can also filter plots within a specified range and apply multiple conditions. Here, we want to keep only plots with biomass in the range of 40–80 g/m², but we also want to keep plots without a value (NA).

data %>% 
  filter((Biomass >= 40 & Biomass <= 80) | is.na(Biomass))
# A tibble: 20 × 22
   PlotID ForestType ForestTypeName      Herbs Juveniles CoverE1 Biomass
    <dbl>      <dbl> <chr>               <dbl>     <dbl>   <dbl>   <dbl>
 1      6          1 oak forest             16         3      60    46.4
 2      7          1 oak forest             17         5      70    49.2
 3      8          2 oak hornbeam forest    21         1      70    48.7
 4     10          1 oak forest             14         4      75    79.1
 5     18          2 oak hornbeam forest    13         3      85    72.2
 6     86          2 oak hornbeam forest    32         5      40    46.9
 7     87          2 oak hornbeam forest    34         9      65    42.9
 8     99          2 oak hornbeam forest    18         6      85    64.8
 9    102          2 oak hornbeam forest    37        13      65    57.4
10    105          3 ravine forest          12        10      80    65.5
11    109          2 oak hornbeam forest    25         9      50    60.7
12    112          2 oak hornbeam forest    32        14      60    58.4
13    113          4 alluvial forest        37        10      70    74.5
14    114          3 ravine forest          12         6      50    71.5
15    115          3 ravine forest          23         2      70    54.6
16    116          3 ravine forest          16         3      45    64.5
17    117          2 oak hornbeam forest    31         7      55    54.7
18    118          2 oak hornbeam forest     8         5      45    66.2
19    123          3 ravine forest          34        11      70    53.6
20    126          2 oak hornbeam forest    27         7      65    72.9
# ℹ 15 more variables: Soil_depth_categ <dbl>, pH_KCl <dbl>, Slope <dbl>,
#   Altitude <dbl>, Canopy_E3 <dbl>, Radiation <dbl>, Heat <dbl>,
#   TransDir <dbl>, TransDif <dbl>, TransTot <dbl>, EIV_light <dbl>,
#   EIV_moisture <dbl>, EIV_soilreaction <dbl>, EIV_nutrients <dbl>, TWI <dbl>

Example of multiple conditions in one filter line. Here, we select all plots of the specified types and exclude the others, taking only those with higher biomass, because we used the OR (|) operator between the two conditions.

data %>% 
  filter((ForestTypeName %in% c("alluvial forest", "ravine forest") |
    (Biomass >= 80)))
# A tibble: 24 × 22
   PlotID ForestType ForestTypeName  Herbs Juveniles CoverE1 Biomass
    <dbl>      <dbl> <chr>           <dbl>     <dbl>   <dbl>   <dbl>
 1    101          4 alluvial forest    28         6      90    91.1
 2    103          4 alluvial forest    35         5      80   114. 
 3    104          4 alluvial forest    25        11      85   188. 
 4    105          3 ravine forest      12        10      80    65.5
 5    106          3 ravine forest      16         0      75    33.6
 6    108          3 ravine forest      28         7      70    82.3
 7    110          4 alluvial forest    37         4      95   126. 
 8    111          4 alluvial forest    26        10      85    84.8
 9    113          4 alluvial forest    37        10      70    74.5
10    114          3 ravine forest      12         6      50    71.5
# ℹ 14 more rows
# ℹ 15 more variables: Soil_depth_categ <dbl>, pH_KCl <dbl>, Slope <dbl>,
#   Altitude <dbl>, Canopy_E3 <dbl>, Radiation <dbl>, Heat <dbl>,
#   TransDir <dbl>, TransDif <dbl>, TransTot <dbl>, EIV_light <dbl>,
#   EIV_moisture <dbl>, EIV_soilreaction <dbl>, EIV_nutrients <dbl>, TWI <dbl>

Sometimes you might find it useful to filter something out, rather than specifying what should stay. This is indicated by the exclamation mark, as shown below.

data %>% 
  filter(!is.na(Juveniles))
# A tibble: 65 × 22
   PlotID ForestType ForestTypeName      Herbs Juveniles CoverE1 Biomass
    <dbl>      <dbl> <chr>               <dbl>     <dbl>   <dbl>   <dbl>
 1      1          2 oak hornbeam forest    26        12      20    12.8
 2      2          1 oak forest             13         3      25     9.9
 3      3          1 oak forest             14         1      25    15.2
 4      4          1 oak forest             15         5      30    16  
 5      5          1 oak forest             13         1      35    20.7
 6      6          1 oak forest             16         3      60    46.4
 7      7          1 oak forest             17         5      70    49.2
 8      8          2 oak hornbeam forest    21         1      70    48.7
 9      9          2 oak hornbeam forest    15         4      15    13.8
10     10          1 oak forest             14         4      75    79.1
# ℹ 55 more rows
# ℹ 15 more variables: Soil_depth_categ <dbl>, pH_KCl <dbl>, Slope <dbl>,
#   Altitude <dbl>, Canopy_E3 <dbl>, Radiation <dbl>, Heat <dbl>,
#   TransDir <dbl>, TransDif <dbl>, TransTot <dbl>, EIV_light <dbl>,
#   EIV_moisture <dbl>, EIV_soilreaction <dbl>, EIV_nutrients <dbl>, TWI <dbl>

A specific alternative to filter() is the slice() function. Let’s say I want to get the top 3 rows/cases/vegetation samples with the highest numbers of recorded juveniles. So, we can arrange the values, determine the threshold and filter the values above it. Or we can use slice_max() to do the job for us.

data %>% 
  filter(!is.na(Juveniles)) %>%
  slice_max(Juveniles, n = 3) 
# A tibble: 3 × 22
  PlotID ForestType ForestTypeName      Herbs Juveniles CoverE1 Biomass
   <dbl>      <dbl> <chr>               <dbl>     <dbl>   <dbl>   <dbl>
1    124          2 oak hornbeam forest    34        15      75   102. 
2    112          2 oak hornbeam forest    32        14      60    58.4
3    119          2 oak hornbeam forest    63        14      50    36.6
# ℹ 15 more variables: Soil_depth_categ <dbl>, pH_KCl <dbl>, Slope <dbl>,
#   Altitude <dbl>, Canopy_E3 <dbl>, Radiation <dbl>, Heat <dbl>,
#   TransDir <dbl>, TransDif <dbl>, TransTot <dbl>, EIV_light <dbl>,
#   EIV_moisture <dbl>, EIV_soilreaction <dbl>, EIV_nutrients <dbl>, TWI <dbl>

Tip: Try slice_min() if you need the lowest values.

2.6 mutate()

mutate() adds new variables that are functions of existing variables, e.g. round the values. It can also overwrite the original variable.

For example, in the field, I recorded juveniles of woody plants and all other herb-layer species separately. However, we now want to sum these two values for each row/case/vegetation plot so that we can speak about overall species richness. I create a new variable by simply summing these two (note how easy it is with tidy names!).

data %>% 
  mutate(SpeciesRichness = Herbs + Juveniles) %>%
  select(PlotID, SpeciesRichness, Herbs, Juveniles)
# A tibble: 65 × 4
   PlotID SpeciesRichness Herbs Juveniles
    <dbl>           <dbl> <dbl>     <dbl>
 1      1              38    26        12
 2      2              16    13         3
 3      3              15    14         1
 4      4              20    15         5
 5      5              14    13         1
 6      6              19    16         3
 7      7              22    17         5
 8      8              22    21         1
 9      9              19    15         4
10     10              18    14         4
# ℹ 55 more rows

Sometimes, you want to add a variable where all rows/cases will receive the same value. For example, because you plan to join the data with other data or because it is useful for another mutate(). This is also a straightforward method for converting data with abundances to presence/absence data.

data %>% 
  mutate(selection = 1)%>%
  select(PlotID, selection, ForestType, ForestTypeName)
# A tibble: 65 × 4
   PlotID selection ForestType ForestTypeName     
    <dbl>     <dbl>      <dbl> <chr>              
 1      1         1          2 oak hornbeam forest
 2      2         1          1 oak forest         
 3      3         1          1 oak forest         
 4      4         1          1 oak forest         
 5      5         1          1 oak forest         
 6      6         1          1 oak forest         
 7      7         1          1 oak forest         
 8      8         1          2 oak hornbeam forest
 9      9         1          2 oak hornbeam forest
10     10         1          1 oak forest         
# ℹ 55 more rows

You can also create a variable with more categories based on the values of other variables using ifelse() inside mutate(). You give the condition when it should be called ‘low’ and what to do if the condition is not met - name it ‘high’.

Tip: Inside ‘Condition’, you can also use other conditional operators, like AND (&), OR (|), or EQUAL (==).

data %>%
  mutate(Productivity = ifelse(Biomass<60,"low","high")) %>% 
  select (PlotID, ForestTypeName, Productivity, Biomass) %>%
  print(n=20)
# A tibble: 65 × 4
   PlotID ForestTypeName      Productivity Biomass
    <dbl> <chr>               <chr>          <dbl>
 1      1 oak hornbeam forest low             12.8
 2      2 oak forest          low              9.9
 3      3 oak forest          low             15.2
 4      4 oak forest          low             16  
 5      5 oak forest          low             20.7
 6      6 oak forest          low             46.4
 7      7 oak forest          low             49.2
 8      8 oak hornbeam forest low             48.7
 9      9 oak hornbeam forest low             13.8
10     10 oak forest          high            79.1
11     11 oak forest          low              7.4
12     14 oak hornbeam forest low             14.2
13     16 oak hornbeam forest low             26.2
14     18 oak hornbeam forest high            72.2
15     28 oak forest          low             29.7
16     29 oak hornbeam forest low             37.9
17     31 oak forest          low             14.9
18     32 oak hornbeam forest low             19.2
19     36 oak hornbeam forest low              3.3
20     41 oak hornbeam forest low             26.9
# ℹ 45 more rows

The same can be done with a similar approach using case_when(). Conditions are evaluated from the first (on the left) to the last (on the right side), so for each step, you take only the rows that are not yet treated by the previous condition.

data %>% 
  mutate(Productivity = case_when(Biomass <= 60 ~ "low", Biomass > 60 ~ "high")) %>% 
  select(PlotID, ForestTypeName, Productivity, Biomass) %>%
  print(n = 20)
# A tibble: 65 × 4
   PlotID ForestTypeName      Productivity Biomass
    <dbl> <chr>               <chr>          <dbl>
 1      1 oak hornbeam forest low             12.8
 2      2 oak forest          low              9.9
 3      3 oak forest          low             15.2
 4      4 oak forest          low             16  
 5      5 oak forest          low             20.7
 6      6 oak forest          low             46.4
 7      7 oak forest          low             49.2
 8      8 oak hornbeam forest low             48.7
 9      9 oak hornbeam forest low             13.8
10     10 oak forest          high            79.1
11     11 oak forest          low              7.4
12     14 oak hornbeam forest low             14.2
13     16 oak hornbeam forest low             26.2
14     18 oak hornbeam forest high            72.2
15     28 oak forest          low             29.7
16     29 oak hornbeam forest low             37.9
17     31 oak forest          low             14.9
18     32 oak hornbeam forest low             19.2
19     36 oak hornbeam forest low              3.3
20     41 oak hornbeam forest low             26.9
# ℹ 45 more rows

I have already said that we often use mutate() to transform original values directly. To work with multiple variables at once, e.g., round them, we can use mutate(across()).

data %>% 
  mutate(across(c(Biomass, pH_KCl, TransTot), ~ round(.x, digits = 1))) %>%
  select(PlotID, Biomass, pH_KCl, TransTot)
# A tibble: 65 × 4
   PlotID Biomass pH_KCl TransTot
    <dbl>   <dbl>  <dbl>    <dbl>
 1      1    12.8    5.3      6.6
 2      2     9.9    3.2      6.9
 3      3    15.2    4        7.3
 4      4    16      3.8      6.4
 5      5    20.7    3.5      6.9
 6      6    46.4    3.8      7  
 7      7    49.2    3.5      7.4
 8      8    48.7    3.7      7  
 9      9    13.8    4.2      6.5
10     10    79.1    4        6.3
# ℹ 55 more rows

.x refers to the entire column (vector) being transformed inside a function like mutate(across(...)) or map().

2.7 Save the data

If you decide that you have already prepared the dataset in the form it should be used later, you probably need to save it and export, e.g. as a csv. Remember that we were primarily trying to see what it would look like with the %>%, so you must either assign the dataset to a new name or add the write command at the end of the pipeline. See also Chapter 1.

Here we will play a bit with the tidyverse package readr.

Write a comma-delimited file:

write_csv(data, "exercisesIA/forest_selected1.csv") 

Write a semicolon-delimited file (;):

write_csv2(data, "exercisesIA/forest_selected2.csv")

Or find out more in the readr cheatsheet. Note that, in contrast to the write.csv() function, the write_csv() function adheres to the tidyverse philosophy, so row names are not included in the export.

You can also write directly to an Excel file, for example, using the writexl library.

library(writexl)
write_xlsx(data, 'exercisesIA/forestEnv.xlsx')

To create an xlsx with (multiple) named sheets, you would need to provide a list of data frames. Here, we specify that data1 should be saved as an Excel sheet named forestEnv (environmental data) and data2 as a sheet named forestSpe (species data from the same dataset).

library(writexl)
data1 <- read_excel("data/forest_understory/Axmanova-Forest-understory-diversity-analyses.xlsx")
data2 <- read_excel("data/forest_understory/Axmanova-Forest-spe.xlsx")
write_xlsx(list(forestEnv = data1, forestSpe = data2), 'exercisesIA/forest.xlsx' )

2.8 Exercises

1. Forest Data - Axmanova-Forest-understory-diversity-analyses.xlsx - download the data from the repository and save it into your project folder (Link to Github folder). This dataset is used throughout the whole chapter. Please use it to prepare your own script with remarks, copy and train what is described above.

Tip: If you copy the description and add it directly to your script, it is simply too long, right? Do you know how to read long lines of text in an R script? Go to “Code” and tick “soft wrap long lines”.

2. Directly from R Studio, create a folder results/chapter2 for saving today’s datasets, as we will train a bit of exporting the data.

3. Use the Forest Data to train more. Create variable “author” which will contain the name Axmanova > as we did before, create variable richness summing the herbs and juveniles > select variables plotID, author, richness, productivity (renamed biomass) and pH. Try to save this subset into results/chapter2.

4. Use the Forest dataset again. Find five plots with the lowest biomass from forest types other than oak forests.

5. Select PlotID and all variables that are connected to transmitted light (Trans..) > round these variables.

6. We will switch to another dataset, Species Forest Data - Axmanova-Forest-spe.xlsx - download the data from the repository and save it into your project folder (Link to Github folder). This is a long-format of the plant species recorded in the vegetation plots. Long format means that the records of species from one site are grouped by the same ID (here as RELEVE_NR). For each species, it is indicated in which vegetation layer it occurred, and an estimation of its abundance is in the form of percentage cover.

Import the data and check the structure > change it to presence-absence data (abundance set to 1) *Alternatively, prepare presence-absence data without considering different layers and save both lists on two different sheets of the Excel file.

7. Stay with the same dataset, we will try to get the list of all species occurring in the tree layer. Import the data > sort the species data alphabetically > get the list of unique species names of the tree layer (i.e. layer == 1) > print, view all the species or store them as a new dataset. Save the result into csv file.

8. iris dataset: We will use the famous iris dataset of flower measurements. Find out more details by asking R ?iris as the dataset is integrated and ready to use. Check the structure of the dataset > select variables including species and those of length measurements > arrange according to Sepal.Length > and filter 15 rows with the longest sepals. Which species prevails among these top 15?

9. squirrel data: Load data of squirrel observations from the Central Park Squirrel Census using this line: read_csv(‘https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2023/2023-05-23/squirrel_data.csv’)

Load squirrels data, check the structure, rename variables to get tidy names > find out what are the possible colours of fur > select at least 5 variables, including ID, fur colour, location, … > filter squirrels of one fur colour, up to your preference > arrange data by location > print at least 30 rows

  1. *get back to Forest data and using case_when() define three levels of productivity: low, medium, and high.

2.9 Further reading

dplyr main web page: https://dplyr.tidyverse.org

Find dplyr Cheatsheet in Posit: https://posit.co/resources/cheatsheets/

Chapter devoted to Data transformation in the R for Data Science book: https://r4ds.hadley.nz/data-transform.html

encoding issues during import and export: https://irene.rbind.io/post/encoding-in-r/