library(tidyverse)
library(readxl)
library(janitor)
4 Wide vs.long format
4.1 Data formats
There are two main ways how the data can be organised across rows and columns. Wide or long format.
Wide format is more conservative and used in many older packages for ecological data analysis. This is the way you need to prepare your species matrix for ordinations in vegan
. However, wide format has also many cons.
One of them is the size of the file. In the example above, there are abundance of given plants in each of the site. When the species is present in just one site, here Trientalis europaea, it is still keeping space across the whole table, where there can be hundreds or thousands of sites. The table code is then of course memory demanding.
For the analyses you cannot keep the empty cells empty, since they are recognised as NAs. You have to fill them with zeros.
Another disadvantage is that you cannot add easily new information. If you for example want to separate species that are in a tree vegetation layer (recognised in vegetation ecology as 1), herb layer (6) and moss layer (9), you would have to add this information to the name of the species e.g. Picea_abies_1. Can you see a conflict with the basic principles of tidyverse?
Long format in contrast, is great for handling large datasets. Imagine we have more sites than those shown in the example above. In this format, we list all the species for Site1, then for Site2 etc. but we list only the species that are really present! By simple count of the rows belonging to each site you have the information about overall species richness.
We can also add any information, describing the data, such as vegetation layers, growth forms, native/alien status etc. After that we can very simply filter, summarise and calculate further statistics.
4.2 From long to wide
We will first start a new script for this chapter and load the libraries. Remember to keep the script tidy and to put there remarks.
Now, we will import the data. In the first example we will again use the Forest understory data
from our folder: Link to Github folder
However, this time we will upload the species data saved in a long format and we will prepare a matrix in a wide format, so that it can be used in specific ecological analyses e.g. in vegan
.
<- read_excel("data/forest_understory/Axmanova-Forest-spe.xlsx")
spe tibble(spe)
# A tibble: 2,277 × 4
RELEVE_NR Species Layer CoverPerc
<dbl> <chr> <dbl> <dbl>
1 1 Quercus petraea agg. 1 63
2 1 Acer campestre 1 18
3 1 Crataegus laevigata 4 2
4 1 Cornus mas 4 8
5 1 Lonicera xylosteum 4 3
6 1 Galium sylvaticum 6 3
7 1 Carex digitata 6 3
8 1 Melica nutans 6 2
9 1 Polygonatum odoratum 6 2
10 1 Geum urbanum 6 2
# ℹ 2,267 more rows
We can see that there are plant species names sorted by RELEVE_NR, where each number indicates a vegetation record from one specific site (can be also called vegetation plot or sample). We will rename this name to make it easier for us as PlotID. Further we need to change the species names to be in the compact format with underscores. For this we will use mutate function with str_replace (for string specification) indicating that each space should be changed to underscore and we will directly apply it to the original column.
%>%
spe rename(PlotID= RELEVE_NR)%>%
mutate(Species = str_replace_all(Species, " ", "_"))
# A tibble: 2,277 × 4
PlotID Species Layer CoverPerc
<dbl> <chr> <dbl> <dbl>
1 1 Quercus_petraea_agg. 1 63
2 1 Acer_campestre 1 18
3 1 Crataegus_laevigata 4 2
4 1 Cornus_mas 4 8
5 1 Lonicera_xylosteum 4 3
6 1 Galium_sylvaticum 6 3
7 1 Carex_digitata 6 3
8 1 Melica_nutans 6 2
9 1 Polygonatum_odoratum 6 2
10 1 Geum_urbanum 6 2
# ℹ 2,267 more rows
*If you want to play a bit, you can create new column (e.g. SpeciesNew) to see both the original name and changed name.
%>%
spe rename(PlotID= RELEVE_NR)%>%
mutate(SpeciesNew = str_replace_all(Species, " ", "_"))
# A tibble: 2,277 × 5
PlotID Species Layer CoverPerc SpeciesNew
<dbl> <chr> <dbl> <dbl> <chr>
1 1 Quercus petraea agg. 1 63 Quercus_petraea_agg.
2 1 Acer campestre 1 18 Acer_campestre
3 1 Crataegus laevigata 4 2 Crataegus_laevigata
4 1 Cornus mas 4 8 Cornus_mas
5 1 Lonicera xylosteum 4 3 Lonicera_xylosteum
6 1 Galium sylvaticum 6 3 Galium_sylvaticum
7 1 Carex digitata 6 3 Carex_digitata
8 1 Melica nutans 6 2 Melica_nutans
9 1 Polygonatum odoratum 6 2 Polygonatum_odoratum
10 1 Geum urbanum 6 2 Geum_urbanum
# ℹ 2,267 more rows
Note that in the same way as above you can change different patterns, e.g. removing part of the strings. There are even more complex things we can do, but we will keep it for later as it requires some knowledge of regex
rules. Here I created a new column where I removed indication that the species belongs to a complex, i.e.the part of the string saying it is an aggregate, I also prepared a column “check” where I compare if the new and old name are equal or not and filtered just those cases where they are not. This should show me exactly the changed rows. If I am happy with the result, I can then remove the lines with this check and I can even rewrite the original column.
%>%
spe rename(PlotID= RELEVE_NR)%>%
mutate(SpeciesNew = str_replace_all(Species, " agg.", ""))%>%
mutate(check=(SpeciesNew==Species))%>%
filter(check == "FALSE") %>%
select(PlotID, Species, SpeciesNew, check) #select only relevant to see at a first glance
# A tibble: 159 × 4
PlotID Species SpeciesNew check
<dbl> <chr> <chr> <lgl>
1 1 Quercus petraea agg. Quercus petraea FALSE
2 1 Rubus fruticosus agg. Rubus fruticosus FALSE
3 1 Quercus petraea agg. Quercus petraea FALSE
4 2 Quercus petraea agg. Quercus petraea FALSE
5 2 Quercus petraea agg. Quercus petraea FALSE
6 3 Quercus petraea agg. Quercus petraea FALSE
7 3 Quercus petraea agg. Quercus petraea FALSE
8 3 Galium pumilum agg. Galium pumilum FALSE
9 3 Senecio nemorensis agg. Senecio nemorensis FALSE
10 3 Veronica chamaedrys agg. Veronica chamaedrys FALSE
# ℹ 149 more rows
We have the condensed name with underscores, but there are still more variables in the table. We can either remove them or merge them to be included in the final wide format. Here we will go a bit against tidy rules and add the information about the vegetation layer directly to the variable Species using unite function from the package tidyr
which merges strings from two or more columns into a new one: A+B =A_B. Default separator is again underscore, unless you specify it differently by sep=XX argument.
Argument na.rm indicates what to do if in one of the combined columns there is no value but NA. We have set this argument to TRUE to remove the NA. If you keep it FALSE it can happen that in some data the new string will be a_NA or NA_b, or even NA_NA (see line 4 of our example).
Remove argument set to TRUE will remove the original columns which we used to combine the new one (in the example above you will have only z). In our case we will keep original columns for visual checking and we will use select function in the next step to remove them.
%>%
spe rename(PlotID= RELEVE_NR)%>%
mutate(Species = str_replace_all(Species, " ", "_"))%>%
unite("SpeciesLayer", Species,Layer, na.rm = TRUE, remove = FALSE)
# A tibble: 2,277 × 5
PlotID SpeciesLayer Species Layer CoverPerc
<dbl> <chr> <chr> <dbl> <dbl>
1 1 Quercus_petraea_agg._1 Quercus_petraea_agg. 1 63
2 1 Acer_campestre_1 Acer_campestre 1 18
3 1 Crataegus_laevigata_4 Crataegus_laevigata 4 2
4 1 Cornus_mas_4 Cornus_mas 4 8
5 1 Lonicera_xylosteum_4 Lonicera_xylosteum 4 3
6 1 Galium_sylvaticum_6 Galium_sylvaticum 6 3
7 1 Carex_digitata_6 Carex_digitata 6 3
8 1 Melica_nutans_6 Melica_nutans 6 2
9 1 Polygonatum_odoratum_6 Polygonatum_odoratum 6 2
10 1 Geum_urbanum_6 Geum_urbanum 6 2
# ℹ 2,267 more rows
At this point we have everything we need to use it as input for the wide format table: PlotID, SpeciesLayer and values of the abundance saved as CoverPerc. One more step is to select only these or to deselect (-) those not needed.
%>%
spe rename(PlotID= RELEVE_NR)%>%
mutate(Species = str_replace_all(Species, " ", "_"))%>%
unite("SpeciesLayer", Species,Layer, na.rm = TRUE, remove = FALSE) %>%
select(PlotID, Species, Layer)
# A tibble: 2,277 × 3
PlotID Species Layer
<dbl> <chr> <dbl>
1 1 Quercus_petraea_agg. 1
2 1 Acer_campestre 1
3 1 Crataegus_laevigata 4
4 1 Cornus_mas 4
5 1 Lonicera_xylosteum 4
6 1 Galium_sylvaticum 6
7 1 Carex_digitata 6
8 1 Melica_nutans 6
9 1 Polygonatum_odoratum 6
10 1 Geum_urbanum 6
# ℹ 2,267 more rows
Now we can finaly use the pivot wider function to transform the data. We have to specify from where we are taking the names of new variables (names_from) and from where we should take the values which should appear in the table (values_from).
%>%
spe rename(PlotID= RELEVE_NR)%>%
mutate(Species = str_replace_all(Species, " ", "_"))%>%
unite("SpeciesLayer", Species,Layer, na.rm = TRUE, remove = TRUE) %>%
pivot_wider(names_from = SpeciesLayer, values_from = CoverPerc)
# A tibble: 65 × 370
PlotID Quercus_petraea_agg._1 Acer_campestre_1 Crataegus_laevigata_4
<dbl> <dbl> <dbl> <dbl>
1 1 63 18 2
2 2 63 NA NA
3 3 63 NA NA
4 4 38 NA NA
5 5 63 NA NA
6 6 38 NA NA
7 7 63 NA NA
8 8 38 NA NA
9 9 38 NA NA
10 10 63 NA NA
# ℹ 55 more rows
# ℹ 366 more variables: Cornus_mas_4 <dbl>, Lonicera_xylosteum_4 <dbl>,
# Galium_sylvaticum_6 <dbl>, Carex_digitata_6 <dbl>, Melica_nutans_6 <dbl>,
# Polygonatum_odoratum_6 <dbl>, Geum_urbanum_6 <dbl>,
# Anemone_species_6 <dbl>, Viola_mirabilis_6 <dbl>,
# Hieracium_murorum_6 <dbl>, Platanthera_bifolia_6 <dbl>,
# Convallaria_majalis_6 <dbl>, Hepatica_nobilis_6 <dbl>, …
There are different combinations of species in each plot, some of them are present and some not. Since we changed the format, all species, even those not occurring in that particular site/plot have to get some values. In long format abundance or some other information is not stored for absent species, so they get NAs. Therefore, one more step is to fill the empty cells by zeros using values_fill. In this case we can do that, because we know that if the species was absent its abundance was exactly 0.
%>%
spe rename(PlotID= RELEVE_NR)%>%
mutate(Species = str_replace_all(Species, " ", "_"))%>%
unite("SpeciesLayer", Species,Layer, na.rm = TRUE, remove = TRUE) %>%
pivot_wider(names_from = SpeciesLayer, values_from = CoverPerc,
values_fill = 0) -> spe_wide
4.3 From wide to long
Sometimes it is needed to transform the data from wide to long. Here we need to say which column should not be changed, which is the PlotID (cols = -PlotID). Than we specify what to do with the column names, how they should be saved (names_to). And how to name the column with values (values_to).
%>%
spe_wide pivot_longer(cols = -PlotID, names_to = 'species', values_to = 'cover')
# A tibble: 23,985 × 3
PlotID species cover
<dbl> <chr> <dbl>
1 1 Quercus_petraea_agg._1 63
2 1 Acer_campestre_1 18
3 1 Crataegus_laevigata_4 2
4 1 Cornus_mas_4 8
5 1 Lonicera_xylosteum_4 3
6 1 Galium_sylvaticum_6 3
7 1 Carex_digitata_6 3
8 1 Melica_nutans_6 2
9 1 Polygonatum_odoratum_6 2
10 1 Geum_urbanum_6 2
# ℹ 23,975 more rows
We need to remove the empty rows. Because we filled them before with zeros, we will first finish the transformation and then filter out the rows with cover equal to zero.
%>%
spe_wide pivot_longer(cols = -PlotID, names_to = 'species', values_to = 'cover')%>%
filter(!cover == 0)
# A tibble: 2,277 × 3
PlotID species cover
<dbl> <chr> <dbl>
1 1 Quercus_petraea_agg._1 63
2 1 Acer_campestre_1 18
3 1 Crataegus_laevigata_4 2
4 1 Cornus_mas_4 8
5 1 Lonicera_xylosteum_4 3
6 1 Galium_sylvaticum_6 3
7 1 Carex_digitata_6 3
8 1 Melica_nutans_6 2
9 1 Polygonatum_odoratum_6 2
10 1 Geum_urbanum_6 2
# ℹ 2,267 more rows
Sometimes we have data with NAs (not zeros) here it is very useful to use argument values_drop_na =TRUE
%>%
spe_wide pivot_longer(cols = -PlotID, names_to = 'species', values_to = 'cover', values_drop_na =TRUE)
# A tibble: 23,985 × 3
PlotID species cover
<dbl> <chr> <dbl>
1 1 Quercus_petraea_agg._1 63
2 1 Acer_campestre_1 18
3 1 Crataegus_laevigata_4 2
4 1 Cornus_mas_4 8
5 1 Lonicera_xylosteum_4 3
6 1 Galium_sylvaticum_6 3
7 1 Carex_digitata_6 3
8 1 Melica_nutans_6 2
9 1 Polygonatum_odoratum_6 2
10 1 Geum_urbanum_6 2
# ℹ 23,975 more rows
4.4 Group by, count
Using group_by we can define how the data should be arranged into groups. Then we can proceed with asking questions about these groups. Basic function how to summarise the information, is to count the number of rows.
In the first example we are working with species list in a long format. We can easily calculate number of species in each plot/sample by counting corresponding number of rows. First you have to specify what is the grouping variable, here it would be PlotID. And then you can directly count.
%>%
spe rename(PlotID= RELEVE_NR)%>%
group_by(PlotID)%>%
count()
# A tibble: 65 × 2
# Groups: PlotID [65]
PlotID n
<dbl> <int>
1 1 44
2 2 17
3 3 16
4 4 22
5 5 15
6 6 22
7 7 23
8 8 24
9 9 28
10 10 19
# ℹ 55 more rows
Count returns number of rows in a new variable called n. You can directly put extra line to your code and rename it. e.g. rename(SpeciesRichness =n). Note that in this case we simply counted all the rows. But we know some woody species might be recorded in tree vegetation layer, as shrub or as a small juvenile, so we potentially calculated such species three times. We can play a bit with distinct function and remove the layer information. Is there a difference in the resulting numbers?
%>%
spe distinct(PlotID= RELEVE_NR, Species)%>%
group_by(PlotID)%>%
count()%>%
rename(SpeciesRichness=n)
# A tibble: 65 × 2
# Groups: PlotID [65]
PlotID SpeciesRichness
<dbl> <int>
1 1 42
2 2 16
3 3 15
4 4 20
5 5 14
6 6 19
7 7 22
8 8 23
9 9 22
10 10 18
# ℹ 55 more rows
Count can be used also for other cases. For example I am interested in how many rows/plots/samples in the table are assigned to vegetation types. I will upload the table with descriptive characteristics for each forest plot, named as env, which a shortcut often used for environmental data file.
<- read_excel("data/forest_understory/Axmanova-Forest-env.xlsx")
env tibble(env)
# A tibble: 65 × 13
RELEVE_NR ForestType ForestTypeName Biomass pH_KCl Canopy_E3 Radiation Heat
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 oak hornbeam f… 12.8 5.28 80 0.881 0.857
2 2 1 oak forest 9.9 3.24 80 0.933 0.814
3 3 1 oak forest 15.2 4.01 80 0.916 0.850
4 4 1 oak forest 16 3.76 75 0.930 0.948
5 5 1 oak forest 20.7 3.50 70 0.869 0.869
6 6 1 oak forest 46.4 3.8 65 0.918 0.883
7 7 1 oak forest 49.2 3.48 65 0.829 0.803
8 8 2 oak hornbeam f… 48.7 3.68 85 0.869 0.869
9 9 2 oak hornbeam f… 13.8 4.24 80 0.614 0.423
10 10 1 oak forest 79.1 4.00 70 0.905 0.930
# ℹ 55 more rows
# ℹ 5 more variables: TWI <dbl>, Longitude <dbl>, Latitude <dbl>,
# deg_lon <dbl>, deg_lat <dbl>
And calculate the number of plots in each forest type.
%>%
env group_by(ForestTypeName)%>%
count()
# A tibble: 4 × 2
# Groups: ForestTypeName [4]
ForestTypeName n
<chr> <int>
1 alluvial forest 11
2 oak forest 16
3 oak hornbeam forest 28
4 ravine forest 10
Actually we can even skip the group_by step and directly specify what to count, which is very useful.
%>%
env count(ForestTypeName)
# A tibble: 4 × 2
ForestTypeName n
<chr> <int>
1 alluvial forest 11
2 oak forest 16
3 oak hornbeam forest 28
4 ravine forest 10
4.5 Summarise
Using group_by and summarise we can calculate e.g. mean values, or total sum of the values within the group. See more in the further reading. Here is an example of summarise for environmental data:
%>%
env group_by(ForestTypeName) %>%
summarise(meanBiomass= mean(Biomass))
# A tibble: 4 × 2
ForestTypeName meanBiomass
<chr> <dbl>
1 alluvial forest 148.
2 oak forest 26.7
3 oak hornbeam forest 44.4
4 ravine forest 79.1
Summarise is also very useful for species data. For example, we can calculate the total abundance, approximated as cover, of all plants in the plot/sample. Later on we can ask what is the relative share of parasites, endangered or alien species. Or we can calculate community mean of some traits, such as the mean height of the plants or their mean leaf area index. For this we would have to append new information, so we will leave it for next chapter.
%>%
spe group_by(PlotID=RELEVE_NR) %>%
summarise(totalCover= sum(CoverPerc))
# A tibble: 65 × 2
PlotID totalCover
<dbl> <dbl>
1 1 164
2 2 115
3 3 101
4 4 96
5 5 109
6 6 121
7 7 147
8 8 185
9 9 128
10 10 174
# ℹ 55 more rows
4.6 Exercises
4.7 Further reading
Summarise https://dplyr.tidyverse.org/reference/summarise.html
Summarise multiple columns https://dplyr.tidyverse.org/reference/summarise_all.html
Data transformation in R for data science https://r4ds.hadley.nz/data-transform.html
Data tidying including pivoting https://r4ds.hadley.nz/data-tidy.html