---
title: "Data simulation and randomization tests"
author: "James S. Santangelo"
output: pdf_document
---
## Lesson preamble
> ### Prerequisites
>
> - This module assumes that students have taken an introductory ecology course that has touched on topics in community ecology (e.g. community assembly)
> - Some knowledge of the R programming language would be helpful, although it is not required.
> - An introductory course in statistics that covered parametric hypothesis testing would additionally be helpful.
> ### Learning objectives
>
> - Learn to simulate small-scale data in R.
> - Understand the basic logic underlying randomization tests and how they differ from standard parametric statistic tests.
> - Apply randomization test to real ecological community data.
>
> ### Lesson outline
>
> Total lesson time: 2 hours
>
> - Brief intro to randomization tests (10 min)
> - Simulating small-scale data in R (30 mins)
> - General procedure for conducting a randomization test (10 mins)
> - Simulate data and perform t-test (10 mins)
> - Perform randomization test on same simulated data (30 mins)
> - Perform randomization test on real ecological data (30 mins)
>
> ### Setup
>
> - `install.packages('dplyr')` (or `tidyverse`)
> - `install.packages('ggplot2')` (or `tidyverse`)
> - `install.packages("reshape2")`
> - `install.packages("EcoSimR")`
## Introduction to randomization tests
Standard parametric statistical tests (e.g. ANOVA, standard linear regression) have numerous assumptions that should be respected prior to drawing conclusions from their results. Violation of these assumptions can lead to erroneous conclusions about the population(s) you are trying to understand. Mixed-effects models provided us with a few ways of overcoming the limitations of standard models, while still using all of our data and remaining in a statistical framework that we generally understand and are comfortable with. Randomization tests provide yet another alternative solution to parametric tests.
Many ecological and evolutionary hypotheses ask: **Is the observed pattern different than what we would expect by random chance?** In other words, the null hypothesis is randomness. Randomization tests (sometimes called permutation tests) allow us to test whether the observed data is different from a random distribution generated by reordering our observed data. If the pattern is random, then it should be just as likely as any other pattern generated by reordering the data. If it is not random, then it should occur more or less frequently than we expect under this distribution.
Randomization tests do not require that we make the same assumptions about our data as standard parametric tests. We do not require that samples be randomly drawn from the population since the goal of randomization tests is not to estimate parameters about the population (parameter estimation requires random samples to estimate means and variances). Similarly, we do not require normality of the data nor do we need equal variances among treatment groups since we are not really thinking about the population from which the data came and are not testing against a distribution (e.g. _t_ distribution) with known shape. This will all become more clear when we walk through an example. But first, an aside.
## An aside on simulating data in R
R allows us to generate fake data (i.e. simulate data) rather easily. This is useful if we just want some toy data to play around with, want to simulate data similar to the type we will collect in an experiment to test our analyses, or want to simulate data collection many times to estimate population parameters (e.g. variance or standard deviation). We can sample existing (i.e. real) data using the `sample()` function, or take random samples from known statistical distributions (e.g. Normal distribution). We will return to `sample()` later in this lesson. For now, let's focus on sampling from known distributions.
We'll start by taking random samples from a normal distribution with mean equal to 1 and standard deviation equal to 0.
```{r message=FALSE, warning=FALSE}
# Let's install all the packages we'll need
library(tidyverse)
library(reshape2)
library(EcoSimR)
```
```{r message=FALSE, warning=FALSE}
set.seed(42) # Ensure reproducibility of random results.
mean <- 0
sd <- 1
n <- 100
x <- rnorm(n, mean, sd)
# Fancy plot, if desired
ggplot(data.frame(x), aes(x=x)) +
geom_histogram(color="black", fill="white") +
annotate("text", x = -0.30, y = 9, label = round(mean(x), 4)) +
annotate("text", x = -0.30, y = 9.5, label = "Mean") +
ylab("Count") +
geom_vline(xintercept = mean(x), linetype = "dashed",
size = 1, colour = "red") +
theme_classic()
# Get mean
mean(x)
```
Note how the mean from the above distribution (0.0325) is not exactly the same as the mean we specified (0). This should not be too surprising since we took *random* samples from a normal distribution and there is likely going to be variance within these samples. However, if we were to repeat this sampling many times (e.g. 10,000), each time plotting the mean of the distribution, we would expect the mean of this new distribution (i.e. the mean of the 10,000 means) to be much closer to what we specified. Let's try this.
```{r message=FALSE, warning=FALSE}
set.seed(43)
mean <- 0
sd <- 1
size <- 100
x <- replicate(10000, mean(rnorm(size, mean, sd)))
# Fancy plot, if desired
ggplot(data.frame(x), aes(x=x)) +
geom_histogram(color="black", fill="white") +
annotate("text", x = -0.055, y = 1050, label = round(mean(x), 4)) +
annotate("text", x = -0.055, y = 1100, label = "Mean") +
ylab("Count") +
geom_vline(xintercept = mean(x), linetype = "dashed",
size = 1, colour = "red") +
theme_classic()
# Get mean
mean(x)
```
As expected, the above distribution is normal and it's mean is much closer to 0 than the distribution generated by sampling only a single time. Interestingly, the same would be true even if the original samples did not come from a normal distribution. For those who took introductory stats, you might remember that this is called the **Central Limit Theorem**, which states that the distribution of sample means approaches a normal distribution regardless of the shape of the original population's distribution.
Note that we could also simulate data from a linear model (or any other model), as the example below shows.
```{r}
set.seed(44)
y_intercept <- 5.1
beta <- 0.23
x <- rnorm(25, mean = 7.6, sd = 0.7)
error <- rnorm(x, mean = 0, sd = 0.5)
y <- y_intercept + beta*x + error
model <- lm(y ~ x)
summary(model)
plot(y ~ x)
abline(model)
```
## Walkthrough of randomization test
Now that we know how to simulate data, let's use this to demonstrate the logic underlying randomization tests. The basic procedure for a randomization test is as follows:
1. Calculate test statistic for the sample data (e.g. difference in means, difference in means, t-statistic, etc.). The test statistic is chosen by the researcher.
2. Randomly reshuffle observations among the treatment groups, each time calculating a test statistic.
3. Repeat 2 multiple times, generating a distribution of test-statistics.
4. Calculate the proportion of times the actual test statistic is outside the distribution of test-statistics.
We will perform the above approach on some simulated data. Imagine we had ventured to South America and collected 10 male and 10 female Hercules beetles. We brought the beetles back to the lab and measure the width of their bodies at the largest point. The question we are interested in is: **Do male and female Hercules beetles differ in body width?** We will assume that body width is normally distributed (i.e. we will draw from a normal distribution).
```{r message=FALSE, warning=FALSE}
set.seed(46)
# Male data
df_males <- data.frame(
width = rnorm(10, mean=16.5, sd=2.5),
sex = "Male"
)
# Female data
df_females <- data.frame(
width = rnorm(n=10, mean=15, sd=1.8),
sex = "Female"
)
# Let's combine the datasets
df_body_widths <- rbind(df_males, df_females)
# Let's look at the male and female data
ggplot(df_body_widths, aes(x = width, fill = sex)) +
ylab("Count") + xlab("Body width") +
geom_histogram(bins = 30, colour = "black") +
geom_vline(data = filter(df_body_widths, sex == "Male"), aes(xintercept = mean(width)),
size = 1, linetype = "dashed", colour = "red") +
geom_vline(data = filter(df_body_widths, sex == "Female"), aes(xintercept = mean(width)),
size = 1, linetype = "dashed", colour = "blue") +
theme_classic()
```
Males have a mean body width of `r mean(df_males$width)` while females have a mean body width of `r mean(df_females$width)`
We will use a randomization test to test if males and females differ in body width. If they do **not** differ, then each body width measurement in our dataframe should be equally likely to belong to a male or to a female. In other words, body width should independent of beetle sex. We can simulate this by randomly rearranging the body width values, effectively removing the effect of sex on body width, and then calculating a test statistic (e.g. _t_ statistic, difference in means). We can do this many times (e.g. 5,000) and ask how many times our observed test statistic falls outside the randomly generated distribution of test statistics. We will use the difference in mean body width as our test statistic so let's go ahead and calculate the _observed_ difference in body width between male and female beetles.
```{r}
mean_males <- mean(df_males$width)
mean_females <- mean(df_females$width)
diff_means_obs <- mean_males - mean_females
```
The observed difference in mean body width between male and female Hercules beetles is `r diff_means_obs`
Traditionally, if we wanted to determine if there is a significant difference in means between two groups, we could use a **t-test**. Let's perform a t-test on these data so we could compare it to the results of the randomization test that we will perform. Note the `alternative = two.sided` argument passed to `t.test()`. Since we are interested in whether males and females differ in mean body length without reference to whether one sex is _specifically_ larger or smaller than the other, we are performing a two-sided hypothesis test. Realistically, males could either have larger or smaller body widths so a two-tailed test is appropriate here.
```{r}
t.test(width ~ sex, data = df_body_widths, alternative = "two.sided")
```
From the t-test above, we see that our t-statistic has a value of 2.49 and that males and females differ significantly in body width (_P_ < 0.05). From the means reported in the t-test, it looks like males have larger body widths than females (17.36 vs. 15.03). Let's now perform a randomization test. To get a better sense of what we're doing, let's first perform a single reshuffling of the data and look at the resulting distribution of the data.
```{r message=FALSE, warning=FALSE}
# Set seed for reproducible results
set.seed(47)
# Randomly reshuffle the width column
reshuffled <- df_body_widths
reshuffled$width <- sample(reshuffled$width,
size = nrow(reshuffled), replace = FALSE)
# View data to confirm that values have been reshuffle
head(df_body_widths)
head(reshuffled)
```
Note the differences in body width values between the original and reshuffled dataframes above.
```{r message=FALSE, warning=FALSE}
# Plot histograms of reshuffled data
ggplot(reshuffled, aes(x = width, fill = sex)) +
ylab("Count") + xlab("Body width") +
geom_histogram(bins = 30, colour = "black") +
geom_vline(data = filter(reshuffled, sex == "Male"), aes(xintercept = mean(width)),
size = 1, linetype = "dashed", colour = "red") +
geom_vline(data = filter(reshuffled, sex == "Female"),aes(xintercept = mean(width)),
size = 1, linetype = "dashed", colour = "blue") +
theme_classic()
# Let get the mean difference of this single simulation
mean_males_sim1 <- mean(reshuffled[reshuffled$sex == "Male", "width"])
mean_females_sim1 <- mean(reshuffled[reshuffled$sex == "Female", "width"])
mean_diff_sim1 <- mean_males_sim1 - mean_females_sim1
```
The mean difference from a single reshuffling of the data is `r mean_diff_sim1`
We can see that the mean difference from the simulated data is considerably different from the observed mean difference. This makes sense since the body width values were randomly assigned to males and females. Let's now do this 5,000 times, each time calculating the mean difference in body width.
```{r message=FALSE, warning=FALSE}
# Set seed for reproducible results
set.seed(49)
# Initialize a list to store the simulated test-statistics
simulated_means <- list()
nreps = 5000 # 5000 iterations
for(i in 1:nreps){
# Create temporary dataframe to permute so we don't modify the original
reshuffled <- df_body_widths
# Permute the width column with the 'sample()' function.
reshuffled$width <- sample(reshuffled$width, size = nrow(reshuffled),
replace = FALSE)
# Calculate the means for each sex
mean_males_sim <- mean(reshuffled %>% filter(sex == "Male")
%>% pull(width))
mean_females_sim <- mean(reshuffled %>% filter(sex == "Female")
%>% pull(width))
# Calculate to difference between simulated male and female body width
# means
mean_diff_sim <- mean_males_sim - mean_females_sim
# Append simulated mean difference to list
simulated_means[i] <- mean_diff_sim
}
# Unlist simulated means list into numeric vector
simulated_means <- unlist(simulated_means)
# Show first 10 simulated mean differences
simulated_means[1:10]
```
We now have a numeric vector containing 5,000 simulated differences in mean body width between males and females. Let's plot a histogram of the simulated values and overlay onto the histogram our observed mean difference.
```{r}
ggplot() +
ylab("Count") + xlab("Simulated mean difference") +
geom_histogram(aes(x = simulated_means), bins = 30,
fill = "grey", alpha = 0.4, colour = "black") +
geom_vline(xintercept = diff_means_obs, size = 1,
linetype = "dashed", colour = "black") +
theme_classic()
```
Finally, to get out P-value, we calculate the number of times the simulated mean difference exceeded the observed mean difference from our data. Because we are performing a two-tailed test, this amounts to determining the number of times the simulated mean difference is **either greater or lesser than the observed difference**. We can do this by asking how many times the absolute value of the simulated mean difference is greater or equal to the absolute value of the observed mean difference.
```{r}
abs_simulated_means <- abs(simulated_means)
abs_diff_means_obs <- abs(diff_means_obs)
exceed_count <- length(abs_simulated_means[abs_simulated_means >=
abs_diff_means_obs])
p_val <- exceed_count / nreps
```
The P-value from the randomization test is `r p_val`
As we can see, the randomization test provides results that are largely consistent with the t-test. This is not surprising since we specifically sampled independent data from normal distributions with similar variances. In other words, our sampled data do not violate any of the assumptions of the t-test and in such a case the difference in means is by definition analogous to the _t_ statistic used in the t-test. There are some cases where randomization tests provide more accurate (or _exact_) P-values than parametric tests (e.g. small samples from skewed distributions) but even more diverse applications of randomization tests are found in ecology and evolutionary biology.
## Randomization tests: A real-world application
Randomization tests are applied in all corners of ecological and evolutionary research but for our purpose heres we'll turn to the field of community ecology for an example. Community ecologists have long been interested in understanding the factors that shape the organization of ecological communities. Are the species in an ecological community just a random assortment of species available from the regional species pool? Conversely, do species interactions and shared resources determine the local distribution of species, such that some species are found together more often than we would expect by chance whereas other never or only rarely co-occur. The proposition that competition and shared resources are important in driving community assembly are known as _assembly rules_ and were first proposed by Diamond (1975). An important null model based on randomization tests was later developed to test what species assemblages would look like in the absence of competition and shared resources (Connor and Simberloff 1979). We will apply one such randomization test in this section.
First, let's look at the type of data we need for this analysis.
```{r message=FALSE, warning=FALSE}
# Load in example dataset
comMatrix_example <- "https://raw.githubusercontent.com/UofTCoders/rcourse/master/data/lec09_CommunityMatrix_Example.csv"
download.file(comMatrix_example, "CommunityMatrix_Example.csv")
comMatrix_example <- read_csv("CommunityMatrix_Example.csv")
head(comMatrix_example)
# Sliced to return a numeric Matrix, which is required for
# colSums() and rowSums()
colSums(comMatrix_example[,2:5])
rowSums(comMatrix_example[,2:5])
```
As you can see, the data is a matrix with sites (e.g. islands, quadrats, etc.) as columns and species as rows. Each cell represents whether a species is present (1) or absent (0) at that site. This type of data is often referred to as a _presence-absence matrix_. The column totals (using `colSums()`) gives us a sense of how species richness varies across the sites while the row totals (using `rowSums()`) tells us how rare or common a species is across all sites. The question we are interested in addressing is: **Are the observed species co-occurrence patterns the result of species interactions or random chance?**
Let's read in some real data before proceeding. The data are from the [National Ecological Observatory Network](https://www.neonscience.org/) and represent the presence and absence of plant species in eight square meter plots during the month of July, 2017, at a single site around Harvard, Boston, MA. Note this is only a subset of the very large dataset provided by NEON, which dates back to 2013 and includes about 46 different sites.
```{r message=FALSE, warning=FALSE}
# Load in NEON data
neon_data <- "https://raw.githubusercontent.com/UofTCoders/rcourse/master/data/NEON_PlantPA_HARV_201707.csv"
download.file(neon_data, "NEON_PlantPA_HARV_201707.csv")
neon_data <- read_csv("NEON_PlantPA_HARV_201707.csv")
glimpse(neon_data)
```
Clearly this data is not in the form that we need it to be. As I'm sure you know by now, R has many useful libraries and functions that should allow us to change this relatively easily. We'll rely on the `reshape2` package to transform our dataset into a presence-absence matrix.
```{r message=FALSE, warning=FALSE}
neon_data_filtered <- neon_data %>%
# We only want rows with plant species and not 'otherVariables'
dplyr::filter(divDataType == "plantSpecies") %>%
# To create a presence-absence matrix, we only need the taxonID
# (i.e. species) and the plotID (i.e. Site)
dplyr::select(plotID, taxonID)
# Create Presence-Absence matrix
# Keep only unique rows so that cells represent presence-absence
# and not abundances
neon_data_filtered <- unique(neon_data_filtered)
PA_matrix <- dcast(neon_data_filtered, formula = taxonID ~ plotID, fun.aggregate = length)
head(PA_matrix)
```
Well that was painless! We now have our data in a form that is usable in our randomization test but we need to decide two things first: (1) _What is our test statistic?_ This is what we'll use to determine if co-occurrence patterns are random or not, similar to how we used the difference in means to tell if male and female beetles differed in length. (2) _How do we randomize such a large matrix?_ Do we randomly permute rows and columns so that each cell is equally likely to contain a 1 or 0? Do we constrain row permutations so that row totals (i.e. species rarity) must be preserved but site richness is free to vary? Do we constrain column permutations so that column totals (i.e. site richness) must be preserved but species rarity can vary? It turns out there are exactly 9 different ways a matrix like this can be randomized, each involving a different set of constraints and biological assumptions (Gotelli 2000).
For our purpose, we will preserve species rarity but will assume that each site is equally likely to be colonized. In other words, we are assuming that sites do not vary in habitat quality or anything else that may change the number of species that can occupy it. However, we want rare species to remain rare throughout the permutations. This model is analogous to the _SIM2_ algorithm in Gotelli (2000) and is reasonable in our case since the sites are relatively close in proximity (i.e. likely have similar habitat quality).
For our index, we will use the **C-score** (Stone and Roberts 1990). I will not go into the calculation of this score here but suffice it to say that the higher the C-score, the less co-occurrence (on average) between all pairs of species in the matrix (i.e. a more segregated matrix). Smaller C-scores indicate an aggregated matrix (i.e. more co-occurrence among species pairs). Thankfully for us, there is an R package that implements the randomization algorithm for us. Let's go ahead and conduct the analysis.
```{r message=FALSE, warning=FALSE}
set.seed(50)
# Perform co-occurrence analysis
co_oc_analysis <- cooc_null_model(PA_matrix, algo = "sim2",
nReps=1000,
metric = "c_score",
suppressProg = TRUE,
saveSeed = TRUE)
# Summarize output from co-occurrence analysis
summary(co_oc_analysis)
# Let plot the randomization test
plot = plot(co_oc_analysis,type="hist")
```
As we can see, our observed C-score (red solid line) is much lower (significantly so) than we would expect by random chance given the species in our community. This suggests the community is _aggregated_ such that some species are more likely to occur with other species in the community than we would expect if the communities were assembling themselves randomly. This can happen if there is facilitation and some species create ecological niches or favourable habitat for other species such that they occur more often. Importantly, the C-score is an average across all pairwise comparisons in the matrix and thus doesn't tell us anything about which specific species pairs may be more or less likely to co-occur. Identifying which species pairs are driving this pattern would require quite a bit more time and identifying the ecological mechanisms behind co-occurrence patterns of specific species pairs requires careful experimentation.
For the sake of completeness, we can also visualize a single permutation of the matrix compared against our observed matrix.
```{r}
plot(co_oc_analysis,type="cooc")
```
Notice how all of the rows in the two figures above always have the same number of filled squares. This is confirmation that the rarity of species (i.e. row totals) has been preserved during the permutation.
## Other uses of randomization tests
Testing for mean differences among groups and for putative rules governing community assembly are only two of the many uses for randomization tests. In this section, I briefly highlight other uses of data randomization in statistics, ecology, and evolutionary biology.
Randomization tests are additionally used numerous fields in ecology and evolutionary biology such as spatial ecology (e.g. testing spatial patterns in animal distribution and allowing for spatial autocorrelations), phylogenetics (e.g. permutations of species phylogenies), community ecology and environmental monitoring. In addition to randomization tests in the way that we've demonstrated here, randomization procedures play an important role in other aspects biology and statistics.
**Bootstrapping** involves sampling data _with replacement_ to estimate the sampling distribution of a population estimator (e.g. mean, median, standard deviation). This is often done to estimate the error (e.g. standard error, confidence interval) around the estimated population parameter.
**Cross-validation** involves randomly subsampling a dataset, which is then held as a _model validation set_ on which the predictive ability of a model can be tested. A model is tested on the remaining data (i.e. the _training set_) and used to predict the validation set. This is done many times to estimate the accuracy of the model.
**Monte-carlo methods** are a huge class of methods that generally involve simulating random numbers to estimate probability distributions and for obtaining numerical solution to really complicated problems. For our purposes, these are especially common in computational biology, genomics and in Bayesian inference (e.g. building phylogenies).
## Additional reading
1. Diamond, J.M. 1975. Assembly of species communities. p. 342-444 in: Ecology and Evolution of Communities. M.L. Cody and J.M. Diamond (eds.). Harvard University Press, Cambridge.
2. Connor, E.F. and D. Simberloff. 1979. The assembly of species communities: chance or competition? Ecology 60: 1132-1140.
3. Connor, E.F., M.D. Collins, and D. Simberloff. 2013. The checkered history of checkerboard distributions. Ecology 94: 2403-2414
4. Stone. L. and A. Roberts. 1990. The checkerboard score and species distributions. Oecologia 85: 74-79.
5. Manly, B. 2007. Randomization, bootstrap and Monte Carlo methods in biology. B. Carlin, C. Chatfield M. Tanner and J. Zidek, eds. 3rd edition. Chapman and Hall/CRC.