Upcoming Assignments/Quizzes

Assignments Open Time Due Time
ANOVA Article Analysis Activity October 22nd (1:00 am EST) October 28th (11:55 pm EST)
Module 8 Data Quiz October 26th (1:00 am EST) October 28th (11:55 pm EST)
Module 9 Conceptual Quiz October 26th (1:00 am EST) October 28th (11:55 pm EST)

Notes from Discussion Board/Office Hours

Relationship between the \(F\)-statistic, p-value, and null hypothesis

In sub-module 9.3, Dr. Baiser covers how to test hypotheses using ANOVA. To do this, we calculate our observed \(F\)-statistic using the mean square among groups and mean square within group from our observed data, and compare that to the distribution of possible \(F\)-statistics (i.e. the \(F\) distribution) based on the degrees of freedom (df) in the numeration and denominator of our \(F\)-statistic to determine how significance of our observed value.

Let’s make some plots to visualize this comparison step-by-step. I’ll use the same example from the sub-module 9.3 lecture. Let’s start from when we calculate our observed \(F\)-statistics (pg. 15 from 9.3 notes), which I’ll call f_obs. Based on our calculations of the mean squares we determined that \(F_{obs} = 5.11\).

Now let’s draw our \(F\)-distribution. Recall that this is determined by the dfs in the numerator (\(df_{num}\)) and the denominator (\(df_{den}\)) of our \(F\)-statistic. If we have \(a\) number of treatments and \(n\) number of replicates, than \(df_{num} = a - 1\) and \(df_{den} = n(a-1)\). In our example, \(a=3\) and \(n=4\) (pg. 8), therefore \(df_{num} = 2\) and \(df_{num} = 9\). With this information we can draw our \(F\)-distribution by creating a vector of possible values of \(F\) and passing those into the df() function in .

library(tidyverse)
library(ggpubr)

# Possible values of F-stat:
x = seq(from = 0, to = 10, by = 0.01)

# Probability of possible values of F-stat
y = df(x = x, df1 = 2, df2 = 9)

ggplot() +
  geom_line(aes(x, y)) +
  labs(x = "F-Statistic", y = "Probability") +
  theme_pubclean()

This curve shows the possible values for the \(F\)-statistic (shown on the x-axis) and the probability of observing those values (y-axis) if the null hypothesis were true (based on the dfs we specified). We can use this to determine if we should reject or fail to reject the null hypothesis by comparing f_obs to a theoretical \(F\)-statistic based on a critical value \(\alpha\), which you’ll recall is often set to \(\alpha = 0.05\). This \(F\)-statistic, which we will call f_crit, will correspond to having a p-value of exactly 0.05.

It is important to note that we working with a density function, which means that we are interested in the area under the curve. We can not simply draw a line with a y-intercept of 0.05 to find f_crit. Instead we need to find the “quantile” of our area of interest (5% or 0.05). Luckily the qf() can calculate quantile for the \(F\)-distribution:

f_crit <- qf(p = 0.05, df1 = 2, df2 = 9, lower.tail = F) 

Which determines that f_crit is equal to 4.26. Note that we set lower.tail = F because were are using a one-way test on the high end. Now we can draw the area under the curve that represents the “rejection region”:

ggplot(data.frame(x,y)) +
  geom_line(aes(x, y)) +
  stat_function(fun = df, 
                args = list(df1 = 2, df2 = 9), 
                xlim = c(f_crit, 10), 
                geom = "area", 
                fill = "red", 
                alpha = 0.6) +
  labs(x = "F-Statistic", y = "Probability") +
  theme_pubclean()

Finally, let’s add f_obs to our plot:

f_obs = 5.11

ggplot(data.frame(x,y)) +
  geom_line(aes(x, y)) +
  stat_function(fun = df, 
                args = list(df1 = 2, df2 = 9), 
                xlim = c(f_crit, 10), 
                geom = "area", 
                fill = "red", 
                alpha = 0.6) +
  geom_vline(aes(xintercept = f_obs), color = "darkblue", linetype = 2) +
  labs(x = "F-Statistic", y = "Probability") +
  theme_pubclean()

As you can see, f_obs falls in the rejection region, and therefore we will reject the null hypothesis that there is no difference between our treatments. As a final note, we can also calculate the p-value associated with f_obs using the pf() function:

p_value <- pf(f_obs, df1 = 2, df2 = 9, lower.tail = F)
round(p_value, 3)
## [1] 0.033