6 Statistics module

6.1 General

This module of FXMATE let’s you determine NOEC and LOEC values depending on the type (currently continuous and quantal are supported) and further characteristics of your data.

The data type is automatically set by the selected guideline. While it can be changed and the app will try to convert your data to make it amenable to the respective methods, this is almost never correct to do and will likely cause erroneous results.

Also results from testing normality, variance homogeneity, and monotonicity can be overruled by the user based on expert judgement by clicking the checkboxes Assume … anyway.

FXMATE supports you visually with method choice in the Decision tree tab, where you always find your current choice of method framed in red, while the optimal choice determined by the app is highlighted in green.

R error and warning messages

For reasons of transparency, error and warning messages from the R backend are displayed in this module. Error messages are related to failure of performing a particular test, about which FXMATE informs you also in a pop-up box. Warning messages, however, can provide important insights into testing performance. A common message, for instance, informs about the presence of ties (equal numbers in data) that can affect the precision with which some methods can compute p-values. If you want to know more about warning messages, you can enter “R <your test> <warning message>” in an online search engine.

Tests in this module come from R packages car, DescTools, ecotoxicology, PMCMRplus, rstatix, and survey. If sources of tests are not explicitly mentioned below, realizations were taken from base R.

Note that FXMATE generally uses an alpha of 0.05 for all null-hypothesis significance tests. Statistical significance is hence assumed if p-values are < 0.05.

6.2 Data type

6.2.1 Continuous data

Continuous data has an infinite number of possible values. A good example is algae growth data. In the context of mortality data, it get’s more complicated. While this data, in theory, comes from a binomial distribution as animals can only be either alive or dead, many test guidelines house several animals in one test unit, producing aggregated estimates of mortality, which then - in the context of NOEC and LOEC determination - are handled as continuous. A good example for this is acute Daphnia data. This is also the reason why data types in the Model and Statistics modules can differ (e.g., Daphnia data is handled as quantal in the former).

6.2.1.1 Normality tests

FXMATE offers you two ways to assess if your data is normally distributed:

As done customarily, respective tests are performed for each treatment. If one treatment shows a statistically significant test, data are assumed not to originate from a normal distribution.
When you click the checkbox Test normality globally, the normality test will be applied to your data with respective means being substracted. This option is offered to you as ANOVA and other so-called parametric testing methods (see Section 6.2.1.4) do not actually assume treatment data to be normally distributed but model residuals (i.e., data minus treatment means).

To support your decision about normality visually, a Q-Q plot is provided in the Normality panel. This shows you quantiles of your samples against theoretical quantiles from a normal distribution. The more closely points follow the 1:1 line, the more likely your data originates from a normal distribution. Using a Q-Q plot is particularly useful, if your sample size is very large as then even small deviations from a normal distribution can result in significant tests when using the methods listed below.

6.2.1.1.1 Shapiro-Wilks test

General: tests if a sample comes from a normally distributed population
Prerequisites: sample size \(\geq\) 3
Test statistic: W
Interpretation: if significant, data cannot be assumed to originate from a normal distribution
Original publication: Shapiro and Wilk (1965)

6.2.1.1.2 Kolmogorov-Smirnov test

General: tests if a sample comes from a given reference probability distribution (here: normal)
Test statistic: D
Interpretation: if significant, data cannot be assumed to originate from a normal distribution
Remarks: known to be less sensitive than Shapiro-Wilks test
Original publication: Kolmogorov (1933)

6.2.1.2 Tests for variance homogeneity

To support your decision about variance homogeneity visually, a graph combining treatment-wise boxplots with standard deviations is provided in the Variance homogeneity panel.

6.2.1.2.1 Levene’s test

General: tests for equality of variances for a variable with two or more groups
Test statistic: F
Interpretation: if significant, variances cannot be assumed to be equal
Original publication: Levene (1960)
Realized using: car

6.2.1.2.2 Bartlett’s test

General: tests if multiple samples are from populations with equal variances
Prerequisites: normal distribution
Test statistic: K-squared
Interpretation: if significant, variances cannot be assumed to be equal
Original publication: Bartlett (1937)

6.2.1.3 Monotonicity tests

To support your decision about monotonicity visually, a plot is provided displaying linear connections between control and treatment means in the Monotonicity panel. If the same effect direction is present across treatments this indicates monotonicity.

6.2.1.3.1 Linear vs. quadratic contrasts

General: linear and quadratic contrasts of normalized rank statistics are constructed
Interpretation: if the quadratic trend is significant and the linear trend is not, the response is not considered monotonic, otherwise it is
Original publication: OECD (2006)
Realized using: own code

6.2.1.3.2 Kendall rank correlation coefficient

General: measures the ordinal association between two variables
Test statistic: tau
Interpretation: if significant, a homogeneous relationship can be assumed with the sign of tau indicating the effect direction
Original publication: Kendall (1938)

6.2.1.3.3 Spearman’s rank correlation coefficient

General: assesses how well the relationship between two variables can be described using a monotonic function
Test statistic: S/rho
Interpretation: if significant, a homogeneous relationship can be assumed with the sign of rho (not displayed in app) indicating the effect direction
Original publication: Spearman (1904)

6.2.1.4 NOEC/LOEC determination

6.2.1.4.1 Omnibus tests

These are statistical tests that determine if there is a difference between multiple groups. They do not test which groups actually differ - that’s where post-hoc tests come into play. A significant omnibus test is usually considered a prerequisite for post-hoc testing. Results of omnibus tests are displayed in the NOEC/LOEC panel.

6.2.1.4.1.1 ANOVA

General: analysis of variance testing if two or more population means are equal
Prerequisites: normal distribution, variance homogeneity
Test statistic: F
Interpretation: significance indicates that some of the group means are different
Original publication: Fisher (1921)

6.2.1.4.1.2 Welch’s ANOVA

General: see Section 6.2.1.4.1.1
Prerequisites: normal distribution
Test statistic: F
Interpretation: see Section 6.2.1.4.1.1
Original publication: Welch (1951)
Realized using: rstatix

6.2.1.4.1.3 Kruskal-Wallis test

General: tests if samples originate from the same distribution
Prerequisites: identically shaped and scaled distribution for all groups; note that this is hard to test for and usually ignored but violation may result in significant results not being due to differences in medians
Test statistic: Chi-squared
Interpretation: if significant, at least one population median of one group is different from the population median of at least one other group
Original publication: Kruskal and Wallis (1952)

6.2.1.4.2 Post-hoc tests

6.2.1.4.2.1 P-value corrections

When comparing several groups to a common control, as done in post-hoc testing for determining the NOEC and LOEC values, we face the multiple comparisons problem. That means that the likelihood of falsely rejecting a null-hypothesis (here that groups do not differ) increases with each comparison and we need to correct for this phenomenon. Many methods listed below use an internal procedure to this end but methods designed to only compare two groups (e.g., Student’s t-test) do not and we do need to take care of this on our own.

FXMATE offers two different correction methods, which are implemented as p-value adjustments:

Bonferroni correction

This technique, developed by Bonferroni (1936), is pretty uncomplicated as p-values derived from two-sample tests are simply multiplied by the number of comparisons (i.e., number of groups minus one). However, this makes the Bonferroni correction quite conservative (particularly if you have many comparisons) and might result in an unnecessary loss of statistical power (i.e., the probability to detect a true effect).

Holm-Bonferroni method

This technique, developed by Holm (1979), is a bit more complicated. All p-values from the two-sample tests are sorted from lowest to highest and then they are corrected sequentially:

\[ \frac{alpha}{m}, \frac{alpha}{m - 1}, ..., \frac{alpha}{2}, \frac{alpha}{1} \]

\(alpha\) is the significance level, which is set to 0.05 in FXMATE, and \(m\) is the number of comparisons. These corrected p-values are then sequentially checked for significance and the procedure stops at the first non-significant comparison. While being a bit more complex, it is generally more powerful than the Bonferroni method and is hence used as default in the app. If desired, you can change this in the Select p-value adjustment method dropdown menu in the sidebar.

6.2.1.4.2.2 Student’s t-test with correction

General: tests if means of two samples differ
Prerequisites: normal distribution, variance homogeneity
P-value correction: Bonferroni or Holm (see Section 6.2.1.4.2.1)
Test statistic: t
Interpretation: if significant, means of samples differ
Original publication: Student (1908)
Realized using: rstatix

6.2.1.4.2.3 Welch’s t-test with correction

General: adaptation of Student’s t-test that can handle unequal variances
Prerequisites: normal distribution
P-value correction: Bonferroni or Holm (see Section 6.2.1.4.2.1)
Test statistic: t
Interpretation: if significant, means of samples differ
Original publication: Welch (1947)
Realized using: rstatix

6.2.1.4.2.4 Wilcoxon rank-sum test with correction

General: tests if samples from two populations have the same distribution
P-value correction: Bonferroni or Holm (see Section 6.2.1.4.2.1)
Prerequisites: dispersions and shapes of sample distributions are equal; note that this is hard to test for and usually ignored but violation may result in significant results not being due to differences in medians
Test statistic: U/W
Interpretation: if significant, medians of samples differ
Original publication: Mann and Whitney (1947)

6.2.1.4.2.5 Dunn’s many-to-one rank comparison test

General: tests for differences in mean ranks of groups
P-value correction: Bonferroni or Holm (see Section 6.2.1.4.2.1)
Prerequisites: same as for Kruskal-Wallis test
Test statistic: z
Interpretation: if significant, medians of samples differ
Original publication: Dunn (1964)
Realized using: PMCMRplus

6.2.1.4.2.6 Dunnett’s test

General: tests if any treatment mean differs from the control mean
P-value correction: internal control
Prerequisites: normal distribution, variance homogeneity
Test statistic: t
Interpretation: if significant, treatment mean differs from the control mean
Original publication: Dunnett (1955)
Realized using: PMCMRplus

6.2.1.4.2.7 William’s trend test

General: tests if any treatment mean differs from the control mean using a step-down procedure
P-value correction: internal control
Prerequisites: normal distribution, variance homogeneity, monotonicity
Test statistic: t
Interpretation: if significant (indicated in the app if decision is reject), treatment mean differs from the control mean
Remarks: as this test uses a step-down procedure, it stops at the first/highest concentration/dose that is not significant (indicated in the app if decision is accept); the test is a one-sided test and alternatives greater or less are automatically selected according to the selected guideline but this can be changed by the user in the Alternative dropdown menu in the sidebar
Original publication: Williams (1971)
Realized using: PMCMRplus

6.2.1.4.2.8 Jonckheere’s trend test

General: tests if population medians have an a priori ordering using a step-down procedure
P-value correction: internal control
Prerequisites: monotonicity
Test statistic: z
Interpretation: if significant, treatment differs from the control
Remarks: as this test uses a step-down procedure, it stops at the first/highest concentration/dose that is not significant; by default, a two-sided test is performed, which can be changed by the user in the Alternative dropdown menu in the sidebar
Original publication: Jonckheere (1954)
Realized using: PMCMRplus

6.2.2 Quantal data

Quantal data arises from binary (0 or 1) data. A classical example is mortality data, where animals can only be either dead or alive.

6.2.2.1 Monotonicity test

To judge if proportions arising from quantal data are monotonically increasing or decreasing, FXMATE uses the functions IsMonotonicallyIncreasing and IsMonotonicallyDecreasing from the ecotoxicology package. These simply test:

\[ p_0 \leq p_1 \leq p_2 \leq ... \leq p_m \]

\[ p_0 \geq p_1 \geq p_2 \geq ... \geq p_m \]

\(p_0\) is the proportion in the control and \(p_m\) the proportion observed at the highest dose/concentration.

You can see the result of these tests in the sidebar under Monotonicity.

6.2.2.2 Test for extrabinomial variance

Extrabinomial variance occurs if observed variance is larger than would be expected for a binomial distribution. As the Cochran-Armitage test is sensitive to extrabinomial variation, it needs to be tested for to decide if the Rao-Scott test is required as an alternative.

FXMATE uses Tarone’s Z test (Tarone, 1979) and is implemented in the app following suggestions discussed here.

A significant Tarone’s Z test indicates extrabinomial variation, which points to using the Rao-Scott test.

6.2.2.3 NOEC/LOEC determination

6.2.2.3.1 Cochran-Armitage test

General: tests for the presence of an association between a variable with two categories (here 0 and 1) and an ordinal variable with several categories with an ordering in the effects using a step-down procedure
P-value correction: internal control
Prerequisites: monotonicity, no extrabinomial variance
Test statistic: Z
Interpretation: if significant, treatment differs from the control
Remarks: as this test uses a step-down procedure, it stops at the first/highest concentration/dose that is not significant; by default, a two-sided test is performed, which can be changed by the user in the Alternative dropdown menu in the sidebar
Original publication: Cochran (1954)
Realized using: DescTools

6.2.2.3.2 Rao-Scott test

General: version of the Cochran-Armitage test adjusted for extrabinomial variance
P-value correction: internal control
Prerequisites: monotonicity
Test statistic: Chi-squared
Interpretation: if significant, treatment differs from the control
Remarks: as this test uses a step-down procedure, it stops at the first/highest concentration/dose that is not significant
Original publication: Rao and Scott (1992)
Realized using: survey

6.2.2.3.3 Chi-squared test with correction

General: tests if there is a difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table
P-value correction: Bonferroni or Holm (see Section 6.2.1.4.2.1)
Prerequisites: large sample size (values in Expected contingency table in the NOEC/LOEC panel need to be 5 or larger)
Test statistic: Chi-squared
Interpretation: if significant, control and treatment frequencies differ
Original publication: Pearson (1900)

6.2.2.3.4 Fisher’s exact test with correction

General: tests if there is a nonrandom association between two categorical variables in a contingency table
P-value correction: Bonferroni or Holm (see Section 6.2.1.4.2.1)
Test statistic: odds ratio
Interpretation: if significant, control and treatment frequencies differ
Original publication: Fisher (1922)

Bartlett, M.S., 1937. Properties of sufficiency and statistical tests. Proceedings of the Royal Society of London. Series A-Mathematical and Physical Sciences 160, 268–282.

Bonferroni, C., 1936. Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R istituto superiore di scienze economiche e commericiali di firenze 8, 3–62.

Cochran, W.G., 1954. Some methods for strengthening the common \(\chi\) 2 tests. Biometrics 10, 417–451.

Dunn, O.J., 1964. Multiple comparisons using rank sums. Technometrics 6, 241–252.

Dunnett, C.W., 1955. A multiple comparison procedure for comparing several treatments with a control. Journal of the American Statistical Association 50, 1096–1121.

Fisher, R.A., 1922. On the interpretation of \(\chi\) 2 from contingency tables, and the calculation of p. Journal of the royal statistical society 85, 87–94.

Fisher, R.A., 1921. Studies in crop variation. I. An examination of the yield of dressed grain from broadbalk. The Journal of Agricultural Science 11, 107–135.

Holm, S., 1979. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics 65–70.

Jonckheere, A.R., 1954. A distribution-free k-sample test against ordered alternatives. Biometrika 41, 133–145.

Kendall, M.G., 1938. A new measure of rank correlation. Biometrika 30, 81–93.

Kolmogorov, A., 1933. Sulla determinazione empirica di una legge didistribuzione. Giorn Dell’inst Ital Degli Att 4, 89–91.

Kruskal, W.H., Wallis, W.A., 1952. Use of ranks in one-criterion variance analysis. Journal of the American statistical Association 47, 583–621.

Levene, H., 1960. Robust tests for equality of variances. Contributions to probability and statistics 278–292.

Mann, H.B., Whitney, D.R., 1947. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics 50–60.

OECD, 2006. Current approaches in the statistical analysis of ecotoxicity data.

Pearson, K., 1900. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50, 157–175.

Rao, J., Scott, A., 1992. A simple method for the analysis of clustered binary data. Biometrics 577–585.

Shapiro, S.S., Wilk, M.B., 1965. An analysis of variance test for normality (complete samples). Biometrika 52, 591–611.

Spearman, C., 1904. The proof and measurement of association between two things. The American Journal of Psychology 15, 72–101.

Student, 1908. The probable error of a mean. Biometrika 1–25.

Tarone, R.E., 1979. Testing the goodness of fit of the binomial distribution. Biometrika 66, 585–590.

Welch, B.L., 1951. On the comparison of several mean values: An alternative approach. Biometrika 38, 330–336.

Welch, B.L., 1947. The generalization of “Student’s” problem when several different population varlances are involved. Biometrika 34, 28–35.

Williams, D., 1971. A test for differences between treatment means when several dose levels are compared with a zero dose control. Biometrics 103–117.