Received: October 18, 2024
Accepted: March 05, 2025
Available: March 31, 2025
In data analysis, validating the normality assumption is crucial for determining the suitability of applying parametric methods. The objective of this research was to compare the power and sensitivity of sixteen normality tests, classified according to various aspects. The methodology involved simulating data using the Fleishman contamination system. This approach allowed us to evaluate the tests under non-normality conditions across ten distributions with varying degrees of deviation from normality. The results obtained showed that tests based on correlation and regression, such as Shapiro-Wilk and Shapiro-Francia, outperform the others in power, especially for large samples and substantial deviations from normality. For moderate deviations, the D’Agostino-Pearson and skewness tests performed well, while for low deviations, the Robust Jarque-Bera and Jarque-Bera tests were the most effective. Additionally, some tests exhibited high power across multiple distribution types, such as Snedecor-Cochran and Chen-Ye, which performed well for both symmetric platykurtic and asymmetric leptokurtic distributions. These findings offer valuable insights for selecting appropriate normality tests based on sample characteristics, which improves the reliability of statistical inference. Finally, it is concluded that this research demonstrates scenarios in which the most commonly used statistical tests are not always the most effective.
Keywords: Distribution classification method, Fleishman’s method, Monte Carlo simulation, normality tests, power comparison.
En el análisis de datos, la validación del supuesto de normalidad es crucial para determinar si es correcto aplicar métodos paramétricos. El objetivo de esta investigación fue comparar la potencia y sensibilidad de dieciséis pruebas de normalidad, clasificadas según diversos aspectos. La metodología utilizada consistió en simular datos a partir del sistema de contaminación Fleishman para evaluar las pruebas en situaciones de no normalidad y diez distribuciones con distintos grados de desviación de la normalidad. Los resultados obtenidos fueron que las pruebas basadas en la correlación y la regresión, como Shapiro-Wilk y Shapiro-Francia, superaron a las demás en potencia, especialmente, para muestras grandes y desviaciones sustanciales de la normalidad. Para desviaciones moderadas se observó que las pruebas de D’Agostino-Pearson y de sesgo se desempeñaron bien, mientras que, para desviaciones bajas, sobresalieron la prueba robusta de Jarque-Bera y la prueba de Jarque-Bera. Además, algunas pruebas mostraron una elevada potencia en distintos tipos de distribuciones, como Snedecor-Cochran y Chen-Ye para distribuciones platicurticas simétricas, y Snedecor-Cochran y Chen-Ye para distribuciones leptocurticas asimétricas. Estos resultados aportaron información valiosa sobre la selección de pruebas de normalidad adecuadas en función de las características de la muestra, lo que ayuda a los investigadores a mejorar la fiabilidad de la inferencia estadística. En conclusión, este artículo muestra escenarios donde las pruebas estadísticas más conocidas no siempre son las más efectivas.
Palabras clave: Método de clasificación de distribuciones, método de Fleishman, simulación Monte Carlo, pruebas de normalidad, comparación de potencias.
Normality tests play a crucial role in maintaining the accuracy of statistical results and aiding in the decision-making process during data analysis. Their primary purpose is to determine whether a dataset follows a normal distribution or comes from a population with a normal distribution pattern. This requirement is essential for many statistical analyses, highlighting its inherent significance
The validity of population inferences drawn from sample data often depends on the normality assumption, which is especially relevant when using certain parametric statistical techniques. This assumption is characterized by a symmetric bell-shaped curve with a defined mean and standard deviation. When data follows a normal distribution, we can confidently use classical and parametric statistical methods since all necessary assumptions are met
The most widely used tests to determine data normality include the
Power and sensitivity are critical features of a normality test, vital for data analysis. Statistical tests can identify natural effects or differences if present. Higher power increases the likelihood of detecting deviations from normality in our data for normality tests, which is crucial in interpreting the results
Different statistical functions are used to categorize normality tests, which assess the coherence of data with a normal distribution. For instance, the Shapiro-Wilk test is based on regression coefficients and compares observed quantities with anticipated quantities under the assumption of normality. The Shapiro-Wilk test is commonly used to test the hypothesis of normality in an experimental design
Statistical tests such as the Kolmogorov-Smirnov test measure the difference between theoretical and empirical cumulative distribution functions
This article is structured into several sections, including the following: Section 2 presents the methodology employed, which comprises four groups of normality tests: moment-based tests, empirical distribution, correlation, and regression. Additionally, two data generation methods are considered: Fleishman's method and the distribution classification method. Section 3 provides a comprehensive analysis of the simulation study results, along with a discussion of the findings in relation to recent research. Finally, Section 4 presents the conclusions of the study.
This section details the methodology, organized into four subsections. The first subsection introduces the selected normality tests and provides the rationale for its application. The second subsection introduces the Fleishman's method, which is utilized to generate non-normal data with varying levels of contamination. The third subsection describes the distribution classification and specifies the probability distributions employed in the study. Finally, the fourth subsection outlines the structure of the simulation study, detailing the data generation process and the application of normality tests, with particular emphasis on assessing their power and sensitivity.
The motivation behind conducting sixteen tests of normality is based on the work conducted by
In Table 1, reference is made to the tests that comprise each of the groups of interest. Expanded information on the distributions used can be found in
Test coding | Author | R Functions |
|
Moment-based normality tests (group 1) |
|||
DK | D’Agostino-Pearson | agostino.test(x) |
|
JB | Jarque-Bera | ajb.norm.test(x) |
|
RJB | Robust Jarque-Bera | rjb.test(x) |
|
BS | Bonett-Seier | bonett.test(x) |
|
BM | Bontemps-Meddahi | statcompute(stat.index=14, data=x) |
|
SK | Bai-Ng | skewness.norm.test(x) |
|
KU | Bai-Ng | kurtosis.norm.test(x) |
|
| Empirical distribution-based normality test (group 2) | |||
| LL | Kolmogorov-Smirnov | lillie.test(x) |
|
| CS | Snedecor-Cochran | chisq.test(x) |
|
| G | Chen-Ye | G.test(x) |
|
| AD | Anderson-Darling | ad.test(x) |
|
| BH | Brys-Hubert-Struy | statcompute(stat.index=16, data=x) |
|
| Correlation and regression-based normality tests (group 3) | |||
| SW | Shapiro-Wilk | shapiro.test(x) |
|
| SF | Shapiro-Francia | sf.test(x) |
|
| Tests with specific case specifications (group 4) | |||
| DH | Doornik-Hansen | statcompute(stat.index=8, data=x) |
|
| BHBS | Brys-Hubert-Struy | statcompute(stat.index=18, data=x) |
|
Conversely, the empirical distribution-based normality test group employs a common strategy involving the analysis of the correlation between the theoretical and empirical distributions. This strategy relies on the relationship of two estimations by weighted least squares using a scale obtained from order statistics, as outlined in
The Correlation and Regression-Based Normality tests group encompasses tests that rely on the idea that deviations from normality can be detected by two sample moments— the skewness and kurtosis. These tests are based on the ratio of two scale estimations derived from the least squares method of order statistics. Specifically, the numerator employs a weighted least squares estimation, while the denominator uses the variance estimated from a sample drawn from a different population
The methods employed to generate the data are outlined in subsections 2.2 and 2.3, with each of the methodologies explained in detail. The generated data are then tested for normality using the tests listed in Table 1.
In this paper, we employed the Fleishman power method
\[ Z = a + bX+cX^2+dX^3 \tag{1} \]
Where 𝑍 is a variable with unknown distribution and parameters (𝜇=0; 𝜎2=1; 𝑔1; 𝑔2) 𝑋 is a normally distributed random variable with mean 0 and variance 1. This procedure calculates the coefficients 𝑎,𝑏,𝑐 and 𝑑 using a polynomial transformation according to the third and fourth moments; skewness (𝑔1) and kurtosis (𝑔2). The values of skewness, kurtosis, and Fleishman’s coefficients, calculated with their respective levels of deviation from normality, are shown in Table 2
Levels of contaminated | Skewness (g1) | Kurtosis (g2) | Coefficients Fleishman (a,b,c,d) |
None | 0 | 0 | (0,1,0,0) |
Under | 0.25 | 0.75 | (-0.037,0.933,0.037,0.021) |
Moderated | 0.75 | 1 | (-0.119,0.956,0.119,0.009) |
High | 1.3 | 2 | (-0.249,0.984,0.249,-0.016) |
Severe | 2 | 6 | (-0.314,0.826,0.314,0.023) |
To compare the performance and power of normality tests using distribution classification, samples from ten non-normal distributions were classified according to
| Case | g1 (Skewness)g2 (Kurtosis) | Classification | Distributions Alternatives |
| 1 | g1 = 0 2.5 ≤ g2 ≤ 4.5 | Symmetric mesocurtic | Weibull (4,5)Logistic (9,3) |
| 2 | g1 = 0 g2 > 4.5 | Symmetric leptokurtic | t (4)t (1)Cauchy (0,0.5) |
| 3 | g1 = 0 g2 < 2.5 | Symmetric platicurtic | Beta (2,2) |
| 4 | g1 = 0 g2 > 4.5 | Asymmetric leptokurtic | Beta (1,6)Gamma (2,9)Gamma (6.5,2.8)Weibull (1,2) |
Monte Carlo methods were employed to assess the performance and power of sixteen normality tests: D’Agostino-Pearson, Jarque-Bera, Robust Jarque-Bera, Bonett-Seier, Bontemps-Meddahi, Skewness, Kurtosis, Lilliefors, Anderson-Darling, Snedecor-Cochran, Chen-Ye, Brys-Hubert-Struyf, Shapiro-Wilk, Shapiro-Francia and Doornik-Hansen. A theoretical comparison was deemed impractical, making simulation necessary
Two methods were used to generate samples: Fleishman’s method and distribution classification. Fleishman’s method generates non-normal data by introducing controlled deviations in skewness and kurtosis. This study considers five contamination levels: uncontaminated, low, moderate, high, and severe. The distribution classification method, on the other hand, uses different probability distributions, categorizing them into symmetric distributions (mesokurtic, leptokurtic, and platykurtic symmetry) and asymmetric distributions (leptokurtic asymmetry).
To compare the power of the normality tests, samples of sizes n=10, 20, 30, 50, 100, 200 and 500 were generated using both Fleishman’s method and the distribution classification method. The groups corresponded to those proposed by
This section presents the results of the simulation study, which involves two methods for generating samples: the Fleishman’s method and distribution classification. These methods are described in their respective cases in Section 2 (Tables 2 and 3). In each case, the normality tests listed in Table 1 are evaluated. Furthermore, a discussion of the results of the article in the context of recent work related to normality tests is provided.
In this subsection, a comparative analysis of power was conducted according to Fleishman’s method. The methodology proposed in subsection 2.4 was applied to obtain the statistical power and sensitivity of the normality tests. The results of the study, conducted across a range of contamination levels, are presented below.
Figure 1 illustrates the power of various normality tests in the presence of low contamination levels. It is evident that, at a significance level of 10 %, the tests in group 1, with sample sizes 10≤n≤20, underestimate the significance level. However, as the sample size increases, there is an overestimation of the significance level in the BM test, while the DK, JB, RJB, BS, SK, and KU tests present estimated power values close to the significance level. Additionally, in this scenario, two main groups can be distinguished, considering the overestimation and underestimation of the significance level.
On the other hand, the CS and G tests exhibit considerable variability compared to the other tests, as they underestimate the significance level. This indicates that the null hypothesis (H0), which states that the sample comes from a population with a normal distribution, is rejected more often than it should be, despite being true in all scenarios. Figure 1 shows that, when the significance level is set at 10 %, tests based on the empirical distribution exhibit distinct behaviors. For instance, the BH test underestimates the significance level compared to the other tests, while the LL and AD tests demonstrate a good fit for all sample sizes.
Similarly, groups 3 and 4 show a good fit for all sample sizes, exhibiting an overestimation of the significance levels at 1 % and 5 %, respectively. Moreover, it is notable that the estimated power displays a trend as the significance level increases for all analyzed sample sizes.
Figure 2 illustrates the power for the moderate contamination level. It can be observed that, when using a significance level of 1 %, the tests in group 1, with a sample size of n = 10, underestimate the significance level, with a value of 0.00227 for the BM test. However, as the sample size increases, this test overestimates the significance level. Conversely, the DK, JB, RJB, BS, SK, and KU tests demonstrate a satisfactory fit for all sample sizes, as their respective estimated significance levels approach 1 %. It is evident that the DK and SK tests are the most powerful in this group, followed by the BM test, and then the RJB, JB, KS, and BS tests, respectively.
Conversely, the BH test exhibits a degree of underestimation, albeit to a lesser extent than that observed for the CS and G tests, for sample sizes 10≤n≤30. Conversely, the LL and AD tests demonstrate a satisfactory fit for all sample sizes. However, as the sample size increases for the five tests in group 2, only three overestimate the significance level.
For a significance level of 5 %, the tests in group 3, SW and SF, exhibit a good fit across all sample sizes. This is evidenced by their tendency to overestimate the 5 % value (see Figure 2). A similar behavior is observed with the tests in group 4. Likewise, for a significance level of 10%, the tests in groups 3 and 4 similarly overestimate across all sample sizes (see Figure 2). It should be noted that the CS and G tests are not depicted in Figures 2, 3, and 4 because the power values reported for the various significance levels and sample sizes approach zero. This occurs because they underestimate the significance level, leading to more frequent rejection of H_0, even when it may be true.
Figure 3 illustrates that the simulated power increases as the sample size and significance level grow, reaching a maximum in some normality tests. For a significance level of 1 %, the tests in group 1 show a good fit for all sample sizes, becoming more sensitive as the degree of contamination increases. This is evidenced by the fact that their respective estimated significance levels approach 1 %, with the DK and SK tests exhibiting higher power. In group 2, it is noteworthy that at a significance level of 10 %, the LL and AD tests maintain a good fit for all sample sizes. However, as the sample size increases for the five tests in group 2, only three overestimate the significance level of 1 %.
For a significance level of 5 %, the tests in group 3, as shown in Figure 3, exhibit a good fit. This is evidenced by the fact that all sample sizes overestimate their 5 % value, reaching maximum power between sample sizes 100≤n≤500. A similar pattern is observed in group 4, where the DH test reaches its maximum power between sample sizes 200≤n≤500, and the BHB-S test reaches its maximum power at a sample size of n=500 with a significance level of 5 %. These tests similarly overestimate the significance level across all sample sizes, making them highly powerful.
Figure 4 illustrates the impact of severe contamination on test power, showing a clear trend where power increases with sample size and significance level. For instance, in the most powerful tests of group 1, DK and SK reach their maximum power from a sample size of 𝑛 = 50, indicating that these moment-based normality tests have a good fit. In group 2, the most powerful test is AD, reaching its maximum power from a sample size of 𝑛 = 50. For the tests in group 3, the most powerful test is SW, exhibiting a good fit across all sample sizes, as it overestimates the significance level and reaches its maximum power between sample sizes 100 ≤ 𝑛 ≤ 500. In contrast, in group 4, the most powerful test is DH, which reaches its maximum power between sample sizes 200 ≤ 𝑛 ≤ 500. Similarly, the BHBS test reaches its maximum power at a sample size of 𝑛 = 500.
Notably, Figure 4 provides evidence in favor of using the Shapiro-Wilk test (SW), which exhibits higher power compared to the other tests analyzed. However, it is crucial to note that although the test shows relatively high power across all sample sizes, it never exceeds a value that could be considered acceptable (minimum of 0.6) for small samples, within the interval (0,1) where it takes values. From a high contamination level onwards, high power is observed only for sample sizes n≥30. It is noteworthy that most of the tests analyzed become highly powerful under severe contamination. In this regard, it can be concluded that normality tests are only effective when the departure from theoretical distribution is significant.
This section presents a comparison of the power according to the classification of distributions (symmetric mesokurtic distributions, symmetric leptokurtic distributions, symmetric platykurtic distributions, and asymmetric leptokurtic distributions) using the simulation methodology outlined in Section 2. The statistical power of the normality tests is then obtained.
In this initial classification, the distributions utilized exhibit characteristics consistent with a normal distribution, as evidenced by their proximity to zero for both skewness and kurtosis. The Logistic (9,3) distribution and the Weibull (4,5) distribution were employed, as illustrated in Figures 5 and 6. Figure 5 shows that the BH normality test is the most powerful, while the G test is the least sensitive and powerful for the Logistic (9,3) distribution. However, for sample sizes (𝑛 ≥ 30), the lowest reported power is observed with the SF test. The BH test performs optimally for all sample sizes and significance levels. Conversely, the tests with the lowest performance at a significant level of 1 % are G, CS, DK, and SF. At a significant level of 5 %, the tests with lower performance are G and CS. Finally, at a significant level of 10 %, the tests G, CS, and SF show the lowest performance
On the other hand, it is noteworthy that the G and CS tests exhibit good performance for all sample sizes in the case of the Weibull (4,5) distribution (see Figure 6). Conversely, at a significance level of 10 % and larger sample sizes (n≥30), the SF normality test shows low power compared to the others. Similarly, the JB and BHBS tests show low performance, particularly at a significance level of 5 %.
The results obtained from the two mesokurtic symmetric distributions (see Figures 5 and 6) indicate that for sample sizes less than or equal to 100 (n≤100), both the CS and G tests exhibit low power and sensitivity with the Logistic (9,3) distribution. In most cases, it is accepted that the data originate from a normal distribution. In contrast, when using the Weibull (4,5) distribution, the opposite is true. It is important to note that, at a significance level of 1 %, the LL test performs well for all samples and is the second most powerful test when the data originate from the Logistic (9,3) distribution. Furthermore, it was observed that the remaining tests exhibited a notable decline in sensitivity and an increase in power as the sample size increased.
In this classification, the test that demonstrates the most robust performance in terms of power is the BH test with the Cauchy (0,0.5) and 𝑡 (1) distributions, followed by the G test with the 𝑡 (4) distribution. Conversely, the test with the lowest power for all sample sizes is the RJB test, followed by the CS test for sample sizes smaller than thirty (see Figures 7, 8, and 9).
As illustrated in Figure 7, it can be observed that as the significance level increases, the normality tests lose power. This indicates that the BH test is the most powerful and least sensitive compared to the others at significance levels of 1 % and 5 %. However, at a 10 % confidence level, the SF test outperforms the BH test, maintaining its power above 0.883. In contrast, the RJB test is the least powerful for the Cauchy (0,0.5) distribution. For sample sizes of at least 50 (n≥50), the tests with the worst power performance are RJB, SF, AD, SW, BS, DH, BHBS, KU, JB, LL, and BM at a 1 % significance level.
In the case of the 𝑡 (4) distribution (Figure 8), the excellent performance of the G, BH, and CS tests at all sample sizes is particularly noteworthy. Additionally, it is observed that the normality tests G, BH, and CS exhibit high power for the three significance levels and all sample sizes, in contrast to the rest of the tests. Conversely, the RJB and KU tests demonstrate the opposite pattern for significance levels of 1 % and 5 %, while for a significance level of 10 %, the JB and BHBS tests exhibit low power and higher sensitivity for sample sizes of (n≥100).
Figure 9 illustrates that the BH test maintains its high power, followed by the SK and DK tests, for significance levels of 1 % and 5 %. At a significance level of 10 %, however, the SF and BM tests exhibit the greatest power under this distribution, while the opposite is true for the RJB, CS, and DH tests (less powerful). Finally, the results of the power and sensitivity normality tests for a leptokurtic symmetric distribution (Figures 8 and 9) indicate that the distribution with the greatest power is the 𝑡 (4), with the normality tests G, BH, and CS performing best at all significance levels.
In the case of a symmetric platykurtic distribution, the Beta (2,2) distribution was considered. It can be observed that half of the normality tests exhibit low statistical power for sample sizes ranging from small to medium (10 ≤𝑛 ≤ 20) at a significance level of 10 %. This is illustrated in Figure 10. It can also be seen that all normality tests perform well for sample sizes (𝑛 ≥ 200), with the exception of the SF test with 𝛼 = 10 %.
Figure 10 illustrates that the CS, G, and DK tests exhibit high power, followed by the SK and BH tests at significance levels of 1 % and 5 %. At a significance level of 10 %, the SK and DK tests are among the four most powerful tests, while the BHBS test is the least powerful.
In this classification, the analyzed distributions exhibit a degree of asymmetry with long tails. It can be observed that in all the tests analyzed, the statistical power and sensitivity increase as the sample size increases. That is, in most cases, it is rejected that the data come from a normal distribution when this is false. The tests with the greatest statistical power are CS and G in the case of Beta (1,6), Gamma (2,9), and Gamma (6.5, 2.8), while in the case of Weibull (1,2), the most effective test is BS (see Figures 11-14).
Figure 11 illustrates that the power of tests CS, G, BS, and KU is high, followed by tests BH, BHBS, and JB for all significance levels. The latter stand out in their high power among the seven tests under this distribution. Conversely, the SW test becomes the least powerful for samples with a minimum of 30 observations (n≥30).
The CS, G, and DK tests exhibit high power, followed by the BH, BHBS, and JB tests for significance levels of 1 % and 5 % (see Figure 12). However, at the 10 % level, it is the BH and KU tests that are the most powerful among the seven tests. Conversely, the tests in group 3, based on correlation and regression, become the least powerful under this distribution.
Figure 13 illustrates that the CS, G, and BS tests maintain their high power for samples with at least 20 observations (n≥20), followed by the BH, BS, and JB tests for significance levels of 1 % and 5 %. However, at the 10 % level, the power varies, with the BH test remaining the most powerful, followed by the KU and JB tests, respectively. Therefore, it can be concluded that these tests are the most powerful under the specified distribution. Conversely, the SF test exhibits the lowest power for a significance level of 10 %.
The BS, BH, JB, and BHBS tests exhibit high power, followed by the KU, G, and BM tests for all significance levels (see Figure 14). Conversely, lower power is observed in the SF test with 𝛼 = 1 % and samples (n≤50), the SK test with 𝛼 = 5 % and samples (n≤50), and the SW test with 𝛼 = 10 % and samples (n≤200).
In summary, the comparative analysis of normality tests with regard to power and sensitivity for leptokurtic asymmetric distributions indicates that the distribution with the highest power is the Gamma (6.5, 2.8) under the CS, G, and BS normality tests for all significance levels and sample sizes.
This subsection presents several normality tests used to determine whether the generated data follow a normal distribution. The null hypothesis (H0) states that the data originate from a normal distribution, while the alternative hypothesis (H1) suggests that the data deviate from normality. To generate controlled non-normal data, Fleishman’s method is employed, allowing for systematic manipulation of skewness and kurtosis. A variety of tests—each based on distinct statistical principles—have been applied to ensure a robust evaluation. These tests include:
A test rejects (𝐻0) when the p-value falls below a predefined significance level (e.g., 𝛼 = 0.05), indicating that the data likely exhibit non-normal behavior. Below, we describe the key normality tests and their implementation in R.
The D’Agostino-Pearson test evaluates normality by transforming skewness and kurtosis into a joint test statistic
The Kolmogorov-Smirnov test compares the empirical cumulative distribution function of a sample with the theoretical normal distribution
The Shapiro-Wilk test is widely used for assessing normality, particularly in small sample sizes
The Jarque-Bera test evaluates normality based on skewness and kurtosis
The Bonett-Seier test provides a robust method for assessing normality based on variance estimation
The results from the applied normality tests reveal that the generated data deviate from a normal distribution. Specifically, low p-values (α < 0.05) obtained consistently across tests—namely, the D'Agostino-Pearson, Kolmogorov-Smirnov, Shapiro-Wilk, Jarque-Bera, and Bonett-Seier tests—indicate that the Fleishman-transformed data exhibit controlled non-normal characteristics such as skewness and kurtosis. These findings underscore the importance of using multiple normality tests, as each offers a distinct perspective on data distribution, thereby ensuring a comprehensive evaluation. Researchers are encouraged to select the appropriate test based on sample size and distribution shape to enhance the reliability of statistical inferences. All analyses were conducted using the R statistical software
The statistical power and sensitivity of normality tests were evaluated using two methodologies for data generation: Fleishman’s method and the distribution classification method, each under multiple contamination scenarios. These methodologies allowed for a comprehensive comparison across symmetric and asymmetric distributions, offering insights into the behavior of various normality tests.
In low contamination scenarios (Fleishman’s method: skewness g_1=0.25 and kurtosis g_2=0.70), tests such as DK, JB, RJB, BS, SK, and KU demonstrated a good fit for all sample sizes at a 10 % significance level. This finding is consistent with those of previous studies, particularly those of
In moderate contamination scenarios (skewness g_1=0.75, and kurtosis g_2=1), RJB (Robust Jarque-Bera) demonstrated the highest power at a 1 % significance level, maintaining robustness across sample sizes. This result aligns with findings by
In high contamination scenarios (skewness g_1=1.3, kurtosis g_2=2), moment-based tests such as DK and SK began to lose sensitivity, particularly at a 10 % significance level, where tests like CS and G began to underestimate the significance level. This underestimation was exacerbated in small samples, particularly for the BM test. Under symmetric leptokurtic distributions, group 2 tests (G, BH, CS) remained the most powerful across all significance levels. This underperformance is especially critical under asymmetric distributions like Gamma (2,9) or Beta (1,6), as reported in
In scenarios involving symmetric platykurtic distributions like Beta (2,2), the tests with the highest power were DK, SK, and BM at both 1 % and 5 % significance levels, as illustrated by their consistency across all sample sizes (n≥200)
Finally, in this article, the problem discussed by
This study demonstrates that normality tests exhibit significant variation in power and sensitivity depending on the distribution type and contamination level. By employing Fleishman’s method and distribution classification, we established a robust framework for evaluating their performance under diverse conditions. In low contamination scenarios, tests such as DK, JB, RJB, BS, SK, and KU are highly effective at a 10 % significance level for all sample sizes, though their performance diminishes under stricter significance levels. Conversely, SW and SF tests consistently perform well across all scenarios, maintaining a high degree of power even at lower significance levels.
Moderate contamination scenarios highlight the robustness of the RJB test at a 1 % significance level, but its power reduces as the sample size increases and the significance level reaches 5 % or 10 %. Additionally, empirical distribution-based tests, such as CS and G, demonstrate reduced sensitivity under moderate contamination, reflecting their conservative nature in rejecting the null hypothesis. High contamination scenarios show that moment-based tests (e.g., DK and SK) struggle with sensitivity as contamination increases, particularly at higher significance levels. This is particularly pronounced in small samples where tests such as BM tend to underestimate significance, rendering them less reliable in these cases.
For symmetric distributions, particularly leptokurtic and platykurtic types, tests such as CS, G, DK, SK, and BM remain among the most powerful. This trend is consistent across distributions like Beta (2,2) and Gamma (6.5, 2.8), emphasizing the importance of sample size and distribution shape in determining test efficacy. For larger sample sizes (n≥500), the most powerful tests across asymmetric distributions, particularly Beta (1,6) and Weibull (1,2), are the CS, G, BS, and BH tests, which maintain strong performance even as significance levels rise. This suggests that these tests are particularly well-suited for detecting non-normality in heavily skewed distributions.
Future studies should further explore how sample size influences normality test performance, particularly in small-sample scenarios with high contamination levels. Investigating the use of alternative methods for calculating p-values, especially under varying degrees of skewness and kurtosis, could further refine the understanding of normality test performance.
The authors have no conflicts of interest pertaining to the findings presented in this article.
Cristian David Correa-Álvarez: Conceptualization, Research, Organization, and Writing of the manuscript.
Jessica María Rojas-Mora: Design and development of the research, as well as the review of the manuscript.
Antonio Elías Zumaqué-Ballesteros: Simulation aspects and the development of the manuscript.
Osnamir Elias Bru-Cordero: Organization of the code, Preparation of figures, and Review of the manuscript.