Received: Enero 10, 2024
Accepted: Abril 22, 2024
Available: 29 abril 29, 2024
The k-sample problem for functional data has been widely studied from theoretical and applied perspectives. In literature, Gaussianity of the generating process is generally assumed, which may be impractical in some situations. This work proposes an extension of the Kruskal-Wallis test to the case of functional data as an alternative to the problem of non-Gaussianity. The methodology used consisted of transforming each group's functional data into scalars using random projections and subsequently performing classical Kruskal-Wallis tests. The main results were the extension of the Kruskal-Wallis test to the case of functional data and the verification of its unbiased and consistency properties. Reducing dimensionality from random projections allows us to extend the classical Kruskal-Wallis test to the functional context and solve problems of non-Gaussianity and atypical observations.
Keywords: Functional data, Random projections, Kruskal-Wallis test, Non-parametric statistics, Brownian motion
El problema de k muestras de datos funcionales se ha estudiado ampliamente desde perspectivas teóricas y aplicadas. En la literatura se asume generalmente el supuesto de Gaussianidad del proceso generador, el cual puede ser impráctico en algunas situaciones particulares. Este trabajo tuvo como objetivo proponer una extensión de la prueba de Kruskal-Wallis al caso de datos funcionales, como alternativa al problema de no Gaussianidad. La metodología empleada consistió en transformar los datos funcionales de cada grupo en escalares empleando proyecciones aleatorias y en realizar posteriormente pruebas de Kruskal-Wallis clásicas. Los principales resultados fueron la extensión de la prueba de Kruskal-Wallis al caso de datos funcionales y la comprobación de las propiedades de insesgadez y consistencia de esta misma. Se puede concluir que la reducción de la dimensionalidad a partir de las proyecciones aleatorias permite extender la prueba de Kruskal-Wallis clásica al contexto funcional y por ende solucionar problemas de no Gaussianidad y observaciones atípicas.
Palabras clave: Datos funcionales, Proyecciones aleatorias, Prueba de Kruskal-Wallis, Estadística no paramétrica, Movimiento Browniano.
Advances in computational and analytical techniques allow for continuous monitoring of many processes. New statistical methods are needed to analyze large data sets arising from these processes. Functional data analysis (FDA) has emerged in recent decades as an alternative to statistical modeling of large data volumes. FDA is a framework for analyzing data consisting of random functions (usually curves) rather than observations of a few variables or random vectors [
To construct a functional observation 𝑋𝑖𝑗 (𝑡) from the discretely observed data one can employ a standard smoothing technique such as cubic B-splines [
This work focuses mainly on proposing a methodology for comparing groups when the same functional variable has been observed in several individuals in each of these. Specifically, a traditional nonparametric tool to solve the .-sample problem for a functional response is adapted to the FDA scenario. Let 𝑋𝑖1 (𝑡),𝑋𝑖2 (𝑡), ⋯ ,𝑋𝑖𝑛(𝑡), 𝑖 = 1,2, ⋯ , K random set of functions defined over an interval T = [a,b] which come from Gaussian processes 𝐺𝑃(𝜇𝑘(𝑡), 𝛾𝑘(𝑠,𝑡)) [
(1)
Against the alternative that at least two functional means are different. The statistical literature has a widely considered hypothesis established in (1) The proposed approaches are proposed for point-wise t-tests, functional ANOVA, functional principal components analysis, and permutation tests.
Some authors have extensively studied the functional ANOVA problem. For example, [
Other approaches were considered by [
This work is organized as follows. Sections 2.1 and 2.2 review the Kruskal-Wallis test and random projections. Section 3 presents an extension of the Kruskal-Wallis test for functional data and shows its respective pseudocode. In Section 4.1, we present the simulation study and in Section 4.2, we present the application with real data. Finally, we present the discussion and some conclusions.
2.1 Kruskal-Wallis test
This section briefly reviews the main statistical technique used in the analysis. Kruskal-Wallis [
(2)
Which establishes that there are no significant differences in the effects of the treatments. The null hypothesis states that the following distributions 𝐹1= 𝐹2 = ⋯ = 𝐹𝑘 are equal. To calculate the Kruskal-Wallis statistic, all . observations from the k-samples are combined and ordered from smallest to largest. Let rij be the rank of rij in this joint classification, and Rj defined as (3)
(3)
Thus, for example, R1 is the sum of the ranks received by the observations of group 1 and R1 is the average rank for these same observations. Kruskal-Wallis H statistics are given by [
(4)
At a significance level of α, H0 is rejected if H ≥ hα otherwise, do not reject. The values of hα are given in Table A.12 of [
Reject 𝐻≥𝜒2𝑘−1,𝛼 ; otherwise, do not reject.
When the null hypothesis is rejected and it is concluded that at least one sample comes from a population with a different median, some post-hoc tests (e.g., Dunn's test) can be used to identify which samples differ significantly.
2.2 Random Projections
The hypothesis of interest (see hypothesis in (1)) can be tested using the projections of the functions. These involve mapping high-dimensional data points into a lower-dimensional space using a randomly generated projection matrix [
Random projections are often used in situations where the dimensionality of the data makes it difficult to work with or analyze. In other words, random projections can be a handy tool for reducing the complexity of the data without losing important information. Given a set of data or a distribution in spaces of dimension greater than one, random projections consist of projecting the data or calculating the marginal of the distribution in a lower-dimensional subspace that has been chosen randomly [
Once the functional data have been projected onto a lower-dimensional space, a hypothesis test can be performed to determine whether the functional means are equal. The choice of hypothesis test depends on the specific application, but a common approach is to use a t-test or an ANOVA test. One advantage of using random projections to test the equality of functional means is that it can be computationally efficient, mainly when dealing with high-dimensional functional data. It can also be robust to noise and outliers in the data, as random projections can help filter out some of the noise.
This research presents an extension of the Kruskal-Wallis test for functional data based on random projections.
We propose extending the Kruskal-Wallis test to the case of functional data (the observation for each individual in the sample corresponds to a functional datum). As in the univariate case, in the context of functional data analysis, statistical tests require the fulfillment of some assumptions. When the samples are small and the curves do not underlie a Gaussian stochastic process, the functional ANOVA could be inappropriate, and a non-parametric method may be used as a valid alternative. Specifically, a Kruskal-Wallis test for functional data based on random projections (KWFD) is proposed as an alternative methodology to the one-way functional ANOVA when the Gaussianity assumption is unrealistic. The KWFD is a non-parametric alternative for comparing the medians of functional data of three or more groups. We extended the KW test by randomly projecting the functional data onto a low-dimensional subspace.
Let Xij (t), i = 1,2, ⋯, nj, j = 1, ⋯, k a functional random sample of curves, where t ∈ [a, b] is the domain (generally time), i correspond to an individual, and j the index for the level factor. The functional random variables are considered independent trajectories of the stochastic processes SP(μj (t),γ(s,t)),j = 1,⋯ ,k with a common covariance function γ(s, t). Let xij (t), i = 1,2, ⋯, n; j = 1,⋯, k be the recorded set of curves under the k treatments. In the following, we describe the procedure for calculating the H statistic to test the null hypothesis in (1)
The Kruskal-Wallis test for functional data based on random projections is calculated similarly to the univariate Kruskal-Wallis test. It is based on the sum of the ranks of the projected curves within each group. The test assumes no specific distribution for the functional data and can be robust to atypical curves.
Section 4.1 presents a simulation study based on a single Brownian motion simulation. Section 4.2 shows the p-values obtained by generating 1000 random projections.
4.1 Simulation study indicators
We assess the power of the test to detect differences between medians of k-samples of functional data. To establish the performance, we show the results of a simulation study. We follow the procedure given in [15] to perform the analysis. For simplicity, just three groups of curves are considered.
Where 𝜇(𝑡)=sin(2𝜋𝑡),𝑡 ∈ (0,10), is the mean function and the errors 𝜀𝑖𝑗(𝑡)=1,2,3, follow a uniform distribution on [−1,1]. As an initial illustration, a graph of a Brownian motion and 120 simulated curves according to the equations given in (5) are shown in Figure 1. The curves in red and green are very similar (these come from analogous models (rows 1 and 2 of the equations in 5, and the curves in blue involve an additional parameter 𝛿(𝑡)=𝛿=1.2 that makes these different from the previous ones. Notice in Figure 1 that the highest periodic peaks of the blue curves are close to 3, while in the other two cases (red and green curves), these are close to 2, i.e., the null hypothesis should be rejected. The errors are assumed to be uniform in the interval (1,1). Performing a hypothesis test on the means of functional data assuming that the processes are Gaussian with data such as those presented in Figure 1 would be inappropriate.
To evaluate the power of the test, we considered 𝛿(𝑡) = 𝛿, for all 𝑡 ∈ [0,10], with 𝛿 = 0.0,⋯,0.7. Four sample size scenarios are considered (𝑛 = 10,30,80,120) for each sample group. In each case, 1000 realizations are generated. Based on each sample size, we performed a Kruskal-Wallis test as defined in Section 3. In each case, the power of the test is obtained as the percentage of 𝑝−𝑣𝑎𝑙𝑢𝑒𝑠 less than 0.05. We used the libraries fda.usc and stats of R to perform the analysis [1
.
1
. The R code used is available at: https://github.com/frajaroco/KWfdRP/blob/main/KWtest.R
4.2 Real data analysis: Temperature curves in Canada
We apply the Kruskal-Wallis test for functional data from Section 3 to a widely used meteorological data set in the context of the FDA [2
The daily temperature data for the four climatic zones were smoothed using a Fourier basis function. The curves obtained after smoothing are shown in Figure 3. The interest is to determine whether there are significant differences between the mean (median) curves of these areas. For this purpose, we apply the Kruskal-Wallis test presented in Section 3. We generate random projections using (6) with 𝑖 the index corresponding to the weather station in each one of the four climatic zones (𝑗=1 (Arctic), 2 (Pacific), 3 (Continental), 4 (Atlantic) ) and 𝜈(𝑡) a Brownian motion. The number of stations in each zone is 4 (Arctic), 7 (Pacific), 9 (Continental), and 15 (Atlantic).
2
. See Canada's Climate Regions at the link:
https://sites.google.com/a/ocsb.ca/cgc-1d/a-unit-4-climate/1-canadas-climate-regions)
(6)After obtaining the random projections, we conduct a classical Kruskal-Wallis test with these values. For this case, a p - value = 0.00361 was obtained, and consequently, in concordance with Canada's Climatic description above, the null hypothesis is rejected. Note that there are some atypical curves in each panel of Figure 3. Using a classical ANOVA test based on random projections can be limited in this case. A robust methodology, as proposed here, could be more appropriate. Wilcoxon’s post-hoc tests [
| Atlantic | Continental | Pacific | |
| Continental | 0.09 | -- | -- |
| Pacific | 0.95 | 0.25 | -- |
| Artic | 0.01 | 0.20 | 0.07 |
The results described above are based on random projections from a particular BM. The attached R code3
, shows the values found with 1000 Brownian motions, and the general conclusion is the same.
3
. https://github.com/frajaroco/KWfdRP/blob/main/KWCanadianWeather.R
4.3 Discussion
ANOVA for functional data has been widely discussed, and several approaches have been considered [
We propose a non-parametric method for the k-functional problem, which is useful when the sample size is small, the assumption of normality is not reasonable, or when there are atypical curves. We propose the use of one-dimensional random projections to solve the problem. After obtaining scalars from functions using random projections, a classical Kruskal-Wallis test can be used to test the hypothesis. The results obtained from the simulated and real data show a good performance of the methodology. The results (Figure 2) illustrate that the Kruskal Wallis test extension performs well under the null hypothesis. Power increases for larger sample sizes and distance parameter. This plot allows us to validate that the proposed test is unbiased and consistent. Some authors consider using points-wise test statistics for functional data problems with two samples and similarly for the .-sample problem, although they are not global tests. Our approach is a helpful alternative when the sample is small, and the Gaussian assumption is inappropriate.
The authors thank the Editor and reviewers for their constructive comments, which improved the article's presentation. Francisco J. Rodríguez-Cortés and Ramón Giraldo has been partially supported by Universidad Nacional de Colombia, HERMES projects, Grant/Award Number: 612113.
The authors declare no conflict of interest.
Rafael Meléndez Surmay, Ramón Giraldo Henao, and Francisco Rodríguez Cortes, performed data processing, formal analysis, investigation, methodology, and original draft writing.