본문 바로가기

[FDA] 3. Penalized smoothing

2024. 3. 15.

Penalized smoothing

In Functional data analysis, how can we choose the number of basis? Is there any possibility that generates overfitting?

Intuitively, if we take more basis and make a basis expansion, the result seems to follow overall function trend better than the case of using less basis.

 

Thus, some expert develop a strategy to choose the number of basis carefully. And we can use penalized smoothing method in this situation.

 

When we fit a basis expansion, it is a procedure finding coefficient value $c$. And basis expansion is done with least squares as follows

$$ S(\mathbf{c}_n) = \sum_j(Y_{jm} - \sum_k c_{nk} B_k (t_{nj}))^2 $$

n denotes each curve and j denotes each time point. Thus, it can be seen the differences between actual data $Y$ and basis expansion for each time point.

 

A penalized method introduce penalty term to above equation.

$$S_\lambda (\mathbf{c}_n) = \sum_j (Y_{jn} - \sum_k c_{nk} B_k(t_{nj}))^2 + \lambda \int _0^1 [L\tilde{X}_n)(t)]^2 \ dt \ \ \ \cdots \ \ \ (1)$$

This equation is somewhat similar to Ridge regression. And if $\lambda \rightarrow \infty $, the result is just a straight line while it is the same as OLS method when $\lambda \rightarrow 0$.

 

What is $L$?

$L$ means some specified linear differential operator. i.e., it is a linear combination of derivatives of some function. And the popular choice of $L(x)(t)$ for periodic data is as follows:

$$L(x)(t) = \frac{4\pi^2}{T^2} x^{(1)}(t) + x^{(3)}(t)$$

$x^{(1)}$ and $x^{(3)}$ each indicate first and third derivatives and this equation is called harmonic acceleration operator.

 

And we can write equation (1) as matrix form as follows:

$$S(\mathbf{c}_n) = (\mathbf{Y}_n - \mathbf{B}_n \mathbf{c}_n)^\top (\mathbf{Y}_n - \mathbf{B}_n \mathbf{c}_n) + \lambda \mathbf{c}_n^\top \mathbf{W} \mathbf{c}_n$$

Simply, we can derive closed form solution for this

$$ \hat{\mathbf{c}}_n = (\mathbf{B}_n^\top \mathbf{B}_n + \lambda \mathbf{W})^{-1} \mathbf{B}_n^\top \mathbf{Y}_n$$

And the fitted values are

$$ \hat{\mathbf{Y}}_n = \mathbf{B}_n \hat{\mathbf{c}}_n$$

 

 

How to choose $\lambda$?

Of course, it still remains some question. Then, how can we set $\lambda$? It seems to be generated another optimization problem.

To choose appropriate $\lambda$, it is helpful to use "degrees of freedom". It is defined as

$$df = trace(\mathbf{B}_n(\mathbf{B}_n^\top \mathbf{B}_n + \lambda \mathbf{W})^{-1} \mathbf{B}_n^\top)$$

 

If $\lambda = 0$ then $df=K$ trivially.

 

After that, there are several ways to choose $\lambda$. I will introduce several methods.

Firstly, let's define $RSS = (\mathbf{Y}_n - \hat{\mathbf{Y}}_n)^\top (\mathbf{Y}_n - \hat{\mathbf{Y}}_n)$.

 

First one is GCV and it is defined as follows:

$$GCV(\lambda) = \frac{J}{(J-df)^2}$$

 

Second one is AIC:

$$AIC(\lambda) = J \log{(J^{-1}RSS)}+2df$$

 

Third one is BIC:

$$BIC(\lambda) = J\log{(J^{-1}RSS)} + \log{(J)}df$$

 

And last one is Cross-validation method.

 

GCV is known as the most popular in FDA since it is easy to compute.

댓글