Theoretical Analysis of the Universal Approximation Properties of GELU in Neural Networks
Keywords:
Gaussian Error Linear Unit (GELU), Universal Approximation Theorem, Activation Functions, Neural Network, Neurons, Compact DomainAbstract
The choice of activation function is critical to a neural network’s expressive power. The Rectified Linear Unit (ReLU) became a widely adopted standard due to its computational efficiency and effectiveness in mitigating vanishing gradients. However, ReLU also possesses well-known theoretical limitations, including non-differentiability at zero and the "Dying ReLU" problem, which can impede training. As an alternative, the smooth ????? Gaussian Error Linear Unit (GELU) has seen increasing adoption in state of-the-art models. This paper provides a rigorous theoretical analysis of GELU’s universal approximation properties. We formally prove that GELU satisfies the necessary and sufficient conditions of the Universal Approximation Theorem (UAT) by demonstrating that its ????? smoothness ensures its membership in the required function class ?, and that its non-terminating Taylor series expansion proves its essential non-polynomial nature. To support this theoretical analysis, we present a series of targeted empirical validations that visually and quantitatively demonstrate the practical consequences of these properties. Our experiments confirm that GELU’s smoothness provides a tangible advantage over ReLU in approximating ????? functions, especially in deep neural networks; its non-zero negative gradient prevents the neuron death seen in ReLU; and its unbounded nature is superior to Tanh for modeling non-saturating functions. This work provides a complete theoretical explanation for GELU’s power as a universal approximator, bridging the abstract UAT framework with the function’s specific mathematical properties.