Logo image
Explicit Foundation Model Optimization with Self-Attentive Feed-Forward Neural Units
Preprint   Open access

Explicit Foundation Model Optimization with Self-Attentive Feed-Forward Neural Units

Jake Ryland Williams and Haoran Zhao
arXiv.org
13 Nov 2023
url
https://doi.org/10.48550/arxiv.2311.07510View
Preprint (Author's original)arXiv.org - Non-exclusive license to distribute Open

Abstract

Computer Science - Learning Mathematics - Probability Physics - Data Analysis, Statistics and Probability Statistics - Machine Learning
Iterative approximation methods using backpropagation enable the optimization of neural networks, but they remain computationally expensive, especially when used at scale. This paper presents an efficient alternative for optimizing neural networks that reduces the costs of scaling neural networks and provides high-efficiency optimizations for low-resource applications. We will discuss a general result about feed-forward neural networks and then extend this solution to compositional (mult-layer) networks, which are applied to a simplified transformer block containing feed-forward and self-attention layers. These models are used to train highly-specified and complex multi-layer neural architectures that we refer to as self-attentive feed-forward unit (SAFFU) layers, which we use to develop a transformer that appears to generalize well over small, cognitively-feasible, volumes of data. Testing demonstrates explicit solutions outperform models optimized by backpropagation alone. Moreover, further application of backpropagation after explicit solutions leads to better optima from smaller scales of data, training effective models from much less data is enabled by explicit solution warm starts. We then carry out ablation experiments training a roadmap of about 250 transformer models over 1-million tokens to determine ideal settings. We find that multiple different architectural variants produce highly-performant models, and discover from this ablation that some of the best are not the most parameterized. This appears to indicate well-generalized models could be reached using less data by using explicit solutions, and that architectural exploration using explicit solutions pays dividends in guiding the search for efficient variants with fewer parameters, and which could be incorporated into low-resource hardware where AI might be embodied.

Metrics

24 Record Views

Details

Logo image