Computer Science - Learning Mathematics - Probability Physics - Data Analysis, Statistics and Probability Statistics - Machine Learning
Iterative approximation methods using backpropagation enable the optimization
of neural networks, but they remain computationally expensive, especially when
used at scale. This paper presents an efficient alternative for optimizing
neural networks that reduces the costs of scaling neural networks and provides
high-efficiency optimizations for low-resource applications. We will discuss a
general result about feed-forward neural networks and then extend this solution
to compositional (mult-layer) networks, which are applied to a simplified
transformer block containing feed-forward and self-attention layers. These
models are used to train highly-specified and complex multi-layer neural
architectures that we refer to as self-attentive feed-forward unit (SAFFU)
layers, which we use to develop a transformer that appears to generalize well
over small, cognitively-feasible, volumes of data. Testing demonstrates
explicit solutions outperform models optimized by backpropagation alone.
Moreover, further application of backpropagation after explicit solutions leads
to better optima from smaller scales of data, training effective models from
much less data is enabled by explicit solution warm starts. We then carry out
ablation experiments training a roadmap of about 250 transformer models over
1-million tokens to determine ideal settings. We find that multiple different
architectural variants produce highly-performant models, and discover from this
ablation that some of the best are not the most parameterized. This appears to
indicate well-generalized models could be reached using less data by using
explicit solutions, and that architectural exploration using explicit solutions
pays dividends in guiding the search for efficient variants with fewer
parameters, and which could be incorporated into low-resource hardware where AI
might be embodied.
Metrics
24 Record Views
Details
Title
Explicit Foundation Model Optimization with Self-Attentive Feed-Forward Neural Units