On one of the meetups I frequent, we were reading the book Understanding Deep Learning by Simon J. D. Prince. The author described the double descent phenomenon for NN models. Many participants were surprised by the plot shown in the book, where the test error first gets worse and then improves beyond its previous level as model complexity increases. For practitioners in the group, this behavior felt counterintuitive. In particular, how can it continue to improve on test data when it is already perfect on the training data?
This gave me the idea to look for a perfect geometric solution of the fitting problem and understand how it is structured.
I googled the idea first, because if it occurred to me, surely someone had already thought about it. It turned out—surprisingly—that no one I could find had, at least not from this angle. Everyone seemed to be considering only approximate methods. I guess people just have different habits. As an abstract mathematician, I was trained to search for an exact solution first. Usually this tendency is considered impractical, but this time it turned out to be helpful. I decided to write an article about it. A preprint is available here:
DOI: 10.13140/RG.2.2.18776.61442
Here I added a short history of how the paper developed.
Writing an article forces you to check all your assumptions and revisit familiar concepts. I learned that although my intuition was broadly correct, I needed results from computational geometry to support it. Fortunately, I did not need to go through graduate school for this. I read a textbook chapter or two here and there, and skimmed a number of articles. I had also taken a course on computational geometry years ago—Hyperplane Arrangements. It was not about triangulations, but it taught me how to work with hyperplanes, which turned out to be very useful.
The main observation rests on the following ideas. A feed-forward NN model with ReLU activation induces a polyhedral partition of the input space. When we restrict it to the convex hull of the training data we can further partition each of these polyhedra into d-simplices. They are full-dimensional polyhedra with the minimal possible number of vertices in a given Euclidean space of dimension d. For this step, we needed a theorem from computational geometry: only convex polyhedra always admit a finite simplicial partition. Moreover, we can construct such a partition so that every data point is a vertex of some d-simplex, provided the convex hull is full-dimensional.
On each simplex, we can define an affine function whose values at the simplex vertices match the data outputs. A simplex in dimension d has exactly d+1 vertices, and d+1 points uniquely determine an affine hyperplane in dimension d+1, where the graph of the prediction function resides—so everything works out nicely. Well, assuming that each data point has a single output. When two simplices touch each other, the affine functions agree on the shared boundary because they match at the same data points, so the pieces fit together continuously without jumps. By restricting each affine function to its corresponding simplex and gluing all these pieces together, we obtain a model defined on the convex hull of the data with zero training error—both MSE and MAE vanish, hooray! We can utilize the resulting function as a model and call it as Simplex-Wise Linear Interpolation model (SWLI model).
That part was fun. The next question was whether a NN model can be constructed to represent exactly the same model.
The simplest example to think about is the absolute value function. To represent it as a NN, we need two hidden nodes to separate the two linear branches. Once the branches are separated, in the next layer we can define any function on one branch and suppress the other by setting its coefficient to zero.
Now I need to replicate this idea for the simplex-wise construction. Each simplex is bounded by hyperplanes defined by its facets. I can use the equations of facet hyperplanes in the first hidden layer and then apply ReLU to them. Do we really need ReLU here? Yes, we do. It is very helpful to think of hyperplane equations as formulas for signed distances from a point to the hyperplane. The sign of the evaluated distance tells us on which side of the hyperplane a point lies, so positive and negative values let us detect whether a point is inside of a given simplex. Points inside a simplex should lie on the same side of each facet hyperplane as the simplex vertex that does not belong to that facet. We will form nodes for each half-space defined by hyperplanes. The outputs of this layer can be scaled in the next layer, and we do not need exact distances, which would require normalization of coefficients.
So the first layer outputs are these expressions, and they will be inputs for the next one. I want to reconstruct the affine function corresponding to each simplex in the second layer. I create one node per simplex. For each simplex I find inputs formed by its facet hyperplanes and pick the ones that are positive at one of the simplex vertices. Therefore I selected d+1 inputs because one facet is opposite to each vertex, due to minimality. Set coefficients for all other inputs to 0. The initial affine function had d+1 parameters, and with the selected inputs as variables I can write an affine function with d+2 parameters. So it is solvable, although I have more variables than necessary.
What about contributions from other simplices? That depends on which variables reach their nodes with nonzero coefficients. Let us assume that the inputs are distinct across simplices. We still have all affine function constants being calculated, right? Sure—but I can subtract them. Or better, I can set them to zero from the start, because I have enough degrees of freedom to do that. (I wrote down another exact solution to please my mathematical soul.) The formula becomes shorter, easier to check, and less error-prone—something I have always appreciated.
At the end, I just sum everything up. Only one node produces a nonzero value on points in the convex hull, and that sum is the output. No need for ReLU here.
I got a list of conclusions about double descent and suggested research topics in my article. If the post is all you're willing to read about the structure and corresponding NN model while still being curious about conclusions, see Section 2 of my article for them. I still have a few of side ideas, but I'm not sure that they are worth putting in the article. Here they are:
- We can implement NN layers with paired nodes, with opposite affine functions. They could be useful for creating simplicial meshes in computational geometry. In particular, together with a triangulation, a perfectly fitted NN model provides a function which computes values on the resulting mesh. Although these meshes are often not one-valued functions globally, such constructions can still work locally.
- A study to characterize the interaction between triangulation refinement and the gradient descent algorithm would be interesting, particularly its effect on overall solution quality and learning efficiency.
- I am curious about implementing a SWLI model in practice, although I do not see an immediate application beyond a proof of concept. Perhaps we can compare computational complexity with NN models. In a simple setting—two variables, three records, one output column—an implemented SWLI model appears quicker to construct than to train an NN. It would be informative to understand when gradient descent outperforms triangulation-based constructions. One possible direction is to initialize an NN using SWIM model parameters derived from a triangulation to accelerate training. Such an approach might also help computational geometry specialists in 3D vision obtain better meshes, which remains challenging even with Delaunay triangulations.
You can see an example of a 3D mesh used in computer vision on the right. Here is the source of the image:
https://freesvg.org/wireframe-head-image
I have used AI to get citations for the results I need in my article. For the most part it had been working fine, although it can generate books which do not exist and point to sources which use similar words but unrelated concepts. At least it provides a list one can examine. I recommend adding "Do not change math" to any math text where you want to check grammar and level of formality.
For those who got to this point, thank you for your time!
