Tuesday, January 27, 2026

Research Diary: Neural Networks from Simplex-Wise Linear Interpolations

On one of the meetups I frequent, we were reading the book Understanding Deep Learning by Simon J. D. Prince. The author described the double descent phenomenon for NN models. Many participants were surprised by the plot shown in the book, where the test error first gets worse and then improves beyond its previous level as model complexity increases. For practitioners in the group, this behavior felt counterintuitive. In particular, how can it continue to improve on test data when it is already perfect on the training data?

This gave me the idea to look for a perfect geometric solution of the fitting problem and understand how it is structured. 

I googled the idea first, because if it occurred to me, surely someone had already thought about it. It turned out—surprisingly—that no one I could find had, at least not from this angle. Everyone seemed to be considering only approximate methods. I guess people just have different habits. As an abstract mathematician, I was trained to search for an exact solution first. Usually this tendency is considered impractical, but this time it turned out to be helpful. I decided to write an article about it. A preprint is available here:

DOI: 10.13140/RG.2.2.18776.61442

Here I added a short history of how the paper developed.

Writing an article forces you to check all your assumptions and revisit familiar concepts. I learned that although my intuition was broadly correct, I needed results from computational geometry to support it. Fortunately, I did not need to go through graduate school for this. I read a textbook chapter or two here and there, and skimmed a number of articles. I had also taken a course on computational geometry years ago—Hyperplane Arrangements. It was not about triangulations, but it taught me how to work with hyperplanes, which turned out to be very useful.

The main observation rests on the following ideas. A feed-forward NN model with ReLU activation induces a polyhedral partition of the input space. When we restrict it to the convex hull of the training data we can further partition each of these polyhedra into d-simplices. They are full-dimensional polyhedra with the minimal possible number of vertices in a given Euclidean space of dimension d. For this step, we needed a theorem from computational geometry: only convex polyhedra always admit a finite simplicial partition. Moreover, we can construct such a partition so that every data point is a vertex of some d-simplex, provided the convex hull is full-dimensional.

On each simplex, we can define an affine function whose values at the simplex vertices match the data outputs. A simplex in dimension d has exactly d+1 vertices, and d+1 points uniquely determine an affine hyperplane in dimension d+1, where the graph of the prediction function resides—so everything works out nicely. Well, assuming that each data point has a single output. When two simplices touch each other, the affine functions agree on the shared boundary because they match at the same data points, so the pieces fit together continuously without jumps. By restricting each affine function to its corresponding simplex and gluing all these pieces together, we obtain a model defined on the convex hull of the data with zero training error—both MSE and MAE vanish, hooray! We can utilize the resulting function as a model and call it as Simplex-Wise Linear Interpolation model (SWLI model).

That part was fun. The next question was whether a NN model can be constructed to represent exactly the same model.

The simplest example to think about is the absolute value function. To represent it as a NN, we need two hidden nodes to separate the two linear branches. Once the branches are separated, in the next layer we can define any function on one branch and suppress the other by setting its coefficient to zero.

Now I need to replicate this idea for the simplex-wise construction. Each simplex is bounded by hyperplanes defined by its facets. I can use the equations of facet hyperplanes in the first hidden layer and then apply ReLU to them. Do we really need ReLU here? Yes, we do. It is very helpful to think of hyperplane equations as formulas for signed distances from a point to the hyperplane. The sign of the evaluated distance tells us on which side of the hyperplane a point lies, so positive and negative values let us detect whether a point is inside of a given simplex. Points inside a simplex should lie on the same side of each facet hyperplane as the simplex vertex that does not belong to that facet. We will form nodes for each half-space defined by hyperplanes. The outputs of this layer can be scaled in the next layer, and we do not need exact distances, which would require normalization of coefficients.

So the first layer outputs are these expressions, and they will be inputs for the next one. I want to reconstruct the affine function corresponding to each simplex in the second layer. I create one node per simplex. For each simplex I find inputs formed by its facet hyperplanes and pick the ones that are positive at one of the simplex vertices. Therefore I selected d+1 inputs because one facet is opposite to each vertex, due to minimality. Set coefficients for all other inputs to 0. The initial affine function had d+1 parameters, and with the selected inputs as variables I can write an affine function with d+2 parameters. So it is solvable, although I have more variables than necessary. 

What about contributions from other simplices? That depends on which variables reach their nodes with nonzero coefficients. Let us assume that the inputs are distinct across simplices. We still have all affine function constants being calculated, right? Sure—but I can subtract them. Or better, I can set them to zero from the start, because I have enough degrees of freedom to do that. (I wrote down another exact solution to please my mathematical soul.) The formula becomes shorter, easier to check, and less error-prone—something I have always appreciated.

At the end, I just sum everything up. Only one node produces a nonzero value on points in the convex hull, and that sum is the output. No need for ReLU here.

I got a list of conclusions about double descent and suggested research topics in my article. If the post is all you're willing to read about the structure and corresponding NN model while still being curious about conclusions, see Section 2 of my article for them. I still have a few of side ideas, but I'm not sure that they are worth putting in the article. Here they are:

  • We can implement NN layers with paired nodes, with opposite affine functions. They could be useful for creating simplicial meshes in computational geometry. In particular, together with a triangulation, a perfectly fitted NN model provides a function which computes values on the resulting mesh. Although these meshes are often not one-valued functions globally, such constructions can still work locally.
  • A study to characterize the interaction between triangulation refinement and the gradient descent algorithm would be interesting, particularly its effect on overall solution quality and learning efficiency.
  • I am curious about implementing a SWLI model in practice, although I do not see an immediate application beyond a proof of concept. Perhaps we can compare computational complexity with NN models. In a simple setting—two variables, three records, one output column—an implemented SWLI model appears quicker to construct than to train an NN. It would be informative to understand when gradient descent outperforms triangulation-based constructions. One possible direction is to initialize an NN using SWIM model parameters derived from a triangulation to accelerate training. Such an approach might also help computational geometry specialists in 3D vision obtain better meshes, which remains challenging even with Delaunay triangulations. 

 You can see an example of a 3D mesh used in computer vision on the right. Here is the source of the image: 

https://freesvg.org/wireframe-head-image 

I have used AI to get citations for the results I need in my article. For the most part it had been working fine, although it can generate books which do not exist and point to sources which use similar words but unrelated concepts. At least it provides a list one can examine. I recommend adding "Do not change math" to any math text where you want to check grammar and level of formality.  Otherwise AI might add terms which it had been seen as highly correlated to yours in its training data, while they do not suit to your topic. 

For those who got to this point, thank you for your time! 

Thursday, April 3, 2025

Tables in Bookdown pdf_book format

Motivation

 I once spent a few days trying to tame tables in a Bookdown pdf_book format. I checked a few R packages, which were supposed to help me. They were not visual, took time to learn and eventually not very useful. When I tried to add a caption to my table my other text which should follow the table was getting inside it. Now I know that it is not package authors fault but the way the package output was rendered in Rmarkdown pdf output, and it might be fixed at the time you are reading the post. I looked up Bookdown book by Yihui Xie and he favors HTML format, or at least methods which work both with CSS and LaTeX. His preferable method did not fit in visible part of my RStudio editor window.

I decided to focus on PDF format because I wanted a book with easy references and index. Plus I can put in my GitHub account, update versions and download to read offline. 

 I searched internet and discovered that there are a lot of advice to switch to CSS format as well. Apparently I have no choice but to do it myself and get my hands dirty. In addition I learned that my Rmarkdown file is converted to TEX format before turning into PDF, so why not to write LaTeX chunk which I want, without relying on correctness of Rmarkdown rendering?

I already had unsuccessful experience a few years back when I needed a LaTeX package in Rmarkdown and it did not work. I decided to try again, hoping that Rmarkdown is more developed now.  I examined different methods which were used in Bookdown pdf_book format, and discovered a lot of LaTeX commands and environments which were used as is. I tried the basic LaTeX environment for a table, tabular,  and it worked! Although for a really good table you need booktabs package.

LaTeX intro

If you do not know LaTeX at all, I provide a few tips here. Please remember that they work only for pdf_book format. If you try to use for text which you want in HTML format, too, they won't work.

1. LaTeX commands start with backslash, which is not read as it is but only serves to denote the command. Here is example of a command which place everything after it on a new page:

\newpage

Here couple of others, which add vertical space if you want more distance between your sections or your picture and your table:

\vspace{5mm}

\vspace{1cm}

Notice the metric system. Both commands should be padded by empty lines before and after to work properly.

2. Any number of empty lines in LaTeX is collapsed to one, and any number of inline spaces are collapsed to one, too. So it is the same as in markdown.

3. LaTeX has environments for specific page chunks, like pictures, tables, formulas, font types and sizes. Their start and end should be explicitly marked. They must be properly nested. Do not forget an environment end or confuse what environment ends where! LaTeX gods will be furious and stop helping you until you fix your wrong ways. You are not even likely to see a correct error wording, because you should know what you did. 💣

4. Basic and many other LaTeX environments use the same command to go to a next line: double backslash.

\\

One notable exception is math environments, which is a whole new game. Here is a basic, simple form when you only use dollar signs, for example $x^2$ will be converted to x squared in math script  on PDF page:

  

 

But when you need to display a bunch of formulas together, LaTeX has a choice of environments for you.  By the way to avoid LaTeX assuming that when you use $, you mean to open a math environment, use a backslash before it. Otherwise you will see a processing error. Say you want  to get: $5.67 in your markdown page, do it on this way: \$5.67.

Back to tables!

The LaTeX environment we need here starts with \begin{tabular} and ends with \end{tabular}. Below you see an example which has all necessary elements of a basic LaTeX tabular environment, how it is rendered and following explanation.

\begin{tabular}{l l}
column 1, row 1 & column 2, row 1 \\
column 1, row 2 & column 2, row 2 \\
\end{tabular} 

Here is what you see on your PDF page:

Note how we correctly stated what is the start and termination of the environment. In addition, there are a few elements which are mandatory. 
1. As you see I have {l l} right after the environment opening. This is a column for alignment selections. I have 2 of them because I have 2 columns and l options mean that each my column must be left aligned. Other options here are r for right and c for center.
2.  Each row ends with double backslash: \\
3. In a row entries of different columns are separated by & symbol.
This table is rather bare bone and we do not have divider lines or borders.

Now let us add horizontal and vertical borders.  For the horizontal lines we will use the commands  \toprule\bottomrule and \midrule. The first 2 can be appear only once each in the table, while the last one can be used as many times as needed. Here how it looks like as a raw text:
 \begin{tabular}{l l}
\toprule
column 1, row 1 & column 2, row 1 \\
\midrule
column 1, row 2 & column 2, row 2 \\
\bottomrule
\end{tabular} 
Here is what you will see when you process the commands:

 Now we need to add vertical borders. They are less easy to spot in code because they are added to alignment options, so instead of
{l l} we type {|l|l|}: 
 
 \begin{tabular}{|l|l|}
\toprule
column 1, row 1 & column 2, row 1 \\
\midrule
column 1, row 2 & column 2, row 2 \\
\bottomrule
\end{tabular}
This LaTeX code can be used as is in Rmarkdown if you have pdf option for your output in YAML metadata (header).  Hope it helps!

Over-parameterization: a reprint about a geometric approach to estimating a number of parameters for Large Language and Image Processing Models

I learned how currently Large Language and Image Processing (Object Detection and Classification, Generation and others) models can benefit from over-parameterization in the book "Understanding Deep Learning" by Simon Prince, although at the moment nobody knew how it happens. I pondered on the question and, being an abstract mathematician by training, thought about something which is usually so useless in Data Science that people as a rule ignore it completely, namely an exact solution. It turned out that in this particular case searching for an exact solution might be a reasonable approach. Here is my pre-print about it: http://dx.doi.org/10.13140/RG.2.2.18776.61442

Tuesday, April 2, 2024

Neo4j Course Compilation Book, PDF, 4 courses

 I added a course "Importing CSV data into Neo4j" to the book:

https://github.com/Mathemilda/BOOKS/tree/main/Neo4j

Update: GraphAcademy representative contacted me and asked not to publish the compilation book.

Saturday, March 23, 2024

Neo4j Course Compilation Book, PDF, 3 courses

I started a book of complied Neo4j courses passed by me. The courses originally are published on https://graphacademy.neo4j.com/, in HTML format, with a separate webpage for each section. I wanted a reference where I can do a global search and eventually I plan to add index section with main concepts. I emailed to Neo4j GraphAcademy to check what they think of it, and they did not reply. So I go with "if it is not forbidden, then it is allowed".  As a data scientist, I was selecting courses which are beneficial for me. I started with 3 courses (Neo4j Fundamentals, Cypher Fundamentals, Graph Data Modeling Fundamentals) and I plan to add others.

Here is my github link for the book folder:  

https://github.com/Mathemilda/BOOKS/tree/main/Neo4j

Do not hesitate to inform me if you found mistakes of any kind in the book. I will appreciate positive comments, too!

Update: GraphAcademy representative contacted me and asked not to publish the compilation book.

Wednesday, January 31, 2024

Explaining Neural Networks as a corporation chain of command, revisited.

 I have seen a standard presentation of NN using brain functioning, and I know that a number of people get impression that one should know biology to understand NN inner structure and works. It is not so. I did a post with a simpler analogy for NN:

Neural Networks as a Corporation Chain of Command

I discovered that in addition a lot of people express a desire for the NN results to be more explainable by its inner functioning. I believe that my analogy is helpful for this, too and I added a conclusion to my post about it. You can see it below.
 
 As we know, a company success is defined by its top management. They select a company structure and define an inner company policy, which middle management adapts and directs further by usual means. We know that an industry leader could originate due to a chance, a lucky initial idea. But its following performance is a learned behavior. People use experience of other company managers combining it with their own trial & error process.

Monday, November 6, 2023

Using Tensorflow tf.keras.layers.Layer to write a custom subclass

If you work with Tensorflow and want to write a custom layer as a subclass of tf.keras.layers.Layer, then there are some details which took me a while to figure out. 

 1) There are usually instructions to create the layer for Sequential API (tf.keras.Sequential) , and one of mandatory input parameters is input_shape. If you write your layer for Functional API (tf.keras.Model), then the corresponding parameter is shape. Thanks to the book "TensorFlow in Action" by Thushan Ganegedara for this!

2) The subclass may not calculate output tensor shape correctly when you do some transformations. If the tensor shape and/or its dimension count changes during the pass through your custom layer, add a method for output shape computation at the end, so your layer will call it and a next layer will work smoothly with it. The above book mentions compute_output_shape() method, although without detail. You can see an example of it below at the end. I found the correct format on stackoverflow: 

https://stackoverflow.com/questions/60598421/tf-keras-custom-layer-output-shape-is-none

About the code below: I marked some custom rows by ellipsis (no, I do not use Python Ellipses notation here). Variables param00 and param01 are your optional layer parameters. You may add build method to define initial variables or constants for the layer which are needed at the start of the layer calculations as well, but it is optional. The shape (or input_shape) parameter must be present if you use the layer at the start of a model or if your change a tensor shape during your layer pass.

class CustomLayer(tf.keras.layers.Layer): 

    def __init__(self, shape, param00, param01, **kwargs): 
        self.shape = shape
        self.param00 = param00
        self.param01 = param01
        super(CustomLayer, self).__init__() 

        ... 

 

    def call(self, input_tensor): 

        if training:
            ... 

            ...

            return output_tensor 

        else:

            return input_tensor

 

    def compute_output_shape(self, shape): 

        return (input_shape[0], self.output_dim)

3) The training parameter has a boolean value and indicates if the layer is required for prediction or only for training. For example, you can apply custom transformations like uniform noise instead of Gaussian. The parameter is computed automatically and should not be used for anything else.

4) There are several other variables, parameters and methods which must be named just so and used in particular way. They are called inherited from the class. You can see some of them them in the Tensorflow documentation for the layer:

https://www.tensorflow.org/guide/keras/making_new_layers_and_models_via_subclassing

Of course self and training are on the list, but in addition the list contains parameters shape for functional API and input_shape for Sequential API, input_tensor, output_tensor and output_dim.  If you keep getting weird error messages about your indentificators, this could be a reason. Here are couple of ways to deal with it:

    a) You can look up Tensorflow code for the layer class.

    b) If you intend to use the code only for yourself then a lazy way to fix the problem is switching your identificators to something not so PEP 8 and scikit-learn standards.


Friday, July 14, 2023

The Blue Planet effect: ANOVA for Twitter data

TwitterData-for-Blog03.knit

Introduction

One of my clients graciously donated me this data set with which I worked at her contract. It was mined from Twitter with posts containing a word “bottle” and then sentiment values were evaluated for each tweet. My goal was to investigate if a “Blue Planet II” documentary influence can be detected on Twitter.

The “Blue Planet II” documentary is series of 8 episodes. It debuted on 29 October 2017 in the United Kingdom, Nordic regions, Europe and in Asia. In the United States, the series premiered on 20 January 2018. Other country dates are published here:

https://en.wikipedia.org/wiki/Blue_Planet_II

I have already used the data in one of my previous blogs and you can look at my work here: ’The “Blue Planet II” effect: Twitter data wrangling & ggplot2 plot

Because blogspot.com has a restriction on a post size I will skip pre-processing and a timeline count plot here. I will hide my code this time, too.

Recap of data processing.

Our data span years 2015-2019.

There are different bottles mentioned in tweets: plastic, metal, refillable, recyclable, hot water bottle, insulated, and even seaweed pouches. Bottled water appears as well. I’ve searched tweet texts with regular expressions for such kind of words to determine my categories. I introduced the categories which are described below.

Some of the tweets were classified as “Env concerned” thanks to words “recycl”, “refil” and such. To compensate for misspelling I used approximate matching.

The “Hot water bottle” category contains simple mentioning of hot water bottles.

The “Water bottle” category contains posts where a water bottle kind was not specified.

The data contain very few (below 150) posts with other bottles, like insulated or vacuum or baby. I dropped them.

As we see there are thousands of tweets in the data, which means that we might get reliable statistics.

Bottle typeCount
Env concerned21207
Hot water bottle2165
Water bottle6184

Since the tweets came with evaluated sentiment values we can check if the values differ as result of society views on plastic. The sentiment values range from -0.375 to 0.667 and I calculated averages for each week and year. I plotted the averages. A vertical line represents the documentary debut date. I smoothed out curves for easier trend detection.The colored corridors around them represent how true curves might be off.

My client wanted to see if posters’ sentiments about disposable bottles changed after the documentary. Unfortunately before the fall of 2017 numbers of posts were very few in each category, and ANOVA could not produce a reliable result to detect such change.

Nevertheless we can solve the following question: Do people write with different feelings about different bottle purposes?

For this I will check if the differences of category sentiment value averages are statistically significant. At the moment our plot shows that “water bottle” category is the most neutral, while others are somehow more positive. We will check what statistical analysis can tell here.

ANOVA for Category Sentiment Values.

ANOVA checks how distinct category means of data differ. It is a generalization of t-test for more complicated cases. At first we will check ANOVA assumptions.

  1. Interval data of the dependent variable.
    • Our dependent variable representing sentiment values is continuous.
  2. Normality
    • We can graph Normal Q-Q plots for categories to see that they are mostly normal, except for some deviations for “Hot Water Category”. Strictly speaking ANOVA numbers might be a bit off, but not much.
  3. homoscedasticity
    • There are 2 tests for checking homoscedasticity, or ascertaining that our groups do not have different variances which we can detect from the data: Bartlett test and Levene test. The first one is used when we have normal distributions for our variable groups, and the second is applied when the distributions do not look normal. I did the second one and got F-value as 212.36 and p-value \(< 2.2\cdot 10^{-16}\). Therefore the variances do not differ too much.
  4. no multicollinearity.
    • The vectors in question have different lengths, so they cannot be multicollinear. There is sometimes more strict requirement of independence with a remark that this is much harder to check.

Computing ANOVA statistics yielded the F-value as 993 and the p-value as \(< 2.2\cdot 10^{-16}\). Judging by results we can say with a 99% confidence level that the category sentiment value means are not all the same.

In addition we can look up confidence intervals for the differences. I used Tukey method for Honest Significant Differences with confidence level 99%. It computes confidence intervals for mean differences. In addition to calculations the R test function it provides a graph for the intervals to detect if any of them contains 0. As we see all means are different, because differences between them do not include 0.