Mathematician in Data Science: 2023

If you work with Tensorflow and want to write a custom layer as a subclass of tf.keras.layers.Layer, then there are some details which took me a while to figure out.

1) There are usually instructions to create the layer for Sequential API (tf.keras.Sequential) , and one of mandatory input parameters is input_shape. If you write your layer for Functional API (tf.keras.Model), then the corresponding parameter is shape. Thanks to the book "TensorFlow in Action" by Thushan Ganegedara for this!

2) The subclass may not calculate output tensor shape correctly when you do some transformations. If the tensor shape and/or its dimension count changes during the pass through your custom layer, add a method for output shape computation at the end, so your layer will call it and a next layer will work smoothly with it. The above book mentions compute_output_shape() method, although without detail. You can see an example of it below at the end. I found the correct format on stackoverflow:

https://stackoverflow.com/questions/60598421/tf-keras-custom-layer-output-shape-is-none .

About the code below: I marked some custom rows by ellipsis (no, I do not use Python Ellipses notation here). Variables param00 and param01 are your optional layer parameters. You may add build method to define initial variables or constants for the layer which are needed at the start of the layer calculations as well, but it is optional. The shape (or input_shape) parameter must be present if you use the layer at the start of a model or if your change a tensor shape during your layer pass.

class CustomLayer(tf.keras.layers.Layer):

def __init__(self, shape, param00, param01, **kwargs):
self.shape = shape

self.param00 = param00

self.param01 = param01

super(CustomLayer, self).__init__()

...

def call(self, input_tensor):

if training:
...

...

return output_tensor

else:

return input_tensor

def compute_output_shape(self, shape):

return (input_shape[0], self.output_dim)

3) The training parameter has a boolean value and indicates if the layer is required for prediction or only for training. For example, you can apply custom transformations like uniform noise instead of Gaussian. The parameter is computed automatically and should not be used for anything else.

4) There are several other variables, parameters and methods which must be named just so and used in particular way. They are called inherited from the class. You can see some of them them in the Tensorflow documentation for the layer:

https://www.tensorflow.org/guide/keras/making_new_layers_and_models_via_subclassing

Of course self and training are on the list, but in addition the list contains parameters shape for functional API and input_shape for Sequential API, input_tensor, output_tensor and output_dim. If you keep getting weird error messages about your indentificators, this could be a reason. Here are couple of ways to deal with it:

a) You can look up Tensorflow code for the layer class.

b) If you intend to use the code only for yourself then a lazy way to fix the problem is switching your identificators to something not so PEP 8 and scikit-learn standards.

TwitterData-for-Blog03.knit

Introduction

One of my clients graciously donated me this data set with which I worked at her contract. It was mined from Twitter with posts containing a word “bottle” and then sentiment values were evaluated for each tweet. My goal was to investigate if a “Blue Planet II” documentary influence can be detected on Twitter.

The “Blue Planet II” documentary is series of 8 episodes. It debuted on 29 October 2017 in the United Kingdom, Nordic regions, Europe and in Asia. In the United States, the series premiered on 20 January 2018. Other country dates are published here:

https://en.wikipedia.org/wiki/Blue_Planet_II

I have already used the data in one of my previous blogs and you can look at my work here: ’The “Blue Planet II” effect: Twitter data wrangling & ggplot2 plot

Because blogspot.com has a restriction on a post size I will skip pre-processing and a timeline count plot here. I will hide my code this time, too.

Recap of data processing.

Our data span years 2015-2019.

There are different bottles mentioned in tweets: plastic, metal, refillable, recyclable, hot water bottle, insulated, and even seaweed pouches. Bottled water appears as well. I’ve searched tweet texts with regular expressions for such kind of words to determine my categories. I introduced the categories which are described below.

Some of the tweets were classified as “Env concerned” thanks to words “recycl”, “refil” and such. To compensate for misspelling I used approximate matching.

The “Hot water bottle” category contains simple mentioning of hot water bottles.

The “Water bottle” category contains posts where a water bottle kind was not specified.

The data contain very few (below 150) posts with other bottles, like insulated or vacuum or baby. I dropped them.

As we see there are thousands of tweets in the data, which means that we might get reliable statistics.

Bottle type	Count
Env concerned	21207
Hot water bottle	2165
Water bottle	6184

Since the tweets came with evaluated sentiment values we can check if the values differ as result of society views on plastic. The sentiment values range from -0.375 to 0.667 and I calculated averages for each week and year. I plotted the averages. A vertical line represents the documentary debut date. I smoothed out curves for easier trend detection.The colored corridors around them represent how true curves might be off.

My client wanted to see if posters’ sentiments about disposable bottles changed after the documentary. Unfortunately before the fall of 2017 numbers of posts were very few in each category, and ANOVA could not produce a reliable result to detect such change.

Nevertheless we can solve the following question: Do people write with different feelings about different bottle purposes?

For this I will check if the differences of category sentiment value averages are statistically significant. At the moment our plot shows that “water bottle” category is the most neutral, while others are somehow more positive. We will check what statistical analysis can tell here.

ANOVA for Category Sentiment Values.

ANOVA checks how distinct category means of data differ. It is a generalization of t-test for more complicated cases. At first we will check ANOVA assumptions.

Interval data of the dependent variable.
- Our dependent variable representing sentiment values is continuous.
Normality
- We can graph Normal Q-Q plots for categories to see that they are mostly normal, except for some deviations for “Hot Water Category”. Strictly speaking ANOVA numbers might be a bit off, but not much.
homoscedasticity
- There are 2 tests for checking homoscedasticity, or ascertaining that our groups do not have different variances which we can detect from the data: Bartlett test and Levene test. The first one is used when we have normal distributions for our variable groups, and the second is applied when the distributions do not look normal. I did the second one and got F-value as 212.36 and p-value \(< 2.2\cdot 10^{-16}\). Therefore the variances do not differ too much.
no multicollinearity.
- The vectors in question have different lengths, so they cannot be multicollinear. There is sometimes more strict requirement of independence with a remark that this is much harder to check.

Computing ANOVA statistics yielded the F-value as 993 and the p-value as \(< 2.2\cdot 10^{-16}\). Judging by results we can say with a 99% confidence level that the category sentiment value means are not all the same.

In addition we can look up confidence intervals for the differences. I used Tukey method for Honest Significant Differences with confidence level 99%. It computes confidence intervals for mean differences. In addition to calculations the R test function it provides a graph for the intervals to detect if any of them contains 0. As we see all means are different, because differences between them do not include 0.

Mathematician in Data Science

Monday, November 6, 2023

Using Tensorflow tf.keras.layers.Layer to write a custom subclass

Friday, July 14, 2023

The Blue Planet effect: ANOVA for Twitter data

Introduction

Recap of data processing.

ANOVA for Category Sentiment Values.