Mathematician in Data Science: 2022

Wednesday, November 2, 2022

Improving my Shiny app for Law of Large Numbers

I have updated my Shiny app for Law of Large Numbers. Here are the list of changes:

Adding Pareto distribution with a mean but without standard deviation.
A choice for a histogram displayed interval, centered around 1.
Displayed sample ranges in a plot title.
An option to re-generate random numbers used for mean calculations.
Made it more pleasing visually by switching background color to more "paper" like and more bookish fonts.

Here is a link to it: https://mathemilda.shinyapps.io/large-number-theorem-visuals/

I will be grateful for your feedback!

Monday, May 2, 2022

Russian Troll Tweet data, Machine Learning with accuracy 99.6%

You can read motivation for my work here: https://myabakhova.blogspot.com/2022/01/russian-troll-data-investigation.html I published EDA, Feature Engineering and Machine Learning on my Kaggle account in 3 parts due to constrains of Kaggle resources:

Part 1. EDA

Part 2. Feature Engineering

Part 3. Machine Learning, a test accuracy 99.6%

Conclusion

Analysis of a Twitter account behavior helps a lot in determining paid trolls. The most helpful for detection properties are the ones related to propaganda methods. Apparently trolls have specified guidelines for their posts and they stick to them. I see it as convenient because we can set up filters for catching the most significant phenomena, and then check a whole account activity.

In addition the most important for prediction features turned out to be not very dependable on languages but mostly on troll account activity. Thus we can do it for other languages, and do not limit it to Russian trolls posting English texts.

Please upvote it on Kaggle if you like it!

Monday, January 17, 2022

Russian Troll data investigation

I found Russian Troll Data set on kaggle and analyzed it:

Part 1. EDA

Part 2. Feature Engineering

Part 3. Machine Learning, a test accuracy 99.6%

This project for me is personal. I experienced the propaganda machine of the Soviet Union and am horrified to see it used on Americans. As a young adult in Soviet Russia, I succumbed to brainwashing and had no idea what was really going on. "Everybody always lies" had been the norm. I came to the US in the 90s and I was amazed that in the US, deception is not normalized as in my homeland. I have been surprised when Americans trusted my words while Soviet people would look at me with suspicion regardless of the situation. I am saddened that Americans’ trust has been abused by paid trolls. I worry that deception will be normalized here in the States as well. I believe that propaganda is a form of psychological abuse.

My first encounter with the trolls was in Russian forums. At first I thought that these are badly informed people and I tried to give them links to correct information. Only in a few months when I read an article about Russian troll farms did I realize what was going on. I have seen a guy still posting articles about omnipotent Hilary Clinton who is set on destroying the world for a month after Donald Trump won. You see, a government propaganda machine is bureaucratic, and it took time for them to change their instructions. Or one of them bragged how they scheduled their issues of compromising materials to the most damage, in particular he was proud about “pedophile ring” lies right before the presidential election. This quote shows a more current example: “I would believe in [Biden] Rebuild plan when the potholes on my street will be fixed”, because in Russia a president actually has the power to command local authorities.

When I saw these data become available, I wanted to help prevent people getting brainwashed like I was. I intend to study the data, to extract English tweets and to compare them with tweets from Americans. The difficulty with the data is that Russian propaganda uses 60/40 Göbbels 60/40 method: https://zarinazabrisky.medium.com/rotten-herrings-and-crucified-children-c4c278466985 It means that they mix in posts from real Americans to confuse their detection. With the tables I plan to check if there is a way to distinguish paid Russian trolls from Americans using Machine Learning methods.