Monday, January 17, 2022

Russian Troll data investigation

I found Russian Troll Data set on kaggle and analyzed it:  

Part 1. EDA

Part 2. Feature Engineering

Part 3. Machine Learning, a test accuracy 99.6%

This project for me is personal. I experienced the propaganda machine of the Soviet Union and am horrified to see it used on Americans. As a young adult in Soviet Russia, I succumbed to brainwashing and had no idea what was really going on. "Everybody always lies" had been the norm. I came to the US in the 90s and I was amazed that in the US, deception is not normalized as in my homeland. I have been surprised when Americans trusted my words while Soviet people would look at me with suspicion regardless of the situation. I am saddened that Americans’ trust has been abused by paid trolls. I worry that deception will be normalized here in the States as well. I believe that propaganda is a form of psychological abuse.

My first encounter with the trolls was in Russian forums. At first I thought that these are badly informed people and I tried to give them links to correct information. Only in a few months when I read an article about Russian troll farms did I realize what was going on. I have seen a guy still posting articles about omnipotent Hilary Clinton who is set on destroying the world for a month after Donald Trump won. You see, a government propaganda machine is bureaucratic, and it took time for them to change their instructions. Or one of them bragged how they scheduled their issues of compromising materials to the most damage, in particular he was proud about “pedophile ring” lies right before the presidential election. This quote shows a more current example: “I would believe in [Biden] Rebuild plan when the potholes on my street will be fixed”, because in Russia a president actually has the power to command local authorities.

When I saw these data become available, I wanted to help prevent people getting brainwashed like I was. I intend to study the data, to extract English tweets and to compare them with tweets from Americans. The difficulty with the data is that Russian propaganda uses 60/40 Göbbels 60/40 method: https://zarinazabrisky.medium.com/rotten-herrings-and-crucified-children-c4c278466985 It means that they mix in posts from real Americans to confuse their detection. With the tables I plan to check if there is a way to distinguish paid Russian trolls from Americans using Machine Learning methods.