Monday, May 2, 2022

Russian Troll Tweet data, Machine Learning with accuracy 99.6%

You can read motivation for my work here: I published EDA, Feature Engineering and Machine Learning on my Kaggle account in 3 parts due to constrains of Kaggle resources:

Part 1. EDA

Part 2. Feature Engineering

Part 3. Machine Learning, a test accuracy 99.6% 


Analysis of a Twitter account behavior helps a lot in determining paid trolls. The most helpful for detection properties are the ones related to propaganda methods. Apparently trolls have specified guidelines for their posts and they stick to them. I see it as convenient because we can set up filters for catching the most significant phenomena, and then check a whole account activity. 

In addition the most important for prediction features turned out to be not very dependable on languages but mostly on troll account activity. Thus we can do it for other languages, and do not limit it to Russian trolls posting English texts. 

Please upvote it on Kaggle if you like it!

Monday, January 17, 2022

Russian Troll data investigation

I found Russian Troll Data set on kaggle and analyzed it:  

Part 1. EDA

Part 2. Feature Engineering

Part 3. Machine Learning, a test accuracy 99.6%

This project for me is personal. I experienced the propaganda machine of the Soviet Union and am horrified to see it used on Americans. As a young adult in Soviet Russia, I succumbed to brainwashing and had no idea what was really going on. "Everybody always lies" had been the norm. I came to the US in the 90s and I was amazed that in the US, deception is not normalized as in my homeland. I have been surprised when Americans trusted my words while Soviet people would look at me with suspicion regardless of the situation. I am saddened that Americans’ trust has been abused by paid trolls. I worry that deception will be normalized here in the States as well. I believe that propaganda is a form of psychological abuse.

My first encounter with the trolls was in Russian forums. At first I thought that these are badly informed people and I tried to give them links to correct information. Only in a few months when I read an article about Russian troll farms did I realize what was going on. I have seen a guy still posting articles about omnipotent Hilary Clinton who is set on destroying the world for a month after Donald Trump won. You see, a government propaganda machine is bureaucratic, and it took time for them to change their instructions. Or one of them bragged how they scheduled their issues of compromising materials to the most damage, in particular he was proud about “pedophile ring” lies right before the presidential election. This quote shows a more current example: “I would believe in [Biden] Rebuild plan when the potholes on my street will be fixed”, because in Russia a president actually has the power to command local authorities.

When I saw these data become available, I wanted to help prevent people getting brainwashed like I was. I intend to study the data, to extract English tweets and to compare them with tweets from Americans. The difficulty with the data is that Russian propaganda uses 60/40 Göbbels 60/40 method: It means that they mix in posts from real Americans to confuse their detection. With the tables I plan to check if there is a way to distinguish paid Russian trolls from Americans using Machine Learning methods.

Thursday, August 12, 2021

Covid_19: Vaccination Rate VS Mortality Rate


With a current COVID-19 pandemia we have a lot of discussion in media about vaccination effect. Does it help or not? We have polarized views on different sides of political spectrum. I decided to check the data which I can access on the moment.

Sources, Data Download

The data on mortality is taken from here:

The date at which the set was retrieved is printed below. I included it in the code because the information is updated regularly. I kept only numbers for the last 2 months.

library(lubridate, warn.conflicts = FALSE)
## [1] "2021-08-12"
url <- "
covid_death_full = read.csv(url, stringsAsFactors = FALSE)
## [1]  51 569
## [1] "State"       "StateFIPS"   "X2020.01.22" "X2020.01.23" "X2020.01.24"
N <- ncol(covid_death_full)
covid = covid_death_full[, c(1, (N-61):N)]
N <- ncol(covid)
## [1] 51 63

What do we have as our covid-19 data?

head(covid[, c(1, 56:61)])
##   State X2021.08.03 X2021.08.04 X2021.08.05 X2021.08.06 X2021.08.07 X2021.08.08
## 1    AL       11542       11561       11574       11600       11600       11600
## 2    AK         382         382         390         390         390         390
## 3    AZ       18282       18289       18300       18342       18342       18342
## 4    AR        6215        6230        6247        6269        6269        6269
## 5    CA       63950       64001       64055       64091       64091       64091
## 6    CO        6956        6963        6970        6970        6970        6970

Here we have numbers of accumulated daily death counts. I will use these numbers, hoping that diagnostics approaches are consistent across states. State population counts are obtained from here as a csv file:

## 'data.frame':    52 obs. of  9 variables:
##  $ ï..rank        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ State          : chr  "California" "Texas" "Florida" "New York" ...
##  $ Pop            : int  39613493 29730311 21944577 19299981 12804123 12569321 11714618 10830007 10701022 9992427 ...
##  $ Growth         : num  0.0038 0.0385 0.033 -0.0118 0.0003 -0.0121 0.0033 0.0303 0.0308 0.0008 ...
##  $ Pop2018        : int  39461588 28628666 21244317 19530351 12800922 12723071 11676341 10511131 10381615 9984072 ...
##  $ Pop2010        : int  37319502 25241971 18845537 19399878 12711160 12840503 11539336 9711881 9574323 9877510 ...
##  $ growthSince2010: num  0.0615 0.1778 0.1644 -0.0051 0.0073 ...
##  $ Percent        : num  0.1184 0.0889 0.0656 0.0577 0.0383 ...
##  $ density        : num  254 114 409 410 286 ...

Note that I have 52 rows and not 51, as in a previous data set. I need to drop “Puerto Rico”.

popltns= census_data[census_data$State !="Puerto Rico", c("State", "Pop", "density")]

Now for vaccination information. I worked with text information from here:

I got it on July, 18, 2021. I considered only fully vaccinated. I added delay between vaccination statistics and mortality counts because a person does not get full immunity for the first 2 weeks, plus the disease itself takes time. I wrangled the data into a csv file, but I skipped the code for this because the process was simple, and I do not want my post to be too long.


I would prefer for my files to contain full state names and abbreviations. I will combine my data in one data frame.

StateAbbrev <- read.csv("state-abbreviations.csv")
popltns <-  merge(popltns, StateAbbrev)
df <- merge(popltns,vaccinated)

Comparing with last month data.

Let us compute mortality rates for the last month.

last_month <- data.frame(st = covid$State,  
                            last_month = covid[,N] - covid[,(N-31)])
df <-  merge(df, last_month, by.x="st_abbr", by.y= "st", all =TRUE)
df$mortality_last_month <- df$last_month*1e05/df$Pop

For a plot I modified sizes of points to be proportional to state population densities. Our outliers appear mostly not densely populated or with very high density.

ggplot(data=df, aes(x=Vaccinated_percentage, y =mortality_last_month, size = density)) +
    geom_point(color ="green") + ylim(-1, 7.6) +
    geom_smooth(method='loess', level=.95, span = 0.75, color='darkgreen') +
    geom_text(aes(label=st_abbr), hjust=-.3, vjust=0, size=3
              #, check_overlap =TRUE
              ) + 
    xlab("Vaccinated percentage") + ylab("Last month mortality per 100,000")+
    theme(legend.position = "none")

As we see when percentages of vaccination are lower, it appears to be more helpful. One of possible reasons is that senior citizens got vaccines first and foremost, and they benefited greatly.

I would like to add state colors: reds, blues and purples. I tried to define colors by the last 20 years of presidential elections. If you believe that I made mistakes, please comment.

df$color[df$st_abbr=="DC"] = "blue" <- c('Alabama', 'Alaska', 'Arkansas', 'Georgia', 'Idaho', 'Kansas', 
'Kentucky', 'Louisiana', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 
'North Dakota', 'Oklahoma', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 
'Utah', 'West Virginia', 'Wyoming')

df$col[df$State %in%] <- "red"
df$col[df$st_abbr %in% c('AR', 'AZ', 'GA', 'FL', 'NC', 'OH', 'VA')] <- "purple4"

Let us check for the last month.

ggplot(data=df, aes(x=Vaccinated_percentage, y =mortality_last_month
                    , colour=col)) + 
    ylim(-1, 7.6) + 
    geom_point( size=2) + 
    scale_color_manual(breaks = c("red", "purple4", "blue"),
                        values=c("red", "purple4", "blue"))+
    geom_smooth(method='lm', level=.95, aes(fill= col)) +
    scale_fill_manual(breaks = c("red", "purple4", "blue"),
                        values=c("red", "darkmagenta", "deepskyblue"))+
    geom_text(aes(label=st_abbr), hjust=-.3, vjust=0, size=3
              #, check_overlap =TRUE
              ) + 
    xlab("Vaccinated percentage") + ylab("Last month mortality per 100,000")+
    theme(legend.position = "none")

If you are curious where the blue line intersects 0, I estimated it around 95%. Although 3 week ago it was about 71%.