Mathematician in Data Science: June 2021

TwitterData-for-Blog02.knit

Introduction

One of my clients graciously donated me the data set for my presentation. It was mined from Twitter choosing posts containing a word “bottle”.

The goal was to investigate if “Blue Planet II” documentary influence can be detected on Twitter. The series of 8 episodes debuted on 29 October 2017 in the United Kingdom, Nordic regions, Europe and in Asia. In the United States, the series premiered on 20 January 2018. Other country dates are published here:

https://en.wikipedia.org/wiki/Blue_Planet_II

I carried out the following actions with the set:

Investigated the data and removed repeated retweets of an irrelevant post
Grouped data with respect to bottle usage
Plotted a timeline with post counts
Researched and explained significant dates for the plot

R packages and data download

Let us start with downloading packages and data

library(dplyr)
library(magrittr)
library(ggplot2)
library(lubridate)
df = read.csv("tw_sentiment.csv")
str(df)

## 'data.frame':    30584 obs. of  5 variables:
##  $ Index        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ time         : chr  "01 Jan 2020 23:31:53" "01 Jan 2020 21:01:33" "01 Jan 2020 21:00:06" "01 Jan 2020 18:11:13" ...
##  $ content      : chr  "ayomikun lob arsenal jacket floor vs everton last year boot water bottl week didn t sit well" "rt jeremyjoseph g y club homeless saturday night yeah babi amp got reason re take g y thechas vixen m edg jenli"| __truncated__ "g y club homeless saturday night yeah babi amp got reason re take g y thechas vixen m edg jenlion donat hot wat"| __truncated__ "get budgen water bottl head spend budgen water bottl seen mani better one cheaper liter everywher t co qyfdwdhjyj" ...
##  $ type         : chr  "Reply" "ReTweet" "Tweet" "Tweet" ...
##  $ SentimentQDAP: num  0 0.125 0.129 0.118 0.133 ...

Our data span years 2015-2019. It was harvested from Twitter for posts containing the word “bottle” (in reality it shortened to “bottl”).

EDA

I discovered that in one week there were a lot (almost a thousand) retweets of the following message:

“aw rememb primari school type weather d start freez water bottl can ici water throughout day”

Apparently the initial post was done by some influencer. I removed them as irrelevant, because I intended to see how tweet counts had been changing with time.

df = df[-grep("aw rememb primari school type weather ", df$content) , ]
dim(df)

## [1] 29590     5

Feature Engineering

I divided the posts into the following categories:

“Env concerned” for messages which appear to refer to environmental impact
“Water bottle” for vague mentioning of water bottles
“Hot water bottle” as something not about environment and as such been a good basic trend for Twitter discussions, in case that there was a particular increase in general post activity.

df = df %>%
  mutate(Category = 
  if_else(agrepl("\\bbottl(ed)? water", content, max.distance = 1, fixed = FALSE), 
          "Env concerned",
  if_else(agrepl(" (single?.use)|((re)?fill)|(reusabl)|(biodegrad)|(dispos\\w+)", content, 
          max.distance = 1,fixed = FALSE), "Env concerned",
  if_else(agrepl("((metal)|(aluminium)|( recycl)|(seawe pouch)|(glass( water)?))", content, 
          max.distance = 1, fixed = FALSE), "Env concerned",
  if_else(grepl("\\bhot water bottl\\b", content), "Hot water bottle",
  if_else(agrepl("( (camelba(c)?k)|(tritan)|(sigg)|(loveisland))", content, 
          max.distance = 1, fixed = FALSE), "Env concerned",
  if_else(agrepl("( (infus)|(best)|(beauti))(((\\s)|(\\S))+)? bottl", content, 
          max.distance = 1, fixed = FALSE), "Env concerned",
  if_else(agrepl("plastic(\\s\\S+)? bottl", content, max.distance = 1, fixed = FALSE),
          "Env concerned",
  if_else(agrepl("water bottl", content, max.distance = 1),"Water bottle", 
          NA_character_)))))))))
df = df[complete.cases(df), ]

I need to introduce a date type column in my data.Then I want to summarize post counts by weeks because otherwise my plot will be too busy.

df = df %>%
  mutate(Date = as.Date(substring(time, 1,  11), format = "%d %b %Y"))%>%
  mutate(week= ymd(floor_date(Date, "week")))
  
countsByWeekAndYear = df %>%
  select(week,  Category) %>%
  group_by(week , Category) %>%
  count(Category, name="tweet.n")

The documentary influence on Twitter posts

Below is a time line plot for weekly tweet counts with different post topics. I addition I marked some important dates with vertical lines:

When first episode had been aired in Europe/Asia/Canada and US.
4th episode dates, discussing plastic particles in sea water
7th episode dates, covering human activity in the oceans and its consequences.
Global 24-hour boycotts for plastic packaging which were introduced in wake of the series.

Other events without definite dates:

The documentary promotion had been started more than a month in advance.
In April 2018, in response to growing public support directly linked to Blue Planet II, the British government announced that a national ban on single-use plastic products is considered.
In the summer of 2019 the documentary was shown on BBC America again. I did not find the corresponding schedule.


ggplot(data = countsByWeekAndYear, aes(week, tweet.n, color = Category))+ 
  geom_line(size = .6) +
  scale_x_date(date_labels = "%b %Y", date_breaks="3 months")+
  labs(x = NULL, y = "Weekly number of tweets")+
  theme(axis.text.x = element_text(angle=90, hjust = -.5, vjust=.5))+
  theme(plot.title=element_text(face="bold", color = "darkblue", size=12)) + 
  theme(axis.text.y=element_text(size=8, color = "black"), 
        axis.text.x=element_text(color = "black", size=8))+
  geom_vline(aes(xintercept=as.Date("2017-09-27", format="%Y-%m-%d"), 
                linetype="1st episode"), colour = "blue")+
  geom_vline(aes(xintercept=as.Date("2017-11-19", format="%Y-%m-%d"), 
                linetype ="4th episode"), size=.4)+
  geom_vline(aes(xintercept=as.Date("2017-12-10", format="%Y-%m-%d"), 
                linetype ="7th episode"), colour = "red", size=.4)+
  geom_vline(aes(xintercept=as.Date("2018-01-20", format="%Y-%m-%d"), 
                linetype="1st episode"), colour = "blue")+
  geom_vline(aes(xintercept=as.Date("2018-03-10", format="%Y-%m-%d"), 
                linetype ="4th episode"), size=.4)+
  geom_vline(aes(xintercept=as.Date("2018-04-07", format="%Y-%m-%d"), 
                linetype ="7th episode"), colour = "red", size=.4)+
  geom_vline(aes(xintercept=as.Date("2018-06-03", format="%Y-%m-%d"),
                linetype ="Boycotts on plastic"), size=.4, colour = "purple")+
  geom_vline(aes(xintercept=as.Date("2019-06-03", format="%Y-%m-%d"),
                linetype ="Boycotts on plastic"), size=.4,colour = "purple") +
  ggtitle("A Timeline for Weekly Bottle Tweet Counts") +
  scale_linetype_manual(name = 'Events',
                        values = c("1st episode" = 3,
                                "4th episode" = 2,
                                "7th episode" = 4,
                                "Boycotts on plastic"=5),
                        guide = guide_legend(override.aes = 
                                    list(colour = c('blue', "black", 'red', 
                                                    'purple'))))+
  theme(
  legend.title = element_text(color = "blue", size = 9))

I checked out Twitter user numbers and there had been no significant changes of Twitter users during these years. It appears that they became much more aware about plastic bottles after the series.

Another thing I was asked to do is to check if sentiments changed for the tweets before and after the documentary. It turned out that the amount of posts with the the word “bottl” before the documentary was not sufficient to show after ANOVA application as distinguishable.

Mathematician in Data Science

Friday, June 4, 2021

The “Blue Planet II” effect: Twitter data wrangling & ggplot2 plot

Introduction

R packages and data download

EDA

Feature Engineering

The documentary influence on Twitter posts