Analysing Twitter data with R
Part 1: Collecting Twitter statuses related to a scientific conference
In this blog article, I use the {rtweet} package to explore Twitter statuses collected during a scientific conference. I divided this article into three parts. This is part 1.
Twitter is one of the few social media used in the scientific community. Users having a scientific Twitter profile communicate about recent publication of research articles, the tools they use, for instance softwares or microscopes, the seminars and conferences they attend or their life as a scientist. For instance, on my personnal Twitter account, I share my blog articles, research papers, and slides, and I retweet or like content related to R programming and bioimage analysis topics. Twitter archives all tweets and offers an API to make searches on these data. The {rtweet} package provides a convenient interface between the Twitter API and R.
I collected data during the 2020 NEUBIAS conference that was held earlier this year in Bordeaux. NEUBIAS is the Network of EUropean BioImage AnalystS, a scientific network created in 2016 and supported till this year by European COST funds. Bioimage analysts extract and visualise data coming from biological images (mostly microscopy images but not exclusively) using image analysis algorithms and softwares developped by computer vision labs to answer biological questions for their own biology research or for other scientists. I consider myself as a bioimage analyst, and I am an active member of NEUBIAS since 2017. I notably contributed to the creation of a local network of Bioimage Analysts during my postdoc in Heidelberg from 2016 to 2019 and to the co-organisation of two NEUBIAS training schools, and I gave lectures and practicals in three NEUBIAS training schools. Moreover, I recently co-created a Twitter bot called Talk_BioImg, which retweets the hashtag #BioimageAnalysis, to encourage people from this community to connect on Twitter (see “Announcing the creation of a Twitter bot retweeting #BioimageAnalysis” and “Create a Twitter bot on a raspberry Pi 3 using R” for more information).
In this first part, I will explain how I collected and aggregated the data, mainly how:
1) I identified the conference hashtags to narrow my search.
2) I harvested Twitter statuses, i.e. tweets, retweets and quotes (retweets with comments), for these hashtags over 12 days by manually querying the Twitter API.
3) I aggregated these data. As I queried the Twitter API every 2-3 days, some statuses appeared several times. I chose to keep only the most recent occurence of each status.
Librairies
To collect the Twitter statuses and get snapshots from specific statuses, I use the package {rtweet}. To store and read the data in the RDS format I use {readr}. To manipulate and tidy the data, I use {dplyr}, {forcats}, {purrr}, and {tidyr}. To visualise the collected data, I use the packages {ggplot2}, {ggtext}, {grid}, {lubridate} and {RColorBrewer}.
library(dplyr)
library(forcats)
library(here)
library(ggplot2)
if (!requireNamespace("ggtext", quietly = TRUE)){
remotes::install_github("wilkelab/ggtext")
}
library(ggtext)
library(glue)
library(grid)
library(kableExtra)
library(lubridate)
library(purrr)
library(RColorBrewer)
library(readr)
library(rtweet)
library(tidyr)
Plots: Theme and palette
The code below defines a common theme and color palette for all the plots. The function theme_set()
from {ggplot2} sets the theme for all the plots.
# Define a personnal theme
custom_plot_theme <- function(...){
theme_classic() %+replace%
theme(panel.grid = element_blank(),
axis.line = element_line(size = .7, color = "black"),
axis.text = element_text(size = 11),
axis.title = element_text(size = 12),
legend.text = element_text(size = 11),
legend.title = element_text(size = 12),
legend.key.size = unit(0.4, "cm"),
strip.text.x = element_text(size = 12, colour = "black", angle = 0),
strip.text.y = element_text(size = 12, colour = "black", angle = 90))
}
## Set theme for all plots
theme_set(custom_plot_theme())
# Define a palette for graphs
greenpal <- colorRampPalette(brewer.pal(9,"Greens"))
Gathering all harvested statuses
As I queried the Twitter API every 2-3 days, most of the tweets were collected at least twice. I created the function read_and_add_harvesting_date()
to add the harvesting date as an extra variable and keep always the latest harvested version.
#' Read the RDS files and add the date contained in the filename
#'
#' @param RDS_file path to RDS file
#' @param split_date_pattern part of the filename which is not the date
#' @param RDS_dir path to directory containing RDS file
#'
#' @return a tibble containing the content of RDS file
#' @export
#'
#' @examples
read_and_add_harvesting_date <- function(RDS_file, split_date_pattern, RDS_dir) {
RDS_filename <- gsub(RDS_file, pattern = paste0(RDS_dir, "/"), replacement ="")
RDS_date <- gsub(RDS_filename, pattern = split_date_pattern, replacement = "")
readRDS(RDS_file) %>%
mutate(harvest_date = as.Date(RDS_date))
}
RDS_file_sr_neubiasBdx <- grep(list.files(here("data_neubias"), full.names = TRUE),
pattern = "neubias",
value = TRUE)
split_date_pattern_sr_neubiasBdx <- "_sr_neubiasBdx.rds"
RDS_dir_name_sr_neubiasBdx <- here("data_neubias")
all_neubiasBdx <- pmap_df(list(RDS_file_sr_neubiasBdx,
split_date_pattern_sr_neubiasBdx,
RDS_dir_name_sr_neubiasBdx), read_and_add_harvesting_date)
With the read_and_add_harvesting_date()
function, I then gathered all harvested tweets in one unique dataframe.
all_neubiasBdx_unique <- all_neubiasBdx %>%
arrange(desc(harvest_date)) %>% # take latest harvest date
distinct(status_id, .keep_all = TRUE) # .keep_all to keep all variables
write_rds(all_neubiasBdx_unique, path = file.path("data_out_neubias", "all_neubiasBdx_unique.rds"))
I now have all the Twitter statuses that I harvested over 12 days aggregated together. I use again the function glimpse()
to display all the variables.
all_neubiasBdx_unique %>%
glimpse()
Click opposite on code
to see the result.
``` ## Observations: 2,629 ## Variables: 92 ## $ user_id"2785982550", "2785982550", "9408326384908943… ## $ status_id "1237770746740555777", "1236050321799024641",… ## $ created_at 2020-03-11 16:01:57, 2020-03-06 22:05:36, 20… ## $ screen_name "fabdechaumont", "fabdechaumont", "BlkHwk0ps"… ## $ text "#neubiasBordeaux having a lasting impact 😂 @… ## $ source "Twitter for Android", "Twitter for Android",… ## $ display_text_width 76, 140, 140, 140, 140, 140, 139, 140, 139, 1… ## $ reply_to_status_id NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ reply_to_user_id NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ reply_to_screen_name NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ is_quote FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL… ## $ is_retweet TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU… ## $ favorite_count 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 10,… ## $ retweet_count 1, 2, 4, 4, 3, 7, 8, 2, 4, 4, 7, 3, 7, 3, 2, … ## $ quote_count NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ reply_count NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ hashtags ["neubiasBordeaux", "neubiasBordeaux", NA, "… ## $ symbols
[NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ urls_url
[NA, NA, NA, NA, "NIS.ai", <"NIS.ai", "micro… ## $ urls_t.co
[NA, NA, NA, NA, "https://t.co/lmbELKBgwW", … ## $ urls_expanded_url
[NA, NA, NA, NA, "http://NIS.ai", <"http://N… ## $ media_url
[NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ media_t.co
[NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ media_expanded_url
[NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ media_type
[NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ ext_media_url
[NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ ext_media_t.co
[NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ ext_media_expanded_url
[NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ ext_media_type
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ mentions_user_id [<"1943817554", "2785982550">, "2829155332",… ## $ mentions_screen_name
[<"pseudoobscura", "fabdechaumont">, "MaKaef… ## $ lang
"en", "en", "en", "en", "en", "en", "en", "en… ## $ quoted_status_id NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ quoted_text NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ quoted_created_at NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ quoted_source NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ quoted_favorite_count NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ quoted_retweet_count NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ quoted_user_id NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ quoted_screen_name NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ quoted_name NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ quoted_followers_count NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ quoted_friends_count NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ quoted_statuses_count NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ quoted_location NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ quoted_description NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ quoted_verified NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ retweet_status_id "1237710280009945089", "1235880480416894976",… ## $ retweet_text "#neubiasBordeaux having a lasting impact 😂 @… ## $ retweet_created_at 2020-03-11 12:01:40, 2020-03-06 10:50:42, 20… ## $ retweet_source "Twitter Web App", "TweetDeck", "Twitter for … ## $ retweet_favorite_count 6, 6, 8, 6, 8, 16, 19, 2, 7, 9, 24, 3, 16, NA… ## $ retweet_retweet_count 1, 2, 4, 4, 3, 7, 8, 2, 4, 4, 7, 3, 7, NA, NA… ## $ retweet_user_id "1943817554", "2829155332", "2829155332", "56… ## $ retweet_screen_name "pseudoobscura", "MaKaefer", "MaKaefer", "mar… ## $ retweet_name "Christopher Schmied", "Marie - not a queue j… ## $ retweet_followers_count 416, 293, 293, 1499, 293, 1499, 68, 293, 293,… ## $ retweet_friends_count 576, 423, 423, 3805, 423, 3805, 144, 423, 423… ## $ retweet_statuses_count 468, 2205, 2205, 7053, 2205, 7053, 98, 2205, … ## $ retweet_location "Berlin, Germany", "", "", "Bishopstoke, Hamp… ## $ retweet_description "Data Scientist | Bioimage Analyst @LeibnizFM… ## $ retweet_verified FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL… ## $ place_url NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ place_name NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ place_full_name NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ place_type NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ country NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ country_code NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ geo_coords [
, , , , [ , , , , [ , "https://twitter.com/fabdechaumont/status/123… ## $ name "Fabrice de Chaumont", "Fabrice de Chaumont",… ## $ location "", "", "Just were I have to be.", "Just were… ## $ description "Playing with mice", "Playing with mice", "Lo… ## $ url NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ protected FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL… ## $ followers_count 77, 77, 298, 298, 298, 298, 298, 298, 298, 29… ## $ friends_count 45, 45, 204, 204, 204, 204, 204, 204, 204, 20… ## $ listed_count 0, 0, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 0, 5, 5, … ## $ statuses_count 63, 63, 15726, 15726, 15726, 15726, 15726, 15… ## $ favourites_count 71, 71, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1… ## $ account_created_at 2014-09-02 13:44:31, 2014-09-02 13:44:31, 20… ## $ verified FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL… ## $ profile_url NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ profile_expanded_url NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ account_lang NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ profile_banner_url "https://pbs.twimg.com/profile_banners/278598… ## $ profile_background_url "http://abs.twimg.com/images/themes/theme1/bg… ## $ profile_image_url "http://pbs.twimg.com/profile_images/12363584… ## $ query "#neubiasBordeaux", "#neubiasBordeaux", "#neu… ## $ harvest_date 2020-03-11, 2020-03-11, 2020-03-11, 2020-03-… ```
The final dataframe contains 92 variables and 2629 Twitter statuses.
total_tweet_number <- all_neubiasBdx_unique %>%
filter(!is_retweet) %>%
pull(status_id) %>%
unique() %>%
length()
total_retweet_number <- all_neubiasBdx_unique %>%
filter(is_retweet) %>%
pull(status_id) %>%
unique() %>%
length()
More precisely, among the 2629 Twitter statuses, only 661 are original tweets or quotes and 1968 are retweets.
Number of tweets and retweets during the conference
Several events happened during the conference:
- two training schools on bioimage analysis and a “taggathon” to tag ressources in bioimage analysis and update the online database from Saturday the 29th of February to Tuesday the 3rd of March (morning) which I will refer to as “training schools”
- a satellite meeting on Tuesday the 3rd of March (afternoon) and a symposium from Wednesday the 4th of March to Friday the 6th of March which I will refer to as “symposium”
I was curious about the evolution of the number of tweets and quotes versus retweets over the conference.
nb_days <- floor(as.numeric(max(all_neubiasBdx_unique$created_at) - min(all_neubiasBdx_unique$created_at)))
df_per_slot <- all_neubiasBdx_unique %>%
mutate(
datetime = as_datetime(created_at),
slot = round_time(datetime, n = "6 hours")
) %>%
count(is_retweet, slot)
df_annotate_text <- tibble(
x = c(ymd_hm(c("2020-03-03 12:00", "2020-03-05 18:00")),
mean(ymd_hm(c("2020-02-29 06:00", "2020-03-03 12:00"))),
mean(ymd_hm(c("2020-03-06 12:00", "2020-03-03 12:00")))
),
y = c(190, 180, 210, 210),
label = c("Satellite meeting", "Gala dinner", "TRAINING SCHOOLS", "SYMPOSIUM")
)
df_annotate_curve <- tibble(
x = ymd_hm(c("2020-03-03 12:00", "2020-03-05 18:00")),
y = c(190, 180)-5,
xend = x,
yend = y-20
)
ylim_max <- 225
ggplot(df_per_slot) +
aes(x = slot, y = n, color = is_retweet) +
geom_rect(aes(
xmin = ymd_hm("2020-02-29 06:00"), xmax = ymd_hm("2020-03-03 12:00"),
ymin = 0, ymax = ylim_max
),
fill = "grey80", colour = NA
) +
geom_rect(aes(
xmin = ymd_hm("2020-03-03 12:00"), xmax = ymd_hm("2020-03-06 12:00"),
ymin = 0, ymax = ylim_max
),
fill = "grey90", colour = NA
) +
geom_line(size = 1.2) +
geom_point() +
geom_text(
data = df_annotate_text, aes(x = x, y = y, label = label),
hjust = "center", size = 4, color = "grey20"
) +
geom_curve(
data = df_annotate_curve,
aes(x = x, y = y, xend = xend, yend = yend),
size = 0.6, curvature = 0,
arrow = arrow(length = unit(2, "mm")), color = "grey20"
) +
scale_x_datetime(
date_breaks = "1 day", date_labels = "%b-%d",
limits = c(as.POSIXct(NA), as.POSIXct(ymd_hm("2020-03-12 00:00"))),
guide = guide_axis(n.dodge = 2)
) +
scale_color_manual(
labels = c(`FALSE` = "Tweet", `TRUE` = "Retweet"),
values = c("#00441B", "#5DB86A")
) +
scale_y_continuous(expand = c(0, 0), limits = c(0, ylim_max)) +
labs(
x = NULL, y = NULL,
title = glue("Frequency of Twitter statuses containing NEUBIAS conference hashtags"),
subtitle = glue(
"Count of <span style = 'color:#00441B;'>tweets </span>",
"and <span style = 'color:#5DB86A;'>retweets</span> per 6 hours over {nb_days} days"
),
caption = "<i>\nSource: Data collected from Twitter's REST API via rtweet</i>",
colour = "Type"
) +
theme(
plot.subtitle = element_markdown(),
plot.caption = element_markdown(),
legend.position = "none"
)
Identifying the most retweeted tweet
As there was plenty of retweets, I was also curious to see which tweets were the most retweeted and I wanted to display the most retweeted tweet.
most_retweeted <- all_neubiasBdx_unique %>%
filter(is_retweet == FALSE) %>%
arrange(desc(retweet_count))
most_retweeted %>%
select(status_id, created_at, screen_name, retweet_count, favorite_count) %>%
head(10) %>%
knitr::kable()
status_id | created_at | screen_name | retweet_count | favorite_count |
---|---|---|---|---|
1234405016603107328 | 2020-03-02 09:07:44 | pseudoobscura | 34 | 74 |
1234401337741316096 | 2020-03-02 08:53:07 | MarionLouveaux | 21 | 36 |
1236023660852514821 | 2020-03-06 20:19:39 | fab_cordelieres | 19 | 60 |
1235252471104229382 | 2020-03-04 17:15:13 | MarionLouveaux | 18 | 38 |
1234806841009475584 | 2020-03-03 11:44:27 | martinjones78 | 17 | 21 |
1233069442189463553 | 2020-02-27 16:40:38 | jan_eglinger | 16 | 12 |
1235502652693323776 | 2020-03-05 09:49:21 | matuskalas | 14 | 24 |
1234403570969128961 | 2020-03-02 09:02:00 | pseudoobscura | 13 | 24 |
1235865025157328896 | 2020-03-06 09:49:17 | martinjones78 | 13 | 30 |
1235167276153831425 | 2020-03-04 11:36:41 | Zahady | 13 | 45 |
To get a snapshot of the most retweeted tweet I use the tweet_shot()
function from {rtweet} and I store the image as .png with the image_write()
function from the package {magick}.
m <- tweet_shot(statusid_or_url = most_retweeted$status_id[1])
image_write(m, "tweet3.png")
Conclusion
In this first part, I explained how I collected and aggregated Twitter statuses harvested in the context of a scientific conference. First, I identified the hashtags suggested by the organisers of the conference and decided to limit my search to these hashtags. Second, I manually queried the Twitter API using the search_tweets2()
function from the {rtweet} package. Third, I gathered this data in a unique dataframe and visualized the evolution of the number of tweets and retweets during the conference.
In the second and third part of this serie of blog articles, I will explore respectively the characteristics of the Twitter users who tweetted using these hashtags and the content of the tweets.
Acknowledgements
I would like to thank Dr. Sébastien Rochette for his help on {ggplot2} and {magick}.
Resources
I highly recommend reading the {rtweet} vignette.
Citation:
For attribution, please cite this work as:
Louveaux M. (2020, Mar. 24). "Analysing Twitter data with R". Retrieved from https://marionlouveaux.fr/blog/twitter-analysis-part1/.
@misc{Louve2020Analy,
author = {Louveaux M},
title = {Analysing Twitter data with R},
url = {https://marionlouveaux.fr/blog/twitter-analysis-part1/},
year = {2020}
}
Share this post
Twitter
Google+
Facebook
LinkedIn
Email