Analysing Twitter data with R

Part 1: Collecting Twitter statuses related to a scientific conference

March 24, 2020 Marion Louveaux

20 minutes read

In this blog article, I use the {rtweet} package to explore Twitter statuses collected during a scientific conference. I divided this article into three parts. This is part 1.

Twitter is one of the few social media used in the scientific community. Users having a scientific Twitter profile communicate about recent publication of research articles, the tools they use, for instance softwares or microscopes, the seminars and conferences they attend or their life as a scientist. For instance, on my personnal Twitter account, I share my blog articles, research papers, and slides, and I retweet or like content related to R programming and bioimage analysis topics. Twitter archives all tweets and offers an API to make searches on these data. The {rtweet} package provides a convenient interface between the Twitter API and R.

I collected data during the 2020 NEUBIAS conference that was held earlier this year in Bordeaux. NEUBIAS is the Network of EUropean BioImage AnalystS, a scientific network created in 2016 and supported till this year by European COST funds. Bioimage analysts extract and visualise data coming from biological images (mostly microscopy images but not exclusively) using image analysis algorithms and softwares developped by computer vision labs to answer biological questions for their own biology research or for other scientists. I consider myself as a bioimage analyst, and I am an active member of NEUBIAS since 2017. I notably contributed to the creation of a local network of Bioimage Analysts during my postdoc in Heidelberg from 2016 to 2019 and to the co-organisation of two NEUBIAS training schools, and I gave lectures and practicals in three NEUBIAS training schools. Moreover, I recently co-created a Twitter bot called Talk_BioImg, which retweets the hashtag #BioimageAnalysis, to encourage people from this community to connect on Twitter (see “Announcing the creation of a Twitter bot retweeting #BioimageAnalysis” and “Create a Twitter bot on a raspberry Pi 3 using R” for more information).

In this first part, I will explain how I collected and aggregated the data, mainly how:
1) I identified the conference hashtags to narrow my search.
2) I harvested Twitter statuses, i.e. tweets, retweets and quotes (retweets with comments), for these hashtags over 12 days by manually querying the Twitter API.
3) I aggregated these data. As I queried the Twitter API every 2-3 days, some statuses appeared several times. I chose to keep only the most recent occurence of each status.

Librairies

To collect the Twitter statuses and get snapshots from specific statuses, I use the package {rtweet}. To store and read the data in the RDS format I use {readr}. To manipulate and tidy the data, I use {dplyr}, {forcats}, {purrr}, and {tidyr}. To visualise the collected data, I use the packages {ggplot2}, {ggtext}, {grid}, {lubridate} and {RColorBrewer}.

library(dplyr)
library(forcats)
library(here)
library(ggplot2)
if (!requireNamespace("ggtext", quietly = TRUE)){
  remotes::install_github("wilkelab/ggtext")
}
library(ggtext)
library(glue)
library(grid)
library(kableExtra)
library(lubridate)
library(purrr)
library(RColorBrewer)
library(readr)
library(rtweet)
library(tidyr)

Plots: Theme and palette

The code below defines a common theme and color palette for all the plots. The function theme_set() from {ggplot2} sets the theme for all the plots.

# Define a personnal theme
custom_plot_theme <- function(...){
  theme_classic() %+replace%
    theme(panel.grid = element_blank(),
          axis.line = element_line(size = .7, color = "black"),
          axis.text = element_text(size = 11),
          axis.title = element_text(size = 12),
          legend.text = element_text(size = 11),
          legend.title = element_text(size = 12),
          legend.key.size = unit(0.4, "cm"),
          strip.text.x = element_text(size = 12, colour = "black", angle = 0),
          strip.text.y = element_text(size = 12, colour = "black", angle = 90))
}

## Set theme for all plots 
theme_set(custom_plot_theme())

# Define a palette for graphs
greenpal <- colorRampPalette(brewer.pal(9,"Greens"))

Identifying the hashtags for the conference

One of the organisers spontaneously used the hashtag #neubias2020 when announcing one of the event of the conference, the “call4help”, a session during the symposium devoted to the discussion of concrete cases that are difficult to analyse and come from ongoing research projects.

m <- tweet_shot(statusid_or_url = "https://twitter.com/jan_eglinger/status/1233069442189463553?s=19")
magick::image_write(m, "tweet1.png")

One day later, another organiser announced a handful of official hashtags to be used during the conference: #neubiasBordeaux, #neubias_Bdx, #neubiasBdx, and #neubias2020_Bdx.

m <- tweet_shot(statusid_or_url = "https://twitter.com/fab_cordelieres/status/1233435244851843077")
magick::image_write(m, "tweet2.png")

I decided to harvest statuses containing at least one of the following hashtags: #neubiasBordeaux, #neubias_Bdx, #neubiasBdx, #neubias2020_Bdx, or #neubias2020. From now on, I will call these five hashtags the “conference hashtags”.

Searching for Twitter statuses containing the conference hashtags

I started harvesting Twitter statuses, i.e. tweets, retweets and quotes (retweets with comments), during the conference and I stopped 5 days after the last day of the conference. The Twitter API gives access to tweets only in the past 6 to 9 days before the search. To make sure I collect as much tweets as possible, I manually searched every 2-3 days for tweets containing the conference hashtags with the function search_tweets2() from {rtweet}. I stored the search results as RDS files in a subfolder called data_Neubias. Original data can be downloaded here.

sr_neubiasBdx <- search_tweets2(c("#neubiasBordeaux", "#neubias_Bdx", "#neubiasBdx", "#neubias2020_Bdx", "neubias2020"), n = 18000, retryonratelimit = TRUE)

write_rds(sr_neubiasBdx, path = file.path("data_neubias", paste0(Sys.Date(), "_sr_neubiasBdx.rds")))

Note that to use {rtweet} you need to have a Twitter account. The first time you execute the search_tweets2() function, {rtweet} redirects you to login into your Twitter account and get a personnal token. This token should be stored in your .Renviron file. You can open and edit the .Renviron file with the command usethis::edit_r_environ().

I retrieved one dataframe per search. Each dataframe contains 91 variables, giving an important amount of information on the status, notably on its content (full text, hyperlinks, hashtags, media such as images and videos, …), on its author, and on the number of time it was retweeted, quoted or liked. For instance, let’s have a look at the first rows of the dataframe collected on the 6th of March.

sr_neubiasBdx <- read_rds(path = "/data_neubias/2020-03-06_sr_neubiasBdx.rds")

sr_neubiasBdx %>% 
  glimpse()

As there is 91 variables, I use the function glimpse() from {dplyr} to display them in a condensed manner. Click opposite on codeto see the result.


```
## Observations: 2,694
## Variables: 91
## $ user_id                  "42737563", "42737563", "42737563", "42737563…
## $ status_id                "1236031392246648832", "1236031350085431297",…
## $ created_at               2020-03-06 20:50:22, 2020-03-06 20:50:12, 20…
## $ screen_name              "fab_cordelieres", "fab_cordelieres", "fab_co…
## $ text                     "Fabrice de Chaumont @fabdechaumont starts th…
## $ source                   "TweetDeck", "TweetDeck", "TweetDeck", "Tweet…
## $ display_text_width       140, 140, 140, 130, 140, 140, 140, 140, 140, …
## $ reply_to_status_id       NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ reply_to_user_id         NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ reply_to_screen_name     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ is_quote                 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ is_retweet               TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
## $ favorite_count           0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ retweet_count            1, 1, 7, 4, 4, 3, 3, 5, 7, 1, 1, 1, 1, 1, 1, …
## $ quote_count              NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ reply_count              NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hashtags                 ["neubiasBordeaux", "neubiasBordeaux", NA, <…
## $ symbols                  [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ urls_url                 [NA, NA, NA, NA, "livemousetracker.org", NA,…
## $ urls_t.co                [NA, NA, NA, NA, "https://t.co/PUPMq9t49g", …
## $ urls_expanded_url        [NA, NA, NA, NA, "https://livemousetracker.o…
## $ media_url                [NA, NA, NA, "http://pbs.twimg.com/tweet_vid…
## $ media_t.co               [NA, NA, NA, "https://t.co/pn4IYRraVO", NA, …
## $ media_expanded_url       [NA, NA, NA, "https://twitter.com/martinjone…
## $ media_type               [NA, NA, NA, "photo", NA, NA, NA, NA, NA, NA…
## $ ext_media_url            [NA, NA, NA, "http://pbs.twimg.com/tweet_vid…
## $ ext_media_t.co           [NA, NA, NA, "https://t.co/pn4IYRraVO", NA, …
## $ ext_media_expanded_url   [NA, NA, NA, "https://twitter.com/martinjone…
## $ ext_media_type           NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ mentions_user_id         [<"1189027725211820032", "2785982550">, "282…
## $ mentions_screen_name     [<"anna_medyukhina", "fabdechaumont">, "MaKa…
## $ lang                     "en", "en", "en", "en", "en", "en", "en", "en…
## $ quoted_status_id         NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_text              NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_created_at        NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ quoted_source            NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_favorite_count    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_retweet_count     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_user_id           NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_screen_name       NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_name              NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_followers_count   NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_friends_count     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_statuses_count    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_location          NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_description       NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_verified          NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ retweet_status_id        "1235879491504869376", "1235880480416894976",…
## $ retweet_text             "Fabrice de Chaumont @fabdechaumont starts th…
## $ retweet_created_at       2020-03-06 10:46:46, 2020-03-06 10:50:42, 20…
## $ retweet_source           "Twitter for Android", "TweetDeck", "Twitter …
## $ retweet_favorite_count   6, 3, 15, 9, 7, 4, 8, 11, 13, 5, 9, 5, 7, 6, …
## $ retweet_retweet_count    1, 1, 7, 4, 4, 3, 3, 5, 7, 1, 1, 1, 1, 1, 1, …
## $ retweet_user_id          "1189027725211820032", "2829155332", "5678688…
## $ retweet_screen_name      "anna_medyukhina", "MaKaefer", "martinjones78…
## $ retweet_name             "Anna Medyukhina", "Marie - not a queue jumpe…
## $ retweet_followers_count  60, 284, 1487, 1487, 1487, 1487, 751, 81, 394…
## $ retweet_friends_count    138, 408, 3788, 3788, 3788, 3788, 2185, 93, 5…
## $ retweet_statuses_count   97, 2198, 6994, 6994, 6994, 6994, 2451, 45, 4…
## $ retweet_location         "Memphis, TN", "", "Bishopstoke, Hampshire, U…
## $ retweet_description      "scientist, bioimage analyst, runner", "Image…
## $ retweet_verified         FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ place_url                NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ place_name               NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ place_full_name          NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ place_type               NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ country                  NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ country_code             NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ geo_coords               [, , , ,  [, , , ,  [,  "https://twitter.com/fab_cordelieres/status/1…
## $ name                     "Fabrice Cordelieres", "Fabrice Cordelieres",…
## $ location                 "Bordeaux, France", "Bordeaux, France", "Bord…
## $ description              "Trained as a cell biologist, working as a Bi…
## $ url                      NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ protected                FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ followers_count          673, 673, 673, 673, 673, 673, 673, 673, 673, …
## $ friends_count            725, 725, 725, 725, 725, 725, 725, 725, 725, …
## $ listed_count             2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ statuses_count           807, 807, 807, 807, 807, 807, 807, 807, 807, …
## $ favourites_count         1106, 1106, 1106, 1106, 1106, 1106, 1106, 110…
## $ account_created_at       2009-05-26 22:13:16, 2009-05-26 22:13:16, 20…
## $ verified                 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ profile_url              NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ profile_expanded_url     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ account_lang             NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ profile_banner_url       NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ profile_background_url   "http://abs.twimg.com/images/themes/theme1/bg…
## $ profile_image_url        "http://pbs.twimg.com/profile_images/37880000…
## $ query                    "#neubiasBordeaux", "#neubiasBordeaux", "#neu…
```

On the 6th of March, I collected 2694 Twitter statuses.

Gathering all harvested statuses

As I queried the Twitter API every 2-3 days, most of the tweets were collected at least twice. I created the function read_and_add_harvesting_date() to add the harvesting date as an extra variable and keep always the latest harvested version.

#' Read the RDS files and add the date contained in the filename
#'
#' @param RDS_file path to RDS file
#' @param split_date_pattern part of the filename which is not the date
#' @param RDS_dir  path to directory containing RDS file
#'
#' @return a tibble containing the content of RDS file
#' @export
#'
#' @examples
read_and_add_harvesting_date <- function(RDS_file, split_date_pattern, RDS_dir) {
  RDS_filename <- gsub(RDS_file, pattern = paste0(RDS_dir, "/"), replacement ="")
  RDS_date <- gsub(RDS_filename, pattern = split_date_pattern, replacement = "")
  readRDS(RDS_file) %>%
    mutate(harvest_date = as.Date(RDS_date))
}

RDS_file_sr_neubiasBdx <- grep(list.files(here("data_neubias"), full.names = TRUE),
                                 pattern = "neubias",
                                 value = TRUE)
split_date_pattern_sr_neubiasBdx <- "_sr_neubiasBdx.rds"
RDS_dir_name_sr_neubiasBdx <- here("data_neubias")


all_neubiasBdx <- pmap_df(list(RDS_file_sr_neubiasBdx,
                               split_date_pattern_sr_neubiasBdx,
                               RDS_dir_name_sr_neubiasBdx), read_and_add_harvesting_date)

With the read_and_add_harvesting_date() function, I then gathered all harvested tweets in one unique dataframe.

all_neubiasBdx_unique <- all_neubiasBdx %>% 
  arrange(desc(harvest_date)) %>% # take latest harvest date
  distinct(status_id, .keep_all = TRUE) # .keep_all to keep all variables

write_rds(all_neubiasBdx_unique, path = file.path("data_out_neubias", "all_neubiasBdx_unique.rds"))

I now have all the Twitter statuses that I harvested over 12 days aggregated together. I use again the function glimpse() to display all the variables.

all_neubiasBdx_unique %>% 
  glimpse()

Click opposite on codeto see the result.


```
## Observations: 2,629
## Variables: 92
## $ user_id                  "2785982550", "2785982550", "9408326384908943…
## $ status_id                "1237770746740555777", "1236050321799024641",…
## $ created_at               2020-03-11 16:01:57, 2020-03-06 22:05:36, 20…
## $ screen_name              "fabdechaumont", "fabdechaumont", "BlkHwk0ps"…
## $ text                     "#neubiasBordeaux having a lasting impact 😂 @…
## $ source                   "Twitter for Android", "Twitter for Android",…
## $ display_text_width       76, 140, 140, 140, 140, 140, 139, 140, 139, 1…
## $ reply_to_status_id       NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ reply_to_user_id         NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ reply_to_screen_name     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ is_quote                 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ is_retweet               TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
## $ favorite_count           0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 10,…
## $ retweet_count            1, 2, 4, 4, 3, 7, 8, 2, 4, 4, 7, 3, 7, 3, 2, …
## $ quote_count              NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ reply_count              NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hashtags                 ["neubiasBordeaux", "neubiasBordeaux", NA, "…
## $ symbols                  [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ urls_url                 [NA, NA, NA, NA, "NIS.ai", <"NIS.ai", "micro…
## $ urls_t.co                [NA, NA, NA, NA, "https://t.co/lmbELKBgwW", …
## $ urls_expanded_url        [NA, NA, NA, NA, "http://NIS.ai", <"http://N…
## $ media_url                [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ media_t.co               [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ media_expanded_url       [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ media_type               [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ext_media_url            [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ext_media_t.co           [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ext_media_expanded_url   [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ext_media_type           NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ mentions_user_id         [<"1943817554", "2785982550">, "2829155332",…
## $ mentions_screen_name     [<"pseudoobscura", "fabdechaumont">, "MaKaef…
## $ lang                     "en", "en", "en", "en", "en", "en", "en", "en…
## $ quoted_status_id         NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_text              NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_created_at        NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ quoted_source            NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_favorite_count    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_retweet_count     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_user_id           NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_screen_name       NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_name              NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_followers_count   NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_friends_count     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_statuses_count    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_location          NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_description       NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_verified          NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ retweet_status_id        "1237710280009945089", "1235880480416894976",…
## $ retweet_text             "#neubiasBordeaux having a lasting impact 😂 @…
## $ retweet_created_at       2020-03-11 12:01:40, 2020-03-06 10:50:42, 20…
## $ retweet_source           "Twitter Web App", "TweetDeck", "Twitter for …
## $ retweet_favorite_count   6, 6, 8, 6, 8, 16, 19, 2, 7, 9, 24, 3, 16, NA…
## $ retweet_retweet_count    1, 2, 4, 4, 3, 7, 8, 2, 4, 4, 7, 3, 7, NA, NA…
## $ retweet_user_id          "1943817554", "2829155332", "2829155332", "56…
## $ retweet_screen_name      "pseudoobscura", "MaKaefer", "MaKaefer", "mar…
## $ retweet_name             "Christopher Schmied", "Marie - not a queue j…
## $ retweet_followers_count  416, 293, 293, 1499, 293, 1499, 68, 293, 293,…
## $ retweet_friends_count    576, 423, 423, 3805, 423, 3805, 144, 423, 423…
## $ retweet_statuses_count   468, 2205, 2205, 7053, 2205, 7053, 98, 2205, …
## $ retweet_location         "Berlin, Germany", "", "", "Bishopstoke, Hamp…
## $ retweet_description      "Data Scientist | Bioimage Analyst @LeibnizFM…
## $ retweet_verified         FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ place_url                NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ place_name               NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ place_full_name          NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ place_type               NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ country                  NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ country_code             NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ geo_coords               [, , , ,  [, , , ,  [,  "https://twitter.com/fabdechaumont/status/123…
## $ name                     "Fabrice de Chaumont", "Fabrice de Chaumont",…
## $ location                 "", "", "Just were I have to be.", "Just were…
## $ description              "Playing with mice", "Playing with mice", "Lo…
## $ url                      NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ protected                FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ followers_count          77, 77, 298, 298, 298, 298, 298, 298, 298, 29…
## $ friends_count            45, 45, 204, 204, 204, 204, 204, 204, 204, 20…
## $ listed_count             0, 0, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 0, 5, 5, …
## $ statuses_count           63, 63, 15726, 15726, 15726, 15726, 15726, 15…
## $ favourites_count         71, 71, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1…
## $ account_created_at       2014-09-02 13:44:31, 2014-09-02 13:44:31, 20…
## $ verified                 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ profile_url              NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ profile_expanded_url     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ account_lang             NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ profile_banner_url       "https://pbs.twimg.com/profile_banners/278598…
## $ profile_background_url   "http://abs.twimg.com/images/themes/theme1/bg…
## $ profile_image_url        "http://pbs.twimg.com/profile_images/12363584…
## $ query                    "#neubiasBordeaux", "#neubiasBordeaux", "#neu…
## $ harvest_date             2020-03-11, 2020-03-11, 2020-03-11, 2020-03-…
```

The final dataframe contains 92 variables and 2629 Twitter statuses.

total_tweet_number <- all_neubiasBdx_unique %>% 
  filter(!is_retweet) %>% 
  pull(status_id) %>% 
  unique() %>% 
  length()

total_retweet_number <- all_neubiasBdx_unique %>% 
  filter(is_retweet) %>% 
  pull(status_id) %>% 
  unique() %>% 
  length()

More precisely, among the 2629 Twitter statuses, only 661 are original tweets or quotes and 1968 are retweets.

Number of tweets and retweets during the conference

Several events happened during the conference:
- two training schools on bioimage analysis and a “taggathon” to tag ressources in bioimage analysis and update the online database from Saturday the 29th of February to Tuesday the 3rd of March (morning) which I will refer to as “training schools”
- a satellite meeting on Tuesday the 3rd of March (afternoon) and a symposium from Wednesday the 4th of March to Friday the 6th of March which I will refer to as “symposium”

I was curious about the evolution of the number of tweets and quotes versus retweets over the conference.

nb_days <- floor(as.numeric(max(all_neubiasBdx_unique$created_at) - min(all_neubiasBdx_unique$created_at)))

df_per_slot <- all_neubiasBdx_unique %>% 
  mutate(
    datetime = as_datetime(created_at),
    slot = round_time(datetime, n = "6 hours")
  ) %>% 
  count(is_retweet, slot) 

df_annotate_text <- tibble(
  x = c(ymd_hm(c("2020-03-03 12:00", "2020-03-05 18:00")),
        mean(ymd_hm(c("2020-02-29 06:00", "2020-03-03 12:00"))), 
            mean(ymd_hm(c("2020-03-06 12:00", "2020-03-03 12:00")))
  ),
  y = c(190, 180, 210, 210),
  label = c("Satellite meeting", "Gala dinner", "TRAINING SCHOOLS", "SYMPOSIUM")
)


df_annotate_curve <- tibble(
 x = ymd_hm(c("2020-03-03 12:00", "2020-03-05 18:00")),
  y = c(190, 180)-5,
 xend = x,
 yend = y-20
)

ylim_max <-  225

ggplot(df_per_slot) +
  aes(x = slot, y = n, color = is_retweet) +
  geom_rect(aes(
    xmin = ymd_hm("2020-02-29 06:00"), xmax = ymd_hm("2020-03-03 12:00"),
    ymin = 0, ymax = ylim_max
  ),
  fill = "grey80", colour = NA
  ) +
  geom_rect(aes(
    xmin = ymd_hm("2020-03-03 12:00"), xmax = ymd_hm("2020-03-06 12:00"),
    ymin = 0, ymax = ylim_max
  ),
  fill = "grey90", colour = NA
  ) +
  geom_line(size = 1.2) +
  geom_point() +
  geom_text(
    data = df_annotate_text, aes(x = x, y = y, label = label),
    hjust = "center", size = 4, color = "grey20"
  ) +
  geom_curve(
    data = df_annotate_curve,
    aes(x = x, y = y, xend = xend, yend = yend),
    size = 0.6, curvature = 0,
    arrow = arrow(length = unit(2, "mm")), color = "grey20"
  ) +
  scale_x_datetime(
    date_breaks = "1 day", date_labels = "%b-%d",
    limits = c(as.POSIXct(NA), as.POSIXct(ymd_hm("2020-03-12 00:00"))),
    guide = guide_axis(n.dodge = 2)
  ) +
  scale_color_manual(
    labels = c(`FALSE` = "Tweet", `TRUE` = "Retweet"),
    values = c("#00441B", "#5DB86A")
  ) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, ylim_max)) +
  labs(
    x = NULL, y = NULL,
    title = glue("Frequency of Twitter statuses containing NEUBIAS conference hashtags"),
    subtitle = glue(
      "Count of <span style = 'color:#00441B;'>tweets </span>",
      "and <span style = 'color:#5DB86A;'>retweets</span> per 6 hours over {nb_days} days"
    ),
    caption = "<i>\nSource: Data collected from Twitter's REST API via rtweet</i>",
    colour = "Type"
  ) +
  theme(
    plot.subtitle = element_markdown(),
    plot.caption = element_markdown(),
    legend.position = "none"
  )

Identifying the most retweeted tweet

As there was plenty of retweets, I was also curious to see which tweets were the most retweeted and I wanted to display the most retweeted tweet.

most_retweeted <- all_neubiasBdx_unique %>% 
  filter(is_retweet == FALSE) %>% 
  arrange(desc(retweet_count)) 


most_retweeted %>% 
  select(status_id, created_at, screen_name, retweet_count, favorite_count) %>% 
  head(10) %>% 
knitr::kable()

status_id	created_at	screen_name	retweet_count	favorite_count
1234405016603107328	2020-03-02 09:07:44	pseudoobscura	34	74
1234401337741316096	2020-03-02 08:53:07	MarionLouveaux	21	36
1236023660852514821	2020-03-06 20:19:39	fab_cordelieres	19	60
1235252471104229382	2020-03-04 17:15:13	MarionLouveaux	18	38
1234806841009475584	2020-03-03 11:44:27	martinjones78	17	21
1233069442189463553	2020-02-27 16:40:38	jan_eglinger	16	12
1235502652693323776	2020-03-05 09:49:21	matuskalas	14	24
1234403570969128961	2020-03-02 09:02:00	pseudoobscura	13	24
1235865025157328896	2020-03-06 09:49:17	martinjones78	13	30
1235167276153831425	2020-03-04 11:36:41	Zahady	13	45

To get a snapshot of the most retweeted tweet I use the tweet_shot() function from {rtweet} and I store the image as .png with the image_write() function from the package {magick}.

m <- tweet_shot(statusid_or_url = most_retweeted$status_id[1])
image_write(m, "tweet3.png")

Conclusion

In this first part, I explained how I collected and aggregated Twitter statuses harvested in the context of a scientific conference. First, I identified the hashtags suggested by the organisers of the conference and decided to limit my search to these hashtags. Second, I manually queried the Twitter API using the search_tweets2()function from the {rtweet} package. Third, I gathered this data in a unique dataframe and visualized the evolution of the number of tweets and retweets during the conference.

In the second and third part of this serie of blog articles, I will explore respectively the characteristics of the Twitter users who tweetted using these hashtags and the content of the tweets.

Acknowledgements

I would like to thank Dr. Sébastien Rochette for his help on {ggplot2} and {magick}.

Resources

I highly recommend reading the {rtweet} vignette.

Citation:

For attribution, please cite this work as:
Louveaux M. (2020, Mar. 24). "Analysing Twitter data with R". Retrieved from https://marionlouveaux.fr/blog/twitter-analysis-part1/.

BibTex citation:

@misc{Louve2020Analy,
    author = {Louveaux M},
    title = {Analysing Twitter data with R},
    url = {https://marionlouveaux.fr/blog/twitter-analysis-part1/},
    year = {2020}
  }

Blog

Home

About

Blog

Publications

Contact

Recent posts

Visualising GPX hiking data and photos with leaflet

Analysing Twitter data: Exploring tweets content

Analysing Twitter data: Exploring user profiles and relationships