R Script: UGC journal matching

My first bit of R for the new year. This was for an article in The Wire that's also available on this blog here. It essentially extracts a list of journals from a messy PDF, cleans the text, compares it with another list and posts the matches.

library(pdftools)
library(stringr)

##Load PDFs
text1 <- pdf_text("[link to 8919877_Journals-1.pdf]")
text2 <- pdf_text("[link to 9047119_Journals-3.pdf]")
text3 <- pdf_text("[link to 7690152_Journals-4.pdf]")
text4 <- pdf_text("[link to 6988680_Journals-2.pdf]")
text5 <- pdf_text("[link to 3554232_Journals-5.pdf]")

##It's currently loaded each page as a separate element
##Combining all pages into one big chunk
pages1 <- paste(text1, collapse="")
pages2 <- paste(text2, collapse="")
pages3 <- paste(text3, collapse="")
pages4 <- paste(text4, collapse="")
pages5 <- paste(text5, collapse="")
all_pages <- c(pages1,pages2,pages3,pages4,pages5)

##Splitting into rows
all_lines <- unlist(strsplit(all_pages, split="\r\n",fixed = TRUE))
##We get 60442 lines instead of 40000

##Extracting text inbetween numbers and the words 'Scopus'
sub_all_lines <- as.data.frame(as.matrix(sub(".*?[0-9](.*?)Scopus.*", "\\1", all_lines)))

##converting to lower case and removing all non-alphabetical chaarcters
sub_all_lines$V1 <- gsub("[^[:alpha:]]"," ",sub_all_lines$V1)
sub_all_lines$V1 <- tolower(sub_all_lines$V1)

sub_all_lines3 <- sub_all_lines

sub_all_lines3$V1 <- gsub("wos *$","",sub_all_lines$V1)
sub_all_lines3$V1 <- str_trim(sub_all_lines3$V1)

##loading list of predatort journals and cleaning that
predatory <- read_csv("[link to predatory.csv]", col_names = FALSE)
predatory$X1 <- gsub("[^[:alnum:]]"," ",predatory$X1)
predatory$X1 <- tolower(predatory$X1)

##generating list
result <- merge(sub_all_lines3, predatory, by.x="V1",by.y = "X1")
result <- unique(result)