Extracting info from html to R

My employer provides information on the course participants in the courses I teach using a web-plattform. I wanted to group the students into randomized groups, so I needed their names as data in R. While this is indeed a very specific request, I think I will need to do it again, and similar approaches might be warranted in other cases.

Save the web-page with the list of the participants using firefox/iceweasel. Choose the option "Web-page, only HTML".
Run grep with a suitable pattern that singles out the lines you want.

  grep td names.html  | grep dynamic-data | unhtml > names.txt

Import into R using read.table and a suitable sep parameter.

bar <- read.table(file = "names.txt", sep = "\n")

If necessary, join the surnames with the personal names. In my case, the personal names where the even items in bar, and the surnames were the even items.

personal.name <- bar[seq(from = 1, to = nrow(bar), by = 2),]
surname <- bar[seq(from = 2, to = nrow(bar), by = 2),]
my.names <- paste(personal.name, surname)

Note to self

my.index <- sample.int(length(namn))
my.major.groups <- as.numeric(cut(1:length(namn), breaks = 5))
my.minor.groups <- sapply(table(my.major.groups), function(x) {as.numeric(cut(1:x, breaks = 5))})
my.matrix <- data.frame(namn[my.index], unlist(my.minor.groups))
my.matrix$major <- as.numeric(substr(rownames(my.matrix), 1, 1))
my.matrix$minor <- LETTERS[my.matrix$unlist.my.minor.groups]
my.matrix <- my.matrix[,c(1,3,4)]
colnames(my.matrix) <- c("Namn", "Grupp", "Undergrupp")
write.csv2(my.matrix, file = "~/groups.csv", row.names = FALSE)

comments powered by Disqus

Back to the index

Blog roll

R-bloggers, Debian Weekly

Last modified: oktober 17, 2019