Recursive Functions

In my travel into the realms of functional programming I have embraced the idea of immutable data. I will not write programs that modify objects. The results of the evaluations will be stored in new objects.

Now, I wanted to write a web-spider that traverses a website and collect links. I don't know the number of pages on the website in advance, of course. My first attempt included a while construct, but it turned out that I could not write it without modifying objects. So I tried just a little bit harder and came to a very beautiful solution: a recursive function. The nice part is here is that function invocation offers a place to put result of an evaluation, and instead of putting that result into an existing object, we an put it as input to a new invocation of the same function.

Consider the following structure on a website. We know in advance the number of sections but we do not know the number of pages within each section. The only way to know if there is an additional page within a section is if a known page links to it, ie. in section1 page.1 will have a link to page.2 which will link to page.3 which will have no further links.

If we knew the number of pages in each section in advance it would be rather easy to write a rapply() statement to traverse the tree, but here we find out that number only at runtime.

section1 -
           page.1 -
                    link1
                    link2
                    ...
                    linkn
                    AND a link to page.2
           page.2
                    link1
                    link2
                    ...
                    linkn
                    AND a link to page.3
           page.3
                    link1
                    link2
                    ...
                    linkn
                    NO link to further pages in the section
section2 -
           page.1 -
                    link1
                    link2
                    ...
                    linkn
                    AND a link to page.2
           page.2
                    link1
                    link2
                    ...
                    linkn
                    NO link to further pages in the section
section3 -
...
section20

In the real world case, there were 20 sections and thousands of pages in each section.

## function to (recursively) traverse the pages within each section.
try.add <- function(section.number, i, links){
  error.message <- try(r <- my.get.page(paste(base.url, section.number, 'latest', i, sep = '/')), silent = TRUE)
  if(length(attr(error.message, "condition")) == 0){
    temp.links <- xpathSApply(r, "//div[@id=some identifier']/table/tbody/tr/td[@class='some other identifier']/a", xmlAttrs, trim = TRUE)[2, ]
    if(length(temp.links) > 0){
      return(try.add(section.number, i + 1, rbind(links, temp.links)))
    }
  }
  return(links)
}
## Get all links to thread pages in each section
tree.of.links.in.sections <- sapply(section.numbers, function(section.number) {
  try.add(section.number, 1, vector())
})

The beauty in this construct are in two places

return(try.add(section.number, i + 1, rbind(links, temp.links)))

Here we add 1 to our counter and we add the newly found links to the global list temp.links. But we never modify the global list, since it is created anew for each invocation. If we did not find any new links to follow, we conclude that we are on the last page in the section and simply returns the list of links we have.

While testing it I added a condition so it would automatically end the search for links when it had gone through 10 pages in the current section and it worked flawlessly. But when I ran it on real world data it failed with the error "too deeply nested". It turns out that R does not implement Tail Call Optimization, so it will keep the state of each function invoked which will soon fill upp RAM. So my beautiful code will require copious amounts of RAM for no good reason.

A real functional language like scheme or erlang would have no problem here. It is a pity R has.


Back to the index

Blog roll

R-bloggers, Debian Weekly
Valid XHTML 1.0 Strict [Valid RSS] Valid CSS! Emacs Muse Last modified: oktober 12, 2017