Sunday, September 17, 2017

wunderscraping and S3 classes in R

I recently wrote an R package where I use generic functions.  Generic functions are how R implements object oriented programming, and for R they are very informal.  Everyone who uses R uses generic functions.  A familiar generic function is summary, which takes any single object and prints summary information.  How does summary know how to handle each type of object?  It doesn't, summary is a generic function that passes the object off to its method, which is another function that does know what to do with it.  Methods MUST be named as such: generic.class.  So, the summary method for lm objects is summary.lm.  Users can see help for summary.lm directly with ?summary.lm.

In summary (no pun!), all you have to do to add a new method is write a function and name it as generic.class.  So, if I made a new class for weather objects named Wx and I wanted to add a summary method, I'd simply name the function summary.Wx.  If I want to add a completely new method then I need to register it first, using UseMethod.  With UseMethod, I can make a new generic function that will route objects to their appropriate methods.  A simple generic function is foo <- function(x) UseMethod('foo'), for which a simple method is foo.bar <- function(x) print(class(x)) # -> 'bar'.  Be careful that the generic function accepts the arguments that the methods will need.  If foo.bar is foo.bar <- function(x, sufffix) print(paste0(x, suffix)) then foo(barObject, 'the third') will fail with an unused argument error.  Generic functions that must accept unknown arguments for future methods can use ellipses: foo <- function(x, ...) UseMethod('foo').  Notice that foo must accept all arguments that any method foo may require, so it's simple to use ellipses if you cannot be sure the generic function will only ever need the class object.

Below is a short concrete example creating a scheduler class and methods to add, clean, and execute the schedule.  The class uses environments and datetime objects, both of which can be unfamiliar to most R users, but using generic functions the scheduler object is given a methods interface that is easy to use:

scheduler <- function() { ## constructor function for scheduler object
    e <- structure(new.env(), class='scheduler')
    e $count <- 0
    e $date=format(Sys.Date(), tz='America/New_York')
    e
}

## generic functions
check <- function(x) UseMethod('check')
clean <- function(x) UseMethod('clean')
plan <- function(x, ...) UseMethod('plan')
schedule <- function(x) UseMethod('schedule')
## default methods
check.default <- function(x) warning(paste0('get cannot handle class ', class(x)))
clean.default <- function(x) warning(paste0('clean cannot handle class ', class(x)))
plan.default <- function(x) warning(paste0('set cannot handle class ', class(x)))
schedule.default <- function(x) warning(paste0('schedule cannot handle class ', class(x)))

## scheduler methods
check.scheduler <- function(scheduler) ls.str(scheduler)

clean.scheduler <- function(scheduler) scheduler $schedule <- with(scheduler, schedule[schedule>Sys.time()])

plan.scheduler <- function(scheduler, ...) { # convenience wrapper around seq.POSIXt
    scheduler $schedule <- seq(strptime(0, '%H'), strptime(23, '%H'), ...)
    scheduler $times <- strftime(scheduler $schedule, FORMAT)
}

schedule.scheduler <- function(scheduler) {} ## execute the schedule

Using the class is simple:

mysch <- scheduler()
plan(mysch, by='90 min') # using seq.POSIXt makes periodic scheduling easy
## clean(mysch) # don't clean to start now, else wait till next period

someFunction(mysch) ## write the method for the scheduler and use it from someFunc

For an even more concrete example, check out wunderscraper here.  Make a schedule, as discussed above, and use it to scrape wunderground using main(mysch).  All users will need to register first for a Wunderground API key!

Post Script:
What do you think about S3 classes?  They are very informal, and yet very useful and making R more user friendly and also indicating, but not enforcing, a certain structural expectation about how users should work with a class.  As long as people don't do such insane and anti-social things as changing the class of an R object, then the informality of S3 classes is OK.  What is more controversial, perhaps, about S3 classes is that they are method centric, rather than class centric.  Users familiar with Python or C++ understand using classes are relatively independent objects, whereas R's generic functions creates a framework where all objects are related by a similar set of methods.  Users can completely ignore the fact that a generic function named plot exists for visualizing objects, and instead make a new generic function named graph, or some other synonym, but again that would be rather anti-social coding behaviour, and not something most people are going to do by accident.

R has more formal classes implemented in the S4 anc RC classes, but I actually prefer the informality and flexibility of S3.  I particularly find S4 a poor fit with R's already byzantine typing.  RC classes look useful for reference semantics, however as you can see in the above example, environments provide a useful container with reference semantics.  What is particularly useful about the environment is that if the function using the environment crashes, the environment state remains as left by the function, and can be inspected or simply reused.  The wunderscraper, for example, must keep count of how many API calls it makes in a day.  If the wunderscraper crashes, it can be restarted with the same environment, and it will pick back up.  For development purposes, I can even change the scraping function, recompile the code, and restart the new code with the old environment, as long as the new code hasn't change the environment.

No comments:

Post a Comment