Sunday, September 17, 2017

wunderscraping and S3 classes in R

I recently wrote an R package where I use generic functions.  Generic functions are how R implements object oriented programming, and for R they are very informal.  Everyone who uses R uses generic functions.  A familiar generic function is summary, which takes any single object and prints summary information.  How does summary know how to handle each type of object?  It doesn't, summary is a generic function that passes the object off to its method, which is another function that does know what to do with it.  Methods MUST be named as such: generic.class.  So, the summary method for lm objects is summary.lm.  Users can see help for summary.lm directly with ?summary.lm.

In summary (no pun!), all you have to do to add a new method is write a function and name it as generic.class.  So, if I made a new class for weather objects named Wx and I wanted to add a summary method, I'd simply name the function summary.Wx.  If I want to add a completely new method then I need to register it first, using UseMethod.  With UseMethod, I can make a new generic function that will route objects to their appropriate methods.  A simple generic function is foo <- function(x) UseMethod('foo'), for which a simple method is foo.bar <- function(x) print(class(x)) # -> 'bar'.  Be careful that the generic function accepts the arguments that the methods will need.  If foo.bar is foo.bar <- function(x, sufffix) print(paste0(x, suffix)) then foo(barObject, 'the third') will fail with an unused argument error.  Generic functions that must accept unknown arguments for future methods can use ellipses: foo <- function(x, ...) UseMethod('foo').  Notice that foo must accept all arguments that any method foo may require, so it's simple to use ellipses if you cannot be sure the generic function will only ever need the class object.

Below is a short concrete example creating a scheduler class and methods to add, clean, and execute the schedule.  The class uses environments and datetime objects, both of which can be unfamiliar to most R users, but using generic functions the scheduler object is given a methods interface that is easy to use:

scheduler <- function() { ## constructor function for scheduler object
    e <- structure(new.env(), class='scheduler')
    e $count <- 0
    e $date=format(Sys.Date(), tz='America/New_York')
    e
}

## generic functions
check <- function(x) UseMethod('check')
clean <- function(x) UseMethod('clean')
plan <- function(x, ...) UseMethod('plan')
schedule <- function(x) UseMethod('schedule')
## default methods
check.default <- function(x) warning(paste0('get cannot handle class ', class(x)))
clean.default <- function(x) warning(paste0('clean cannot handle class ', class(x)))
plan.default <- function(x) warning(paste0('set cannot handle class ', class(x)))
schedule.default <- function(x) warning(paste0('schedule cannot handle class ', class(x)))

## scheduler methods
check.scheduler <- function(scheduler) ls.str(scheduler)

clean.scheduler <- function(scheduler) scheduler $schedule <- with(scheduler, schedule[schedule>Sys.time()])

plan.scheduler <- function(scheduler, ...) { # convenience wrapper around seq.POSIXt
    scheduler $schedule <- seq(strptime(0, '%H'), strptime(23, '%H'), ...)
    scheduler $times <- strftime(scheduler $schedule, FORMAT)
}

schedule.scheduler <- function(scheduler) {} ## execute the schedule

Using the class is simple:

mysch <- scheduler()
plan(mysch, by='90 min') # using seq.POSIXt makes periodic scheduling easy
## clean(mysch) # don't clean to start now, else wait till next period

someFunction(mysch) ## write the method for the scheduler and use it from someFunc

For an even more concrete example, check out wunderscraper here.  Make a schedule, as discussed above, and use it to scrape wunderground using main(mysch).  All users will need to register first for a Wunderground API key!

Post Script:
What do you think about S3 classes?  They are very informal, and yet very useful and making R more user friendly and also indicating, but not enforcing, a certain structural expectation about how users should work with a class.  As long as people don't do such insane and anti-social things as changing the class of an R object, then the informality of S3 classes is OK.  What is more controversial, perhaps, about S3 classes is that they are method centric, rather than class centric.  Users familiar with Python or C++ understand using classes are relatively independent objects, whereas R's generic functions creates a framework where all objects are related by a similar set of methods.  Users can completely ignore the fact that a generic function named plot exists for visualizing objects, and instead make a new generic function named graph, or some other synonym, but again that would be rather anti-social coding behaviour, and not something most people are going to do by accident.

R has more formal classes implemented in the S4 anc RC classes, but I actually prefer the informality and flexibility of S3.  I particularly find S4 a poor fit with R's already byzantine typing.  RC classes look useful for reference semantics, however as you can see in the above example, environments provide a useful container with reference semantics.  What is particularly useful about the environment is that if the function using the environment crashes, the environment state remains as left by the function, and can be inspected or simply reused.  The wunderscraper, for example, must keep count of how many API calls it makes in a day.  If the wunderscraper crashes, it can be restarted with the same environment, and it will pick back up.  For development purposes, I can even change the scraping function, recompile the code, and restart the new code with the old environment, as long as the new code hasn't change the environment.
Darkest dungeon is a slow descent into madness.  It starts off innocently enough.  A path through the woods to a crumbling hamlet.  A few heroes, a torch, and a dungeon waiting to be explored.  And then dungeon conquered!  We return to a celebratory drink or a prayer at the hamlet.  More heroes show up--wide eyed and ready explore and rebuild the ruined landscape.  And the party embarks for another dungeon, and another, and another.  But soon the voices start.  The dungeons grow darker, the monsters more brutal, and the heroes more mad.  Weeks pass in the game; hours pass in real life.  The dungeon eventually massacres your favorite heroes.  Such brutality cannot go unanswered!  With maddening, obsessive, resolve redoubled the entire hamlet wails, thrashes, and prepares to lash back.  Every waking moment is dedicated to taking the dungeon.  Sanity begins to slip, for both the heroes and the players, and soon the dungeon, it's loot and denizens, are the only thing that matters.  The dungeon occupies every thought until we occupy its halls as our home; having become monsters ourselves.

Darkest Dungeon is a party management roguelike.  Players embark from a home-base hamlet to explore a series of dungeons.  After each dungeon the player returns to a hamlet where they can heal the wounded heroes in the party, upgrade hero skills and equipment, and recruit new heroes.  Heroes belong to one of thirteen classes with more classes under development.  Each class has seven skills but only four can be active at a time.  All skills can be upgraded.  All heroes have a weapon and armour that cannot be swapped but are also upgradeable.  Finally each hero can equip two trinkets that modify their stats; some trinkets can only be equipped by one class.

I've installed and deleted Darkest Dungeon from my computer no fewer than three times.  This is because the game is so good that when I start playing the only way to stop is to remove the game.  This also requires me to wait half an hour to redownload the game whenever I foolishly decide to start playing it again.  This game is a time suck.  Party management is deep, and the dungeons are fun and the hamlet-dungeon rhythm really keeps the player in a "just one more" mindset so that's hard to quit playing.

Darkest Dungeon pulls off Lovecraft Horror as good as any other game or movie.  The game is further flavoured with an Eastern European air in terms of the trappings; from the hamlet, the heroes, and the voice-acting.  The narration and quips are appropriately atmospheric and not usually overdone.  When the player kills a monster with poison the narrator speaks in a seething voice:
"slowly, gently, this is how a life is taken"
And when the player takes a critical hit the narrator exclaims:
"mortality clarified...in a single strike"

The strategy is also fun.  The player can craft different strategies around different parties.  Many effective straightforward party builds exist but the player can also put together more complex parties where heroes depend heavily on particular skills in the party.  Each party has it's own rhythm and critical flaws that become apparent only when trying to apply the build in the dungeon.

...to be continued?

Monday, January 19, 2015

Prime sieves in R

R is not a great language for writing a prime sieve.  Prime sieves do lots of looping and at large numbers need high precision and R does neither of these well.  R's number package provides a prime sieve function that on my computer will find primes up to about 10^8. Very large sieves will not fit in memory and require segmented sieving.  Segmented sieving divides the integer space into equal size segments and sieves each segment in turn.  Segmented sieving is the only sieve method capable of reaching very large numbers.  Languages like C++ can optimize the sieve size for the available memory and be very fast.  The R implementation below, however, is slow.

Below are two implementations.  The first is a traditional sieve and the second a segmented sieve.  The traditional sieve is two parts: the first is a conventional sieve of Eratosthenes to find primes up to the fourth root of and the second sieves n using the primes from the first step and then also sieving out all evens.  The two steps limit the magnitude of overlapping multiples.  The second implementation is a segmented sieve that first uses a traditional sieve to get primes up to sqrt(n) and divides n into segments and uses a refined sieve similar to the second step of the traditional sieve to find the primes in each segment.

tradsieve <- function(n) {
  ## sieve of Eratosthenes (n >= 81)
  ## sieve one (seive 2 and all odds up to sqrt(n)
  P <- c(2, seq(3, n^0.25, 2))
  P <- (1:sqrt(n))[-c(1, unique(unlist(mapply(seq, P*P, sqrt(n), P))))]
  ## seive two (seive p in P^2 up to n in steps of p*2)
  (1:n)[-unique(c(seq(6, n, 2), unlist(mapply(seq, P*P, n, P*2))))]
}

The sieve of Eratosthenes crosses off multiples of 2 and all odds.  The second sieve crosses off evens and all multiples of two times the primes from the first sieve.  Because this sieve takes the fourth root n must be 81 or larger.

segsieve <- function(n, segsize=4e2) {
    ## segmented seive (fast and solves memory problem)
    if (segsize%%2>0) segsize <- segsize + 1 ## segsize must be even
    P <- tradsieve(sqrt(n))[-1] ## n >= 6561
    P2.64 <- mpfr(P, 64)*P ## P^2 may overflow R's precision
    I <- cumsum(rle(ceiling(P2.64/segsize)) $lengths) ## segment indices
    J <- as.numeric(P2.64 %% segsize) ## prime indices within segment
    evens <- seq(2, segsize, 2) ## sequence for sieving evens
    f <- function(i) {
        P2 <- P[1:i]*2
        P <- mapply(seq, J[1:i], segsize+P2, P2)
        J[1:i] <<- sapply(P, tail, 1) - segsize ## last elements set next indices
        P <- unlist(sapply(P, head, -1)) ## last elements > segsize
        (1:segsize)[-unique(c(evens, P))]
    }
    P <- sapply(I, f)
    ## replace first element with 2
    c(2, unlist(mapply(function(i, p) p + i*segsize, 0:(length(P)-1), P))[-1])

}

The segmented sieve divides n into segments.  The traditional sieve finds primes up to sqrt(n) and we then associate the square of each prime with its segment.  Segmented sieving iterates over the segments and sieves using the multiples of primes from the current and all previous segments.  After each segment the algorithm subtracts segment size from the final multiple and sets this to the next index to continue each sieve in the next segment

An optimized C++ implementation of segmented sieving will find primes up to 10^10 in less than a second but this R implementation is much slower.  The tradsieve function will find numbers up to 10^7 in about a second but will crash when attempting 10^8.  The segsieve function takes a minute to find 10^7 but it will also work with 10^9 and higher although on even 10^9 it takes an hour--which suggests 10^10 could takes hours and 10^11 days.  Before judging this code harshly, consider that R's numbers package's prime number finding function works up only to about 10^8.  R can implement a segmented sieve but ultimately a language with strong memory management is a better option.

Finding big primes is hard and implementing these functions and learning all the obstacles, and tricks, for finding big primes has given me an understanding for why it is that these numbers are so important for cryptography.

Sunday, November 2, 2014

ssh to Windows from an iPad

My main computer is Windows, and sometimes I'd like to connect to it via ssh with an iPad and iSSH.  The reason I do this is to get access to an R console on iPad.  The App store has an R console app, but it's limited to packages that the app author included and it can't be extended.  The best way to get a full R installation on an iPad is to ssh to a computer with a full R installation.

A computer needs an active sshd (ssh daemon) to accept ssh connections. Windows doesn't provide an sshd and requires special software.  Many options are available, but most of them are limited or not free.  The best free solution that provides full functionality is OpenSSH, and the best way to get OpenSSH is through Cygwin.

I won't repeat all the steps here, but instead I will supply what I found to be the most helpful resources, and raise some potential obstacles.  NOTE that any instructions completed within a Cygwin terminal will usually require you to launch the terminal 'as administrator,' which you can do by right clicking the executable.

First go get Cygwin:

Before installing Cygwin read through the instructions for setting up an sshd with OpenSSH.  Cygwin is big and you don't need all of it; you can install just the portions you need.  The best walkthrough I've found for setting up OpenSSH's sshd is here:

If you install Cygwin, get the right packages, and start the above walkthrough, you're likely to encounter an error at the portion of the walkthrough where you invoke ssh-user-config.  You might get something to the effect of a home directory that doesn't exist.  This is because Cygwin chooses the wrong default home directory, unless you've set the correct home directory in a Windows global environment variable named HOME.  Thing is, you probably don't want to set a global environment variable for your home folder, but would rather use a user environment variable, but this isn't good enough for Cygwin.  The solution is to regenerate a special file that Cygwin uses to find things like your home directory.  This file resides in C:\\cygwin64\etc\passwd--Cygwin uses some special POSIX to Windows file system conversions, however, and will refer to this file as /etc/passwd.  You could go changing this file by hand with a text editor, but the simplest solution is to invoke mkpasswd and regenerate the /etc/passwd file to point to the correct home directory.  Launch a Cygwin terminal and enter the following:

mkpasswd -l -p "$(cygpath -H)" > /etc/passwd

This tells Cygwin to regenerate /etc/passwd.  The l switch tells mkpasswd to print the local user account--that's you--and the p switch tells it to set the home directory to the specified file path instead of using Cygwin's default method of finding the home.  The next bit, between the quotes, is the argument to p--the specified home path.  We enclose this in quotes because the home path may contain spaces, which must be enclosed in quotes.  The $( is a special syntax that starts a subshell, a shell script within a shell, and returns the results.  Here we make a subshell to invoke cygpath to get the value for the argument to the p switch of mkpasswd.  Are we confused yet?  And why can't we just type out the home path ourselves?  Cygwin is bastard child of scandalous trysts between Windows and POSIX, and while it operates in Windows it still needs POSIX style paths.  The cygpath utility helps convert between Windows and POSIX paths.  The H switch to cygpath is a convenience switch that returns the homeroot of the current user.  Lastly, the > is telling mkpasswd to write the results to the /etc/passwd file--it's a standard *nix thing.  Diligent readers may note that /etc/passwd has already been modified by the ssh-host-config command, but don't worry mkpasswd will keep all that intact and rewrite only the entry for the user's home.

If none of that worked, then there's some more resources on this problem here:

Now you can get back to the walkthrough, and you should be able to finish everything normally.

The next step is to try connecting.  If you could connect to the localhost, as described in the above walkthrough, then you are ready to go.  Launch iSSH and create a new configuration.  You will need to know your computer's IP address on the local network.  To find this launch Window's usual command line terminal--not Cygwin--and type ipconfig /all.  That will list all the IP stuff, there will be a lot of it, but there should be only one IPv4 entry, and that's what you need.  It should be four integers separated by periods, the first integer should be 192, but not always.  In iSSH enter the IPv4 value in the Host field.  Enter 22 in the Port field, and enter your Window's user name in the Login field.  You should also go ahead and enter your Window's login password in the password field.  It should be optional, but for me if I didn't do this the connection would often 'drop unexpectedly' after I entered my password at the prompt; I never had this problem after setting my password here in the configuration setup.

You will probably now encounter another problem: connection timeout.  Windows firewall likely has port 22 blocked.  To open port 22 follow these instructions:

Now try connecting again through iSSH.  If everything works you should find yourself at a Cygwin prompt.  Just type R and you're cookin' with gas!

If you royally fucked up any of the steps and feel like you need to nuke your sshd stuff and start over, then here's an entry at Superuser that can help you do that:

Configuration finished.  Have fun!


Wednesday, October 15, 2014

R's quirky nested assignment

R has some quirky features that I think exist to serve its functional paradigm.  One of the oddest is what I know only as nested assignment, as demonstrated in this function for the Collatz sequence:

collatz <- function(x) {
  cseq <- c(x)
  while (x>1) cseq <- c(cseq, x<-ifelse(x%%2<1, x/2, 3*x+1))
  cseq
}

Here in the call to combine--c--we've also assigned x a new value.  The really odd thing about this is that most general-purpose non-lisp style languages use the equal sign = as the assignment operator.  In function calls the assignment operator is slightly different, where k=v usually means "set the argument named k to the value v," and that's exactly what the equal sign in R's function calls also mean.  But since R uses a little ASCII arrow <- for the assignment operator, it is still free to make assignments outside the scope of the called function within the argument list of the called function!  Because R is a functional language, most function calls must be assigned to something.  In this sense assignments within function calls are nested.  Kinda weird, but it saves lines of code and is probably suited to a functional style where one might (tastefully) chain together a couple (anonymous) functions.

The next post will look at the functions behind R's operators and how to use them with sapply and friends.  As it turns the assignment operator is a function!  The subset operator is also a function and this enables some really elegant solutions for data management using R's base functions.

Sunday, October 5, 2014

Sliding windows in R

The sliding window is conceptually simple but sometimes tricky to implement.  R sadly lacks off-the-shelf options for these prototypical data structures--the zoo package provides rollapply, but it's an overwhelming function for simple tasks.  Fortunately it's easy to implement your own sliding window with base functions.  The two functions we'll need are rle (for: run length encoding) and cumsum (cumulative sum).

rle requires a sorted vector and will find the lengths of the runs of values--thus run length encoding.  For example let:

X <- c(5, 5, 5, 6, 6, 7, 9, 9)

Which has an rle of:

(3, 2, 1, 2)

for values:

(5, 6, 7, 9)

Next we obtain the cumulative sum of our run length encoding.  For the rle of X from the previous example:

csum <- c(0, cumsum(c(3, 2, 1, 2))) ## == c(0, 3, 5, 6, 8)

We need to add the zero on to the front of this vector, it's not part of the usual cumsum but we need it.  Now we can index the original vector, Xusing the cumulative sum; just create an integer vector from beginning and ending points in the cumulative sum vector.  For example, if the cumulative sum vector is csum and the original vector is X, then we index values 5 through 7 in X with:

X[(csum[1]+1):csum[4]] ## == (5, 5, 5, 6, 6, 7)

and sliding forward we get values 6, 7, and 9 with:

X[(csum[2]+1):csum[5]] ## == (6, 6, 7, 9, 9)

Now we have a means to slide over a dataset of something like dates, where the range of dates remains constant--for example seven days--but the number of cases within the sliding range could vary.

csum's first element is zero and each succeeding element is the total count of cases with values equal to or lesser than that element's run value.  Because we need to add zero to the beginning, csum is one longer than the number of run values and the zero does not associate with any run value.  The starting value of the integer vector for a window over a run of values is the cumulative sum of the preceding counts of the runs of values plus one; and that's why we add one to the starting value from the cumulative sum.  For example, the starting value for a window going over 6, 7, and 9 is going to be the index of the last case of the previous run value--5--plus one.  There are three 5s and they are the first three elements in all the runs, so the index of their last case is three.  Add one for four and that's the index of the first element of the next group--the 6s:

(csum[2]+1):csum[5] ## == 4:8 == (4, 5, 6, 7, 8)
X[4:8] ## == (6, 6, 7, 9, 9)

Putting this all together, what we need is a list of the integer vectors that reference the cases in each iteration of the sliding window, and we might even want the window to move forward in step sizes larger than one.  Here's an easy to use function:

slide <- function(X, width, a=1, step.size=1) {
  ## X must be sorted
  rle.X <- rle(X)
  cum.rle <- c(0, cumsum(rle.X$ lengths))
  if (length(cum.rle) < 2) return(NA)
  A <- seq(a, length(cum.rle)-width, step.size)
  Z <- seq(width+a, length(cum.rle), step.size)
  slides <- mapply(function(a, z) (cum.rle[a]+1):cum.rle[z], A, Z)
  names(slides) <- rle.X$ values[A]
  return(slides)
}

slide returns a list of integer vectors for referencing each iteration of a sliding window that you can use with lapplysapply, or similar functions:

mydframe <- data.frame(x=sample(1:10, 100, replace=TRUE), y=rnorm(100))
mydframe <- mydframe[order(mydframe$ x), ]
myslides <- slide(mydframe$ x, 3)
do.stuff.to <- function(x) x
myresults <- lapply(myslides, function(N) do.stuff.to(mydframe[N, ]))

Viola!  That's it!  Just write your own do.stuff.to function.