Wednesday, October 15, 2014

R's quirky nested assignment

R has some quirky features that I think exist to serve its functional paradigm.  One of the oddest is what I know only as nested assignment, as demonstrated in this function for the Collatz sequence:

collatz <- function(x) {
  cseq <- c(x)
  while (x>1) cseq <- c(cseq, x<-ifelse(x%%2<1, x/2, 3*x+1))
  cseq
}

Here in the call to combine--c--we've also assigned x a new value.  The really odd thing about this is that most general-purpose non-lisp style languages use the equal sign = as the assignment operator.  In function calls the assignment operator is slightly different, where k=v usually means "set the argument named k to the value v," and that's exactly what the equal sign in R's function calls also mean.  But since R uses a little ASCII arrow <- for the assignment operator, it is still free to make assignments outside the scope of the called function within the argument list of the called function!  Because R is a functional language, most function calls must be assigned to something.  In this sense assignments within function calls are nested.  Kinda weird, but it saves lines of code and is probably suited to a functional style where one might (tastefully) chain together a couple (anonymous) functions.

The next post will look at the functions behind R's operators and how to use them with sapply and friends.  As it turns the assignment operator is a function!  The subset operator is also a function and this enables some really elegant solutions for data management using R's base functions.

Sunday, October 5, 2014

Sliding windows in R

The sliding window is conceptually simple but sometimes tricky to implement.  R sadly lacks off-the-shelf options for these prototypical data structures--the zoo package provides rollapply, but it's an overwhelming function for simple tasks.  Fortunately it's easy to implement your own sliding window with base functions.  The two functions we'll need are rle (for: run length encoding) and cumsum (cumulative sum).

rle requires a sorted vector and will find the lengths of the runs of values--thus run length encoding.  For example let:

X <- c(5, 5, 5, 6, 6, 7, 9, 9)

Which has an rle of:

(3, 2, 1, 2)

for values:

(5, 6, 7, 9)

Next we obtain the cumulative sum of our run length encoding.  For the rle of X from the previous example:

csum <- c(0, cumsum(c(3, 2, 1, 2))) ## == c(0, 3, 5, 6, 8)

We need to add the zero on to the front of this vector, it's not part of the usual cumsum but we need it.  Now we can index the original vector, Xusing the cumulative sum; just create an integer vector from beginning and ending points in the cumulative sum vector.  For example, if the cumulative sum vector is csum and the original vector is X, then we index values 5 through 7 in X with:

X[(csum[1]+1):csum[4]] ## == (5, 5, 5, 6, 6, 7)

and sliding forward we get values 6, 7, and 9 with:

X[(csum[2]+1):csum[5]] ## == (6, 6, 7, 9, 9)

Now we have a means to slide over a dataset of something like dates, where the range of dates remains constant--for example seven days--but the number of cases within the sliding range could vary.

csum's first element is zero and each succeeding element is the total count of cases with values equal to or lesser than that element's run value.  Because we need to add zero to the beginning, csum is one longer than the number of run values and the zero does not associate with any run value.  The starting value of the integer vector for a window over a run of values is the cumulative sum of the preceding counts of the runs of values plus one; and that's why we add one to the starting value from the cumulative sum.  For example, the starting value for a window going over 6, 7, and 9 is going to be the index of the last case of the previous run value--5--plus one.  There are three 5s and they are the first three elements in all the runs, so the index of their last case is three.  Add one for four and that's the index of the first element of the next group--the 6s:

(csum[2]+1):csum[5] ## == 4:8 == (4, 5, 6, 7, 8)
X[4:8] ## == (6, 6, 7, 9, 9)

Putting this all together, what we need is a list of the integer vectors that reference the cases in each iteration of the sliding window, and we might even want the window to move forward in step sizes larger than one.  Here's an easy to use function:

slide <- function(X, width, a=1, step.size=1) {
  ## X must be sorted
  rle.X <- rle(X)
  cum.rle <- c(0, cumsum(rle.X$ lengths))
  if (length(cum.rle) < 2) return(NA)
  A <- seq(a, length(cum.rle)-width, step.size)
  Z <- seq(width+a, length(cum.rle), step.size)
  slides <- mapply(function(a, z) (cum.rle[a]+1):cum.rle[z], A, Z)
  names(slides) <- rle.X$ values[A]
  return(slides)
}

slide returns a list of integer vectors for referencing each iteration of a sliding window that you can use with lapplysapply, or similar functions:

mydframe <- data.frame(x=sample(1:10, 100, replace=TRUE), y=rnorm(100))
mydframe <- mydframe[order(mydframe$ x), ]
myslides <- slide(mydframe$ x, 3)
do.stuff.to <- function(x) x
myresults <- lapply(myslides, function(N) do.stuff.to(mydframe[N, ]))

Viola!  That's it!  Just write your own do.stuff.to function.