rle requires a sorted vector and will find the lengths of the runs of values--thus run length encoding. For example let:
X <- c(5, 5, 5, 6, 6, 7, 9, 9)
Which has an rle of:
(3, 2, 1, 2)
for values:
(5, 6, 7, 9)
Next we obtain the cumulative sum of our run length encoding. For the rle of X from the previous example:
csum <- c(0, cumsum(c(3, 2, 1, 2))) ## == c(0, 3, 5, 6, 8)
We need to add the zero on to the front of this vector, it's not part of the usual cumsum but we need it. Now we can index the original vector, X, using the cumulative sum; just create an integer vector from beginning and ending points in the cumulative sum vector. For example, if the cumulative sum vector is csum and the original vector is X, then we index values 5 through 7 in X with:
X[(csum[1]+1):csum[4]] ## == (5, 5, 5, 6, 6, 7)
and sliding forward we get values 6, 7, and 9 with:
X[(csum[2]+1):csum[5]] ## == (6, 6, 7, 9, 9)
Now we have a means to slide over a dataset of something like dates, where the range of dates remains constant--for example seven days--but the number of cases within the sliding range could vary.
csum's first element is zero and each succeeding element is the total count of cases with values equal to or lesser than that element's run value. Because we need to add zero to the beginning, csum is one longer than the number of run values and the zero does not associate with any run value. The starting value of the integer vector for a window over a run of values is the cumulative sum of the preceding counts of the runs of values plus one; and that's why we add one to the starting value from the cumulative sum. For example, the starting value for a window going over 6, 7, and 9 is going to be the index of the last case of the previous run value--5--plus one. There are three 5s and they are the first three elements in all the runs, so the index of their last case is three. Add one for four and that's the index of the first element of the next group--the 6s:
(csum[2]+1):csum[5] ## == 4:8 == (4, 5, 6, 7, 8)
X[4:8] ## == (6, 6, 7, 9, 9)
Putting this all together, what we need is a list of the integer vectors that reference the cases in each iteration of the sliding window, and we might even want the window to move forward in step sizes larger than one. Here's an easy to use function:
slide <- function(X, width, a=1, step.size=1) {
## X must be sorted
rle.X <- rle(X)
cum.rle <- c(0, cumsum(rle.X$ lengths))
if (length(cum.rle) < 2) return(NA)
A <- seq(a, length(cum.rle)-width, step.size)
Z <- seq(width+a, length(cum.rle), step.size)
slides <- mapply(function(a, z) (cum.rle[a]+1):cum.rle[z], A, Z)
names(slides) <- rle.X$ values[A]
return(slides)
}
slide returns a list of integer vectors for referencing each iteration of a sliding window that you can use with lapply, sapply, or similar functions:
mydframe <- data.frame(x=sample(1:10, 100, replace=TRUE), y=rnorm(100))
mydframe <- mydframe[order(mydframe$ x), ]
myslides <- slide(mydframe$ x, 3)
do.stuff.to <- function(x) x
myresults <- lapply(myslides, function(N) do.stuff.to(mydframe[N, ]))
Viola! That's it! Just write your own do.stuff.to function.
No comments:
Post a Comment