Monday, January 19, 2015

Prime sieves in R

R is not a great language for writing a prime sieve.  Prime sieves do lots of looping and at large numbers need high precision and R does neither of these well.  R's number package provides a prime sieve function that on my computer will find primes up to about 10^8. Very large sieves will not fit in memory and require segmented sieving.  Segmented sieving divides the integer space into equal size segments and sieves each segment in turn.  Segmented sieving is the only sieve method capable of reaching very large numbers.  Languages like C++ can optimize the sieve size for the available memory and be very fast.  The R implementation below, however, is slow.

Below are two implementations.  The first is a traditional sieve and the second a segmented sieve.  The traditional sieve is two parts: the first is a conventional sieve of Eratosthenes to find primes up to the fourth root of and the second sieves n using the primes from the first step and then also sieving out all evens.  The two steps limit the magnitude of overlapping multiples.  The second implementation is a segmented sieve that first uses a traditional sieve to get primes up to sqrt(n) and divides n into segments and uses a refined sieve similar to the second step of the traditional sieve to find the primes in each segment.

tradsieve <- function(n) {
  ## sieve of Eratosthenes (n >= 81)
  ## sieve one (seive 2 and all odds up to sqrt(n)
  P <- c(2, seq(3, n^0.25, 2))
  P <- (1:sqrt(n))[-c(1, unique(unlist(mapply(seq, P*P, sqrt(n), P))))]
  ## seive two (seive p in P^2 up to n in steps of p*2)
  (1:n)[-unique(c(seq(6, n, 2), unlist(mapply(seq, P*P, n, P*2))))]
}

The sieve of Eratosthenes crosses off multiples of 2 and all odds.  The second sieve crosses off evens and all multiples of two times the primes from the first sieve.  Because this sieve takes the fourth root n must be 81 or larger.

segsieve <- function(n, segsize=4e2) {
    ## segmented seive (fast and solves memory problem)
    if (segsize%%2>0) segsize <- segsize + 1 ## segsize must be even
    P <- tradsieve(sqrt(n))[-1] ## n >= 6561
    P2.64 <- mpfr(P, 64)*P ## P^2 may overflow R's precision
    I <- cumsum(rle(ceiling(P2.64/segsize)) $lengths) ## segment indices
    J <- as.numeric(P2.64 %% segsize) ## prime indices within segment
    evens <- seq(2, segsize, 2) ## sequence for sieving evens
    f <- function(i) {
        P2 <- P[1:i]*2
        P <- mapply(seq, J[1:i], segsize+P2, P2)
        J[1:i] <<- sapply(P, tail, 1) - segsize ## last elements set next indices
        P <- unlist(sapply(P, head, -1)) ## last elements > segsize
        (1:segsize)[-unique(c(evens, P))]
    }
    P <- sapply(I, f)
    ## replace first element with 2
    c(2, unlist(mapply(function(i, p) p + i*segsize, 0:(length(P)-1), P))[-1])

}

The segmented sieve divides n into segments.  The traditional sieve finds primes up to sqrt(n) and we then associate the square of each prime with its segment.  Segmented sieving iterates over the segments and sieves using the multiples of primes from the current and all previous segments.  After each segment the algorithm subtracts segment size from the final multiple and sets this to the next index to continue each sieve in the next segment

An optimized C++ implementation of segmented sieving will find primes up to 10^10 in less than a second but this R implementation is much slower.  The tradsieve function will find numbers up to 10^7 in about a second but will crash when attempting 10^8.  The segsieve function takes a minute to find 10^7 but it will also work with 10^9 and higher although on even 10^9 it takes an hour--which suggests 10^10 could takes hours and 10^11 days.  Before judging this code harshly, consider that R's numbers package's prime number finding function works up only to about 10^8.  R can implement a segmented sieve but ultimately a language with strong memory management is a better option.

Finding big primes is hard and implementing these functions and learning all the obstacles, and tricks, for finding big primes has given me an understanding for why it is that these numbers are so important for cryptography.