HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Prime factorization for integers with up to 9-digit long prime factors

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
factorswithlongdigitforintegersprimefactorization

Problem

I've become interested prime factorization since solving Project Euler problem 3 (finding the largest prime factor of 600851475143). Learning here that initializing lists with many elements to later prune them is the computational equivalent of swimming with bricks, I challenged myself to code a function that returns all prime factors as efficiently as possible, without the need for initializing a big list beforehand.

The routine takes the input n and divides it by 2 as many times as possible before leaving a remainder, redefining n to be the result of the division each time. When 2 no longer cleanly divides n, it moves to 3, but it is after 3 that the crux of the inefficiency arises.

After 3, if n does not equal 1 and there are still factors to be found, the next factor tried is the next consecutive odd integer after 3. For example, take the number \$129373200 = (2^4) × (3^5) × (5^2) × (11^3)\$. After the algorithm divides all 2's, 3's and 5's away, the next test-number is 7, which I view as efficient because 7 is prime. However, as the routine iterates, 9 is tested before 11, and I see this as inefficient because 9 is composite.

If I edit the code such that it stores all tested numbers in a list, and checks if the next test-number is a multiple of a previously tested one or not before iterating, the runtime slows down considerably. Is there an efficient way to do this sort of check via elementary functions without storing tested numbers in a list (or set, dictionary, etc.)?

TLDR: I want to understand why factoring numbers with 8-digit-long prime factors takes the code 7 seconds versus numbers with 9-digit-long prime factors that take nearly 2.5 minutes to solve. I want to reduce this large jump in runtime.

```
def pf(n):
startTime=datetime.now()
factors = [] #Initialize a list to store prime factors.
while n % 2 == 0: #While n/2 continues to yield no remainder:
factors += [2] #Append 2 to the factor list.
n /= 2 #Redefine

Solution

Code Comments

First, that is a thoroughly excessive amount of commenting. I appreciate your zeal in documentation, but you do not need to comment every line. Just a couple comments here and there is thoroughly sufficient.

Second, your function should do its job. If you want to time it, that is an external job. Use the timeit library:

def pf(n):
    # stuff that doesn't involve datetime
    return factors

print timeit.timeit('pf(val)',
                    setup='from __main__ import pf; val = 145721*(145721+2)',
                    number=10)


That will print how long it took your function to run 10 times. Great library, highly recommended.

Thirdly, factors += [p] should be spelled factors.append(p).

Algorithm Comments

The reason your code performs so poorly is because of the its algorithmic complexity. You are trying every odd number from 2 to sqrt(N), so your algorithm is O(sqrt(N)). But here N is the actual number... if we express N in terms of digits, it's O(sqrt(10^N)) = O(10^(N/2)). So when we go from 30-digit number to a 43-digit number, we would expect to see an increase in time on the order of a million in the worst case (if we were looking at twin primes). The fact that we see an increase of "only" 20x is an artifact of case selection more than anything else. I tried picking multiples of (arbitrary) twin primes, and the runtimes I got after 10 runs each were:

145721 * 145723        0.536s
1117811 * 1117813      2.716s
18363797 * 18363799   45.436s


Roughly 10x each time. Yay predictions.

So what can we do to speed this up? When our times are this bad, it's either because we have a really stupid bug or we're using a bad algorithm. There isn't anything I see that will save this particular algorithm. At best, we can improve by changing the iteration to avoid multiples of 3 and 5:

jump = 4
while p*p <= n:
    ...
    else:
        p += jump
        jump ^= 6 #so it alternates between 2 and 4


That gets us to:

145721 * 145723        0.263s
1117811 * 1117813      2.172s
18363797 * 18363799   34.740s


Still exponential, but better constants at least. Not sure how much better we're going to do than that. So let's just google some prime factorization algorithms. First one I came across was Pollard's rho. Simply copying the algorithm gives produces:

def pollard(n):
    from fractions import gcd

    def get_factor(n):
        x_fixed = 2
        x = 2
        cycle_size = 2
        factor = 1

        while factor == 1:
            for count in xrange(cycle_size):
                if factor > 1: break
                x = (x * x + 1) % n
                factor = gcd(x - x_fixed, n)

            cycle_size *= 2
            x_fixed = x

        return factor

    factors = []
    while n > 1:
        next = get_factor(n)
        factors.append(next)
        n //= next
    return factors


That gives me the correct response on all three twin primes, with this performance:

145721 * 145723          0.021s (12.5x improvement)
1117811 * 1117813        0.295s ( 7.4x)
18363797 * 18363799      1.131s (30.7x)
1500450269 * 1500450271  5.733s (est. ~1000x)


Even jumping up to 10-digits, we still see good performance (we'd expect over an hour with the previous algorithm, so we're talking 1000x improvement), and we see better growth to.

Sometimes, you just need a better algorithm.

Code Snippets

def pf(n):
    # stuff that doesn't involve datetime
    return factors

print timeit.timeit('pf(val)',
                    setup='from __main__ import pf; val = 145721*(145721+2)',
                    number=10)
145721 * 145723        0.536s
1117811 * 1117813      2.716s
18363797 * 18363799   45.436s
jump = 4
while p*p <= n:
    ...
    else:
        p += jump
        jump ^= 6 #so it alternates between 2 and 4
145721 * 145723        0.263s
1117811 * 1117813      2.172s
18363797 * 18363799   34.740s
def pollard(n):
    from fractions import gcd

    def get_factor(n):
        x_fixed = 2
        x = 2
        cycle_size = 2
        factor = 1

        while factor == 1:
            for count in xrange(cycle_size):
                if factor > 1: break
                x = (x * x + 1) % n
                factor = gcd(x - x_fixed, n)

            cycle_size *= 2
            x_fixed = x

        return factor

    factors = []
    while n > 1:
        next = get_factor(n)
        factors.append(next)
        n //= next
    return factors

Context

StackExchange Code Review Q#104381, answer score: 12

Revisions (0)

No revisions yet.