patternpythonModerate
Prime factorization for integers with up to 9-digit long prime factors
Viewed 0 times
factorswithlongdigitforintegersprimefactorization
Problem
I've become interested prime factorization since solving Project Euler problem 3 (finding the largest prime factor of 600851475143). Learning here that initializing lists with many elements to later prune them is the computational equivalent of swimming with bricks, I challenged myself to code a function that returns all prime factors as efficiently as possible, without the need for initializing a big list beforehand.
The routine takes the input n and divides it by 2 as many times as possible before leaving a remainder, redefining n to be the result of the division each time. When 2 no longer cleanly divides n, it moves to 3, but it is after 3 that the crux of the inefficiency arises.
After 3, if n does not equal 1 and there are still factors to be found, the next factor tried is the next consecutive odd integer after 3. For example, take the number \$129373200 = (2^4) × (3^5) × (5^2) × (11^3)\$. After the algorithm divides all 2's, 3's and 5's away, the next test-number is 7, which I view as efficient because 7 is prime. However, as the routine iterates, 9 is tested before 11, and I see this as inefficient because 9 is composite.
If I edit the code such that it stores all tested numbers in a list, and checks if the next test-number is a multiple of a previously tested one or not before iterating, the runtime slows down considerably. Is there an efficient way to do this sort of check via elementary functions without storing tested numbers in a list (or set, dictionary, etc.)?
TLDR: I want to understand why factoring numbers with 8-digit-long prime factors takes the code 7 seconds versus numbers with 9-digit-long prime factors that take nearly 2.5 minutes to solve. I want to reduce this large jump in runtime.
```
def pf(n):
startTime=datetime.now()
factors = [] #Initialize a list to store prime factors.
while n % 2 == 0: #While n/2 continues to yield no remainder:
factors += [2] #Append 2 to the factor list.
n /= 2 #Redefine
The routine takes the input n and divides it by 2 as many times as possible before leaving a remainder, redefining n to be the result of the division each time. When 2 no longer cleanly divides n, it moves to 3, but it is after 3 that the crux of the inefficiency arises.
After 3, if n does not equal 1 and there are still factors to be found, the next factor tried is the next consecutive odd integer after 3. For example, take the number \$129373200 = (2^4) × (3^5) × (5^2) × (11^3)\$. After the algorithm divides all 2's, 3's and 5's away, the next test-number is 7, which I view as efficient because 7 is prime. However, as the routine iterates, 9 is tested before 11, and I see this as inefficient because 9 is composite.
If I edit the code such that it stores all tested numbers in a list, and checks if the next test-number is a multiple of a previously tested one or not before iterating, the runtime slows down considerably. Is there an efficient way to do this sort of check via elementary functions without storing tested numbers in a list (or set, dictionary, etc.)?
TLDR: I want to understand why factoring numbers with 8-digit-long prime factors takes the code 7 seconds versus numbers with 9-digit-long prime factors that take nearly 2.5 minutes to solve. I want to reduce this large jump in runtime.
```
def pf(n):
startTime=datetime.now()
factors = [] #Initialize a list to store prime factors.
while n % 2 == 0: #While n/2 continues to yield no remainder:
factors += [2] #Append 2 to the factor list.
n /= 2 #Redefine
Solution
Code Comments
First, that is a thoroughly excessive amount of commenting. I appreciate your zeal in documentation, but you do not need to comment every line. Just a couple comments here and there is thoroughly sufficient.
Second, your function should do its job. If you want to time it, that is an external job. Use the
That will print how long it took your function to run 10 times. Great library, highly recommended.
Thirdly,
Algorithm Comments
The reason your code performs so poorly is because of the its algorithmic complexity. You are trying every odd number from 2 to
Roughly 10x each time. Yay predictions.
So what can we do to speed this up? When our times are this bad, it's either because we have a really stupid bug or we're using a bad algorithm. There isn't anything I see that will save this particular algorithm. At best, we can improve by changing the iteration to avoid multiples of 3 and 5:
That gets us to:
Still exponential, but better constants at least. Not sure how much better we're going to do than that. So let's just google some prime factorization algorithms. First one I came across was Pollard's rho. Simply copying the algorithm gives produces:
That gives me the correct response on all three twin primes, with this performance:
Even jumping up to 10-digits, we still see good performance (we'd expect over an hour with the previous algorithm, so we're talking 1000x improvement), and we see better growth to.
Sometimes, you just need a better algorithm.
First, that is a thoroughly excessive amount of commenting. I appreciate your zeal in documentation, but you do not need to comment every line. Just a couple comments here and there is thoroughly sufficient.
Second, your function should do its job. If you want to time it, that is an external job. Use the
timeit library:def pf(n):
# stuff that doesn't involve datetime
return factors
print timeit.timeit('pf(val)',
setup='from __main__ import pf; val = 145721*(145721+2)',
number=10)That will print how long it took your function to run 10 times. Great library, highly recommended.
Thirdly,
factors += [p] should be spelled factors.append(p). Algorithm Comments
The reason your code performs so poorly is because of the its algorithmic complexity. You are trying every odd number from 2 to
sqrt(N), so your algorithm is O(sqrt(N)). But here N is the actual number... if we express N in terms of digits, it's O(sqrt(10^N)) = O(10^(N/2)). So when we go from 30-digit number to a 43-digit number, we would expect to see an increase in time on the order of a million in the worst case (if we were looking at twin primes). The fact that we see an increase of "only" 20x is an artifact of case selection more than anything else. I tried picking multiples of (arbitrary) twin primes, and the runtimes I got after 10 runs each were:145721 * 145723 0.536s
1117811 * 1117813 2.716s
18363797 * 18363799 45.436sRoughly 10x each time. Yay predictions.
So what can we do to speed this up? When our times are this bad, it's either because we have a really stupid bug or we're using a bad algorithm. There isn't anything I see that will save this particular algorithm. At best, we can improve by changing the iteration to avoid multiples of 3 and 5:
jump = 4
while p*p <= n:
...
else:
p += jump
jump ^= 6 #so it alternates between 2 and 4That gets us to:
145721 * 145723 0.263s
1117811 * 1117813 2.172s
18363797 * 18363799 34.740sStill exponential, but better constants at least. Not sure how much better we're going to do than that. So let's just google some prime factorization algorithms. First one I came across was Pollard's rho. Simply copying the algorithm gives produces:
def pollard(n):
from fractions import gcd
def get_factor(n):
x_fixed = 2
x = 2
cycle_size = 2
factor = 1
while factor == 1:
for count in xrange(cycle_size):
if factor > 1: break
x = (x * x + 1) % n
factor = gcd(x - x_fixed, n)
cycle_size *= 2
x_fixed = x
return factor
factors = []
while n > 1:
next = get_factor(n)
factors.append(next)
n //= next
return factorsThat gives me the correct response on all three twin primes, with this performance:
145721 * 145723 0.021s (12.5x improvement)
1117811 * 1117813 0.295s ( 7.4x)
18363797 * 18363799 1.131s (30.7x)
1500450269 * 1500450271 5.733s (est. ~1000x)Even jumping up to 10-digits, we still see good performance (we'd expect over an hour with the previous algorithm, so we're talking 1000x improvement), and we see better growth to.
Sometimes, you just need a better algorithm.
Code Snippets
def pf(n):
# stuff that doesn't involve datetime
return factors
print timeit.timeit('pf(val)',
setup='from __main__ import pf; val = 145721*(145721+2)',
number=10)145721 * 145723 0.536s
1117811 * 1117813 2.716s
18363797 * 18363799 45.436sjump = 4
while p*p <= n:
...
else:
p += jump
jump ^= 6 #so it alternates between 2 and 4145721 * 145723 0.263s
1117811 * 1117813 2.172s
18363797 * 18363799 34.740sdef pollard(n):
from fractions import gcd
def get_factor(n):
x_fixed = 2
x = 2
cycle_size = 2
factor = 1
while factor == 1:
for count in xrange(cycle_size):
if factor > 1: break
x = (x * x + 1) % n
factor = gcd(x - x_fixed, n)
cycle_size *= 2
x_fixed = x
return factor
factors = []
while n > 1:
next = get_factor(n)
factors.append(next)
n //= next
return factorsContext
StackExchange Code Review Q#104381, answer score: 12
Revisions (0)
No revisions yet.