patterncCritical
Disproving Euler proposition by brute force in C
Viewed 0 times
forcedisprovingeulerbruteproposition
Problem
I wrote a little bit of code to brute force disprove the Euler proposition that states:
$$a^4 + b^4 + c^4 = d^4$$
has no solution when \$a\$, \$b\$, \$c\$, \$d\$ are positive integers.
I'm no mathematician, but reading around this, at least one solution was found by Noam Elkies in 1987 (
I ran it for a while, and worked out that it would take roughly \$85 \times 10^6\$ years for it to get to the answer above on my current machine.
I mean, maybe if I waited a few thousand years, I'd have a slightly better machine, but my question is (and I am new to C and computer science in general - so please be gentle); what strategies could I take to make the above code run faster? I thought about threading, and using some sort of bit shifting in place of the
Would there be any way (on my current machine) to get this to run in my lifetime?
$$a^4 + b^4 + c^4 = d^4$$
has no solution when \$a\$, \$b\$, \$c\$, \$d\$ are positive integers.
I'm no mathematician, but reading around this, at least one solution was found by Noam Elkies in 1987 (
a = 95800; b = 217519; c = 414560; d = 422481). I wanted to get an idea of how much firepower it would take to solve by brute force, so I wrote the following in C:#include
#include
#include
int prop(long int A, long int B, long int C, long int D) {
return (pow(A, 4) + pow(B, 4) + pow(C, 4) == pow(D, 4));
}
int main() {
long int a, b, c, d;
clock_t t;
t = clock();
for (a = 1; a < 100000; a++) {
for (b = 1; b < 300000; b++) {
for (c = 1; c < 500000; c++) {
for (d = 1; d < 500000; d++) {
if (prop(a, b, c, d))
printf("FOUND IT!\na = %ld\nb = %ld\nc = %ld\nd = %ld\n", a, b, c, d);
}
if (!(c%1000))
printf("a = %ld, b = %ld, c = %ld, time = %fs\n", a, b, c, ((double)(clock() - t))/CLOCKS_PER_SEC);
}
printf("a = %ld, b = %ld, time = %fs\n", a, b, ((double)(clock() - t))/CLOCKS_PER_SEC);
}
printf("a = %ld, time = %fs\n", a, ((double)(clock() - t))/CLOCKS_PER_SEC);
}
}I ran it for a while, and worked out that it would take roughly \$85 \times 10^6\$ years for it to get to the answer above on my current machine.
I mean, maybe if I waited a few thousand years, I'd have a slightly better machine, but my question is (and I am new to C and computer science in general - so please be gentle); what strategies could I take to make the above code run faster? I thought about threading, and using some sort of bit shifting in place of the
pow() calls.Would there be any way (on my current machine) to get this to run in my lifetime?
Solution
Note: At some point, this review drifted into the realm of assembler and GMP. An actual review is at the end of this post, whereas the first section discusses the runtime-problems concerning
No life time for run time
Would there be any way (on my current machine) to get this to run in my lifetime?
There's a great saying in carpentry: measure twice, cut once. It concerns cutting wood or other material, where you have to throw away your resources if you accidentally cut at the wrong place.
A similar saying is there for software engineers: you can't optimize what you can't measure. There are several ways to measure your code, e.g. benchmarking, profiling, or looking at the generated assembler to see how many instructions a certain part of your code will take.
Here, we will take the latter route, start with the assembler, take considerations step by step and see where we end up.
A study in assembly
Lets have a look at your code. Well, not yours, but the assembler the compiler generates. You can use
Even though you might not know assembler, you can see those four calls to
Double trouble
But wait a second. We're trying to solve a completely integral problem! Why does our generated program use those registers at all?
This should raise a red flag. And indeed, if we remember
Using proper functions
So let's use another
Note that your compiler should create something like this from that:
but if your compiler doesn't recognize this potential (micro) optimization, you can use
instead.
We also need to change
Allright. Now, before I show the times of the new program, let's check the output of your old one:
a = 1, b = 1, c = 1000, time = 114.156000s
That's when I hit ^C. How does the one using
a = 1, b = 1, c = 1000, time = 0.296000s
a = 1, b = 1, c = 2000, time = 0.578000s
a = 1, b = 1, c = 3000, time = 0.859000s
a = 1, b = 1, c = 4000, time = 1.140000s
a = 1, b = 1, c = 5000, time = 1.421000s
a = 1, b = 1, c = 6000, time = 1.703000s
a = 1, b = 1, c = 7000, time = 1.984000s
a = 1, b = 1, c = 8000, time = 2.265000s
a = 1, b = 1, c = 9000, time = 2.546000s
a = 1, b = 1, c = 10000, time = 2.828000s
a = 1, b = 1, c = 11000, time = 3.109000s
a = 1, b = 1, c = 12000, time = 3.390000s
a = 1, b = 1, c = 13000, time = 3.687000s
a = 1, b = 1, c = 14000, time = 3.968000s
a = 1, b = 1, c = 15000, time = 4.250000s
a = 1, b = 1, c = 16000, time = 4.531000s
Which is 0,2% of your original time, or a 500x speedup.
However, this comes at a cost:
pow, wrong data types and arbitrary large integers.No life time for run time
Would there be any way (on my current machine) to get this to run in my lifetime?
There's a great saying in carpentry: measure twice, cut once. It concerns cutting wood or other material, where you have to throw away your resources if you accidentally cut at the wrong place.
A similar saying is there for software engineers: you can't optimize what you can't measure. There are several ways to measure your code, e.g. benchmarking, profiling, or looking at the generated assembler to see how many instructions a certain part of your code will take.
Here, we will take the latter route, start with the assembler, take considerations step by step and see where we end up.
A study in assembly
Lets have a look at your code. Well, not yours, but the assembler the compiler generates. You can use
gcc -S -O3. On my platform, this results in the following "hot" section in main:.L6:
add rbx, 1
cmp rbx, 500000
je .L18
.L8:
mov rax, QWORD PTR .LC0[rip]
movsd xmm0, QWORD PTR [rsp+40]
movq xmm1, rax
call pow ; (1)
mov rax, QWORD PTR .LC0[rip]
movsd QWORD PTR [rsp+8], xmm0
movsd xmm0, QWORD PTR [rsp+48]
movq xmm1, rax
call pow ; (2)
mov rax, QWORD PTR .LC0[rip]
movsd QWORD PTR [rsp+16], xmm0
movsd xmm0, QWORD PTR [rsp+32]
movq xmm1, rax
call pow ; (3)
mov rax, QWORD PTR .LC0[rip]
movsd QWORD PTR [rsp+24], xmm0
pxor xmm0, xmm0
cvtsi2sd xmm0, rbx
movq xmm1, rax
call pow ; (4)
movsd xmm2, QWORD PTR [rsp+8]
addsd xmm2, QWORD PTR [rsp+16]
movapd xmm1, xmm0
movsd xmm0, QWORD PTR [rsp+24]
addsd xmm0, xmm2
ucomisd xmm0, xmm1
jp .L6
jne .L6Even though you might not know assembler, you can see those four calls to
pow. The first thing you need to know is that call is slow compared to those other operations. Those four calls happen in the innermost loop. The compiler removed the call to prop and instead replaced it by its code (that's faster).mov assigns values to registers, add adds values, and so on. The registers with xmm* are double precision registers, meant for double variables. So we're basically calling pow with the right values and then add, subtract and modify our small little double values.Double trouble
But wait a second. We're trying to solve a completely integral problem! Why does our generated program use those registers at all?
This should raise a red flag. And indeed, if we remember
pow's signature, it should be clear that it's not the right tool. It takes a double base and exponent, which indicates that it's suitable for terms like \$15.12151^{3.1415926}\$. This is a total overkill for your problem.Using proper functions
So let's use another
pow version instead:long int pow4(long int x){
return x * x * x * x;
}Note that your compiler should create something like this from that:
movq %rdi, %rax
imulq %rdi, %rax
imulq %rax, %rax
retbut if your compiler doesn't recognize this potential (micro) optimization, you can use
long int pow4(long int x){
const long int t = x * x;
return t * t;
}instead.
We also need to change
prop:int prop(long int A, long int B, long int C, long int D) {
return (pow4(A) + pow4(B) + pow4(C) == pow4(D));
}Allright. Now, before I show the times of the new program, let's check the output of your old one:
a = 1, b = 1, c = 1000, time = 114.156000s
That's when I hit ^C. How does the one using
pow4 hold up?a = 1, b = 1, c = 1000, time = 0.296000s
a = 1, b = 1, c = 2000, time = 0.578000s
a = 1, b = 1, c = 3000, time = 0.859000s
a = 1, b = 1, c = 4000, time = 1.140000s
a = 1, b = 1, c = 5000, time = 1.421000s
a = 1, b = 1, c = 6000, time = 1.703000s
a = 1, b = 1, c = 7000, time = 1.984000s
a = 1, b = 1, c = 8000, time = 2.265000s
a = 1, b = 1, c = 9000, time = 2.546000s
a = 1, b = 1, c = 10000, time = 2.828000s
a = 1, b = 1, c = 11000, time = 3.109000s
a = 1, b = 1, c = 12000, time = 3.390000s
a = 1, b = 1, c = 13000, time = 3.687000s
a = 1, b = 1, c = 14000, time = 3.968000s
a = 1, b = 1, c = 15000, time = 4.250000s
a = 1, b = 1, c = 16000, time = 4.531000s
Which is 0,2% of your original time, or a 500x speedup.
However, this comes at a cost:
pow4(500000) is too large for a int64_t, since \$\log_2(500000^4) \approx 76\$. The greatest number you could check with a uint64_t is Code Snippets
.L6:
add rbx, 1
cmp rbx, 500000
je .L18
.L8:
mov rax, QWORD PTR .LC0[rip]
movsd xmm0, QWORD PTR [rsp+40]
movq xmm1, rax
call pow ; (1)
mov rax, QWORD PTR .LC0[rip]
movsd QWORD PTR [rsp+8], xmm0
movsd xmm0, QWORD PTR [rsp+48]
movq xmm1, rax
call pow ; (2)
mov rax, QWORD PTR .LC0[rip]
movsd QWORD PTR [rsp+16], xmm0
movsd xmm0, QWORD PTR [rsp+32]
movq xmm1, rax
call pow ; (3)
mov rax, QWORD PTR .LC0[rip]
movsd QWORD PTR [rsp+24], xmm0
pxor xmm0, xmm0
cvtsi2sd xmm0, rbx
movq xmm1, rax
call pow ; (4)
movsd xmm2, QWORD PTR [rsp+8]
addsd xmm2, QWORD PTR [rsp+16]
movapd xmm1, xmm0
movsd xmm0, QWORD PTR [rsp+24]
addsd xmm0, xmm2
ucomisd xmm0, xmm1
jp .L6
jne .L6long int pow4(long int x){
return x * x * x * x;
}movq %rdi, %rax
imulq %rdi, %rax
imulq %rax, %rax
retlong int pow4(long int x){
const long int t = x * x;
return t * t;
}int prop(long int A, long int B, long int C, long int D) {
return (pow4(A) + pow4(B) + pow4(C) == pow4(D));
}Context
StackExchange Code Review Q#145221, answer score: 151
Revisions (0)
No revisions yet.