patternMinor
Monte Carlo Pi (MASM)
Viewed 0 times
masmmontecarlo
Problem
I'm currently trying to brush up on my assembly skills and, being at the FPU section of the tutorial, I implemented a very basic version of a Monte-Carlo-Algorithm to compute pi. I deliberately use the x87 instruction set instead of SSE because that's what I'm learning at the moment. What would be the pros and cons of SSE vs x87 for that particular case?
; Monte - Carlo PI
.686
.model flat, c
.stack 4096
.mmx
.xmm
.data
align 16
rand DWORD 0
maxu DWORD 07fffffffh
one DWORD 1.0
four DWORD 4.0
maxn DWORD 50000000
count DWORD 0
result REAL8 0.
.code
includelib MSVCRT
extrn exit : near
main proc
; ecx is loop counter
xor eax, eax
mov ecx, maxn
; init FPU and store const
; values 1 and 2^-32
finit
fild dword ptr [maxu]
fld dword ptr [one]
fdivr st(1), st(0)
lp:
; load two random numbers x and y
rdrand ebx
mov dword ptr[rand], ebx
rdrand ebx
mov dword ptr[rand + 4], ebx
fild dword ptr[rand]
fild dword ptr[rand + 4]
; normalize to[0, 1] and square
fmul st(0), st(3)
fmul st(0), st(0)
fxch st(1)
fmul st(0), st(3)
fmul st(0), st(0)
fadd st(0), st(1)
; if point is inside the circle increase counter
fcomi st(0), st(2)
ja greater
inc eax
greater:
fstp result
fstp result
loop lp
; pi/4 = #points inside/#all tries
mov count, eax
fild dword ptr [count]
fidiv dword ptr [maxn]
fmul four
fstp result
push 0
call exit
main endp
end mainSolution
Here are some things I've observed that may help you improve your code.
Prefer specific instructions to data storage
There is a constant
Take advantage of instructions that pop the stack
Many instruction have a version which pops the top of the FPU stack. The code currently includes this:
However it appears that the only reason for those two
By using the
Think carefully about instruction ordering
The
Avoid branching where practical
Branching is a costly operation to the processor, so avoiding (that is conditional or unconditional jumps) saves cycles and time. In this code, we have this:
I've already addressed changing that to this by using the stack-popping versions to get this:
We can do still better, though. The carry flag is set by the
This assumes that
Use stack space instead of named variables
Having named variables is very useful for starting out and troubleshooting, but doing so prevents your routine from being reentrant. That is, if the code is called by more than one thread simultaneously, either or both will work incorrectly because there is only one copy of the variables. If you instead allocate those on the stack, this potential problem is eliminated.
Consider refining the mathematics
At the moment, the loop scales each random number, squares each, adds the products and compares. Mathematically, if \$a\$ and \$b\$ are the two random numbers, we calculate
\$(a k)^2 + (b k)^2 < 1\$ but we can certainly manipulate that because \$k\$ is a constant. Specifically, we can convert this to \$a^2 + b^2 < \frac{1}{k^2}\$. This would allow calculating the constant term \$\frac{1}{k^2}\$ once outside the loop and minimizing the operations within the loop.
Consider reformatting the code
It compiles as is, but typical formatting of assembly language code put only labels and directives in the first column and all code is indented.
Consider the merits of SSE
To answer your question, yes, SSE would speed things up. In particular, you could calculate multiple numbers in parallel. This would allow the code to effectively use fewer loops for the same precision.
Prefer specific instructions to data storage
There is a constant
1 that is loaded into one but there is actually an instruction fld1 which you could use instead. This would avoid both the storage area and the time taken to access it.Take advantage of instructions that pop the stack
Many instruction have a version which pops the top of the FPU stack. The code currently includes this:
greater:
fstp result
fstp result
loop lpHowever it appears that the only reason for those two
fstp instructions is to pop the operands off the stack -- they're not actually needed later. A better way to do that would be this:faddp st(1), st(0)
; if point is inside the circle increase counter
fcomip st(0), st(1)
ja greater
inc eax
greater:
loop lpBy using the
p suffix on both the faddp and fcomip instructions, the stack is already adjusted.Think carefully about instruction ordering
The
fxch instruction is not really needed. Instead, you could simply load each random number as needed onto the top of the stack by moving the second of the fild instructions to where the fxch is now. Avoid branching where practical
Branching is a costly operation to the processor, so avoiding (that is conditional or unconditional jumps) saves cycles and time. In this code, we have this:
fadd st(0), st(1)
; if point is inside the circle increase counter
fcomi st(0), st(2)
ja greater
inc eax
greater:
fstp result
fstp result
loop lpI've already addressed changing that to this by using the stack-popping versions to get this:
faddp st(1), st(0)
; if point is inside the circle increase counter
fcomip st(0), st(2)
ja greater
inc eax
greater:
loop lpWe can do still better, though. The carry flag is set by the
fcomip instruction if we should be incrementing eax. We can take advantage of this directly and avoid a branch:faddp st(1), st(0)
; if point is inside the circle increase counter
fcomip st(0), st(2)
adc eax, edx
loop lpThis assumes that
edx is set to 0 which can easily be done the same place that eax is initially set to zero.Use stack space instead of named variables
Having named variables is very useful for starting out and troubleshooting, but doing so prevents your routine from being reentrant. That is, if the code is called by more than one thread simultaneously, either or both will work incorrectly because there is only one copy of the variables. If you instead allocate those on the stack, this potential problem is eliminated.
Consider refining the mathematics
At the moment, the loop scales each random number, squares each, adds the products and compares. Mathematically, if \$a\$ and \$b\$ are the two random numbers, we calculate
\$(a k)^2 + (b k)^2 < 1\$ but we can certainly manipulate that because \$k\$ is a constant. Specifically, we can convert this to \$a^2 + b^2 < \frac{1}{k^2}\$. This would allow calculating the constant term \$\frac{1}{k^2}\$ once outside the loop and minimizing the operations within the loop.
Consider reformatting the code
It compiles as is, but typical formatting of assembly language code put only labels and directives in the first column and all code is indented.
Consider the merits of SSE
To answer your question, yes, SSE would speed things up. In particular, you could calculate multiple numbers in parallel. This would allow the code to effectively use fewer loops for the same precision.
Code Snippets
greater:
fstp result
fstp result
loop lpfaddp st(1), st(0)
; if point is inside the circle increase counter
fcomip st(0), st(1)
ja greater
inc eax
greater:
loop lpfadd st(0), st(1)
; if point is inside the circle increase counter
fcomi st(0), st(2)
ja greater
inc eax
greater:
fstp result
fstp result
loop lpfaddp st(1), st(0)
; if point is inside the circle increase counter
fcomip st(0), st(2)
ja greater
inc eax
greater:
loop lpfaddp st(1), st(0)
; if point is inside the circle increase counter
fcomip st(0), st(2)
adc eax, edx
loop lpContext
StackExchange Code Review Q#94162, answer score: 6
Revisions (0)
No revisions yet.