patternMinor

Monte Carlo Pi (MASM)

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

masmmontecarlo

Problem

I'm currently trying to brush up on my assembly skills and, being at the FPU section of the tutorial, I implemented a very basic version of a Monte-Carlo-Algorithm to compute pi. I deliberately use the x87 instruction set instead of SSE because that's what I'm learning at the moment. What would be the pros and cons of SSE vs x87 for that particular case?

; Monte - Carlo PI

.686
.model flat, c
.stack 4096
.mmx
.xmm

.data
align 16
rand DWORD 0
maxu DWORD 07fffffffh
one DWORD 1.0
four DWORD 4.0
maxn DWORD 50000000
count DWORD 0
result REAL8 0.

.code
includelib MSVCRT
extrn exit : near

main proc

; ecx is loop counter
xor eax, eax
mov ecx, maxn

; init FPU and store const
; values 1 and 2^-32
finit
fild dword ptr [maxu]
fld dword ptr [one]
fdivr st(1), st(0)

lp:
; load two random numbers x and y
rdrand ebx
mov dword ptr[rand], ebx
rdrand ebx
mov dword ptr[rand + 4], ebx
fild dword ptr[rand]
fild dword ptr[rand + 4]

; normalize to[0, 1] and square
fmul st(0), st(3)
fmul st(0), st(0)
fxch st(1)
fmul st(0), st(3)
fmul st(0), st(0)
fadd st(0), st(1)

; if point is inside the circle increase counter
fcomi st(0), st(2)
ja greater
inc eax
greater:
fstp result
fstp result
loop lp

; pi/4 = #points inside/#all tries
mov count, eax
fild dword ptr [count]
fidiv dword ptr [maxn]
fmul four
fstp result

push 0
call exit

main endp
end main

Solution

Here are some things I've observed that may help you improve your code.

Prefer specific instructions to data storage

There is a constant 1 that is loaded into one but there is actually an instruction fld1 which you could use instead. This would avoid both the storage area and the time taken to access it.

Take advantage of instructions that pop the stack

Many instruction have a version which pops the top of the FPU stack. The code currently includes this:

greater:
    fstp result
    fstp result
    loop lp

However it appears that the only reason for those two fstp instructions is to pop the operands off the stack -- they're not actually needed later. A better way to do that would be this:

faddp st(1), st(0)

    ; if point is inside the circle increase counter
    fcomip st(0), st(1)
    ja greater
    inc eax
greater:
    loop lp

By using the p suffix on both the faddp and fcomip instructions, the stack is already adjusted.

Think carefully about instruction ordering

The fxch instruction is not really needed. Instead, you could simply load each random number as needed onto the top of the stack by moving the second of the fild instructions to where the fxch is now.

Avoid branching where practical

Branching is a costly operation to the processor, so avoiding (that is conditional or unconditional jumps) saves cycles and time. In this code, we have this:

fadd st(0), st(1)
    ; if point is inside the circle increase counter
    fcomi st(0), st(2)
    ja greater
    inc eax
greater:
    fstp result
    fstp result
    loop lp

I've already addressed changing that to this by using the stack-popping versions to get this:

faddp st(1), st(0)
    ; if point is inside the circle increase counter
    fcomip st(0), st(2)
    ja greater
    inc eax
greater:
    loop lp

We can do still better, though. The carry flag is set by the fcomip instruction if we should be incrementing eax. We can take advantage of this directly and avoid a branch:

faddp st(1), st(0)
    ; if point is inside the circle increase counter
    fcomip st(0), st(2)
    adc eax, edx
    loop lp

This assumes that edx is set to 0 which can easily be done the same place that eax is initially set to zero.

Use stack space instead of named variables

Having named variables is very useful for starting out and troubleshooting, but doing so prevents your routine from being reentrant. That is, if the code is called by more than one thread simultaneously, either or both will work incorrectly because there is only one copy of the variables. If you instead allocate those on the stack, this potential problem is eliminated.

Consider refining the mathematics

At the moment, the loop scales each random number, squares each, adds the products and compares. Mathematically, if \$a\$ and \$b\$ are the two random numbers, we calculate

\$(a k)^2 + (b k)^2 < 1\$ but we can certainly manipulate that because \$k\$ is a constant. Specifically, we can convert this to \$a^2 + b^2 < \frac{1}{k^2}\$. This would allow calculating the constant term \$\frac{1}{k^2}\$ once outside the loop and minimizing the operations within the loop.

Consider reformatting the code

It compiles as is, but typical formatting of assembly language code put only labels and directives in the first column and all code is indented.

Consider the merits of SSE

To answer your question, yes, SSE would speed things up. In particular, you could calculate multiple numbers in parallel. This would allow the code to effectively use fewer loops for the same precision.

Code Snippets

greater:
    fstp result
    fstp result
    loop lp

faddp st(1), st(0)

    ; if point is inside the circle increase counter
    fcomip st(0), st(1)
    ja greater
    inc eax
greater:
    loop lp

fadd st(0), st(1)
    ; if point is inside the circle increase counter
    fcomi st(0), st(2)
    ja greater
    inc eax
greater:
    fstp result
    fstp result
    loop lp

faddp st(1), st(0)
    ; if point is inside the circle increase counter
    fcomip st(0), st(2)
    ja greater
    inc eax
greater:
    loop lp

faddp st(1), st(0)
    ; if point is inside the circle increase counter
    fcomip st(0), st(2)
    adc eax, edx
    loop lp

Context

StackExchange Code Review Q#94162, answer score: 6

Revisions (0)

No revisions yet.