patternMinor

Efficient communication across multiple processors for finite elements

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

finiteelementsprocessorscommunicationefficientformultipleacross

Problem

I am implementing finite element code using MPI. Obviously, when using MPI I will have to include communication between different nodes so that all processors have all relevant information. As this is my first time doing something like this, I have implemented a very naive algorithm to get updated volume information for each element as shown below:

```
!loop over all of the processors
do n=0,numProcs-1
!for each iteration of the loop, one processor will hand out relevant information
if( my_id .eq. n ) then
do i=0,numProcs-1
if( i .ne. my_id ) then
currProc=i
!get the number of volumes processor i needs from processor n
call MPI_RECV( numNeeded,1,MPI_INTEGER,i,my_id, &
MPI_COMM_WORLD,status,ierr )

!for each volume that processor i needs:
do j=1,numNeeded
!get the element number
call MPI_RECV( currNeeded,1,MPI_INTEGER,i,j, &
MPI_COMM_WORLD,status,ierr )
!return the volume of that element
call MPI_SEND( myElements(elementGlobalToLocal(currNeeded))%xsj,1, &
MPI_REAL8,i,j,MPI_COMM_WORLD,status,ierr )
enddo
endif
enddo
else
!tell processor n how many volumes are needed from it, calculated elsewhere
call MPI_SEND( numElementsNeeded(n+1),1,MPI_INTEGER,n,n, &
MPI_COMM_WORLD,status,ierr )

!for each volume needed
do j=1,numElementsNeeded(n+1)
!send the element number, calculated elsewhere
call MPI_SEND( elementsNeeded(n+1, j),1,MPI_INTEGER,n,j, &
MPI_COMM_WORLD,status,ierr )
!receive the appropriate volume
call MPI_RECV( myExternalXsj(currReceived),1,MPI_REAL8,n,j, &
MPI_COMM_WORLD,status,ierr

Solution

The answer was staring me right in the face, I just wasn't paying attention, apparently. Instead of sending one integer and one real*8 at a time, I should have been sending all of them at once, as:

do n=0,numProcs-1
    if( my_id .eq. n ) then
        do i=0,numProcs-1
            if( i .ne. my_id ) then
                currProc=i
                call MPI_RECV( numNeeded,1,MPI_INTEGER,i,my_id*2, & 
                               MPI_COMM_WORLD,status,ierr )

                allocate( currNeeded(numNeeded) )
                allocate( volsNeeded(numNeeded) )

                call MPI_RECV( currNeeded,numNeeded,MPI_INTEGER,currProc,my_id*2+1, & 
                               MPI_COMM_WORLD,status,ierr )

                do j=1,numNeeded
                    volsNeeded(j)=myElements(elementGlobalToLocal(currNeeded(j)))%xsj
                enddo

                call MPI_SEND( volsNeeded,numNeeded,MPI_REAL8,currProc,my_id, & 
                               MPI_COMM_WORLD,status,ierr )

                deallocate( currNeeded )
                deallocate( volsNeeded )
            endif
        enddo
    else
        call MPI_SEND( numElementsNeeded(n+1),1,MPI_INTEGER,n,n*2, & 
                       MPI_COMM_WORLD,status,ierr )

        call MPI_SEND( elementsNeeded(n+1, 1:numElementsNeeded(n+1)), & 
                       numElementsNeeded(n+1),MPI_INTEGER,n,n*2+1, & 
                       MPI_COMM_WORLD,status,ierr )

        call MPI_RECV( myExternalXsj(currReceived:currReceived+numElementsNeeded(n+1)), & 
                       numElementsNeeded(n+1),MPI_REAL8,n,n,MPI_COMM_WORLD,status,ierr )

        do j=1,numElementsNeeded(n+1)
            elementGlobalToLocal(elementsNeeded(n+1, j))=currReceived
            currReceived=currReceived+1
        enddo
    endif
enddo

Reducing the MPI call times drastically reduces the time spent in communication from (on 8 cores) roughly 269 seconds to roughly 13 seconds, a speedup by a factor of more than 20. This still doesn't change the problem of low scalability, though perhaps using non-blocking send/receive operations (with a barrier after the end of the loop to ensure all messages are received/sent) will do that.

Code Snippets

do n=0,numProcs-1
    if( my_id .eq. n ) then
        do i=0,numProcs-1
            if( i .ne. my_id ) then
                currProc=i
                call MPI_RECV( numNeeded,1,MPI_INTEGER,i,my_id*2, & 
                               MPI_COMM_WORLD,status,ierr )

                allocate( currNeeded(numNeeded) )
                allocate( volsNeeded(numNeeded) )

                call MPI_RECV( currNeeded,numNeeded,MPI_INTEGER,currProc,my_id*2+1, & 
                               MPI_COMM_WORLD,status,ierr )

                do j=1,numNeeded
                    volsNeeded(j)=myElements(elementGlobalToLocal(currNeeded(j)))%xsj
                enddo

                call MPI_SEND( volsNeeded,numNeeded,MPI_REAL8,currProc,my_id, & 
                               MPI_COMM_WORLD,status,ierr )

                deallocate( currNeeded )
                deallocate( volsNeeded )
            endif
        enddo
    else
        call MPI_SEND( numElementsNeeded(n+1),1,MPI_INTEGER,n,n*2, & 
                       MPI_COMM_WORLD,status,ierr )

        call MPI_SEND( elementsNeeded(n+1, 1:numElementsNeeded(n+1)), & 
                       numElementsNeeded(n+1),MPI_INTEGER,n,n*2+1, & 
                       MPI_COMM_WORLD,status,ierr )

        call MPI_RECV( myExternalXsj(currReceived:currReceived+numElementsNeeded(n+1)), & 
                       numElementsNeeded(n+1),MPI_REAL8,n,n,MPI_COMM_WORLD,status,ierr )

        do j=1,numElementsNeeded(n+1)
            elementGlobalToLocal(elementsNeeded(n+1, j))=currReceived
            currReceived=currReceived+1
        enddo
    endif
enddo

Context

StackExchange Code Review Q#64178, answer score: 4

Revisions (0)

No revisions yet.