patternMinor
Efficient communication across multiple processors for finite elements
Viewed 0 times
finiteelementsprocessorscommunicationefficientformultipleacross
Problem
I am implementing finite element code using MPI. Obviously, when using MPI I will have to include communication between different nodes so that all processors have all relevant information. As this is my first time doing something like this, I have implemented a very naive algorithm to get updated volume information for each element as shown below:
```
!loop over all of the processors
do n=0,numProcs-1
!for each iteration of the loop, one processor will hand out relevant information
if( my_id .eq. n ) then
do i=0,numProcs-1
if( i .ne. my_id ) then
currProc=i
!get the number of volumes processor i needs from processor n
call MPI_RECV( numNeeded,1,MPI_INTEGER,i,my_id, &
MPI_COMM_WORLD,status,ierr )
!for each volume that processor i needs:
do j=1,numNeeded
!get the element number
call MPI_RECV( currNeeded,1,MPI_INTEGER,i,j, &
MPI_COMM_WORLD,status,ierr )
!return the volume of that element
call MPI_SEND( myElements(elementGlobalToLocal(currNeeded))%xsj,1, &
MPI_REAL8,i,j,MPI_COMM_WORLD,status,ierr )
enddo
endif
enddo
else
!tell processor n how many volumes are needed from it, calculated elsewhere
call MPI_SEND( numElementsNeeded(n+1),1,MPI_INTEGER,n,n, &
MPI_COMM_WORLD,status,ierr )
!for each volume needed
do j=1,numElementsNeeded(n+1)
!send the element number, calculated elsewhere
call MPI_SEND( elementsNeeded(n+1, j),1,MPI_INTEGER,n,j, &
MPI_COMM_WORLD,status,ierr )
!receive the appropriate volume
call MPI_RECV( myExternalXsj(currReceived),1,MPI_REAL8,n,j, &
MPI_COMM_WORLD,status,ierr
```
!loop over all of the processors
do n=0,numProcs-1
!for each iteration of the loop, one processor will hand out relevant information
if( my_id .eq. n ) then
do i=0,numProcs-1
if( i .ne. my_id ) then
currProc=i
!get the number of volumes processor i needs from processor n
call MPI_RECV( numNeeded,1,MPI_INTEGER,i,my_id, &
MPI_COMM_WORLD,status,ierr )
!for each volume that processor i needs:
do j=1,numNeeded
!get the element number
call MPI_RECV( currNeeded,1,MPI_INTEGER,i,j, &
MPI_COMM_WORLD,status,ierr )
!return the volume of that element
call MPI_SEND( myElements(elementGlobalToLocal(currNeeded))%xsj,1, &
MPI_REAL8,i,j,MPI_COMM_WORLD,status,ierr )
enddo
endif
enddo
else
!tell processor n how many volumes are needed from it, calculated elsewhere
call MPI_SEND( numElementsNeeded(n+1),1,MPI_INTEGER,n,n, &
MPI_COMM_WORLD,status,ierr )
!for each volume needed
do j=1,numElementsNeeded(n+1)
!send the element number, calculated elsewhere
call MPI_SEND( elementsNeeded(n+1, j),1,MPI_INTEGER,n,j, &
MPI_COMM_WORLD,status,ierr )
!receive the appropriate volume
call MPI_RECV( myExternalXsj(currReceived),1,MPI_REAL8,n,j, &
MPI_COMM_WORLD,status,ierr
Solution
The answer was staring me right in the face, I just wasn't paying attention, apparently. Instead of sending one
Reducing the MPI call times drastically reduces the time spent in communication from (on 8 cores) roughly 269 seconds to roughly 13 seconds, a speedup by a factor of more than 20. This still doesn't change the problem of low scalability, though perhaps using non-blocking send/receive operations (with a barrier after the end of the loop to ensure all messages are received/sent) will do that.
integer and one real*8 at a time, I should have been sending all of them at once, as:do n=0,numProcs-1
if( my_id .eq. n ) then
do i=0,numProcs-1
if( i .ne. my_id ) then
currProc=i
call MPI_RECV( numNeeded,1,MPI_INTEGER,i,my_id*2, &
MPI_COMM_WORLD,status,ierr )
allocate( currNeeded(numNeeded) )
allocate( volsNeeded(numNeeded) )
call MPI_RECV( currNeeded,numNeeded,MPI_INTEGER,currProc,my_id*2+1, &
MPI_COMM_WORLD,status,ierr )
do j=1,numNeeded
volsNeeded(j)=myElements(elementGlobalToLocal(currNeeded(j)))%xsj
enddo
call MPI_SEND( volsNeeded,numNeeded,MPI_REAL8,currProc,my_id, &
MPI_COMM_WORLD,status,ierr )
deallocate( currNeeded )
deallocate( volsNeeded )
endif
enddo
else
call MPI_SEND( numElementsNeeded(n+1),1,MPI_INTEGER,n,n*2, &
MPI_COMM_WORLD,status,ierr )
call MPI_SEND( elementsNeeded(n+1, 1:numElementsNeeded(n+1)), &
numElementsNeeded(n+1),MPI_INTEGER,n,n*2+1, &
MPI_COMM_WORLD,status,ierr )
call MPI_RECV( myExternalXsj(currReceived:currReceived+numElementsNeeded(n+1)), &
numElementsNeeded(n+1),MPI_REAL8,n,n,MPI_COMM_WORLD,status,ierr )
do j=1,numElementsNeeded(n+1)
elementGlobalToLocal(elementsNeeded(n+1, j))=currReceived
currReceived=currReceived+1
enddo
endif
enddoReducing the MPI call times drastically reduces the time spent in communication from (on 8 cores) roughly 269 seconds to roughly 13 seconds, a speedup by a factor of more than 20. This still doesn't change the problem of low scalability, though perhaps using non-blocking send/receive operations (with a barrier after the end of the loop to ensure all messages are received/sent) will do that.
Code Snippets
do n=0,numProcs-1
if( my_id .eq. n ) then
do i=0,numProcs-1
if( i .ne. my_id ) then
currProc=i
call MPI_RECV( numNeeded,1,MPI_INTEGER,i,my_id*2, &
MPI_COMM_WORLD,status,ierr )
allocate( currNeeded(numNeeded) )
allocate( volsNeeded(numNeeded) )
call MPI_RECV( currNeeded,numNeeded,MPI_INTEGER,currProc,my_id*2+1, &
MPI_COMM_WORLD,status,ierr )
do j=1,numNeeded
volsNeeded(j)=myElements(elementGlobalToLocal(currNeeded(j)))%xsj
enddo
call MPI_SEND( volsNeeded,numNeeded,MPI_REAL8,currProc,my_id, &
MPI_COMM_WORLD,status,ierr )
deallocate( currNeeded )
deallocate( volsNeeded )
endif
enddo
else
call MPI_SEND( numElementsNeeded(n+1),1,MPI_INTEGER,n,n*2, &
MPI_COMM_WORLD,status,ierr )
call MPI_SEND( elementsNeeded(n+1, 1:numElementsNeeded(n+1)), &
numElementsNeeded(n+1),MPI_INTEGER,n,n*2+1, &
MPI_COMM_WORLD,status,ierr )
call MPI_RECV( myExternalXsj(currReceived:currReceived+numElementsNeeded(n+1)), &
numElementsNeeded(n+1),MPI_REAL8,n,n,MPI_COMM_WORLD,status,ierr )
do j=1,numElementsNeeded(n+1)
elementGlobalToLocal(elementsNeeded(n+1, j))=currReceived
currReceived=currReceived+1
enddo
endif
enddoContext
StackExchange Code Review Q#64178, answer score: 4
Revisions (0)
No revisions yet.