HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Azure virtual machine stalls in R due to parallel package

Submitted by: @import:stackexchange-devops··
0
Viewed 0 times
duepackageazureparallelmachinestallsvirtual

Problem

I am writing an R package with tests using the testthat package. The tests pass locally and on Travis.

I want to plot the benefit of parallelisation on up to 24 cores, so I set up a virtual machine programmatically on Azure:

az vm create \
   --resource-group  \
   --name  \
   --image microsoft-dsvm:linux-data-science-vm-ubuntu:linuxdsvmubuntu:18.12.01 \
   --size Standard_NV24 \
   --admin-username  \
   --generate-ssh-keys


I call devtools::test() and the virtual machine gets stuck at testthat for hours with 0% CPU usage:

✔  checking for unstated dependencies in ‘tests’ ...
─  checking tests ...
   Running ‘testthat.R’


devtools::test() has no specific arguments to print some output, and it calls testthat::test_dir(), which also has no arguments to print output.

When I remove the tests, the code stalls in a function call. I stopped and relaunched multiple times, so I think it's a problem with Azure.

Other sizes of machines, Standard_F16s_v2 and Standard DS3 v2, have the same problem.

How can I fix it?

Update

I stumbled upon the culprit when I was running another piece of code. The VM hangs when calling the R parallel package on multiple cores (it works fine on just one core).

I added some printouts around that call:

[1] "num_cores = 2"
[1] "Entering parallel"
[1] "Drawing 1"
[1] "Drawing 2"


and then it hangs. Following the suggestion of strace, I launched again and see this output around this last printout:

```
[pid 8571] 21:41:48.718625 write(1, "\n", 1[1] "num_cores = 2"
) = 1
[pid 8571] 21:41:48.719486 write(1, "[1]", 3) = 3
[pid 8571] 21:41:48.719550 write(1, " \"num_cores = 2\"", 16) = 16
[pid 8571] 21:41:48.719615 write(1, "\n", 1) = 1
[pid 8703] 21:41:48.730707 futex(0x498fcf4, FUTEX_WAIT_PRIVATE, 4032, NULL
[pid 8705] 21:41:48.730745 futex(0x498fcf4, FUTEX_WAIT_PRIVATE, 4032, NULL
[pid 8701] 21:41:48.730757 futex(0x498fcf4, FUTEX_WAIT_PRIVATE, 4032, NULL
[pid 8704] 21:41:48.730794 futex(0x49

Solution

The problem is with the image
microsoft-dsvm:linux-data-science-vm-ubuntu:linuxdsvmubuntu:18.12.01 ("Linux
Data Science VM" (on the Azure portal). This is an old image with R 3.4
instead of the release version of 3.5.

I succeeded with the following virtual machine (Ubuntu Server 19.04.19, size D64 v3):

az vm create \
       --resource-group  \
       --name  \
       --image Canonical:UbuntuServer:19.04:19.04.201906280 \
       --size Standard_D64_v3 \
       --admin-username  \
       --generate-ssh-keys


and installed R, Rstan, and an SSL library for curl and devtools with:

sudo apt update
sudo apt -y install r-base
sudo apt -y install r-cran-rstan

# Add LibSSL for installing curl and devtools, see:
# https://stackoverflow.com/questions/44228055/r-rstudio-install-devtools-fails
sudo apt-get install libcurl4-openssl-dev libssl-dev


and the machine can use multiple cores:

Welcome to PosteriorBootstrap, a parallel approach for adaptive non-parametric learning
[1] "Speedup performance"
[1] "n_bootstrap = 100"
[1] "num_cores = 1"
[1] "Finished sampling"
[1] "num_cores = 2"
[1] "Finished sampling"
...
[1] "num_cores = 64"
[1] "Finished sampling"


I also checked the duration and the speedup is as I expected.

Code Snippets

az vm create \
       --resource-group <resource group> \
       --name <name> \
       --image Canonical:UbuntuServer:19.04:19.04.201906280 \
       --size Standard_D64_v3 \
       --admin-username <azure user> \
       --generate-ssh-keys
sudo apt update
sudo apt -y install r-base
sudo apt -y install r-cran-rstan

# Add LibSSL for installing curl and devtools, see:
# https://stackoverflow.com/questions/44228055/r-rstudio-install-devtools-fails
sudo apt-get install libcurl4-openssl-dev libssl-dev
Welcome to PosteriorBootstrap, a parallel approach for adaptive non-parametric learning
[1] "Speedup performance"
[1] "n_bootstrap = 100"
[1] "num_cores = 1"
[1] "Finished sampling"
[1] "num_cores = 2"
[1] "Finished sampling"
...
[1] "num_cores = 64"
[1] "Finished sampling"

Context

StackExchange DevOps Q#8064, answer score: 1

Revisions (0)

No revisions yet.