Cluster Deployment and Power States Control

This is the last part of the course, almost there! You’ll deploy your cluster, validate it with some smoke tests and learn a bit of Slurm along the way, and at last, understand BurrMill power modes and their use. This is enough to start your Kaldi experiments!

We are still assuming your current directory ./ is the root BurrMill directory.

Deploy your cluster with the bm-deploy tool

I hope I’ve kept you busy for a while, so you already got your quota. So now we’re at that last step: hit that button, and you are ready to crunch! One command away, really. Remember to commit your cluster template, but do not push yet. The deployment is done using the bm-deploy new command.

Hint to a seasoned Slurmer: You can add the -d1 option (shorthand for --debug=1) to see the final generated slurm.conf. If you are not one, ignore this as an implementation detail.

$ bm-deploy new xc
bm-deploy: Checking for an existing deployment of cluster 'xc'
bm-deploy: Reading configuration from /home/kkm/work/burrmill/etc/cluster/xc.cluster.yaml
bm-deploy: Confirming machine type availability in zone 'us-central1-c' for main nodes in 'etc/cluster/xc.cluster.yaml'
bm-deploy: Using machine type 'e2-medium' for low-power mode.
bm-deploy: Creating new cluster 'xc' in zone 'us-central1-c', project 'solo-mill-12pc'
bm-deploy: Verifying prerequisites: project, compute image, CNS snapshot.
Summary of the new deployment:
------------------------------------------------
Project . . . . . . . . . . . . . solo-mill-12pc
Cluster name  . . . . . . . . . . xc
Zone  . . . . . . . . . . . . . . us-central1-c
Common node software disk . . . . burrmill-cns-v009-200330
Shared NFS disk size  . . . . . . 1280GB
Periodic snapshot of NFS disk . . false
NFS server  . . . . . . . . . . . xc-filer
NFS server full power machine . . e2-highmem-8
NFS server low-power machine  . . e2-medium
Login node  . . . . . . . . . . . xc-login
Login node full power machine . . e2-standard-2
Login node low-power machine  . . e2-medium
------------------------------------------------
Is this a configuration you want to proceed with? [Y/n]:Y
bm-deploy: Writing Slurm configuration to project metadatum xc_slurm_config
bm-deploy: Deploying Slurm config from the work tree
bm-deploy: Writing GCE node class definition nodeclass/c2.gclass:
bm-deploy: Writing GCE node class definition nodeclass/n1.gclass:
bm-deploy: Writing GCE node class definition nodeclass/n2.gclass:
bm-deploy: Writing GCE node class definition nodeclass/p100.gclass:
bm-deploy: Writing GCE node class definition nodeclass/t4.gclass:
bm-deploy: Writing GCE node class definition nodeclass/v100.gclass:
bm-deploy: Adding Slurm node and partition definitions to slurm.conf:
bm-deploy: Packaging Slurm configuration files at revision mark '6e6675683': ***[1]
-rw-rw-r-- root/60000      136 2020-04-07 16:25 cgroup.conf
-rw-rw-r-- root/60000       41 2020-04-07 16:25 gres.conf
drwxrwxr-x root/60000        0 2020-04-07 16:25 nodeclass/
-rw-rw-r-- root/60000       77 2020-04-07 16:25 nodeclass/t4.gclass
-rw-rw-r-- root/60000       30 2020-04-07 16:25 nodeclass/c2.gclass
-rw-rw-r-- root/60000       79 2020-04-07 16:25 nodeclass/p100.gclass
-rw-rw-r-- root/60000       35 2020-04-07 16:25 nodeclass/n2.gclass
-rw-rw-r-- root/60000       69 2020-04-07 16:25 nodeclass/n1.gclass
-rw-rw-r-- root/60000       79 2020-04-07 16:25 nodeclass/v100.gclass
-rw-rw-r-- root/60000     2464 2020-04-07 16:25 slurm.conf
bm-deploy: Writing configuration to metadata 'xc_slurm_config' of project 'solo-mill-12pc'. This takes a minute, wait.
Updated [https://www.googleapis.com/compute/v1/projects/solo-mill-12pc].
bm-deploy: Deploying cluster xc. This may take up to 5 minutes.
The fingerprint of the deployment is xMhreGX_OfqJjlCJzZIP4g==
Waiting for create [operation-1586301955136-5a2bbb48addb4-f81f62ac-db443537]...done.
Create operation operation-1586301955136-5a2bbb48addb4-f81f62ac-db443537 completed successfully.
NAME                      TYPE                                        STATE      ERRORS  INTENT
burrmill-cns-v009-200330  compute.v1.disks                            COMPLETED  []
burrmill-xc-intranet      compute.v1.firewalls                        COMPLETED  []
cluster-xc                compute.v1.subnetworks                      COMPLETED  []
cns-source-snapshot       gcp-types/compute-v1:compute.snapshots.get  COMPLETED  []
runtimeconfig-xc          runtimeconfig.v1beta1.config                COMPLETED  []
xc-boot-control           compute.v1.disks                            COMPLETED  []
xc-boot-filer             compute.v1.disks                            COMPLETED  []
xc-boot-login             compute.v1.disks                            COMPLETED  []
xc-control                compute.v1.instance                         COMPLETED  []
xc-filer                  compute.v1.instance                         COMPLETED  []
xc-login                  compute.v1.instance                         COMPLETED  []
xc-shared-nfs-disk        compute.v1.disks                            COMPLETED  []
Created [https://runtimeconfig.googleapis.com/v1beta1/projects/solo-mill-12pc/configs/runtimeconfig-xc/variables/config].
Created [https://runtimeconfig.googleapis.com/v1beta1/projects/solo-mill-12pc/configs/runtimeconfig-xc/variables/zone].
bm-deploy: Waiting for the init completion signal from new machines
bm-deploy: slurmctld is ready on xc-control ***[2]
bm-deploy: nfsd is ready on xc-filer ***[2]
bm-deploy: The cluster machines are currently running. Use 'bm-power off xc' to turn them off later.
Deployment complete. Note that the cluster might not be in entirely consistent low-power state. ***[3]
... Next time you power it on, use 'bm-power [low|high] xc' to make it consistent.

Pay attention to the highlighted lines and note numbers tucked to the end of each:

  1. If you try to deploy uncommitted configuration ./etc/cluster/xc.cluster.yaml, then bm-deploy will get angry at you. The recommended steps are committing the change, deploying, then, if anything goes wrong updating and amending the commit. Since it’s just a single, already committed file, you can do
$ git commit --amend --no-edit etc/cluster/xc.cluster.yaml
  1. Watch for these messages reporting the ready status. It is not uncommon for slurmctld not to start because of a botched configuration.
  2. This simply means that the cluster is generally is in the low-power state now, but in addition the controller is running, which is off by default in low-power state. And the explanation what are the power states is coming next.

This is it. Your hard work has paid off!

If you are about to go to your favorite pub and celebrate, please hang with me for another moment. You’re paying between $0.12 and $0.16 per hour for all three cluster nodes running. But do not shut them down yet.

Login and smoke-test compute nodes

The main thing that can, and often does go wrong is the Slurm node configuration. You need to make sure that Slurm controller is responsive, and that it can spawn every type of node that you have declared. To validate the setup, login to the cluster and force Slurm to spawn a node of each different type.

First of all, be aware of two little things:

1. The nasty interplay between SSH, scp and the pty

If you are like me, the first thing you want to do is copy a few files from your home directory to your new cluster’s login node right away, to feel yourself at home there. Unfortunately, this may give you a bit of a headache.

Do not do this on the initial login:

$ scp .bash_profile .bashrc .profile .gitconfig .inputrc .vimrc .dircolors .lessfilter .tmux.conf xc-login:

but instead do this: connect interactively, disconnect, copy files, and you are good to go.

$ ssh xc-login
. . .
han_solo_gmail_com@xc-login:~$ ^D
$ scp .bash_profile .bashrc .profile .gitconfig .inputrc .vimrc .dircolors .lessfilter .tmux.conf xc-login:
. . . file copy progress . . .
$ ssh xc-login

If you do the thing that I warned you not to do, then your SSH Agent will not be forwarded. If that happens, disconnect, reset the connection with ssh -O exit xc-login and reconnect. I can explain at length why it works this way, but the practical takeaway is enough: the first connection that you establish with the login node must be an interactive, systemd-controlled session for all functions to work properly. This is an SSH shortcoming that’s not easy to work around.

2. Cryptic message from the tunneling API

This message meand that you are trying to connect too early, when the SSH server on the host is not yet accepting connections. Repeat once every few second, and you’ll be connected. It’s a GCE API bug, incorrect diagnostic response, and does not indicate an issue with IAP authorization.

$ ssh xc-login
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: u'not authorized'].
ssh_exchange_identification: Connection closed by remote host

If your host gets unhealthy, login sessions might not be allowed until something times out. In this case, you may continue to see this message for quite a while. (I have not seen timeouts longer than 5 minutes). You will get connected to the broken host, just wait a few minutes. This is not related to IAP tunneling, but rather how systemd dependencies work: login is not enabled until critical system startup services are ready, and these are given generous timeouts.

Login and start each type of compute node

We should probably automate this with a script, but for now, you do it manually. The bm-deploy checks that the machine types you specify for the filer and the login node are available, but it cannot do that for compute nodes: this is a Slurm’s business. After you connect to the login node, use a few Slurm commands to verify the controller is healthy, and that it spawns nodes when requested.

The following is part hands-on walkthrough with interludes explaining how Slurm and GCE machine states are related, part demonstration what could, and often does happen, and how to diagnose the issue. You won’t encounter this error when reusing the node templates. For the demonstration I defined two types of CPU node and one GPU node. One of these won’t start, and we will diagnose why. This is a very common error, so you are likely to encounter it later at some point, if you define different node shapes. Unless you want a full hands-on experiment, do not redefine your template like this; just mark the spot to return later, if you encounter the issue. But you still should perform most of the checks on a new cluster.

  n2:
    Count: 48
    CoresPerSocket: 5
    RealMemory: 15000
    Gres: gcecpu:n2:no_consume
    GCE:
      machine-type: n2-custom-10-15360

  c2:
    Count: 240
    CoresPerSocket: 2
    RealMemory: 16240
    Gres: gcecpu:c2:no_consume
    GCE:
      machine-type: c2-standard-4

  p100:
    Count: 24
    CoresPerSocket: 2
    RealMemory: 5960
    Gres: cuda:p100:1
    GCE:
      machine-type: n1-custom-4-6144
      accelerator: type=nvidia-tesla-p100,count=1

Connect to the login node, and run the sinfo command to make sure the controller is responding, and see the summary status of all nodes. The STATE column is, quite predictably, shows the node state: idle means nothing runs on it, and the ~ indicates that the node is powered off. The output uses the compressed node name notation, such that name-[1-48] means there are nodes name-1, name-2, …, name-48. Without arguments, the command groups nodes by STATE column to ensure a more concise output.

xc-login:~$ sinfo
      PARTITION GRES         WEIGHT  CORES  NODES       STATE NODELIST
      std*      gcecpu:c2:1       1      2    240       idle~ xc-node-c2-[1-240]
      std*      gcecpu:n2:1       1      5     48       idle~ xc-node-n2-[1-48]
      gpu       cuda:p100:1       1      2     24       idle~ xc-node-p100-[1-24]

At the same time, use a gcloud subcommand command to see all machines in the project. This particular command is likely the most often used; remember it. For one, it shows all machines that are shut down, TERMINATED, or active, RUNNING, to see if you forgot to turn off that 32-CPU node that is not doing anything useful. Now we run it just to establish a baseline. As you’ve deployed your only cluster, you’ll see the 3 permanent nodes that are all running, and nothing else.

xc-login:~$ gcloud compute instances list
NAME        ZONE           MACHINE_TYPE  PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP     STATUS
xc-control  us-central1-c  e2-highcpu-2               10.125.48.4                  RUNNING
xc-filer    us-central1-c  e2-medium                  10.125.48.2                  RUNNING
xc-login    us-central1-c  e2-medium                  10.125.48.3  35.239.207.127  RUNNING

Now order Slurm to power up one node of each type, to make sure they can be. The command must be run as the user slurm, or it will be rejected.

xc-login:~$ sudo -u slurm scontrol update nodename=xc-node-c2-1,xc-node-n2-1,xc-node-p100-1 state=POWER_UP

Quickly run sinfo again. The # suffix tells that the node is powering up, and Slurm is waiting for its readiness signal.

xc-login:~$ sinfo
      PARTITION GRES         WEIGHT  CORES  NODES       STATE NODELIST
      std*      gcecpu:c2:1       1      2      1       idle# xc-node-c2-1
      std*      gcecpu:n2:1       1      5      1       idle# xc-node-n2-1
      std*      gcecpu:c2:1       1      2    239       idle~ xc-node-c2-[2-240]
      std*      gcecpu:n2:1       1      5     47       idle~ xc-node-n2-[2-48]
      gpu       cuda:p100:1       1      2      1       idle# xc-node-p100-1
      gpu       cuda:p100:1       1      2     23       idle~ xc-node-p100-[2-24]

See what’s going on with instances in GCE. You should see each of the nodes you just started. CPU machines take a few seconds to boot, so they are already RUNNING (this only reflects the “power switch” state of the virtual “hardware”, a low-level state). GPU nodes take more time to set up, so the GPU node is still STAGING, not yet even powered on.

xc-login:~$ gcloud compute instances list
NAME            ZONE           MACHINE_TYPE                 PREEMPTIBLE  INTERNAL_IP   EXTERNAL_IP     STATUS
xc-control      us-central1-c  e2-highcpu-2                              10.125.48.4                   RUNNING
xc-filer        us-central1-c  e2-medium                                 10.125.48.2                   RUNNING
xc-login        us-central1-c  e2-medium                                 10.125.48.3   35.239.207.127  RUNNING
xc-node-c2-1    us-central1-c  c2-standard-4                true         10.125.48.14                  RUNNING
xc-node-n2-1    us-central1-c  custom (10 vCPU, 15.00 GiB)  true         10.125.48.16                  RUNNING
xc-node-p100-1  us-central1-c  custom (4 vCPU, 6.00 GiB)    true         10.125.48.15                  STAGING

Run sinfo a few times, once every 5-10 seconds. As soon as you see the STATE column showing idle with no suffix, the smoke test has passed; these nodes are in a usable state. In our what-if demo, the c2 node type has a configuration problem, so Slurm drained it, and will not schedule any job to it.

  • drain indicates that the node was created, but has a configuration issue such that Slurm cannot use it.
  • down state in 2 minutes or less after the start request means the node cannot be started at all. The most likely cause for this state is that either the CPU or GPU is not offered in this zone. This is very easy to spot, so I won’t cover this case. Occasionally zones do genuinely run out of resources, but it’s a rare event.

In any case, you are waiting for the output containing no “clarifying” suffix character after the state.

xc-login:~$ sinfo
      PARTITION GRES         WEIGHT  CORES  NODES       STATE NODELIST
      std*      gcecpu:c2:1       1      2    239       idle~ xc-node-c2-[2-240]
      std*      gcecpu:n2:1       1      5     47       idle~ xc-node-n2-[2-48]
      std*      gcecpu:c2:1       1      2      1       drain xc-node-c2-1
      std*      gcecpu:n2:1       1      5      1        idle xc-node-n2-1
      gpu       cuda:p100:1       1      2     23       idle~ xc-node-p100-[2-24]
      gpu       cuda:p100:1       1      2      1        idle xc-node-p100-1

When you see a problem like a drained node, it’s nearly certain that either (or both) the CPU count or memory size specified in the template for Slurm’s use does not match the actual values, which depend on the GCE machine spec. The best place to peek is the log of the Slurm control daemon, slurmctld. Just take a few (-n30) last errors (-p4) that it has reported:

xc-login:~$ ssh xc-control 'journalctl -b -p4 -u slurmctld -n30'
-- Logs begin at Tue 2020-04-07 16:27:07 PDT, end at Wed 2020-04-08 14:09:50 PDT. --
Apr 08 14:09:37 xc-control slurmctld[619]: error: Node xc-node-c2-1 has low real_memory size (16041 < 16240)
Apr 08 14:09:37 xc-control slurmctld[619]: error: _slurm_rpc_node_registration node=xc-node-c2-1: Invalid argument
 . . .

You can see here that the node reported that it in fact has less available RAM than specified in the Slurm configuration. This is what has to be fixed in the cluster template,

Let’s now step back from the problem just discovered, and see what happens next. To get a feeling how the whole rig works, let nodes time out (they stay up for 6 minutes if no jobs is sent to the node). Repeat sinfo every 10-15 seconds. At some point idle nodes are “powered down” (BurrMill actually performs this request by deleting the VMs), and you will see an output like that below. You may also spot the idle% state, which means the node is in process of powering down; it will change to idle~ in a minute or two. We set the timeout with a good margin: it’s okay if Slurm won’t use a node for a while because it’s still %, not ~, but it’s pretty bad if the node goes into the ~ state before GCE finishes deleting the machine: the name is still used, and Slurm will fail to power up the node with the same name if it decides to.This is why you should have extra nodes in each group: if you plan to go up to 20 GPU parallelism, reserve 24 GPUs; similarly, a few extra CPU-full of CPU-only nodes should be reserved.

Returning to our problem, you can run sinfo until Slurm settles on declaring the problem with the node permanent. This is reflected in the third possible node state suffixNot to be confused with the * tucked to the std partition name: it indicated the default partition. character, the *.

xc-login:~$ sinfo
      PARTITION GRES         WEIGHT  CORES  NODES       STATE NODELIST
      std*      gcecpu:c2:1       1      2    239       idle~ xc-node-c2-[2-240]
      std*      gcecpu:n2:1       1      5     48       idle~ xc-node-n2-[1-48]
      std*      gcecpu:c2:1       1      2      1      drain* xc-node-c2-1
      gpu       cuda:p100:1       1      2     24       idle~ xc-node-p100-[1-24]

If you wait a bit more, about a minute, the drained node will change into idle~ state. This happens when our trigger, a script that Slurm binds to all different types of event, kicks in, and resets the faulted node. The main purpose of the trigger is to recover nodes after they have been preempted by GCE: for Slurm, it looks like node had shut down on its own or crashed, so it’s no longer useful for scheduling. Such nodes will be flagged with the * indefinitely, until administrator’s intervention, which makes all sense for a hardware node. Under BurrMill, the trigger script intervenes instead: we expect that some nodes will be preempted, and these nodes are simply deleted and marked healthy again by the trigger

xc-login:~$ sinfo
      PARTITION GRES         WEIGHT  CORES  NODES       STATE NODELIST
      std*      gcecpu:c2:1       1      2    240       idle~ xc-node-c2-[1-240]
      std*      gcecpu:n2:1       1      5     48       idle~ xc-node-n2-[1-48]
      gpu       cuda:p100:1       1      2     24       idle~ xc-node-p100-[1-24]

As for fixing the too little RAM issue, it’s clear: the error is in the estimation of RealMemory for the c2 node type. If you were really fixing this, you’d go the normal route: change the template by tuning the declared RAM to 16000Which, by no coincidence at all, happens to be the actual default in the stock template!, reserving 41MB in case the image is rebuilt and the free RAM happens to be reduced; then check it in, or, rather, amend the top commit with it, if it’s still the one adding your new template.

A catch here is that bm-deploy config -s xc will not let you update a cluster config while the cluster is powered on. This would be unsafe if an actual computation were in progress, but we are still debugging the new installation. This can be overridden with the -f/--force switch, which is exactly the intended use case for the --force option.

$ git commit --amend --no-edit etc/cluster/xc.cluster.yaml
[master d35d3b8] Define configuration for cluster xc
 . . . .
$ bm-deploy conf -s -f xc
bm-deploy: Reading deployment record of cluster 'xc' in project 'solo-mill-12pc'...
bm-deploy: Reading configuration from /home/kkm/work/burrmill/etc/cluster/xc.cluster.yaml
bm-deploy:[WARNING]: Cluster 'xc' must be powered off. Changing Slurm configuration while the
... cluster is online will create inconsistencies between controller and compute nodes.
... Make sure to either reboot 'xc-control' or run  'systemctl restart slurmctld' on it
Verifying/updating Slurm configuration
 . . .
bm-deploy: Writing configuration to metadata 'xc_slurm_config' of project 'solo-mill-12pc'. This takes a minute, wait.
Updated [https://www.googleapis.com/compute/v1/projects/solo-mill-12pc].

Now you should force both the control and login nodes to re-read the configuration: Slurm requires identical configuration on all nodes, so you would not be able to control it from the login node unless you had updated it too. On the controller, reading the configuration is part of the slurmctld start and restart routines. On the login node, a special systemd service slurmconf handles this.

The reload is easily done from the login node:

xc-login:~$ ssh xc-control 'sudo systemctl restart slurmctld'
xc-login:~$ sudo systemctl restart slurmconf

Then repeat the power-up test. This is how it should look like when it passes at the first attempt, which will be most likely your initial case:

xc-login:~$ sudo -u slurm scontrol update nodename=xc-node-c2-1,xc-node-n2-1,xc-node-p100-1 state=POWER_UP
xc-login:~$ sinfo
      PARTITION GRES         WEIGHT  CORES  NODES       STATE NODELIST
      std*      gcecpu:c2:1       1      2      1       idle# xc-node-c2-1
      std*      gcecpu:n2:1       1      5      1       idle# xc-node-n2-1
      std*      gcecpu:c2:1       1      2    239       idle~ xc-node-c2-[2-240]
      std*      gcecpu:n2:1       1      5     47       idle~ xc-node-n2-[2-48]
      gpu       cuda:p100:1       1      2      1       idle# xc-node-p100-1
      gpu       cuda:p100:1       1      2     23       idle~ xc-node-p100-[2-24]
xc-login:~$ gcloud compute instances list
NAME            ZONE           MACHINE_TYPE                 PREEMPTIBLE  INTERNAL_IP   EXTERNAL_IP     STATUS
xc-control      us-central1-c  e2-highcpu-2                              10.125.48.4                   RUNNING
xc-filer        us-central1-c  e2-medium                                 10.125.48.2                   RUNNING
xc-login        us-central1-c  e2-medium                                 10.125.48.3   35.239.207.127  RUNNING
xc-node-c2-1    us-central1-c  c2-standard-4                true         10.125.48.17                  RUNNING
xc-node-n2-1    us-central1-c  custom (10 vCPU, 15.00 GiB)  true         10.125.48.19                  STAGING
xc-node-p100-1  us-central1-c  custom (4 vCPU, 6.00 GiB)    true         10.125.48.18                  STAGING

Repeat the sinfo command a few times every 5-10 second, until you see the node “powering up” indicator # disappear:

xc-login:~$ sinfo
      PARTITION GRES         WEIGHT  CORES  NODES       STATE NODELIST
      std*      gcecpu:c2:1       1      2    239       idle~ xc-node-c2-[2-240]
      std*      gcecpu:n2:1       1      5     47       idle~ xc-node-n2-[2-48]
      std*      gcecpu:c2:1       1      2      1        idle xc-node-c2-1
      std*      gcecpu:n2:1       1      5      1        idle xc-node-n2-1
      gpu       cuda:p100:1       1      2     23       idle~ xc-node-p100-[2-24]
      gpu       cuda:p100:1       1      2      1        idle xc-node-p100-1

All nodes requested to power up did so, so the smoke test has been successful.

You may leave the nodes to expire in 6 minutes, or ask Slurm to power them down:

xc-login:~$ sudo -u slurm scontrol update nodename=xc-node-c2-1,xc-node-n2-1,xc-node-p100-1 state=POWER_DOWN
xc-login:~$ sinfo
      PARTITION GRES         WEIGHT  CORES  NODES       STATE NODELIST
      std*      gcecpu:c2:1       1      2      1       idle% xc-node-c2-1
      std*      gcecpu:n2:1       1      5      1       idle% xc-node-n2-1
      std*      gcecpu:c2:1       1      2    239       idle~ xc-node-c2-[2-240]
      std*      gcecpu:n2:1       1      5     47       idle~ xc-node-n2-[2-48]
      gpu       cuda:p100:1       1      2      1       idle% xc-node-p100-1
      gpu       cuda:p100:1       1      2     23       idle~ xc-node-p100-[2-24]

The % suffix indicates that Slurm is waiting for the nodes to report their powering down, and then some more before the name can be reused.

Finally, make sure that the shared NFS disk is accessible and set up. There may me a short lingering of ~0.5s, because /mill is an automount: the NFS server is not mounted until first accessed. This allows cluster main nodes to boot independently in any order.

xc-login:~$ ls -lR /mill
/mill:
total 12
drwxr-xr-x 2 user burrmill 4096 2020-04-07 16:27:07 data
drwxr-xr-x 2 user burrmill 4096 2020-04-07 16:27:07 etc
drwxr-xr-x 2 user burrmill 4096 2020-04-07 18:21:08 home

/mill/data:
total 0

/mill/etc:
total 0

/mill/home:
total 0

Note that everyone can read and write the shared disk from any machine. The security has been turned off for it entirely by design.

Now you can disconnect from the login node and power down the cluster:

xc-login:~$ ^D
$ bm-power off
bm-power: Reading deployment record of cluster 'xc' in project 'solo-mill-12pc'...
bm-power: Stopping nodes: xc-control xc-filer xc-login
bm-power: Asynchronous stop request submitted

And the last thing you need to know is how to control the low/high power states of the cluster.

Power control

The “low power” and “high power” mode names are actually euphemisms for “cheap” and “expensive.” When you power on the cluster in the low-power mode, only the NFS server and the login node are started by default, with the machine type e2-medium by default. This is the mode you normally use to prepare experiments, analyze results or start single-node test jobs. The high-power is used for running actual experiments. The machine types are upgraded to their full specification, and the control node is also started.

The bm-power command, the most boring and the most frequently used, has 4 subcommands that control on/off and high/low power states:

  • bm-power off: Turn all machines off.
  • bm-power low: Boot the cluster into low-power mode. The cluster must be powered off for this command to work.
  • bm-power high: Boot from power-off into high-power mode.
  • bm-power on: start cluster in the last high/low power state.

bm-power without arguments will show the current power state of your cluster.


This. Is. It. You made it! You know everything to make your work with Kaldi in GCP efficient and money-saving. Two cheatsheet appendixes list the concise index of tasks that you’ve just completed, and a cheatsheet of gcloud and gsutil commands that you may need for a start. In a later post, we’ll create a Stackdriver dashboard to monitor cluster load, shared disk throttling and node preemption even counts.

Leave a Reply